Sori-4B
Speech-to-Text model combining:
- Audio Encoder: Qwen3-Omni-30B-A3B-Instruct (pretrained weights)
- Language Model: Qwen3-4B-Instruct-2507
Architecture
Audio (Mel Spectrogram) โ Qwen3-Omni Audio Encoder โ Continuous Features โ Qwen3-4B LLM โ Text
Usage
from transformers import AutoTokenizer
from modeling_sori_speech import SoriSpeechForConditionalGeneration
import torch
import torchaudio
# Load model
model = SoriSpeechForConditionalGeneration.from_pretrained(
"Seungyoun/Sori-4B",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("Seungyoun/Sori-4B")
# Process audio
audio, sr = torchaudio.load("speech.wav")
if sr != 16000:
audio = torchaudio.transforms.Resample(sr, 16000)(audio)
# Extract mel spectrogram
mel_transform = torchaudio.transforms.MelSpectrogram(
sample_rate=16000, n_fft=400, hop_length=160, n_mels=128
)
mel_spec = torch.log(mel_transform(audio) + 1e-6)
# Generate transcription
# ... (see full example in model card)
License
Apache 2.0 (following Qwen3-Omni and Qwen3 licenses)
- Downloads last month
- 7
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for Seungyoun/Sori-4B
Base model
Qwen/Qwen3-4B-Instruct-2507