Sori-4B

Speech-to-Text model combining:

  • Audio Encoder: Qwen3-Omni-30B-A3B-Instruct (pretrained weights)
  • Language Model: Qwen3-4B-Instruct-2507

Architecture

Audio (Mel Spectrogram) โ†’ Qwen3-Omni Audio Encoder โ†’ Continuous Features โ†’ Qwen3-4B LLM โ†’ Text

Usage

from transformers import AutoTokenizer
from modeling_sori_speech import SoriSpeechForConditionalGeneration
import torch
import torchaudio

# Load model
model = SoriSpeechForConditionalGeneration.from_pretrained(
    "Seungyoun/Sori-4B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("Seungyoun/Sori-4B")

# Process audio
audio, sr = torchaudio.load("speech.wav")
if sr != 16000:
    audio = torchaudio.transforms.Resample(sr, 16000)(audio)

# Extract mel spectrogram
mel_transform = torchaudio.transforms.MelSpectrogram(
    sample_rate=16000, n_fft=400, hop_length=160, n_mels=128
)
mel_spec = torch.log(mel_transform(audio) + 1e-6)

# Generate transcription
# ... (see full example in model card)

License

Apache 2.0 (following Qwen3-Omni and Qwen3 licenses)

Downloads last month
7
Safetensors
Model size
5B params
Tensor type
F32
ยท
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Seungyoun/Sori-4B

Finetuned
(901)
this model