Qwen3-TTS-Tokenizer-12Hz

This repository contains the Qwen3-TTS-Tokenizer-12Hz, as presented in the paper Qwen3-TTS Technical Report.

Qwen3-TTS-Tokenizer-12Hz achieves extreme bitrate reduction and ultra-low-latency streaming, enabling immediate first-packet emission through its 12.5 Hz, 16-layer multi-codebook design and a lightweight causal ConvNet.

Paper: Qwen3-TTS Technical Report
GitHub Repository: QwenLM/Qwen3-TTS
Demo: Hugging Face Space

Quickstart

Environment Setup

Install the qwen-tts Python package from PyPI:

pip install -U qwen-tts

Tokenizer Encode and Decode

You can encode audio into discrete tokens for storage or transport and decode them back into speech using the snippet below:

import soundfile as sf
from qwen_tts import Qwen3TTSTokenizer

tokenizer = Qwen3TTSTokenizer.from_pretrained(
    "Qwen/Qwen3-TTS-Tokenizer-12Hz",
    device_map="cuda:0",
)

# Encode audio from a URL (or local path)
enc = tokenizer.encode("https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/tokenizer_demo_1.wav")

# Decode codes back into waveforms
wavs, sr = tokenizer.decode(enc)
sf.write("decode_output.wav", wavs[0], sr)

Overview

Introduction

Qwen3-TTS covers 10 major languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian) as well as multiple dialectal voice profiles. Key features:

Powerful Speech Representation: Powered by the self-developed Qwen3-TTS-Tokenizer-12Hz, it achieves efficient acoustic compression and high-dimensional semantic modeling of speech signals. It fully preserves paralinguistic information and acoustic environmental features.
Extreme Low-Latency Streaming Generation: Based on the innovative Dual-Track hybrid streaming generation architecture, it can output the first audio packet immediately after a single character is input, with end-to-end synthesis latency as low as 97ms.

Model Architecture

Released Tokenizers

Tokenizer Name	Description
Qwen3-TTS-Tokenizer-12Hz	The Qwen3-TTS-Tokenizer-12Hz model which can encode the input speech into codes and decode them back into speech.

Evaluation

For detailed evaluation results on speech generation consistency, speaker similarity, and tokenizer benchmarks (ASR tasks, PESQ, STOI, UTMOS), please refer to the technical report or the GitHub repository.

Citation

@article{Qwen3-TTS,
  title={Qwen3-TTS Technical Report},
  author={Hangrui Hu and Xinfa Zhu and Ting He and Dake Guo and Bin Zhang and Xiong Wang and Zhifang Guo and Ziyue Jiang and Hongkun Hao and Zishan Guo and Xinyu Zhang and Pei Zhang and Baosong Yang and Jin Xu and Jingren Zhou and Junyang Lin},
  journal={arXiv preprint arXiv:2601.15621},
  year={2026}
}

Downloads last month: 68,300

Spaces using Qwen/Qwen3-TTS-Tokenizer-12Hz 2

Collection including Qwen/Qwen3-TTS-Tokenizer-12Hz

Qwen3-TTS

Collection

7 items • Updated 21 days ago • 288

Paper for Qwen/Qwen3-TTS-Tokenizer-12Hz

Qwen3-TTS Technical Report

Paper • 2601.15621 • Published 22 days ago • 63