StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs (ICLR 2026)

StableToken is a noise-robust semantic speech tokenizer that performs discrete speech representation learning, achieving state-of-the-art stability in noisy environments.

📄 Paper | 💻 GitHub

For code and more detailed information, please refer to the corresponding GitHub repository.

Model Details

Attribute	Value
Frame Rate	25 Hz
Codebook Size	8,192
BPS (Bits Per Second)	325

Quick Start

To use StableToken, please clone the official repository and install dependencies.

Installation

git clone --recursive https://github.com/Tencent/StableToken.git
cd StableToken && pip install -r requirements.txt

Inference

import os
from huggingface_hub import snapshot_download
from transformers import WhisperFeatureExtractor
from src.model.modeling_whisper import WhisperLFQEncoder
from src.utils.flow_inference import AudioDecoder
from src.utils.utils import extract_speech_token, speech_token_to_wav

# 1. Download & Load Models
model_dir = snapshot_download("tencent/StableToken")

# Load Tokenizer
tokenizer = WhisperLFQEncoder.from_pretrained(os.path.join(model_dir, "tokenizer")).eval().cuda()
feature_extractor = WhisperFeatureExtractor.from_pretrained(os.path.join(model_dir, "tokenizer"))

# Load Decoder
decoder = AudioDecoder(
    config_path=os.path.join(model_dir, "decoder", "config.yaml"),
    flow_ckpt_path=os.path.join(model_dir, "decoder", "flow.pt"),
    hift_ckpt_path=os.path.join(model_dir, "decoder", "hift.pt"),
    device="cuda"
)

# 2. Tokenize
tokens = extract_speech_token(tokenizer, feature_extractor, ["/path/to/audio.wav"], device="cuda")[0]

# 3. Reconstruct
tts_speech, sampling_rate = speech_token_to_wav(decoder, tokens)

Performance

StableToken achieves 60% lower UED (Unit Edit Distance) than best existing supervised semantic tokenizers.

Noise Robustness (UED ↓)

Model	Frame Rate	Codebook Size	UED (%, ↓)
GLM-4-Voice-Tokenizer	12.5Hz	16,384	31.10
S3 Tokenizer	25Hz	4,096	26.17
CosyVoice2	25Hz	6,561	38.66
StableToken	25Hz	8,192	10.17 🏆

Reconstruction Quality

Measurements on LibriSpeech (LS) and SEED benchmarks.

Model	Frame Rate	BPS	WER (↓) LS-clean	WER (↓) LS-other	WER (↓) SEED-en	WER (↓) SEED-zh	MOS (↑) LS-clean	MOS (↑) LS-other	MOS (↑) SEED-en	MOS (↑) SEED-zh
GLM-4-Voice-Tokenizer	12.5Hz	175	4.04	9.33	3.54	3.23	4.07	3.99	4.16	4.10
S3 Tokenizer	25Hz	300	5.78	13.38	5.91	4.26	3.40	3.31	3.40	3.31
CosyVoice2	25Hz	325	4.25	9.68	4.34	2.75	3.36	3.25	3.31	3.58
StableToken	25Hz	325	3.84	7.99	3.44	2.62	4.09	3.83	4.01	4.18

Citation

@article{song2025stabletoken,
  title={StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs},
  author={Song, Yuhan and Zhang, Linhao and Wu, Chuhan and Liu, Aiwei and Jia, Wei and Wang, Houfeng and Zhou, Xiao},
  journal={arXiv preprint arXiv:2509.22220},
  year={2025}
}

License

This project is licensed under the License Term of StableToken.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for tencent/StableToken

StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs

Paper • 2509.22220 • Published Sep 26, 2025 • 66