[Transformers Integration] Understanding Voxtral Realtime architecture for porting

#17

by Seungyoun - opened 3 days ago

3 days ago

Hi Mistral team 👋

I’m interested in contributing a Hugging Face Transformers integration for Voxtral Mini 4B Realtime 2602. After reading the vLLM implementation, here’s my current understanding (see attached GIF):

Audio → mel → Whisper conv + pooling (~80ms / token) → causal / sliding-window audio encoder → adapter → audio_embed
LLM input is element-wise sum (not concat): audio_text_embeds + text_embeds (plus delay/time conditioning)
[STREAMING_PAD] fills the left-pad + initial delay window (e.g., ~480ms ≈ 6 tokens if 80ms/token)
Decoding is AR text-only: only generated text tokens are fed back; audio continues streaming step-by-step

Questions

Is the summary above correct?
Is the causal audio encoder documented anywhere beyond the vLLM code?
For Transformers, would a custom streaming wrapper (step-wise max_new_tokens=1 + incremental audio chunks) be acceptable, or is there a preferred integration pattern?
Are [STREAMING_PAD] + delay/time embedding baked into weights/config, or mostly tokenizer-level handling?
Any plans for a technical paper?

Happy to start a PR once the approach is aligned — thanks!

cc. @patrickvonplaten , @pandora-s , @iliasslasri , @juliendenize , @sebag90 , @sanchit-gandhi

Seungyoun

3 days ago

Oh, it already appeared in transformers https://github.com/huggingface/transformers/pull/43769

patrickvonplaten

Mistral AI_ org 2 days ago

Technical paper will come out as well. Your animation above looks correct and is very nice. Will try to help bring the transformers PR over the line.

sanchit-gandhi

Mistral AI_ org 1 day ago

•

edited 1 day ago

Very nice animation @Seungyoun !

Note that the encoder is a causal audio encoder trained from scratch (whereas Whisper is bi-directional), so new modelling code is required
The time-delay is embedded using a sin/cos embedding, then projected via an MLP and used to modulate the residual stream in the text decoder
For both 1 and 2, vLLM is the source of truth
The paper will be out shortly to motivate these decisions

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment