[Transformers Integration] Understanding Voxtral Realtime architecture for porting
#17
by
Seungyoun
- opened
Hi Mistral team π
Iβm interested in contributing a Hugging Face Transformers integration for Voxtral Mini 4B Realtime 2602. After reading the vLLM implementation, hereβs my current understanding (see attached GIF):
- Audio β mel β Whisper conv + pooling (~80ms / token) β causal / sliding-window audio encoder β adapter β
audio_embed - LLM input is element-wise sum (not concat):
audio_text_embeds + text_embeds(plus delay/time conditioning) - [STREAMING_PAD] fills the left-pad + initial delay window (e.g., ~480ms β 6 tokens if 80ms/token)
- Decoding is AR text-only: only generated text tokens are fed back; audio continues streaming step-by-step
Questions
- Is the summary above correct?
- Is the causal audio encoder documented anywhere beyond the vLLM code?
- For Transformers, would a custom streaming wrapper (step-wise
max_new_tokens=1+ incremental audio chunks) be acceptable, or is there a preferred integration pattern? - Are [STREAMING_PAD] + delay/time embedding baked into weights/config, or mostly tokenizer-level handling?
- Any plans for a technical paper?
Happy to start a PR once the approach is aligned β thanks!
cc. @patrickvonplaten , @pandora-s , @iliasslasri , @juliendenize , @sebag90 , @sanchit-gandhi
Technical paper will come out as well. Your animation above looks correct and is very nice. Will try to help bring the transformers PR over the line.
Very nice animation @Seungyoun !
- Note that the encoder is a causal audio encoder trained from scratch (whereas Whisper is bi-directional), so new modelling code is required
- The time-delay is embedded using a sin/cos embedding, then projected via an MLP and used to modulate the residual stream in the text decoder
- For both 1 and 2, vLLM is the source of truth
- The paper will be out shortly to motivate these decisions
