[Transformers Integration] Understanding Voxtral Realtime architecture for porting

#17
by Seungyoun - opened

Hi Mistral team πŸ‘‹

I’m interested in contributing a Hugging Face Transformers integration for Voxtral Mini 4B Realtime 2602. After reading the vLLM implementation, here’s my current understanding (see attached GIF):

voxtral-realtime-cropped

  • Audio β†’ mel β†’ Whisper conv + pooling (~80ms / token) β†’ causal / sliding-window audio encoder β†’ adapter β†’ audio_embed
  • LLM input is element-wise sum (not concat): audio_text_embeds + text_embeds (plus delay/time conditioning)
  • [STREAMING_PAD] fills the left-pad + initial delay window (e.g., ~480ms β‰ˆ 6 tokens if 80ms/token)
  • Decoding is AR text-only: only generated text tokens are fed back; audio continues streaming step-by-step

Questions

  1. Is the summary above correct?
  2. Is the causal audio encoder documented anywhere beyond the vLLM code?
  3. For Transformers, would a custom streaming wrapper (step-wise max_new_tokens=1 + incremental audio chunks) be acceptable, or is there a preferred integration pattern?
  4. Are [STREAMING_PAD] + delay/time embedding baked into weights/config, or mostly tokenizer-level handling?
  5. Any plans for a technical paper?

Happy to start a PR once the approach is aligned β€” thanks!

cc. @patrickvonplaten , @pandora-s , @iliasslasri , @juliendenize , @sebag90 , @sanchit-gandhi

Oh, it already appeared in transformers https://github.com/huggingface/transformers/pull/43769

Mistral AI_ org

Technical paper will come out as well. Your animation above looks correct and is very nice. Will try to help bring the transformers PR over the line.

Mistral AI_ org
β€’
edited 1 day ago

Very nice animation @Seungyoun !

  1. Note that the encoder is a causal audio encoder trained from scratch (whereas Whisper is bi-directional), so new modelling code is required
  2. The time-delay is embedded using a sin/cos embedding, then projected via an MLP and used to modulate the residual stream in the text decoder
  3. For both 1 and 2, vLLM is the source of truth
  4. The paper will be out shortly to motivate these decisions

Sign up or log in to comment