Kyutai
non-profit
Verified
AI & ML interests
None defined yet.
Recent Activity
View all activity
Papers
CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion
ARC-Encoder: learning compressed text representations for large language models
Pretrained ARC-Encoders and a fine-tuning dataset: context compression for unmodified LLMs.
https://kyutai.org/next/stt
MoshiVis is a Vision Speech Model built as a perceptually-augmented version of Moshi v0.1 for conversing about image inputs
MLX, Candle & PyTorch model checkpoints released as part of the Moshi release from Kyutai. Run inference via: https://github.com/kyutai-labs/moshi
CASA: Cross-Attention as Self-Attention for Efficient Vision-Language Fusion on long context streaming inputs
-
CASA Gallery
🏠2Video Gallery for CASA: Cross-Attention via Self-Attention
-
CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion
Paper • 2512.19535 • Published • 12 -
kyutai/CASA-Helium1-VL-2B
Image-Text-to-Text • 3B • Updated • 59 • 7 -
kyutai/CASA-Qwen2_5-VL-3B
Image-Text-to-Text • 4B • Updated • 263 • 2
https://kyutai.org/next/tts
Helium 1: a modular and multilingual LLM
Hibiki is a model for streaming speech translation , which can run on device! See https://github.com/kyutai-labs/hibiki.
CASA: Cross-Attention as Self-Attention for Efficient Vision-Language Fusion on long context streaming inputs
-
CASA Gallery
🏠2Video Gallery for CASA: Cross-Attention via Self-Attention
-
CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion
Paper • 2512.19535 • Published • 12 -
kyutai/CASA-Helium1-VL-2B
Image-Text-to-Text • 3B • Updated • 59 • 7 -
kyutai/CASA-Qwen2_5-VL-3B
Image-Text-to-Text • 4B • Updated • 263 • 2
Pretrained ARC-Encoders and a fine-tuning dataset: context compression for unmodified LLMs.
https://kyutai.org/next/tts
https://kyutai.org/next/stt
Helium 1: a modular and multilingual LLM
MoshiVis is a Vision Speech Model built as a perceptually-augmented version of Moshi v0.1 for conversing about image inputs
Hibiki is a model for streaming speech translation , which can run on device! See https://github.com/kyutai-labs/hibiki.
MLX, Candle & PyTorch model checkpoints released as part of the Moshi release from Kyutai. Run inference via: https://github.com/kyutai-labs/moshi