MLX Speech Models
Collection
Speech AI models for Apple Silicon via MLX. ASR, TTS, VAD, diarization, speaker embedding. β’ 33 items β’ Updated β’ 3
Text-only extraction of Qwen3.5-0.8B quantized for on-device LLM chat on Apple Silicon via MLX.
Qwen3.5 is a hybrid model with 24 layers:
[linear, linear, linear, full] Γ 6| Variant | Size | Path |
|---|---|---|
| INT4 | 404 MB | int4/model.safetensors |
| INT8 | 786 MB | int8/model.safetensors |
Each variant includes config.json, tokenizer.json, and tokenizer_config.json.
import Qwen3Chat
let model = try await Qwen35MLXChat.fromPretrained(quantization: .int4)
let response = try model.generate(
messages: [ChatMessage(role: .user, content: "Hello!")],
sampling: ChatSamplingConfig(temperature: 0.3, maxTokens: 100)
)
Part of the soniqo speech toolkit for Apple Silicon.
Quantized directly from Qwen/Qwen3.5-0.8B using mx.quantize() (group_size=64). Text model extracted, vision tower removed. Norm weights adjusted (+1). Conv1d transposed to MLX channels-last format.
python scripts/convert_qwen35_chat_mlx.py --output int4/ --bits 4
python scripts/convert_qwen35_chat_mlx.py --output int8/ --bits 8
Quantized