---
language:
- zh
- en
license: apache-2.0
library_name: mlx
pipeline_tag: text-to-speech
tags:
- mlx
- tts
- speech
- voice-conditioned
- multi-speaker
- long-form
- diffusion
- apple-silicon
- quantized
- 8bit
---

# VibeVoice — MLX

VibeVoice Large converted and quantized for native MLX inference on Apple Silicon. Hybrid LLM + diffusion architecture for long-form speech, multi-speaker dialogue, and voice cloning.

## Variants

| Path | Precision |
| --- | --- |
| `mlx-int8/` | int8 quantized weights |

## How to Get Started

**Single speaker:**
```bash
python scripts/generate/vibevoice.py \
  --text "Hello from VibeVoice." \
  --output outputs/vibevoice.wav
```

**Multi-speaker dialogue** — speaker labels are 0-based:
```bash
python scripts/generate/vibevoice.py \
  --text "Speaker 0: Have you tried VibeVoice?
Speaker 1: Not yet. Does it need PyTorch?
Speaker 0: No. Pure MLX, runs locally on Apple Silicon.
Speaker 1: That is impressive." \
  --output outputs/dialogue.wav
```

**Voice cloning** — one reference WAV per speaker:
```bash
python scripts/generate/vibevoice.py \
  --text "Speaker 0: This is cloned from the reference." \
  --reference-audio-speaker0 ref_speaker0.wav \
  --output outputs/clone.wav
```

Up to 4 speakers supported: `--reference-audio-speaker0` through `--reference-audio-speaker3`.

**Default generation settings** (matching upstream):
- Greedy decoding (deterministic)
- Seed: 42
- Diffusion steps: 20

Add `--no-greedy` to enable temperature + top-p sampling.

## Model Details

VibeVoice uses a 9B-parameter hybrid architecture combining a Qwen2 language model backbone with a continuous diffusion acoustic decoder. Converted to MLX with explicit weight remapping — no PyTorch at inference time.

See [mlx-speech](https://github.com/appautomaton/mlx-speech) for the full runtime and conversion code.

## License

Apache 2.0.