File size: 6,973 Bytes
31de550 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 | ---
license: cc-by-4.0
library_name: coreml
base_model:
- nvidia/diar_streaming_sortformer_4spk-v2.1
base_model_relation: finetune
tags:
- speaker-diarization
- speech
- audio
- coreml
- apple
- ios
- macos
- sortformer
- streaming
pipeline_tag: automatic-speech-recognition
---
# Sortformer CoreML Models
Streaming speaker diarization models converted from NVIDIA's Sortformer to CoreML for Apple Silicon.
## Model Variants
| Variant | File | Latency | Use Case |
| -------------------- | ------------------------------------ | ------- | --------------------- |
| **Fastest v2** | `Sortformer_v2.mlmodelc` | ~1.04s | Low latency streaming |
| **Fastest v2.1** | `Sortformer_v2.1.mlmodelc` | ~1.04s | Low latency streaming |
| **NVIDIA Low v2** | `SortformerNvidiaLow_v2.mlmodelc` | ~1.04s | Low latency streaming |
| **NVIDIA Low v2.1** | `SortformerNvidiaLow_v2.1.mlmodelc` | ~1.04s | Low latency streaming |
| **NVIDIA High v2** | `SortformerNvidiaHigh_v2.mlmodelc` | ~30.4s | Best quality, offline |
| **NVIDIA High v2.1** | `SortformerNvidiaHigh_v2.1.mlmodelc` | ~30.4s | Best quality, offline |
The `v2` and `v2.1` refer to the version of the model weights to use. According to NVIDIA, `v2.1` is more robust in meeting scenarios.
## Configuration Parameters
| Parameter | Default | NVIDIA Low | NVIDIA High |
| ------------------- | ------- | ---------- | ----------- |
| chunk_len | 6 | 6 | 340 |
| chunk_right_context | 7 | 7 | 40 |
| chunk_left_context | 1 | 1 | 1 |
| fifo_len | 40 | 188 | 40 |
| spkcache_len | 188 | 188 | 188 |
## Model Input/Output Shapes
**General**:
| Input | Shape | Description |
| ---------------- | --------------------- | ------------------------ |
| chunk | `[1, 8*(C+L+R), 128]` | Mel spectrogram features |
| chunk_lengths | `[1]` | Actual chunk length |
| spkcache | `[1, S, 512]` | Speaker cache embeddings |
| spkcache_lengths | `[1]` | Actual cache length |
| fifo | `[1, F, 512]` | FIFO queue embeddings |
| fifo_lengths | `[1]` | Actual FIFO length |
| Output | Shape | Description |
| ------------------------- | ------------------ | ------------------------------------- |
| speaker_preds | `[C+L+R+S+F, 4]` | Speaker probabilities (4 speakers) |
| chunk_pre_encoder_embs | `[C+L+R, 512]` | Embeddings for state update |
| chunk_pre_encoder_lengths | `[1]` | Actual embedding count |
| nest_encoder_embs | `[C+L+R+S+F, 192]` | Embeddings for speaker discrimination |
| nest_encoder_lengths | `[1]` | Actual speaker embedding count |
Note: `C = chunk_len`, `L = chunk_left_context`, `R = chunk_right_context`, `S = spkcache_len`, `F = fifo_len`.
**Configuration-Specific Shapes**:
| Input | Default | NVIDIA Low | NVIDIA High |
| ---------------- | --------------- | --------------- | ---------------- |
| chunk | `[1, 112, 128]` | `[1, 112, 128]` | `[1, 3048, 128]` |
| chunk_lengths | `[1]` | `[1]` | `[1]` |
| spkcache | `[1, 188, 512]` | `[1, 188, 512]` | `[1, 188, 512]` |
| spkcache_lengths | `[1]` | `[1]` | `[1]` |
| fifo | `[1, 40, 512]` | `[1, 188, 512]` | `[1, 40, 512]` |
| fifo_lengths | `[1]` | `[1]` | `[1]` |
| Output | Default | NVIDIA Low | NVIDIA High |
| ------------------------- | --------------- | --------------- | --------------- |
| speaker_preds | `[1, 242, 128]` | `[1, 390, 128]` | `[1, 609, 128]` |
| chunk_pre_encoder_embs | `[1, 14, 512]` | `[1, 14, 512]` | `[1, 381, 512]` |
| chunk_pre_encoder_lengths | `[1]` | `[1]` | `[1]` |
| nest_encoder_embs | `[1, 242, 192]` | `[1, 390, 192]` | `[1, 609, 192]` |
| nest_encoder_lengths | `[1]` | `[1]` | `[1]` |
| Metric | Default | NVIDIA High |
| ------------- | ------- | ----------- |
| Latency | ~1.12s | ~30.4s |
| RTFx (M4 Max) | ~5.7x | ~125.3x |
## Usage with FluidAudio (Swift)
```swift
import FluidAudio
// Initialize with default config (auto-downloads from HuggingFace)
let diarizer = SortformerDiarizer(config: .default)
let models = try await SortformerModels.loadFromHuggingFace(config: .default)
diarizer.initialize(models: models)
// Streaming processing
for audioChunk in audioStream {
if let result = try diarizer.processSamples(audioChunk) {
for frame in 0..<result.frameCount {
for speaker in 0..<4 {
let prob = result.getSpeakerPrediction(speaker: speaker, frame: frame)
}
}
}
}
// Or batch processing
let timeline = try diarizer.processComplete(audioSamples)
for (speakerIndex, segments) in timeline.segments.enumerated() {
for segment in segments {
print("Speaker \(speakerIndex): \(segment.startTime)s - \(segment.endTime)s")
}
}
```
Performance
https://github.com/FluidInference/FluidAudio/blob/main/Documentation/Benchmarks.md
Files
Models
- Sortformer.mlpackage / .mlmodelc - Default config (low latency)
- SortformerNvidiaLow.mlpackage / .mlmodelc - NVIDIA low latency config
- SortformerNvidiaHigh.mlpackage / .mlmodelc - NVIDIA high latency config
Scripts
- convert_to_coreml.py - PyTorch to CoreML conversion
- streaming_inference.py - Python streaming inference example
- mic_inference.py - Real-time microphone demo
Source
Original model: https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1
Credits & Acknowledgements
This project would not have been possible without the significant technical contributions of https://huggingface.co/GradientDescent2718.
Their work was instrumental in:
- Architecture Conversion: Developing the complex PyTorch-to-CoreML conversion pipeline for the 17-layer Fast-Conformer and 18-layer Transformer heads.
- Build & Optimization: Engineering the static shape configurations that allow the model to achieve ~120x RTF on Apple Silicon.
- Logic Implementation: Porting the critical streaming state logic (speaker cache and FIFO management) to ensure consistent speaker identity tracking.
This project was built upon the foundational work of the NVIDIA NeMo team. |