Moonshine-Tiny-DE: Fine-tuned German Speech Recognition

Fine-tuned UsefulSensors/moonshine-tiny for German automatic speech recognition.

Model Details

Base model: UsefulSensors/moonshine-tiny (27M parameters)
Language: German (de)
Training data: MLS German — 469,942 samples (~1,967 hours of audiobook speech)
WER: 36.7% on MLS German test set (3,394 samples)
Training: 10,000 steps, schedule-free AdamW, bf16, effective batch size 64
Hardware: Single NVIDIA RTX 5090 (32 GB), ~9.7 hours

Usage

from transformers import pipeline

transcriber = pipeline("automatic-speech-recognition", model="dattazigzag/moonshine-tiny-de")
result = transcriber("german_audio.wav")
print(result["text"])

Batch processing

from pathlib import Path

audio_files = Path("./audio").glob("*.wav")
for audio in audio_files:
    result = transcriber(str(audio))
    print(f"{audio.name}: {result['text']}")

With explicit model loading

from transformers import AutoProcessor, MoonshineForConditionalGeneration
import torch

model = MoonshineForConditionalGeneration.from_pretrained("dattazigzag/moonshine-tiny-de")
processor = AutoProcessor.from_pretrained("dattazigzag/moonshine-tiny-de")
model.eval()

# Process audio (16kHz mono WAV)
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    generated_ids = model.generate(**inputs, max_new_tokens=80)
text = processor.tokenizer.decode(generated_ids[0], skip_special_tokens=True)

Training Details

Approach

This is not trained from scratch. We fine-tuned the English-only moonshine-tiny model to understand German. The pre-trained model already knew audio feature extraction, attention patterns, and tokenization — we adapted it to German phonetics and vocabulary.

Configuration

Setting	Value
Optimizer	schedule-free AdamW
Learning rate	3e-4 (constant after 300-step warmup)
Precision	bf16
Batch size	16 per device × 4 accumulation = 64 effective
Audio duration	4–20 seconds
Gradient checkpointing	Disabled (broken with Moonshine in transformers 4.49)
Curriculum learning	Disabled (simple first run)

Training curve

Step	Loss	WER
500	2.37	—
1,000	2.04	46.5%
5,000	~1.65	~39%
10,000	1.61	36.7%

Error patterns

Phonetically similar confusions: b/p, d/t, ck/x (classic German ASR challenges)
Compound word splitting errors: "herzaubern" → "herr sauben"
Longer sequences degrade more than shorter ones
Audiobook speech only — no conversational speech exposure

Limitations

Audiobook speech only — trained on MLS (read speech). May underperform on conversational, noisy, or accented German.
First training run — WER can likely be improved with curriculum learning, more training steps, or additional data sources (SWC, VoxPopuli, Bundestag).
No Common Voice data — Mozilla pulled it from HuggingFace in Oct 2025, so we lack speaker diversity.
HuggingFace transformers only — produces safetensors format, not the .ort format for the native moonshine-voice CLI. ONNX conversion is a planned next step.

Fine-tuning toolkit

Trained using a fork of Pierre Chéneau's finetune-moonshine-asr with German-specific adaptations:

Acknowledgments

Moonshine AI / Useful Sensors for the base model
Pierre Chéneau for the fine-tuning toolkit and moonshine-tiny-fr (21.8% WER French reference)
German language support community (issue #141)

Citation

@misc{datta2026moonshine-tiny-de,
  author = {Saurabh Datta},
  title = {Moonshine-Tiny-DE: Fine-tuned German Speech Recognition},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/dattazigzag/moonshine-tiny-de}
}

Downloads last month: 23

Safetensors

Model size

27.1M params

Tensor type

F32

Model tree for dattazigzag/moonshine-tiny-de

Base model

UsefulSensors/moonshine-tiny

Finetuned

(3)

this model

Dataset used to train dattazigzag/moonshine-tiny-de

Evaluation results

WER on MLS German (test split)
self-reported

36.700