Fine-tuned Arabic Quran ASR model works in NeMo but produces empty or incorrect transcriptions in NVIDIA Riva (streaming & offline)

#1
by NovaCon-AI - opened

Summary

I am working on an Arabic ASR system for Quran recitation (tajweed, formal MSA Arabic).
The model works correctly when used directly in NeMo, but fails after conversion and deployment in NVIDIA Riva, where it produces empty outputs or only very short tokens.

I am trying to determine:
1. Which ASR model architecture is correct for Quran recitation and streaming
2. Whether my chosen model is supported for Riva streaming
3. What exact configuration (model type + riva-build flags) is required
4. Why NeMo inference works but Riva inference does not

My Goal
• Language: Arabic (ar-AR)
• Domain: Quran recitation
• Use case: Streaming ASR (low latency)
• Output: Full verse-level transcription, not partial tokens
• Deployment target: NVIDIA Riva

What Works
• Tokenizer is correct and verified:
• SentencePiece tokenizer
• Arabic text round-trip works (text → ids → text)
• Model inference in NeMo works correctly:
• Full Arabic sentences are decoded
• WER output in NeMo training logs is correct
• Model fine-tuning completed successfully
• .nemo model loads without error

What Fails

After converting the same model to Riva and deploying:
• Streaming and offline Riva pipelines return:
• Empty transcripts
• Or a single repeated token (e.g. “وَ”)
• No runtime crash
• Riva server starts successfully
• Models appear loaded, but inference output is unusable

Model Details
• Model type: EncDecHybridRNNTCTCBPEModel
• Encoder: Conformer / FastConformer
• Decoder: RNNT + CTC
• Tokenizer: SentencePiece (1024 vocab)
• Language: Arabic (ar-AR)

Conversion & Deployment Steps Used

NeMo → Riva

nemo2riva
--out Speech_To_Text_Finetuning.riva
--max-dim 5000
--max-batch 4
--device cuda
Speech_To_Text_Finetuning.nemo

Riva Build (Streaming)

riva-build speech_recognition
asr_streaming.rmir
Speech_To_Text_Finetuning.riva
--streaming=true
--decoder_type=greedy
--ms_per_timestep=40
--chunk_size=4.8
--left_padding_size=1.6
--right_padding_size=1.6
--max_batch_size=4
--featurizer.use_utterance_norm_params=False
--featurizer.precalc_norm_time_steps=0
--featurizer.precalc_norm_params=False
--language_code=ar-AR

Same issue occurs with offline pipeline.

Known Related Threads (Symptoms Match Exactly)
• “Finetuned ASR conformer returns only empty transcripts”
• “Issue Deploying Fine-Tuned Arabic Conformer Model in NVIDIA Riva”
• “Riva providing empty transcriptions but NeMo does not”
• “Known issue with conformer models – try –nn.use_trt_fp32”
• FastConformer RNNT models reported as not officially supported for Riva streaming

Key Observations
1. NeMo works, Riva does not
2. Empty or near-empty output is a known Riva failure mode
3. Multiple threads suggest:
• Conformer / FastConformer RNNT streaming is fragile or unsupported
• TRT FP16 causes silent decoding failures
4. Canary models are offline-only
5. Parakeet models are designed for streaming but have limited Arabic coverage

Questions (Core of This Issue)
1. Which ASR model architecture is officially supported for Arabic streaming ASR in Riva?
• Conformer-CTC?
• Citrinet?
• Parakeet?
• Something else?
2. Is EncDecHybridRNNTCTCBPEModel supported for streaming in Riva?
• If not, what is the recommended alternative?
3. Is Quran recitation a valid use case for Riva streaming ASR, or is offline decoding required?
4. Which riva-build flags are mandatory to avoid empty outputs?
• --nn.use_trt_fp32
• disabling VAD?
• different chunk/padding constraints?
5. Is there an official reference pipeline for Arabic ASR deployment in Riva?

Environment
• OS: Ubuntu 22.04
• GPU: RTX 3060 (6GB)
• CUDA: 12.x
• NeMo: recent version
• Riva: 2.x
• Audio: 16kHz mono WAV
• Language code: ar-AR

What I Am Looking For
• A clear recommendation:
• Correct model
• Correct decoding mode (streaming vs offline)
• Correct Riva configuration
• Confirmation whether my current approach is fundamentally incompatible
• A known-good Arabic Riva ASR deployment example

Thank you for your time.
I am happy to provide logs, configs, or a minimal reproduction if needed.

Sign up or log in to comment