--- title: Demo Voice Agent Data Eyond emoji: 🌍 colorFrom: pink colorTo: pink sdk: docker pinned: true --- # Voice Agent Service Real-time voice agent backend dengan WebSocket-based STT (Deepgram) dan TTS (Cartesia). Menerima audio stream dari client, mendeteksi wake word, lalu streaming kembali synthesized speech. **Versi saat ini: Phase 1 (Echo Mode)** — teks setelah wake word langsung di-echo melalui TTS. Phase 2 (LLM + RAG) direncanakan namun belum diimplementasi. ## Requirements - Python 3.11+ - [uv](https://docs.astral.sh/uv/getting-started/installation/) - Deepgram API key - Cartesia API key + Voice ID ## Setup **1. Clone & install dependencies** ```bash uv sync ``` **2. Configure environment** ```bash cp .env.example .env ``` Edit `.env` dan isi API keys: ```env DEEPGRAM_API_KEY=your_key CARTESIA_API_KEY=your_key CARTESIA_VOICE_ID=your_voice_id ``` **Konfigurasi opsional:** ```env CARTESIA_MODEL=sonic-3 # Default: sonic-3 DEEPGRAM_LANGUAGE=id # Default: id (Indonesian) DEEPGRAM_ENDPOINTING_MS=300 # Default: 300ms DEEPGRAM_UTTERANCE_END_MS=2000 # Default: 2000ms SAMPLE_RATE=16000 # Default: 16000 Hz WAKE_WORD=Hai EMA # Default: "Hai EMA" ``` ## Run ```bash `uv run uvicorn main:app --host 0.0.0.0 --port 7861` or `uv run uvicorn main:app --host 0.0.0.0 --port 7861 --reload` ``` Server akan berjalan di `http://localhost:7861`. ## Test **Health check:** ```bash curl http://localhost:7861/health ``` Expected response: ```json { "status": "ok", "version": "1.1.0", "stt_ready": true, "tts_ready": true } ``` Status `degraded` (HTTP 503) akan dikembalikan jika API keys tidak lengkap. **WebSocket test — kirim audio WAV, terima TTS response:** ```bash uv run python test_client.py --test audio --wav path/to/audio.wav --save-tts output.wav ``` > File WAV harus dalam format: **16kHz, 16-bit, mono PCM**. **Test spesifik:** ```bash uv run python test_client.py --test health # Health check uv run python test_client.py --test ping # Heartbeat ping/pong uv run python test_client.py --test interrupt # Cancel ongoing TTS uv run python test_client.py --test stop # Graceful disconnect ``` **Connectivity check (tanpa file audio):** ```bash uv run python test_client.py ``` **Konversi audio M4A ke WAV:** ```bash uv run python convert_audio.py # Konversi semua file di playground/mp4/ uv run python convert_audio.py path/to/file.m4a # Konversi satu file ``` ## Docker **Build:** ```bash docker build -t voice-agent . ``` **Run:** ```bash docker run -p 7861:7861 --env-file .env voice-agent ``` ## Wake Word Default wake word: **"Hai EMA"** (bahasa Indonesia, case-insensitive) Contoh: ucapkan _"Hai EMA, apa kabar?"_ → agent akan membalas dengan TTS _"apa kabar"_. Dapat dikonfigurasi via environment variable `WAKE_WORD`. ## Arsitektur ### Alur saat ini (Phase 1 — Echo) ``` Client Audio Stream (PCM 16kHz 16-bit mono) ↓ Deepgram STT (nova-2, real-time streaming) ↓ Wake Word Detection ↓ Echo Response ↓ Cartesia TTS (streaming chunks) ↓ Client Audio Playback ``` ### Alur yang direncanakan (Phase 2 — LLM + RAG) ``` Client Audio Stream ↓ Deepgram STT ↓ Wake Word Detection ↓ PDF Knowledge Base Retrieval (belum diimplementasi) ↓ LLM Answer Generation (belum diimplementasi) ↓ Cartesia TTS ↓ Client Audio Playback ``` ## WebSocket Protocol **Endpoint:** `ws://localhost:7861/ws/voice` **Client → Server:** | Type | Format | Keterangan | |------|--------|------------| | Binary | PCM audio chunk | Audio 16kHz, 16-bit, mono | | Text | `{"action": "ping"}` | Heartbeat keep-alive | | Text | `{"action": "stop"}` | Graceful disconnect | | Text | `{"action": "interrupt"}` | Cancel ongoing TTS | **Server → Client:** | Type | Format | Keterangan | |------|--------|------------| | Binary | PCM audio chunk | TTS response audio | | Text | `{"event": "transcript", "text": "..."}` | Hasil STT | | Text | `{"event": "reply", "text": "..."}` | Teks setelah wake word | | Text | `{"event": "tts_end"}` | TTS selesai | | Text | `{"event": "interrupted"}` | TTS dibatalkan | | Text | `{"event": "pong"}` | Response ping | | Text | `{"event": "error", "code": "...", "message": "..."}` | Error | Lihat [API_CONTRACT.md](API_CONTRACT.md) untuk dokumentasi lengkap WebSocket protocol. ## Struktur Project ``` ├── src/ │ ├── config.py # Konfigurasi & environment variables │ ├── pipeline.py # Core voice pipeline (STT → Wake Word → TTS) │ ├── stt/ │ │ ├── deepgram_client.py # Deepgram real-time STT (aktif) │ │ └── assemblyai_client.py # AssemblyAI STT (alternatif, tidak digunakan) │ ├── tts/ │ │ └── cartesia_client.py # Cartesia TTS streaming │ ├── llm/ │ │ └── answerer.py # LLM answer generation (Phase 2, belum diimplementasi) │ └── knowledge/ │ └── loader.py # PDF loader & RAG (Phase 2, belum diimplementasi) ├── main.py # FastAPI entry point & WebSocket handler ├── test_client.py # Test client ├── convert_audio.py # Konverter M4A → WAV ├── playground/ # Audio sample dan output TTS ├── Dockerfile ├── .env.example └── API_CONTRACT.md ```