--- title: Sage emoji: 🦉 colorFrom: blue colorTo: yellow sdk: docker app_port: 7860 --- # Sage A recommendation system that refuses to hallucinate. ```json { "query": "budget bluetooth headphones", "recommendations": [{ "explanation": "Reviewers say \"For $18 Bluetooth headphones there is no better pair\" [review_141313]...", "confidence": {"hhem_score": 0.78, "is_grounded": true}, "citations_verified": true }] } ``` **Try it:** [vxa8502-sage.hf.space](https://vxa8502-sage.hf.space) (API explorer with Swagger UI) --- ## The Problem Product recommendations without explanations are black boxes. Users see "You might like X" but never learn *why*. When you ask an LLM to explain, it confidently invents features and fabricates reviews. **Sage is different:** Every claim is a verified quote from real customer reviews. When evidence is sparse, it refuses rather than guesses. Human evaluation scored trust at **4.0/5** because honesty beats confident fabrication. --- ## Results | Metric | Target | Achieved | Status | |--------|--------|----------|--------| | NDCG@10 (recommendation quality) | > 0.30 | 0.487 | Pass | | Claim-level faithfulness (HHEM) | > 0.85 | 0.968 | Pass | | Human evaluation (n=50) | > 3.5/5 | 3.6/5 | Pass | | P99 latency (production) | < 500ms | 283ms | Pass | | Median latency (cached) | < 100ms | 88ms | Pass | **Grounding impact:** Explanations generated WITH evidence score 73% on HHEM. WITHOUT evidence: 2.6%. RAG grounding reduces hallucination by 70 percentage points. --- ## Architecture ``` User Query: "wireless earbuds for running" │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ SAGE API (FastAPI) │ ├─────────────────────────────────────────────────────────────┤ │ 1. EMBED │ E5-small (384-dim) ~20ms │ │ 2. CACHE CHECK │ Exact + semantic (0.92 sim) ~1ms │ │ 3. RETRIEVE │ Qdrant vector search ~50ms │ │ 4. AGGREGATE │ Chunk → Product (MAX score) ~1ms │ │ 5. EXPLAIN │ Claude/GPT + evidence ~300ms │ │ 6. VERIFY │ HHEM hallucination check ~50ms │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Response: │ │ - Product ID + score │ │ - Explanation with [citations] │ │ - HHEM confidence score │ │ - Quote verification results │ └─────────────────────────────────────────────────────────────┘ ``` **Data flow:** 1M Amazon reviews → 5-core filter → 334K reviews → semantic chunking → 423K chunks in Qdrant. *([pipeline.py](scripts/pipeline.py) | [Kaggle notebook](scripts/kaggle_pipeline.ipynb))* --- ## Why This Architecture? The key insight: **hallucination happens when evidence is weak, not when the model is bad.** When you give an LLM one short review as context, it fills in the gaps with plausible-sounding fabrications. The solution is refusing to explain when evidence is insufficient. | Decision | Alternative | Why This Choice | |----------|-------------|-----------------| | **E5-small** (384-dim) | E5-large, BGE-large | Faster inference, same accuracy on product reviews. Latency > marginal gains. | | **Qdrant** | Pinecone, Weaviate | Free cloud tier, payload filtering, clean Python SDK. | | **Semantic chunking** | Fixed-window | Preserves complete arguments; better quote verification. | | **HHEM** (Vectara) | GPT-4 judge, NLI models | Purpose-built for RAG hallucination; no API cost. | | **Claim-level evaluation** | Full-explanation | Isolates which claims hallucinate; more actionable. | | **Quality gate** (refuse) | Always answer | 64% refusal rate → 4.0/5 trust. Honesty > coverage. | --- ## Known Limitations | Limitation | Impact | Mitigation | |------------|--------|------------| | **Single category** (Electronics) | Can't recommend across categories | Architecture supports multi-category; data constraint only | | **No image features** | Misses visual product attributes | Could add CLIP embeddings in future | | **English only** | Non-English reviews have lower retrieval quality | E5 is primarily English-trained | | **Cache invalidation manual** | Stale explanations possible | TTL-based expiry (1 hour); manual `/cache/clear` | | **Cold start latency** | First request ~10s (HF Spaces wake) | Subsequent requests P99 < 500ms; cache hits 88ms | | **No user personalization** | Same results for all users | Would need user history for collaborative filtering | --- ## Quick Start ### Docker (recommended) ```bash git clone https://github.com/vxa8502/sage-recommendations cd sage-recommendations cp .env.example .env # Edit .env: add ANTHROPIC_API_KEY (or OPENAI_API_KEY) docker compose up curl http://localhost:8000/health ``` ### Local Development ```bash python3 -m venv .venv && source .venv/bin/activate pip install -e ".[dev,pipeline,api,anthropic]" cp .env.example .env # Edit .env: add API keys make qdrant-up # Start local Qdrant make data # Load data (or use Qdrant Cloud) make serve # Start API at localhost:8000 ``` ### Environment Variables ```bash # Required (one of) ANTHROPIC_API_KEY=sk-ant-... OPENAI_API_KEY=sk-... LLM_PROVIDER=anthropic # or "openai" # Optional: Qdrant Cloud (instead of local) QDRANT_URL=https://xxx.cloud.qdrant.io QDRANT_API_KEY=... ``` --- ## API Reference ### POST /recommend ```bash curl -X POST https://vxa8502-sage.hf.space/recommend \ -H "Content-Type: application/json" \ -d '{"query": "wireless earbuds for running", "k": 3, "explain": true}' ``` Returns ranked products with: - Explanation grounded in customer reviews - HHEM confidence score (0-1) - Quote verification results - Evidence chunks with citations ### POST /recommend/stream Server-sent events for token-by-token explanation streaming. ### GET /health ```json {"status": "healthy", "qdrant_connected": true, "llm_reachable": true} ``` ### GET /metrics Prometheus metrics including `sage_requests_total`, `sage_request_latency_seconds`, `sage_cache_events_total`, `sage_errors_total`. ### GET /cache/stats ```json { "size": 42, "max_entries": 1000, "exact_hits": 10, "semantic_hits": 5, "misses": 27, "evictions": 0, "hit_rate": 0.35, "ttl_seconds": 3600.0, "similarity_threshold": 0.92, "avg_semantic_similarity": 0.95 } ``` --- ## Evaluation ```bash make eval # ~5 min: standard pre-commit make eval-full # ~17 min: complete automated suite + load test ``` See `make help` for all targets (including `eval-quick`, `load-test`). --- ## Project Structure (Key Directories) ``` sage/ ├── adapters/ # External integrations (Qdrant, LLM, HHEM) ├── api/ # FastAPI routes, middleware, Prometheus metrics ├── config/ # Settings, logging, query templates ├── core/ # Domain models, aggregation, verification, chunking ├── services/ # Business logic (retrieval, explanation, cache) scripts/ ├── pipeline.py # Data ingestion and embedding ├── evaluation.py # NDCG, precision, recall, novelty, baselines ├── faithfulness.py # HHEM, RAGAS, grounding delta ├── human_eval.py # Interactive human evaluation ├── load_test.py # P99 latency benchmarking ``` --- ## Failure Modes (By Design) | Condition | System Behavior | |-----------|-----------------| | Insufficient evidence (< 2 chunks) | Refuses to explain | | Low relevance (top score < 0.7) | Refuses to explain | | Quote not found in evidence | Falls back to paraphrased claims | | HHEM score < 0.5 | Flags as uncertain | The system refuses to hallucinate rather than confidently stating unsupported claims. --- ## License Academic/portfolio use only. Uses Amazon Reviews 2023 dataset.