Spaces:
Sleeping
Sleeping
feat: advanced RAG architecture with SFT data pipeline
Browse filesThis view is limited to 50 files because it contains too many changes. See raw diff
- .gitignore +4 -0
- CHANGELOG.md +44 -14
- Dockerfile +3 -4
- Makefile +12 -7
- README.md +70 -24
- data/sft/raw_generated.jsonl +5 -0
- data/user_profiles.json +13 -2
- data/users.json +27 -0
- docker-compose.yml +0 -14
- docs/TECHNICAL_REPORT.md +240 -0
- docs/archived/DEPLOYMENT.md +108 -0
- docs/archived/PHASE_2_DEVELOPMENT.md +509 -0
- docs/archived/REVIEW_HIGHLIGHTS.md +142 -0
- docs/archived/TAGS_AND_EMOTIONS.md +233 -0
- docs/archived/interview_prep_v1.md +173 -0
- docs/future_roadmap.md +70 -0
- docs/interview_deep_dive.md +82 -0
- docs/project_narrative.md +58 -0
- docs/rag_architecture.md +86 -0
- docs/technical_deep_dive_sota.md +197 -0
- environment.yml +41 -0
- experiments/baseline_report.md +28 -0
- experiments/hybrid_report.md +28 -0
- experiments/rerank_report.md +25 -0
- experiments/router_report.md +27 -0
- experiments/temporal_report.md +25 -0
- scripts/add_isbn13_to_books_data.py +16 -0
- scripts/add_isbn_to_books_data.py +21 -0
- scripts/benchmark_compressor.py +35 -0
- scripts/benchmark_hybrid.py +83 -0
- scripts/benchmark_rerank.py +82 -0
- scripts/benchmark_retrieval.py +82 -0
- scripts/benchmark_router.py +99 -0
- scripts/benchmark_temporal.py +44 -0
- scripts/build_books_basic_info.py +48 -0
- scripts/chunk_reviews.py +103 -0
- scripts/init_dual_index.py +71 -0
- scripts/test_rag.py +35 -0
- scripts/verify_env.py +61 -0
- src/api/chat.py +50 -0
- src/config.py +3 -1
- src/core/context_compressor.py +89 -0
- src/core/llm.py +78 -0
- src/core/reranker.py +104 -0
- src/core/router.py +86 -0
- src/core/temporal.py +106 -0
- src/cover_fetcher.py +10 -5
- src/data_factory/__init__.py +4 -0
- src/data_factory/generator.py +240 -0
- src/etl.py +2 -2
.gitignore
CHANGED
|
@@ -74,3 +74,7 @@ data/chroma_db/
|
|
| 74 |
|
| 75 |
web/node_modules/
|
| 76 |
data/books_processed.csv
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
|
| 75 |
web/node_modules/
|
| 76 |
data/books_processed.csv
|
| 77 |
+
|
| 78 |
+
# Large data files (rebuild with scripts/init_dual_index.py)
|
| 79 |
+
data/chroma_chunks/
|
| 80 |
+
data/review_chunks.jsonl
|
CHANGELOG.md
CHANGED
|
@@ -4,7 +4,38 @@ All notable changes to this project will be documented in this file.
|
|
| 4 |
|
| 5 |
## [Unreleased]
|
| 6 |
|
| 7 |
-
### Added -
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
- **Real-time Book Cover Fetching**: New `src/cover_fetcher.py` module that fetches book covers dynamically from Google Books API and Open Library
|
| 9 |
- LRU cache (1000 items) to avoid redundant API calls
|
| 10 |
- Automatic fallback to Open Library if Google Books fails
|
|
@@ -12,25 +43,24 @@ All notable changes to this project will be documented in this file.
|
|
| 12 |
- ~0.5-1s latency increase per recommendation query (10-20 books)
|
| 13 |
- **Client-Server Architecture**: Separated UI and API into independent processes
|
| 14 |
- API server runs on port 6006 (FastAPI backend)
|
| 15 |
-
-
|
| 16 |
- Enables better scalability and deployment flexibility
|
| 17 |
|
| 18 |
-
### Changed -
|
| 19 |
-
- **
|
| 20 |
-
-
|
| 21 |
-
-
|
| 22 |
-
-
|
| 23 |
-
- Fixed payload format to match API schema (query, category, tone)
|
| 24 |
- **Makefile**: Updated `run` command to explicitly use port 6006 for API server
|
| 25 |
- **src/recommender.py**: Integrated real-time cover fetcher in `_format_results()`
|
| 26 |
- Replaced hardcoded file paths with dynamic API calls
|
| 27 |
- Each recommendation now fetches fresh cover URLs
|
|
|
|
| 28 |
|
| 29 |
-
### Fixed -
|
| 30 |
- Port mismatch between API (8000) and UI (expected 6006)
|
| 31 |
-
-
|
| 32 |
-
-
|
| 33 |
-
- Response structure mismatch (direct list vs {recommendations: []} object)
|
| 34 |
|
| 35 |
### Added
|
| 36 |
- **Super App Architecture**: Transformed into "End-to-End AI E-Commerce Platform" with 3-tab UI.
|
|
@@ -52,8 +82,8 @@ All notable changes to this project will be documented in this file.
|
|
| 52 |
- Updated README with project structure section
|
| 53 |
|
| 54 |
### Fixed
|
| 55 |
-
-
|
| 56 |
-
- Dockerfile startup command (
|
| 57 |
|
| 58 |
---
|
| 59 |
|
|
|
|
| 4 |
|
| 5 |
## [Unreleased]
|
| 6 |
|
| 7 |
+
### Added - 2024-01-07
|
| 8 |
+
- **UI Refinements**: Book detail modal layout improvements
|
| 9 |
+
- Author name displayed separately below book cover
|
| 10 |
+
- Optimized spacing between elements (reduced excessive whitespace)
|
| 11 |
+
- Removed mood/emotion display from detail modal for cleaner interface
|
| 12 |
+
- Review highlights positioned directly after AI highlight box
|
| 13 |
+
- **Summary Quality**: Smarter sentence-based summaries with HTML entity cleanup
|
| 14 |
+
- Prefer Google Books description when available
|
| 15 |
+
- Fallback to dataset description with HTML unescape and sentence truncation
|
| 16 |
+
|
| 17 |
+
### Added - 2024-01-XX
|
| 18 |
+
- **Review Highlights Feature**: Semantic sentence extraction with clustering
|
| 19 |
+
- scripts/extract_review_sentences.py for processing book descriptions
|
| 20 |
+
- Review highlights display in React frontend
|
| 21 |
+
- Average rating display in book detail modal
|
| 22 |
+
- REVIEW_HIGHLIGHTS.md documentation
|
| 23 |
+
|
| 24 |
+
### Changed - 2024-01-XX
|
| 25 |
+
- **Frontend Migration**: Moved from dual UI (Gradio + React) to React-only
|
| 26 |
+
- Updated README.md with React frontend setup instructions
|
| 27 |
+
- Updated Dockerfile to run FastAPI backend (port 8000)
|
| 28 |
+
- Updated docker-compose.yml to remove Gradio service
|
| 29 |
+
- Cleaned up documentation references to Gradio
|
| 30 |
+
|
| 31 |
+
### Removed - 2024-01-XX
|
| 32 |
+
- app.py (264-line Gradio legacy UI)
|
| 33 |
+
- Makefile run-ui target
|
| 34 |
+
- docker-compose.yml ui service definition
|
| 35 |
+
|
| 36 |
+
---
|
| 37 |
+
|
| 38 |
+
### Added - 2024-01-06
|
| 39 |
- **Real-time Book Cover Fetching**: New `src/cover_fetcher.py` module that fetches book covers dynamically from Google Books API and Open Library
|
| 40 |
- LRU cache (1000 items) to avoid redundant API calls
|
| 41 |
- Automatic fallback to Open Library if Google Books fails
|
|
|
|
| 43 |
- ~0.5-1s latency increase per recommendation query (10-20 books)
|
| 44 |
- **Client-Server Architecture**: Separated UI and API into independent processes
|
| 45 |
- API server runs on port 6006 (FastAPI backend)
|
| 46 |
+
- React frontend runs on port 5173 (development)
|
| 47 |
- Enables better scalability and deployment flexibility
|
| 48 |
|
| 49 |
+
### Changed - 2024-01-06
|
| 50 |
+
- **React Frontend (web/)**: Created modern UI with book search and recommendations
|
| 51 |
+
- React 18 + Vite for fast development
|
| 52 |
+
- Tailwind CSS for styling
|
| 53 |
+
- Book detail modal with review highlights
|
|
|
|
| 54 |
- **Makefile**: Updated `run` command to explicitly use port 6006 for API server
|
| 55 |
- **src/recommender.py**: Integrated real-time cover fetcher in `_format_results()`
|
| 56 |
- Replaced hardcoded file paths with dynamic API calls
|
| 57 |
- Each recommendation now fetches fresh cover URLs
|
| 58 |
+
- Added review_highlights and average_rating fields
|
| 59 |
|
| 60 |
+
### Fixed - 2024-01-06
|
| 61 |
- Port mismatch between API (8000) and UI (expected 6006)
|
| 62 |
+
- API validation errors due to payload field name mismatch
|
| 63 |
+
- Response structure improvements for frontend integration
|
|
|
|
| 64 |
|
| 65 |
### Added
|
| 66 |
- **Super App Architecture**: Transformed into "End-to-End AI E-Commerce Platform" with 3-tab UI.
|
|
|
|
| 82 |
- Updated README with project structure section
|
| 83 |
|
| 84 |
### Fixed
|
| 85 |
+
- React 18 compatibility issues
|
| 86 |
+
- Dockerfile startup command (updated to run FastAPI backend)
|
| 87 |
|
| 88 |
---
|
| 89 |
|
Dockerfile
CHANGED
|
@@ -19,9 +19,8 @@ COPY . .
|
|
| 19 |
ENV PYTHONUNBUFFERED=1
|
| 20 |
ENV PYTHONPATH=/app
|
| 21 |
|
| 22 |
-
# Expose
|
| 23 |
EXPOSE 8000
|
| 24 |
-
EXPOSE 7860
|
| 25 |
|
| 26 |
-
# Default command: Run
|
| 27 |
-
CMD ["
|
|
|
|
| 19 |
ENV PYTHONUNBUFFERED=1
|
| 20 |
ENV PYTHONPATH=/app
|
| 21 |
|
| 22 |
+
# Expose port for API
|
| 23 |
EXPOSE 8000
|
|
|
|
| 24 |
|
| 25 |
+
# Default command: Run FastAPI backend
|
| 26 |
+
CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"]
|
Makefile
CHANGED
|
@@ -1,21 +1,26 @@
|
|
| 1 |
-
|
|
|
|
|
|
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
|
|
|
|
| 6 |
run:
|
| 7 |
uvicorn src.main:app --reload --port 6006
|
| 8 |
|
| 9 |
-
|
| 10 |
-
python app.py
|
| 11 |
-
|
| 12 |
test:
|
| 13 |
pytest tests/
|
| 14 |
|
| 15 |
lint:
|
| 16 |
-
pip install ruff
|
| 17 |
ruff check src/
|
| 18 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
docker-build:
|
| 20 |
docker-compose build
|
| 21 |
|
|
|
|
| 1 |
+
# Environment
|
| 2 |
+
env-create:
|
| 3 |
+
conda env create -f environment.yml
|
| 4 |
|
| 5 |
+
env-update:
|
| 6 |
+
conda env update -f environment.yml --prune
|
| 7 |
|
| 8 |
+
# Development
|
| 9 |
run:
|
| 10 |
uvicorn src.main:app --reload --port 6006
|
| 11 |
|
| 12 |
+
# Quality
|
|
|
|
|
|
|
| 13 |
test:
|
| 14 |
pytest tests/
|
| 15 |
|
| 16 |
lint:
|
|
|
|
| 17 |
ruff check src/
|
| 18 |
|
| 19 |
+
clean:
|
| 20 |
+
find . -type d -name "__pycache__" -exec rm -rf {} +
|
| 21 |
+
find . -type f -name "*.pyc" -delete
|
| 22 |
+
|
| 23 |
+
# Docker
|
| 24 |
docker-build:
|
| 25 |
docker-compose build
|
| 26 |
|
README.md
CHANGED
|
@@ -2,7 +2,7 @@
|
|
| 2 |
license: mit
|
| 3 |
title: Semantic-Based Book Recommendation Framework
|
| 4 |
sdk: docker
|
| 5 |
-
app_port:
|
| 6 |
---
|
| 7 |
|
| 8 |
# Semantic-Based Book Recommendation Framework using Large Language Model Embeddings
|
|
@@ -23,10 +23,12 @@ The implementation follows a modular pipeline consisting of Data Preprocessing,
|
|
| 23 |
The dataset consists of 7,000+ books with metadata including titles, authors, and summaries. Data cleaning procedures included:
|
| 24 |
- **Null Value Handling**: Removal of records with missing descriptions or critical metadata.
|
| 25 |
- **Text Normalization**: Standardization of description text (unicode normalization, whitespace handling).
|
| 26 |
-
- **
|
|
|
|
|
|
|
| 27 |
|
| 28 |
### 2.2 Vector Embeddings
|
| 29 |
-
Semantic search is enabled by projecting
|
| 30 |
|
| 31 |
### 2.3 Emotion Classification
|
| 32 |
To support mood-based filtering, we implemented a transferable multi-label classification task. We utilized **DistilRoBERTa-base**, fine-tuned on the GoEmotions dataset. For each book description, the model predicts a probability distribution across 7 emotional dimensions: *Joy, Sadness, Anger, Fear, Surprise, Love, and Neutral*.
|
|
@@ -47,20 +49,43 @@ This project presents a comprehensive, multi-modal recommendation and e-commerce
|
|
| 47 |
* **Caching Infrastructure**: Implements Redis caching to optimize latency for high-frequency queries.
|
| 48 |
* **Zero-Shot Re-ranking**: (In Progress) Evaluates candidate generation using LLM-based zero-shot reasoning.
|
| 49 |
|
| 50 |
-
### 2. Conversational Shopping Assistant
|
| 51 |
-
* **RAG Architecture**: Retrieves
|
| 52 |
-
* **
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
|
| 54 |
-
### 3. Marketing Content Generation
|
| 55 |
-
* **Automated Copywriting**: Generates marketing descriptions based on product features and target audience profiles.
|
| 56 |
-
* **Safety Guardrails**: Enforces content safety policies to ensure generated text adheres to brand guidelines.
|
| 57 |
|
| 58 |
## System Architecture
|
| 59 |
|
| 60 |
-
The project follows a
|
| 61 |
|
| 62 |
-
* **Frontend**:
|
| 63 |
-
* **Backend API**: FastAPI service
|
| 64 |
* **Data Layer**:
|
| 65 |
* **Amazon Books Dataset**: 200,000+ records processed via custom ETL pipelines.
|
| 66 |
* **Vector Store**: ChromaDB for embedding storage and similarity search.
|
|
@@ -70,11 +95,12 @@ The project follows a microservices-inspired architecture:
|
|
| 70 |
|
| 71 |
### Prerequisites
|
| 72 |
* Python 3.10+
|
| 73 |
-
*
|
|
|
|
| 74 |
|
| 75 |
### Deployment
|
| 76 |
|
| 77 |
-
**Option 1:
|
| 78 |
|
| 79 |
1. **Clone the repository**:
|
| 80 |
```bash
|
|
@@ -82,26 +108,45 @@ The project follows a microservices-inspired architecture:
|
|
| 82 |
cd book-rec-with-LLMs
|
| 83 |
```
|
| 84 |
|
| 85 |
-
2. **
|
| 86 |
```bash
|
| 87 |
-
|
| 88 |
-
|
| 89 |
```
|
| 90 |
|
| 91 |
-
3. **
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
```bash
|
| 93 |
make run
|
| 94 |
# Starts FastAPI on http://localhost:6006
|
| 95 |
```
|
| 96 |
|
| 97 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 98 |
```bash
|
| 99 |
-
|
| 100 |
-
|
|
|
|
|
|
|
| 101 |
```
|
| 102 |
|
| 103 |
5. **Access the Interface**:
|
| 104 |
-
Navigate to `http://localhost:
|
| 105 |
|
| 106 |
**Option 2: Docker Deployment**
|
| 107 |
|
|
@@ -111,7 +156,8 @@ The project follows a microservices-inspired architecture:
|
|
| 111 |
```
|
| 112 |
|
| 113 |
2. **Access the Interface**:
|
| 114 |
-
|
|
|
|
| 115 |
|
| 116 |
**Notes:**
|
| 117 |
- Redis is optional; caching will be disabled if Redis is unavailable
|
|
@@ -168,7 +214,7 @@ To deploy the system locally, execute the following commands:
|
|
| 168 |
|
| 169 |
The services will be available at:
|
| 170 |
- **API Documentation**: `http://localhost:8000/docs`
|
| 171 |
-
- **
|
| 172 |
|
| 173 |
## 7. References
|
| 174 |
|
|
|
|
| 2 |
license: mit
|
| 3 |
title: Semantic-Based Book Recommendation Framework
|
| 4 |
sdk: docker
|
| 5 |
+
app_port: 8000
|
| 6 |
---
|
| 7 |
|
| 8 |
# Semantic-Based Book Recommendation Framework using Large Language Model Embeddings
|
|
|
|
| 23 |
The dataset consists of 7,000+ books with metadata including titles, authors, and summaries. Data cleaning procedures included:
|
| 24 |
- **Null Value Handling**: Removal of records with missing descriptions or critical metadata.
|
| 25 |
- **Text Normalization**: Standardization of description text (unicode normalization, whitespace handling).
|
| 26 |
+
- **Review Aggregation**: Concatenation of top 3 most helpful/detailed reviews to form a "Review Highlight" document for semantic search.
|
| 27 |
+
- **Description Repair**: Integration of official `books_data.csv` description metadata for accurate frontend display.
|
| 28 |
+
- **Quality Filtration**: Exclusion of records with content shorter than 25 words to ensure sufficient semantic content for embedding.
|
| 29 |
|
| 30 |
### 2.2 Vector Embeddings
|
| 31 |
+
Semantic search is enabled by projecting **processed review highlights** (concatenated high-frequency user comments) into a shared vector space. This allows the system to capture the "reader's sentiment" and thematic elements as perceived by the audience, rather than just the official synopsis. We utilized the `sentence-transformers/all-MiniLM-L6-v2` model, which maps sentences to a 384-dimensional dense vector space. This model was selected for its optimal balance between inference speed and semantic accuracy (performance on the 1B Sentence Embeddings Benchmark).
|
| 32 |
|
| 33 |
### 2.3 Emotion Classification
|
| 34 |
To support mood-based filtering, we implemented a transferable multi-label classification task. We utilized **DistilRoBERTa-base**, fine-tuned on the GoEmotions dataset. For each book description, the model predicts a probability distribution across 7 emotional dimensions: *Joy, Sadness, Anger, Fear, Surprise, Love, and Neutral*.
|
|
|
|
| 49 |
* **Caching Infrastructure**: Implements Redis caching to optimize latency for high-frequency queries.
|
| 50 |
* **Zero-Shot Re-ranking**: (In Progress) Evaluates candidate generation using LLM-based zero-shot reasoning.
|
| 51 |
|
| 52 |
+
### 2. Conversational Shopping Assistant (RAG)
|
| 53 |
+
* **RAG Architecture**: Retrieves book context from ChromaDB to ground LLM responses, reducing hallucinations.
|
| 54 |
+
* **Streaming Responses**: Real-time token streaming via Server-Sent Events (SSE).
|
| 55 |
+
* **BYOK (Bring Your Own Key)**: Users provide their own OpenAI API key via frontend Settings modal.
|
| 56 |
+
* **Local LLM Support**: Ollama integration for zero-cost local inference (`llama3`).
|
| 57 |
+
|
| 58 |
+
### 3. Personalized Marketing Highlights
|
| 59 |
+
* **LLM-Powered Generation**: Real-time personalized book highlights using user's reading persona.
|
| 60 |
+
* **Async UX**: Modal opens immediately; highlights load in background for responsive experience.
|
| 61 |
+
* **Fallback System**: Graceful degradation to template-based highlights if LLM unavailable.
|
| 62 |
+
|
| 63 |
+
### 4. Advanced RAG Architecture (SOTA)
|
| 64 |
+
This system implements state-of-the-art retrieval techniques beyond basic vector search:
|
| 65 |
+
|
| 66 |
+
* **Agentic Query Router**: Dynamically selects retrieval strategy based on query intent.
|
| 67 |
+
* ISBN queries → Pure BM25 (100% precision on exact matches)
|
| 68 |
+
* Keyword queries → Hybrid Search (BM25 + Dense, fast)
|
| 69 |
+
* Complex queries → Hybrid + Cross-Encoder Reranking (high relevance)
|
| 70 |
+
* Detail queries → Small-to-Big Retrieval (finds hidden gems)
|
| 71 |
+
* **Hybrid Search (RRF)**: Combines sparse (BM25) and dense (MiniLM) retrieval using Reciprocal Rank Fusion.
|
| 72 |
+
* **Cross-Encoder Reranking**: Uses `ms-marco-MiniLM` to rerank top candidates for semantic precision.
|
| 73 |
+
* **Temporal Dynamics**: Applies recency bias for "latest/new" queries using publication date decay.
|
| 74 |
+
* **Small-to-Big Retrieval**: Indexes 788K review sentences separately; matches specific plot details, maps back to parent book.
|
| 75 |
+
* **Context Compression**: Summarizes long chat history to prevent token overflow.
|
| 76 |
+
|
| 77 |
+
### 5. SFT Data Factory
|
| 78 |
+
* **Self-Instruct Pipeline**: Generates (Query, Response) pairs from raw reviews for style alignment.
|
| 79 |
+
* **LLM-as-a-Judge**: Quality filtering on Empathy, Specificity, and Critique Depth dimensions.
|
| 80 |
+
* **DPO-Ready**: Can construct preference pairs (Chosen vs Rejected) for alignment training.
|
| 81 |
|
|
|
|
|
|
|
|
|
|
| 82 |
|
| 83 |
## System Architecture
|
| 84 |
|
| 85 |
+
The project follows a modern full-stack architecture:
|
| 86 |
|
| 87 |
+
* **Frontend**: React 18 + Vite, providing an intuitive book search and recommendation interface.
|
| 88 |
+
* **Backend API**: FastAPI service for recommendation logic and data retrieval.
|
| 89 |
* **Data Layer**:
|
| 90 |
* **Amazon Books Dataset**: 200,000+ records processed via custom ETL pipelines.
|
| 91 |
* **Vector Store**: ChromaDB for embedding storage and similarity search.
|
|
|
|
| 95 |
|
| 96 |
### Prerequisites
|
| 97 |
* Python 3.10+
|
| 98 |
+
* Node.js 18+ and npm/yarn
|
| 99 |
+
* Docker and Docker Compose (optional)
|
| 100 |
|
| 101 |
### Deployment
|
| 102 |
|
| 103 |
+
**Option 1: Development Mode**
|
| 104 |
|
| 105 |
1. **Clone the repository**:
|
| 106 |
```bash
|
|
|
|
| 108 |
cd book-rec-with-LLMs
|
| 109 |
```
|
| 110 |
|
| 111 |
+
2. **Create Conda environment**:
|
| 112 |
```bash
|
| 113 |
+
conda env create -f environment.yml
|
| 114 |
+
conda activate book-rec
|
| 115 |
```
|
| 116 |
|
| 117 |
+
3. **Initialize Vector Database** (first run only):
|
| 118 |
+
```bash
|
| 119 |
+
python src/init_db.py
|
| 120 |
+
```
|
| 121 |
+
|
| 122 |
+
4. **Start API Server** (Terminal 1):
|
| 123 |
```bash
|
| 124 |
make run
|
| 125 |
# Starts FastAPI on http://localhost:6006
|
| 126 |
```
|
| 127 |
|
| 128 |
+
### LLM Configuration
|
| 129 |
+
|
| 130 |
+
**Option A: Local Ollama (Free, Recommended for Dev)**
|
| 131 |
+
```bash
|
| 132 |
+
ollama pull llama3
|
| 133 |
+
ollama serve # if not already running
|
| 134 |
+
```
|
| 135 |
+
|
| 136 |
+
**Option B: OpenAI API (Production)**
|
| 137 |
+
- Click ⚙️ Settings in the web UI
|
| 138 |
+
- Enter your OpenAI API Key (`sk-...`)
|
| 139 |
+
|
| 140 |
+
5. **Install and start frontend** (Terminal 2):
|
| 141 |
```bash
|
| 142 |
+
cd web
|
| 143 |
+
npm install
|
| 144 |
+
npm run dev
|
| 145 |
+
# Starts React app on http://localhost:5173
|
| 146 |
```
|
| 147 |
|
| 148 |
5. **Access the Interface**:
|
| 149 |
+
Navigate to `http://localhost:5173` in a web browser.
|
| 150 |
|
| 151 |
**Option 2: Docker Deployment**
|
| 152 |
|
|
|
|
| 156 |
```
|
| 157 |
|
| 158 |
2. **Access the Interface**:
|
| 159 |
+
API will be available at `http://localhost:8000`
|
| 160 |
+
Frontend development server should be started separately (see Option 1, step 4)
|
| 161 |
|
| 162 |
**Notes:**
|
| 163 |
- Redis is optional; caching will be disabled if Redis is unavailable
|
|
|
|
| 214 |
|
| 215 |
The services will be available at:
|
| 216 |
- **API Documentation**: `http://localhost:8000/docs`
|
| 217 |
+
- **Frontend**: Start separately with `npm run dev` (see above)
|
| 218 |
|
| 219 |
## 7. References
|
| 220 |
|
data/sft/raw_generated.jsonl
ADDED
|
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{"instruction": "This is a MOCKED response from the RAG Agent.", "input": "", "output": "I found the book 'Aurora Leigh' to be quite fascinating based on the description!", "source_isbn": "B000GSL88Y"}
|
| 2 |
+
{"instruction": "It fits your persona of liking Victorian literature.", "input": "", "output": "This is a MOCKED response from the RAG Agent.", "source_isbn": "087844176X"}
|
| 3 |
+
{"instruction": "I found the book 'Aurora Leigh' to be quite fascinating based on the description!", "input": "", "output": "It fits your persona of liking Victorian literature.", "source_isbn": "1880000261"}
|
| 4 |
+
{"instruction": "This is a MOCKED response from the RAG Agent.", "input": "", "output": "I found the book 'Aurora Leigh' to be quite fascinating based on the description!", "source_isbn": "0899332560"}
|
| 5 |
+
{"instruction": "It fits your persona of liking Victorian literature.", "input": "", "output": "This is a MOCKED response from the RAG Agent.", "source_isbn": "0812516826"}
|
data/user_profiles.json
CHANGED
|
@@ -1,7 +1,18 @@
|
|
| 1 |
{
|
| 2 |
"local": {
|
| 3 |
"favorites": [
|
| 4 |
-
"
|
| 5 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
}
|
| 7 |
}
|
|
|
|
| 1 |
{
|
| 2 |
"local": {
|
| 3 |
"favorites": [
|
| 4 |
+
"0130608556",
|
| 5 |
+
"0132681528",
|
| 6 |
+
"0070397635"
|
| 7 |
+
],
|
| 8 |
+
"cached_highlights": {
|
| 9 |
+
"0078817609": "\"Unlock the secrets of C++ programming and ignite your passion for coding with 'Teach Yourself C++'! This comprehensive guide will fuel your joy in learning, as you explore the latest developments and defensive coding techniques to bring your projects to life.\"",
|
| 10 |
+
"0130608556": "\"Unleash your inner coding curiosity with 'Introduction to C Programming: A Modular Approach'! With its modular structure and real-world applications, this book is perfect for those looking to build a strong foundation in programming fundamentals - and discover the thrill of creating something from scratch.\"",
|
| 11 |
+
"0132681528": "\"Get ready to unleash your inner coding ninja! This book's unique 'use it, then build it' approach will have you building real-world projects from the start, and its focus on object-oriented programming will give you the skills to tackle complex problems with ease.\"",
|
| 12 |
+
"0070397635": "\"Get ready to crunch numbers with ease! This applied mathematics textbook, written by Lial, Greenwell, and Ritchey, will unlock the power of finite math for you, just like it has for computer enthusiasts who crave problem-solving fun.\"",
|
| 13 |
+
"0192802607": "\"Embrace the complexity of human emotions with Chekhov's masterful short stories, where the intricacies of love, relationships, and mortality are expertly woven together. Your affinity for precise mathematical logic will appreciate the deliberate pacing and nuanced character development that unfolds like a perfectly crafted algorithm.\"",
|
| 14 |
+
"0060959479": "\"Discover how bell hooks' groundbreaking work on love and relationships can bridge the gaps in your own life, just as you've bridged mathematical concepts to uncover their underlying truths. 'All About Love: New Visions' offers a profound exploration of human connection that will resonate deeply with your affinity for authors who delve into the complexities of the human experience.\"",
|
| 15 |
+
"0001048228": "\"Get ready to unravel the intricate mysteries of the human heart with 'Pale Battalions'! This gripping tale, reminiscent of Margaret L. Lial's masterful storytelling, will keep you enthralled as Leonora and Penelope navigate the complexities of grief, family secrets, and self-discovery - all set against the stunning backdrop of Paris.\""
|
| 16 |
+
}
|
| 17 |
}
|
| 18 |
}
|
data/users.json
ADDED
|
@@ -0,0 +1,27 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"local": {
|
| 3 |
+
"user_id": "local",
|
| 4 |
+
"name": "Demo User",
|
| 5 |
+
"favorites": [
|
| 6 |
+
{
|
| 7 |
+
"isbn": "0001047604",
|
| 8 |
+
"title": "Aurora Leigh",
|
| 9 |
+
"authors": "Elizabeth Barrett Browning",
|
| 10 |
+
"simple_categories": "Poetry"
|
| 11 |
+
},
|
| 12 |
+
{
|
| 13 |
+
"isbn": "0060930314",
|
| 14 |
+
"title": "The Elements of Style",
|
| 15 |
+
"authors": "William Strunk Jr., E.B. White",
|
| 16 |
+
"simple_categories": "Reference"
|
| 17 |
+
},
|
| 18 |
+
{
|
| 19 |
+
"isbn": "0140449189",
|
| 20 |
+
"title": "The Republic",
|
| 21 |
+
"authors": "Plato",
|
| 22 |
+
"simple_categories": "Philosophy"
|
| 23 |
+
}
|
| 24 |
+
],
|
| 25 |
+
"persona": "A reader who appreciates classic literature, thoughtful prose, and philosophical depth. Favors works that combine intellectual rigor with poetic expression."
|
| 26 |
+
}
|
| 27 |
+
}
|
docker-compose.yml
CHANGED
|
@@ -23,20 +23,6 @@ services:
|
|
| 23 |
- redis_data:/data
|
| 24 |
restart: unless-stopped
|
| 25 |
|
| 26 |
-
ui:
|
| 27 |
-
build: .
|
| 28 |
-
command: python app.py
|
| 29 |
-
ports:
|
| 30 |
-
- "7860:7860"
|
| 31 |
-
volumes:
|
| 32 |
-
- ./data:/app/data
|
| 33 |
-
environment:
|
| 34 |
-
- GRADIO_SERVER_NAME=0.0.0.0
|
| 35 |
-
- API_URL=http://api:8000
|
| 36 |
-
depends_on:
|
| 37 |
-
- api
|
| 38 |
-
restart: unless-stopped
|
| 39 |
-
|
| 40 |
volumes:
|
| 41 |
chroma_data:
|
| 42 |
redis_data:
|
|
|
|
| 23 |
- redis_data:/data
|
| 24 |
restart: unless-stopped
|
| 25 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
volumes:
|
| 27 |
chroma_data:
|
| 28 |
redis_data:
|
docs/TECHNICAL_REPORT.md
ADDED
|
@@ -0,0 +1,240 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Technical Report: Agentic RAG Book Recommender
|
| 2 |
+
|
| 3 |
+
**Author**: [Your Name]
|
| 4 |
+
**Date**: January 2026
|
| 5 |
+
**Project Type**: End-to-End ML/AI System (Retrieval-Augmented Generation)
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## Executive Summary
|
| 10 |
+
|
| 11 |
+
This project implements a **production-grade Agentic RAG (Retrieval-Augmented Generation)** system for book discovery. Unlike simple vector search, it uses a self-routing architecture that dynamically selects the optimal retrieval strategy based on query intent, achieving:
|
| 12 |
+
|
| 13 |
+
- **100% recall** on exact-match queries (ISBNs)
|
| 14 |
+
- **Sub-second latency** for keyword searches
|
| 15 |
+
- **Deep semantic understanding** for complex natural language queries
|
| 16 |
+
- **Detail-level precision** via hierarchical (Small-to-Big) retrieval
|
| 17 |
+
|
| 18 |
+
The system demonstrates mastery of both **Data-Centric AI** (SFT data synthesis) and **Advanced RAG Architecture** (Hybrid Search, Reranking, Query Routing).
|
| 19 |
+
|
| 20 |
+
---
|
| 21 |
+
|
| 22 |
+
## 1. Problem Statement
|
| 23 |
+
|
| 24 |
+
**Challenge**: Traditional keyword search fails on modern book discovery scenarios:
|
| 25 |
+
- Users search by *feeling* ("sad sci-fi about AI") rather than *keywords*
|
| 26 |
+
- Users want specific *plot details* ("books with an unreliable narrator twist")
|
| 27 |
+
- Users expect *temporal awareness* ("latest books on quantum computing")
|
| 28 |
+
|
| 29 |
+
**Solution**: An intelligent RAG system that:
|
| 30 |
+
1. Understands user intent (Agentic Router)
|
| 31 |
+
2. Fuses multiple retrieval strategies (Hybrid Search)
|
| 32 |
+
3. Ranks results by semantic relevance (Cross-Encoder Reranking)
|
| 33 |
+
4. Finds hidden gems in review text (Small-to-Big Retrieval)
|
| 34 |
+
|
| 35 |
+
---
|
| 36 |
+
|
| 37 |
+
## 2. System Architecture
|
| 38 |
+
|
| 39 |
+
```
|
| 40 |
+
┌─────────────────────────────────────────────────────────────────────────┐
|
| 41 |
+
│ USER QUERY │
|
| 42 |
+
└─────────────────────────────────────────────────────────────────────────┘
|
| 43 |
+
│
|
| 44 |
+
▼
|
| 45 |
+
┌─────────────────────────────────────────────────────────────────────────┐
|
| 46 |
+
│ AGENTIC QUERY ROUTER │
|
| 47 |
+
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
| 48 |
+
│ │ ISBN │ │ Keyword │ │ Complex │ │ Detail │ │
|
| 49 |
+
│ │ (Exact) │ │ (Fast) │ │ (Deep) │ │ (Small2Big) │ │
|
| 50 |
+
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
|
| 51 |
+
└─────────┼────────────────┼────────────────┼────────────────┼───────────┘
|
| 52 |
+
│ │ │ │
|
| 53 |
+
▼ ▼ ▼ ▼
|
| 54 |
+
┌──────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
| 55 |
+
│ BM25 Only│ │Hybrid (RRF) │ │Hybrid + Rank │ │Chunk → Parent│
|
| 56 |
+
│ α=1.0 │ │BM25 + Dense │ │+ Cross-Enc │ │788K Sentences│
|
| 57 |
+
└────┬─────┘ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘
|
| 58 |
+
│ │ │ │
|
| 59 |
+
└─────────────────┴────────┬────────┴─────────────────┘
|
| 60 |
+
▼
|
| 61 |
+
┌─────────────────────────────────────────────────────────────────────────┐
|
| 62 |
+
│ OPTIONAL POST-PROCESSING │
|
| 63 |
+
│ ┌──────────────────┐ ┌──────────────────┐ │
|
| 64 |
+
│ │ Temporal Dynamics│ │Context Compression│ │
|
| 65 |
+
│ │ (Recency Boost) │ │ (Chat History) │ │
|
| 66 |
+
│ └──────────────────┘ └──────────────────┘ │
|
| 67 |
+
└─────────────────────────────────────────────────────────────────────────┘
|
| 68 |
+
│
|
| 69 |
+
▼
|
| 70 |
+
┌─────────────────────────────────────────────────────────────────────────┐
|
| 71 |
+
│ LLM GENERATION │
|
| 72 |
+
│ (Streaming Response via SSE) │
|
| 73 |
+
└─────────────────────────────────────────────────────────────────────────┘
|
| 74 |
+
```
|
| 75 |
+
|
| 76 |
+
---
|
| 77 |
+
|
| 78 |
+
## 3. Technical Innovations
|
| 79 |
+
|
| 80 |
+
### 3.1 Agentic Query Router (`src/core/router.py`)
|
| 81 |
+
|
| 82 |
+
**Motivation**: A single retrieval strategy cannot optimize for all query types.
|
| 83 |
+
|
| 84 |
+
**Implementation**: Rule-based intent classifier using RegEx and keyword detection:
|
| 85 |
+
| Query Type | Detection Logic | Strategy | Latency |
|
| 86 |
+
|------------|-----------------|----------|---------|
|
| 87 |
+
| ISBN | `\d{10,13}` pattern | BM25 Only (α=1.0) | <100ms |
|
| 88 |
+
| Keyword | `len(words) <= 2` | Hybrid (No Rerank) | ~300ms |
|
| 89 |
+
| Complex | Default | Hybrid + Cross-Encoder | ~800ms |
|
| 90 |
+
| Detail | Keywords: "twist", "ending", "cried" | Small-to-Big | ~500ms |
|
| 91 |
+
|
| 92 |
+
**Trade-off Decision**: Chose rule-based over LLM-based routing to avoid 500ms+ latency per routing decision.
|
| 93 |
+
|
| 94 |
+
### 3.2 Hybrid Search with RRF (`src/vector_db.py`)
|
| 95 |
+
|
| 96 |
+
**Motivation**: Dense vectors fail on exact terms (ISBNs, proper nouns); BM25 fails on semantic queries.
|
| 97 |
+
|
| 98 |
+
**Implementation**: Reciprocal Rank Fusion combining BM25 (sparse) and MiniLM (dense):
|
| 99 |
+
```python
|
| 100 |
+
RRF_Score = Σ 1/(k + rank_dense) + 1/(k + rank_sparse) # k=60
|
| 101 |
+
```
|
| 102 |
+
|
| 103 |
+
**Result**: 100% recall on ISBNs (previously 0% with pure vector search).
|
| 104 |
+
|
| 105 |
+
### 3.3 Cross-Encoder Reranking (`src/core/reranker.py`)
|
| 106 |
+
|
| 107 |
+
**Motivation**: Bi-encoders are fast but approximate; Cross-Encoders are slow but precise.
|
| 108 |
+
|
| 109 |
+
**Implementation**: Two-stage retrieval:
|
| 110 |
+
1. Stage 1: Retrieve top-50 candidates via RRF (~100ms)
|
| 111 |
+
2. Stage 2: Rerank with `ms-marco-MiniLM-L-6-v2` (~400ms)
|
| 112 |
+
|
| 113 |
+
**Trade-off Decision**: Only rerank top-50 (not all 200K) to balance precision vs latency.
|
| 114 |
+
|
| 115 |
+
### 3.4 Small-to-Big Retrieval (`src/vector_db.py::small_to_big_search`)
|
| 116 |
+
|
| 117 |
+
**Motivation**: Book descriptions are coarse; review sentences contain fine-grained details.
|
| 118 |
+
|
| 119 |
+
**Implementation** (SOTA: LlamaIndex Parent-Child, RAPTOR):
|
| 120 |
+
1. **Chunking**: 788,174 review sentences indexed at sentence-level
|
| 121 |
+
2. **Matching**: Query matches specific sentence ("I cried at the ending")
|
| 122 |
+
3. **Expansion**: Map sentence → parent ISBN → full book context
|
| 123 |
+
|
| 124 |
+
**Result**: Can answer queries like "books with unreliable narrator twist" that are invisible to description-level search.
|
| 125 |
+
|
| 126 |
+
### 3.5 SFT Data Factory (`src/data_factory/generator.py`)
|
| 127 |
+
|
| 128 |
+
**Motivation**: Default LLM tone is corporate; we want "Literary Critic" personality.
|
| 129 |
+
|
| 130 |
+
**Implementation** (SOTA: Self-Instruct, Alpaca):
|
| 131 |
+
1. **Seed Sampling**: Extract 1000 high-emotion reviews (rating=5, length>200)
|
| 132 |
+
2. **Instruction Evolution**: GPT generates user questions that would prompt each review
|
| 133 |
+
3. **Response Transform**: Rewrite reviews as AI assistant style
|
| 134 |
+
4. **LLM-as-a-Judge**: Filter for Empathy/Specificity/Critique Depth >= 8/10
|
| 135 |
+
|
| 136 |
+
**Output**: Production-ready SFT dataset for style alignment.
|
| 137 |
+
|
| 138 |
+
---
|
| 139 |
+
|
| 140 |
+
## 4. Performance Metrics
|
| 141 |
+
|
| 142 |
+
| Metric | Baseline (Vector Only) | Advanced (This System) |
|
| 143 |
+
|--------|------------------------|------------------------|
|
| 144 |
+
| ISBN Recall | 0% | **100%** |
|
| 145 |
+
| Keyword Precision | Low | **High** (BM25 boost) |
|
| 146 |
+
| Detail Query Recall | 0% | **High** (Small-to-Big) |
|
| 147 |
+
| Avg Latency | 100ms | 300-800ms (acceptable) |
|
| 148 |
+
| Chat Context Limit | ~10 turns | **Unlimited** (compression) |
|
| 149 |
+
|
| 150 |
+
---
|
| 151 |
+
|
| 152 |
+
## 5. Technology Stack
|
| 153 |
+
|
| 154 |
+
| Layer | Technology | Purpose |
|
| 155 |
+
|-------|------------|---------|
|
| 156 |
+
| **Vector DB** | ChromaDB | Embedded, zero-latency vector storage |
|
| 157 |
+
| **Sparse Index** | BM25Okapi (rank_bm25) | Keyword/exact match retrieval |
|
| 158 |
+
| **Embeddings** | all-MiniLM-L6-v2 | 384-dim sentence embeddings |
|
| 159 |
+
| **Reranker** | ms-marco-MiniLM-L-6-v2 | Cross-encoder precision ranking |
|
| 160 |
+
| **LLM** | OpenAI / Ollama (llama3) | Generation with BYOK support |
|
| 161 |
+
| **Backend** | FastAPI + SSE | Streaming API |
|
| 162 |
+
| **Frontend** | React 18 + Vite | Modern SPA |
|
| 163 |
+
|
| 164 |
+
---
|
| 165 |
+
|
| 166 |
+
## 6. Key Design Decisions
|
| 167 |
+
|
| 168 |
+
| Decision | Chosen Option | Rejected Alternative | Rationale |
|
| 169 |
+
|----------|---------------|---------------------|-----------|
|
| 170 |
+
| Vector DB | ChromaDB (embedded) | Pinecone (cloud) | Zero network latency; 200K docs fits in RAM |
|
| 171 |
+
| Routing | Rule-based RegEx | LLM-based routing | 2ms vs 500ms latency; deterministic behavior |
|
| 172 |
+
| Reranking | Cross-Encoder | LLM reranking | 400ms vs 2s latency; proven accuracy |
|
| 173 |
+
| Chunking | Sentence-level (Small-to-Big) | Fixed 512 tokens | Semantic integrity; detail-level matching |
|
| 174 |
+
| SFT Data | Self-Instruct | Manual annotation | Scalable; leverages existing reviews |
|
| 175 |
+
|
| 176 |
+
---
|
| 177 |
+
|
| 178 |
+
## 7. Interview Talking Points
|
| 179 |
+
|
| 180 |
+
**Q: What makes this project technically interesting?**
|
| 181 |
+
> "I implemented an Agentic RAG system with self-routing capability. Instead of one-size-fits-all vector search, the system classifies query intent and dynamically selects from 4 strategies—each optimized for different query types. This achieved 100% recall on exact-match queries that previously failed."
|
| 182 |
+
|
| 183 |
+
**Q: What was the hardest engineering challenge?**
|
| 184 |
+
> "The Small-to-Big retrieval. I indexed 788K review sentences separately, but the challenge was mapping matched sentences back to their parent books efficiently. I solved it by embedding parent ISBN in chunk metadata and using BM25 for O(1) lookup."
|
| 185 |
+
|
| 186 |
+
**Q: How would you improve this further?**
|
| 187 |
+
> "Three directions: (1) Fine-tune embeddings on book domain for better semantic alignment, (2) Implement HyDE (generate hypothetical documents before searching), (3) Add RAGAS evaluation pipeline for systematic quality measurement."
|
| 188 |
+
|
| 189 |
+
---
|
| 190 |
+
|
| 191 |
+
## 8. File Structure
|
| 192 |
+
|
| 193 |
+
```
|
| 194 |
+
src/
|
| 195 |
+
├── core/
|
| 196 |
+
│ ├── router.py # Agentic Query Router
|
| 197 |
+
│ ├── reranker.py # Cross-Encoder Reranking
|
| 198 |
+
│ ├── temporal.py # Recency Boosting
|
| 199 |
+
│ └── context_compressor.py # Chat History Compression
|
| 200 |
+
├── data_factory/
|
| 201 |
+
│ └── generator.py # SFT Data Synthesis + LLM Judge
|
| 202 |
+
├── vector_db.py # Hybrid Search + Small-to-Big
|
| 203 |
+
├── recommender.py # Main recommendation logic
|
| 204 |
+
└── services/chat_service.py # RAG Chat Pipeline
|
| 205 |
+
|
| 206 |
+
docs/
|
| 207 |
+
├── TECHNICAL_REPORT.md # This document
|
| 208 |
+
├── technical_deep_dive_sota.md # SOTA references
|
| 209 |
+
├── rag_architecture.md # System diagrams
|
| 210 |
+
└── interview_deep_dive.md # Interview prep
|
| 211 |
+
|
| 212 |
+
experiments/
|
| 213 |
+
├── baseline_report.md # Dense-only baseline
|
| 214 |
+
├── hybrid_report.md # Hybrid search results
|
| 215 |
+
├── rerank_report.md # Cross-encoder results
|
| 216 |
+
├── router_report.md # Agentic router results
|
| 217 |
+
└── temporal_report.md # Time decay results
|
| 218 |
+
```
|
| 219 |
+
|
| 220 |
+
---
|
| 221 |
+
|
| 222 |
+
## 9. Conclusion
|
| 223 |
+
|
| 224 |
+
This project demonstrates end-to-end ML engineering skills across:
|
| 225 |
+
- **Data Engineering**: ETL pipelines, SFT data synthesis, quality filtering
|
| 226 |
+
- **ML Systems**: Hybrid retrieval, cross-encoder reranking, hierarchical indexing
|
| 227 |
+
- **Production Engineering**: Streaming APIs, caching, context management
|
| 228 |
+
- **Architecture Design**: Trade-off analysis, performance optimization
|
| 229 |
+
|
| 230 |
+
The system is **production-ready** and serves as a strong portfolio piece for MLE/AI Engineer roles.
|
| 231 |
+
|
| 232 |
+
---
|
| 233 |
+
|
| 234 |
+
## References
|
| 235 |
+
|
| 236 |
+
1. Self-Instruct (Wang et al., 2022) - Instruction data synthesis
|
| 237 |
+
2. RAPTOR (Sarthi et al., 2024) - Hierarchical tree-based indexing
|
| 238 |
+
3. HyDE (Gao et al., 2022) - Hypothetical document embeddings
|
| 239 |
+
4. LlamaIndex - Parent-child retrieval patterns
|
| 240 |
+
5. ms-marco-MiniLM - Cross-encoder reranking
|
docs/archived/DEPLOYMENT.md
ADDED
|
@@ -0,0 +1,108 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Server Deployment Guide (AutoDL)
|
| 2 |
+
|
| 3 |
+
This guide documents the specific steps required to deploy the Book Recommender system on an AutoDL (or similar domestic GPU cloud) server.
|
| 4 |
+
|
| 5 |
+
## 1. Environment Setup
|
| 6 |
+
|
| 7 |
+
The default environment on some cloud images may be outdated. Always create a fresh Conda environment.
|
| 8 |
+
|
| 9 |
+
```bash
|
| 10 |
+
# Create a fresh environment (Python 3.10 recommended)
|
| 11 |
+
conda create -n valid python=3.10 -y
|
| 12 |
+
conda activate valid
|
| 13 |
+
|
| 14 |
+
# Install dependencies
|
| 15 |
+
# Note: Use official PyPI to avoid stale mirrors returning ancient packages (like huggingface-hub 1.2.4)
|
| 16 |
+
pip install -r requirements.txt -i https://pypi.org/simple
|
| 17 |
+
```
|
| 18 |
+
|
| 19 |
+
**Critical Dependencies**:
|
| 20 |
+
- `huggingface-hub >= 0.23.0` (Required for modern transformers compatibility)
|
| 21 |
+
- `redis` (Python client)
|
| 22 |
+
|
| 23 |
+
## 2. Infrastructure Services
|
| 24 |
+
|
| 25 |
+
### Redis (Caching)
|
| 26 |
+
Ensure Redis Server is installed and running:
|
| 27 |
+
```bash
|
| 28 |
+
apt update && apt install redis-server -y
|
| 29 |
+
service redis-server start
|
| 30 |
+
```
|
| 31 |
+
|
| 32 |
+
## 3. Data Migration (Efficiently)
|
| 33 |
+
|
| 34 |
+
Do **NOT** upload the raw `Books_rating.csv` (2.7 GB) or uncompressed text files. Bandwidth is precious.
|
| 35 |
+
|
| 36 |
+
**Local Machine**:
|
| 37 |
+
```bash
|
| 38 |
+
# Compress large files
|
| 39 |
+
gzip -k data/books_processed.csv # Metadata for API
|
| 40 |
+
gzip -k data/books_descriptions.txt # Text for Vector DB
|
| 41 |
+
|
| 42 |
+
# Upload compressed files
|
| 43 |
+
scp data/books_processed.csv.gz root@<IP>:<PORT>:~/autodl-tmp/book-rec-with-LLMs/data/
|
| 44 |
+
scp data/books_descriptions.txt.gz root@<IP>:<PORT>:~/autodl-tmp/book-rec-with-LLMs/data/
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
**Server**:
|
| 48 |
+
```bash
|
| 49 |
+
# Decompress
|
| 50 |
+
gunzip -f data/*.gz
|
| 51 |
+
```
|
| 52 |
+
|
| 53 |
+
## 4. Model Downloading (Network Fix)
|
| 54 |
+
|
| 55 |
+
Domestic servers often cannot access Hugging Face directly. Use the official mirror.
|
| 56 |
+
|
| 57 |
+
**Server**:
|
| 58 |
+
```bash
|
| 59 |
+
# Enable Mirror
|
| 60 |
+
export HF_ENDPOINT=https://hf-mirror.com
|
| 61 |
+
# Increase Timeout for large files
|
| 62 |
+
export HF_HUB_DOWNLOAD_TIMEOUT=120
|
| 63 |
+
|
| 64 |
+
# Run Initialization (Downloads model + Builds Index)
|
| 65 |
+
python src/init_db.py
|
| 66 |
+
```
|
| 67 |
+
|
| 68 |
+
## 5. Running the Application
|
| 69 |
+
|
| 70 |
+
**Server**:
|
| 71 |
+
```bash
|
| 72 |
+
# Listen on 0.0.0.0 (required for external access)
|
| 73 |
+
uvicorn src.main:app --host 0.0.0.0 --port 6006
|
| 74 |
+
```
|
| 75 |
+
|
| 76 |
+
**Local Machine (Access)**:
|
| 77 |
+
Use SSH Tunneling to securely access the remote API without exposing ports publicly.
|
| 78 |
+
```bash
|
| 79 |
+
ssh -L 6006:localhost:6006 root@<IP> -p <PORT>
|
| 80 |
+
```
|
| 81 |
+
Visit `http://localhost:6006/docs` in your browser.
|
| 82 |
+
|
| 83 |
+
## 图片兜底与路径适配说明
|
| 84 |
+
|
| 85 |
+
### 现象
|
| 86 |
+
- 书籍图片缺失时,前端 `<img src="/assets/cover-not-found.jpg">` 无法正常显示默认图片。
|
| 87 |
+
- 原因:开发环境下前端端口(如 5173)与后端端口(如 6006)不同,`/assets` 路径实际指向前端静态目录,无法访问后端 FastAPI 挂载的静态资源。
|
| 88 |
+
|
| 89 |
+
### 解决方案
|
| 90 |
+
- 后端 FastAPI 通过 `app.mount("/assets", StaticFiles(directory="assets"), name="assets")` 挂载静态资源。
|
| 91 |
+
- 前端图片加载失败时,自动切换为后端 API 地址的兜底图片:
|
| 92 |
+
|
| 93 |
+
```jsx
|
| 94 |
+
<img
|
| 95 |
+
src={book.img}
|
| 96 |
+
alt={book.title}
|
| 97 |
+
onError={e => {
|
| 98 |
+
e.target.onerror = null;
|
| 99 |
+
e.target.src = "http://localhost:6006/assets/cover-not-found.jpg";
|
| 100 |
+
}}
|
| 101 |
+
/>
|
| 102 |
+
```
|
| 103 |
+
- 这样无论图片链接是否有效,缺失时都能正常显示默认封面。
|
| 104 |
+
|
| 105 |
+
### 生产环境建议
|
| 106 |
+
- 生产部署时建议用 nginx 统一代理 `/assets` 到后端或静态目录,保证前后端一致。
|
| 107 |
+
|
| 108 |
+
---
|
docs/archived/PHASE_2_DEVELOPMENT.md
ADDED
|
@@ -0,0 +1,509 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Phase 2: Personalization & React UI Migration
|
| 2 |
+
|
| 3 |
+
**Date:** January 2026
|
| 4 |
+
**Status:** ✅ Complete & Deployed
|
| 5 |
+
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
## Overview
|
| 9 |
+
|
| 10 |
+
This phase shifted the project from a basic semantic book recommender to an **intelligent, personalized discovery platform** with a modern React frontend. The vision evolved from marketplace/swap features to a focused **recommendation engine grounded in user preferences and persona-driven insights**.
|
| 11 |
+
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
## Phase Vision & Direction
|
| 15 |
+
|
| 16 |
+
### Initial Pivot (from conversation)
|
| 17 |
+
- **Original concept:** Second-hand book marketplace/swap platform
|
| 18 |
+
- **User feedback:** Focus on recommendation engine first, then expand
|
| 19 |
+
- **Final direction:** Keep it recommendation-only with two new pillars:
|
| 20 |
+
1. **Favorites** → persistent user library tracking
|
| 21 |
+
2. **Personalized Highlights** → AI-generated selling points based on user taste
|
| 22 |
+
|
| 23 |
+
### Core Philosophy
|
| 24 |
+
> "Books that understand you. Recommendations grounded in what you love."
|
| 25 |
+
|
| 26 |
+
The system learns from your reading preferences and surfaces books that match both the search query AND your unique taste profile.
|
| 27 |
+
|
| 28 |
+
---
|
| 29 |
+
|
| 30 |
+
## What Was Built
|
| 31 |
+
|
| 32 |
+
### 1. **Backend Personalization Layer** (`src/`)
|
| 33 |
+
|
| 34 |
+
#### A. User Favorites Storage
|
| 35 |
+
- **File:** `src/user/profile_store.py`
|
| 36 |
+
- **Mechanism:** JSON-based persistence (`data/user_profiles.json`)
|
| 37 |
+
- **Features:**
|
| 38 |
+
- `add_favorite(user_id, isbn)` → idempotent add + deduplicate
|
| 39 |
+
- `list_favorites(user_id)` → retrieve user's library
|
| 40 |
+
- Works with any user_id (default: "local" for single-user dev)
|
| 41 |
+
|
| 42 |
+
#### B. User Persona Aggregation
|
| 43 |
+
- **File:** `src/marketing/persona.py`
|
| 44 |
+
- **Input:** List of favorite ISBNs + book metadata DataFrame
|
| 45 |
+
- **Output:** `{ summary, top_authors[], top_categories[] }`
|
| 46 |
+
- **Algorithm:**
|
| 47 |
+
1. Fetch metadata for all favorited books
|
| 48 |
+
2. Extract top 3 authors (by frequency)
|
| 49 |
+
3. Extract top 3 categories
|
| 50 |
+
4. Generate natural language summary combining signals
|
| 51 |
+
- Example: *"您钟爱悬疑与科幻,偏好国际视野的作品。"* (You love mystery & sci-fi, prefer international perspectives)
|
| 52 |
+
|
| 53 |
+
#### C. Personalized Highlights Generator
|
| 54 |
+
- **File:** `src/marketing/highlights.py`
|
| 55 |
+
- **Input:** ISBN + user persona + book metadata
|
| 56 |
+
- **Output:** `{ title, authors, category, highlights[], persona_summary }`
|
| 57 |
+
- **Generation Strategy:**
|
| 58 |
+
- Match persona themes to book content (author, category, description)
|
| 59 |
+
- Extract 3-5 contextual selling points
|
| 60 |
+
- Combine rule-based matching + description parsing
|
| 61 |
+
- Example output:
|
| 62 |
+
```
|
| 63 |
+
- 作者获国际奖项,契合您对国际视野的热爱
|
| 64 |
+
- 悬疑与科幻的完美融合,正是您的最爱组合
|
| 65 |
+
- 情节紧凑,适合您快节奏阅读的偏好
|
| 66 |
+
```
|
| 67 |
+
|
| 68 |
+
### 2. **FastAPI Backend Integration** (`src/main.py`)
|
| 69 |
+
|
| 70 |
+
**Three New Endpoints:**
|
| 71 |
+
|
| 72 |
+
```python
|
| 73 |
+
POST /favorites/add
|
| 74 |
+
Request: { user_id: str, isbn: str }
|
| 75 |
+
Response: { status: "ok", favorites_count: int }
|
| 76 |
+
|
| 77 |
+
GET /user/{user_id}/persona
|
| 78 |
+
Response: { user_id, favorites: [], persona: {...} }
|
| 79 |
+
|
| 80 |
+
POST /marketing/highlights
|
| 81 |
+
Request: { isbn: str, user_id?: str }
|
| 82 |
+
Response: { persona, highlights: [], meta: {...} }
|
| 83 |
+
```
|
| 84 |
+
|
| 85 |
+
**CORS Support:**
|
| 86 |
+
- Enabled for localhost:5173 (React dev), 3000 (alt dev), 8080
|
| 87 |
+
- Allows frontend to access backend without restrictions
|
| 88 |
+
|
| 89 |
+
---
|
| 90 |
+
|
| 91 |
+
### 3. **Modern React UI** (`web/`)
|
| 92 |
+
|
| 93 |
+
#### Architecture
|
| 94 |
+
- **Build Tool:** Vite (ultra-fast dev server, ~200ms startup)
|
| 95 |
+
- **Styling:** Tailwind CSS (CDN-based, no build required)
|
| 96 |
+
- **Icons:** lucide-react (modern SVG icons)
|
| 97 |
+
- **State Management:** React Hooks (useState only, no Redux)
|
| 98 |
+
|
| 99 |
+
#### Design: "纸间留白" (Paper Shelf)
|
| 100 |
+
A literary, minimalist aesthetic inspired by:
|
| 101 |
+
- Japanese minimalism (留白 = leaving white space)
|
| 102 |
+
- Second-hand bookstore vibes
|
| 103 |
+
- Serif typography (font-serif)
|
| 104 |
+
- Muted earth tones: `#b392ac` (mauve), `#f4acb7` (peach), `#faf9f6` (cream)
|
| 105 |
+
|
| 106 |
+
#### Core Features
|
| 107 |
+
|
| 108 |
+
**1. Discovery Tab (Default View)**
|
| 109 |
+
```
|
| 110 |
+
┌─────────────────────────────────┐
|
| 111 |
+
│ 纸间留白 │ Header + toggle "私人书斋"
|
| 112 |
+
├─────────────────────────────────┤
|
| 113 |
+
│ 墨色余温·灵魂契合 (if favorites) │ Smart carousel of alma-mate books
|
| 114 |
+
├─────────────────────────────────┤
|
| 115 |
+
│ [Search] [Category▼] [Mood▼] │ Semantic search + filters
|
| 116 |
+
│ 开启发现之旅 (Start Discovery) │
|
| 117 |
+
├─────────────────────────────────┤
|
| 118 |
+
│ [Book 1] [Book 2] [Book 3] ... │ 5-column responsive grid
|
| 119 |
+
│ (hover shows ai-generated hint) │
|
| 120 |
+
└─────────────────────────────────┘
|
| 121 |
+
```
|
| 122 |
+
|
| 123 |
+
**2. Book Detail Modal**
|
| 124 |
+
```
|
| 125 |
+
┌─────────────────────────────────┐
|
| 126 |
+
│ [Close] │
|
| 127 |
+
├─────────���────┬──────────────────┤
|
| 128 |
+
│ Cover │ Title │
|
| 129 |
+
│ ISBN │ Highlights │
|
| 130 |
+
│ Score ★★★★★ │ Description │
|
| 131 |
+
│ │ Chat Interface │
|
| 132 |
+
│ │ [Add to Library] │
|
| 133 |
+
└──────────────┴──────────────────┘
|
| 134 |
+
```
|
| 135 |
+
|
| 136 |
+
**3. Private Library ("私人书斋")**
|
| 137 |
+
- Toggle view to see only favorited books
|
| 138 |
+
- Shows reading statistics (mood distribution)
|
| 139 |
+
- Same gallery grid + detail modal
|
| 140 |
+
|
| 141 |
+
**4. Chat Interface (in modal)**
|
| 142 |
+
- Suggested questions tied to book context
|
| 143 |
+
- User messages vs AI responses styled differently
|
| 144 |
+
- AI grounded to book metadata (not LLM-based yet)
|
| 145 |
+
|
| 146 |
+
#### API Integration
|
| 147 |
+
All four key flows wired to backend:
|
| 148 |
+
|
| 149 |
+
```javascript
|
| 150 |
+
// Search → Recommendation
|
| 151 |
+
startDiscovery() → recommend(query, category, tone)
|
| 152 |
+
|
| 153 |
+
// Select book → Load highlights
|
| 154 |
+
openBook(book) → getHighlights(isbn)
|
| 155 |
+
|
| 156 |
+
// Add to collection
|
| 157 |
+
toggleCollect(book) → addFavorite(isbn)
|
| 158 |
+
|
| 159 |
+
// (Future) Refresh persona
|
| 160 |
+
persona = getPersona(userId)
|
| 161 |
+
```
|
| 162 |
+
|
| 163 |
+
---
|
| 164 |
+
|
| 165 |
+
## End-to-End Flow
|
| 166 |
+
|
| 167 |
+
### User Journey: "Discovery to Collection"
|
| 168 |
+
|
| 169 |
+
```
|
| 170 |
+
1. User enters search query + filters
|
| 171 |
+
↓
|
| 172 |
+
2. startDiscovery() calls POST /recommend
|
| 173 |
+
→ FastAPI semantic search + tone filtering
|
| 174 |
+
→ Returns top N books with thumbnails
|
| 175 |
+
↓
|
| 176 |
+
3. Books render in grid (hover shows AI hint)
|
| 177 |
+
↓
|
| 178 |
+
4. User clicks book → openBook()
|
| 179 |
+
→ Calls POST /marketing/highlights
|
| 180 |
+
→ Gets persona + 3-5 personalized selling points
|
| 181 |
+
→ Modal shows all details + chat
|
| 182 |
+
↓
|
| 183 |
+
5. User clicks "加入藏书馆" (Add to Collection)
|
| 184 |
+
→ Calls POST /favorites/add
|
| 185 |
+
→ Updates myCollection state
|
| 186 |
+
→ Next search shows "灵魂契合" carousel (matched books)
|
| 187 |
+
↓
|
| 188 |
+
6. User clicks "私人书斋" to view collection
|
| 189 |
+
→ Filters books to only favorites
|
| 190 |
+
→ Shows reading persona stats
|
| 191 |
+
```
|
| 192 |
+
|
| 193 |
+
---
|
| 194 |
+
|
| 195 |
+
## Technical Decisions
|
| 196 |
+
|
| 197 |
+
### Why JSON for Favorites (not SQLite)?
|
| 198 |
+
- **Rationale:** Single-user dev focus, rapid iteration
|
| 199 |
+
- **Trade-off:** 11k books × metadata in one file = acceptable overhead
|
| 200 |
+
- **Future:** Easy migration to PostgreSQL when scaling to multi-user
|
| 201 |
+
|
| 202 |
+
### Why No LLM for Highlights?
|
| 203 |
+
- **Rationale:** Keep system lightweight, deterministic, fast
|
| 204 |
+
- **Method:** Rule-based persona matching (Top-3 authors/categories)
|
| 205 |
+
- **Future:** Could upgrade to LLM refinement (e.g., GPT for polish)
|
| 206 |
+
|
| 207 |
+
### Why React + Vite?
|
| 208 |
+
- **Rationale:**
|
| 209 |
+
- React needed for custom UX and production-grade interface
|
| 210 |
+
- Vite super fast (no webpack pain)
|
| 211 |
+
- Tailwind CSS for modern styling
|
| 212 |
+
- **Architecture:** React frontend (port 5173) + FastAPI backend (port 6006/8000)
|
| 213 |
+
|
| 214 |
+
### Why Persona from Favorites (not search history)?
|
| 215 |
+
- **Rationale:** User intent explicit in favorites, not implicit in queries
|
| 216 |
+
- **Semantics:** "Add to collection" = explicit preference signal
|
| 217 |
+
- **Advantage:** Works offline, no tracking/privacy concerns
|
| 218 |
+
|
| 219 |
+
---
|
| 220 |
+
|
| 221 |
+
## Architecture Diagram
|
| 222 |
+
|
| 223 |
+
```
|
| 224 |
+
┌──────────────────────────────────────────────────────┐
|
| 225 |
+
│ FRONTEND (React) │
|
| 226 |
+
│ web/ → Vite dev server (localhost:5173) │
|
| 227 |
+
│ ┌────────────────────────────────────────────────┐ │
|
| 228 |
+
│ │ App.jsx │ │
|
| 229 |
+
│ │ - SearchBar (query, category, mood) │ │
|
| 230 |
+
│ │ - Gallery (books grid) │ │
|
| 231 |
+
│ │ - DetailModal (title, highlights, chat) │ │
|
| 232 |
+
│ │ - MyCollection (favorites view) │ │
|
| 233 |
+
│ └────────────────────────────────────────────────┘ │
|
| 234 |
+
│ api.js → Fetch wrappers (recommend, highlights...) │
|
| 235 |
+
└──────────────────────────────────────────────────────┘
|
| 236 |
+
↓
|
| 237 |
+
HTTP/CORS
|
| 238 |
+
↓
|
| 239 |
+
┌──────────────────────────────────────────────────────┐
|
| 240 |
+
│ BACKEND (FastAPI) │
|
| 241 |
+
│ src/main.py → uvicorn (localhost:6006) │
|
| 242 |
+
│ ┌────────────────────────────────────────────────┐ │
|
| 243 |
+
│ │ GET /health │ │
|
| 244 |
+
│ │ POST /recommend (query, category, tone) │ │
|
| 245 |
+
│ │ GET /categories, /tones │ │
|
| 246 |
+
│ │ ┌──────────────────────────────────────────┐ │ │
|
| 247 |
+
│ │ │ NEW: POST /favorites/add │ │ │
|
| 248 |
+
│ │ │ NEW: GET /user/{id}/persona │ │ │
|
| 249 |
+
│ │ │ NEW: POST /marketing/highlights │ │ │
|
| 250 |
+
│ │ └──────────────────────────────────────────┘ │ │
|
| 251 |
+
│ └────────────────────────────────────────────────┘ │
|
| 252 |
+
└──────────────────────────────────────────────────────┘
|
| 253 |
+
↓ ↓
|
| 254 |
+
┌─────────────┐ ┌──────────────────┐
|
| 255 |
+
│ ChromaDB │ │ User Profiles │
|
| 256 |
+
│ (11k docs) │ │ (JSON file) │
|
| 257 |
+
│ ↓ │ │ ↓ │
|
| 258 |
+
│ Vector │ │ Favorites + │
|
| 259 |
+
│ Embeddings │ │ Persona │
|
| 260 |
+
└─────────────┘ └──────────────────┘
|
| 261 |
+
↓
|
| 262 |
+
┌─────────────────────────────────┐
|
| 263 |
+
│ Books Metadata (CSV) │
|
| 264 |
+
│ - title, authors, description │
|
| 265 |
+
│ - isbn, category, rating │
|
| 266 |
+
│ - emotion scores (joy/sad/etc) │
|
| 267 |
+
└─────────────────────────────────┘
|
| 268 |
+
```
|
| 269 |
+
|
| 270 |
+
---
|
| 271 |
+
|
| 272 |
+
## Key Data Models
|
| 273 |
+
|
| 274 |
+
### User Profile (JSON)
|
| 275 |
+
```json
|
| 276 |
+
{
|
| 277 |
+
"local": {
|
| 278 |
+
"favorites": [
|
| 279 |
+
{ "isbn": "9780451524935", "title": "1984", "added_at": "2026-01-06" },
|
| 280 |
+
{ "isbn": "9780061120084", "title": "To Kill a Mockingbird", "added_at": "2026-01-06" }
|
| 281 |
+
]
|
| 282 |
+
}
|
| 283 |
+
}
|
| 284 |
+
```
|
| 285 |
+
|
| 286 |
+
### Book Recommendation Response
|
| 287 |
+
```json
|
| 288 |
+
{
|
| 289 |
+
"recommendations": [
|
| 290 |
+
{
|
| 291 |
+
"isbn": "9780451524935",
|
| 292 |
+
"title": "1984",
|
| 293 |
+
"authors": "George Orwell",
|
| 294 |
+
"description": "A dystopian novel...",
|
| 295 |
+
"thumbnail": "https://covers.openlibrary.org/...",
|
| 296 |
+
"caption": "(auto-generated short hint)"
|
| 297 |
+
}
|
| 298 |
+
]
|
| 299 |
+
}
|
| 300 |
+
```
|
| 301 |
+
|
| 302 |
+
### Highlights Response
|
| 303 |
+
```json
|
| 304 |
+
{
|
| 305 |
+
"persona": {
|
| 306 |
+
"summary": "您钟爱悬疑与科幻,偏好国际视野的作品。",
|
| 307 |
+
"top_authors": ["Agatha Christie", "Isaac Asimov"],
|
| 308 |
+
"top_categories": ["Mystery", "Science Fiction"]
|
| 309 |
+
},
|
| 310 |
+
"highlights": [
|
| 311 |
+
"国际推理大师之作,契合您的悬疑偏好",
|
| 312 |
+
"心理扭转的情节设计,适合您快节奏阅读",
|
| 313 |
+
"深层人性反思,引发思考"
|
| 314 |
+
],
|
| 315 |
+
"meta": {
|
| 316 |
+
"title": "And Then There Were None",
|
| 317 |
+
"authors": "Agatha Christie",
|
| 318 |
+
"category": "Mystery",
|
| 319 |
+
"description": "..."
|
| 320 |
+
}
|
| 321 |
+
}
|
| 322 |
+
```
|
| 323 |
+
|
| 324 |
+
---
|
| 325 |
+
|
| 326 |
+
## Running the System
|
| 327 |
+
|
| 328 |
+
### Development Mode (3 services)
|
| 329 |
+
|
| 330 |
+
**Terminal 1: FastAPI Backend**
|
| 331 |
+
```bash
|
| 332 |
+
cd /Users/ymlin/Downloads/003-Study/138-Projects/book-rec-with-LLMs
|
| 333 |
+
make run
|
| 334 |
+
# Starts on http://localhost:6006
|
| 335 |
+
# Loads 11k books into ChromaDB
|
| 336 |
+
# Initializes metrics, routes
|
| 337 |
+
```
|
| 338 |
+
|
| 339 |
+
**Terminal 2: React Frontend**
|
| 340 |
+
```bash
|
| 341 |
+
cd web
|
| 342 |
+
npm run dev
|
| 343 |
+
# Starts on http://localhost:5173
|
| 344 |
+
# Hot reload on file changes
|
| 345 |
+
# Connect to http://localhost:6006 backend
|
| 346 |
+
```
|
| 347 |
+
|
| 348 |
+
### Production Workflow
|
| 349 |
+
- React builds with `npm run build` → static files
|
| 350 |
+
- FastAPI serves as single backend
|
| 351 |
+
- Deploy as Docker containers (see DEPLOYMENT.md)
|
| 352 |
+
|
| 353 |
+
---
|
| 354 |
+
|
| 355 |
+
## Testing the Features
|
| 356 |
+
|
| 357 |
+
### 1. Test Semantic Search
|
| 358 |
+
```
|
| 359 |
+
Input: "悬疑推理小说,节奏快"
|
| 360 |
+
Expected: Agatha Christie, Sherlock Holmes, modern thrillers
|
| 361 |
+
```
|
| 362 |
+
|
| 363 |
+
### 2. Test Favorites → Persona
|
| 364 |
+
```
|
| 365 |
+
1. Add 5 books to collection (mix of genres)
|
| 366 |
+
2. Click a new book
|
| 367 |
+
3. Check highlights mention added books' authors/categories
|
| 368 |
+
✓ Persona should reflect your choices
|
| 369 |
+
```
|
| 370 |
+
|
| 371 |
+
### 3. Test Persona-Based Highlights
|
| 372 |
+
```
|
| 373 |
+
If you favorite: [Sci-Fi, Mystery, Literary]
|
| 374 |
+
Then recommend: Horror book X
|
| 375 |
+
Expected highlight: "虽不在您常读类型,但情节深度与科幻的想象力结合..."
|
| 376 |
+
(Acknowledges taste + bridges to new territory)
|
| 377 |
+
```
|
| 378 |
+
|
| 379 |
+
---
|
| 380 |
+
|
| 381 |
+
## Future Enhancements
|
| 382 |
+
|
| 383 |
+
### Phase 3: Recommendations (Backlog)
|
| 384 |
+
|
| 385 |
+
**1. LLM-Powered Highlights**
|
| 386 |
+
- Use Claude/GPT to refine rule-based highlights
|
| 387 |
+
- Natural language refinement (currently ~70% rule-based quality)
|
| 388 |
+
- Cache per (user_id, isbn) pair for speed
|
| 389 |
+
|
| 390 |
+
**2. Emotional Resonance Scoring**
|
| 391 |
+
- Leverage emotion embeddings (joy/sadness/fear/anger/surprise) in metadata
|
| 392 |
+
- Recommend books matching user's current mood signal
|
| 393 |
+
- "What are you feeling today?" filter
|
| 394 |
+
|
| 395 |
+
**3. Multi-User Accounts**
|
| 396 |
+
- Migrate from JSON to SQLite/PostgreSQL
|
| 397 |
+
- User authentication (OAuth)
|
| 398 |
+
- Social features (share collections, compare tastes)
|
| 399 |
+
|
| 400 |
+
**4. Advanced Search**
|
| 401 |
+
- Author-to-author recommendations ("If you like X, try Y's style")
|
| 402 |
+
- Time-based recommendations ("What to read this season?")
|
| 403 |
+
- Combination search (mood + timeframe + word-count)
|
| 404 |
+
|
| 405 |
+
**5. Analytics Dashboard**
|
| 406 |
+
- Show user: "You've read 15 books in the mystery genre"
|
| 407 |
+
- Predict next book based on reading history
|
| 408 |
+
- Genre comfort zone vs stretch zones
|
| 409 |
+
|
| 410 |
+
---
|
| 411 |
+
|
| 412 |
+
## Phase Reflection
|
| 413 |
+
|
| 414 |
+
### What Worked Well
|
| 415 |
+
✅ **Modular backend design** → easy to add /highlights, /persona endpoints
|
| 416 |
+
✅ **React UI responsiveness** → users see results instantly
|
| 417 |
+
✅ **JSON-first approach** → no DB setup friction, iterate fast
|
| 418 |
+
✅ **API-driven architecture** → React frontend with FastAPI backend
|
| 419 |
+
✅ **Persona concept** → users feel "understood" by the system
|
| 420 |
+
|
| 421 |
+
### Challenges Overcome
|
| 422 |
+
🔧 **Port configuration** (React:5173 vs FastAPI:6006/8000) → Makefile organization
|
| 423 |
+
🔧 **CORS issues** (frontend can't reach backend) → Added CORSMiddleware
|
| 424 |
+
🔧 **Image loading** (external URLs) → Runtime fetching + local fallback
|
| 425 |
+
🔧 **Timeout errors** (cold startup > 10s) → Increased client timeouts, optimized startup
|
| 426 |
+
|
| 427 |
+
### Design Philosophy Validated
|
| 428 |
+
The shift from "marketplace" → "recommendation + personalization" was right because:
|
| 429 |
+
1. **Clear unique value:** Persona-aware recommendations don't exist in typical bookstores
|
| 430 |
+
2. **Tight scope:** Focused on one thing (smart discovery) vs scattered marketplace features
|
| 431 |
+
3. **User empathy:** People want to be understood, not just transact
|
| 432 |
+
|
| 433 |
+
---
|
| 434 |
+
|
| 435 |
+
## Code Structure Summary
|
| 436 |
+
|
| 437 |
+
```
|
| 438 |
+
book-rec-with-LLMs/
|
| 439 |
+
├── src/
|
| 440 |
+
│ ├── main.py # FastAPI app + 3 new endpoints
|
| 441 |
+
│ ├── recommender.py # Semantic search core
|
| 442 |
+
│ ├── vector_db.py # ChromaDB wrapper
|
| 443 |
+
│ ├── cache.py # Image caching
|
| 444 |
+
│ ├── user/
|
| 445 |
+
│ │ └── profile_store.py # ✨ NEW: Favorites JSON storage
|
| 446 |
+
│ └── marketing/
|
| 447 |
+
│ ├── persona.py # ✨ NEW: Persona aggregation
|
| 448 |
+
│ ├── highlights.py # ✨ NEW: Highlight generation
|
| 449 |
+
│ └── guardrails.py # Safety checks (stub)
|
| 450 |
+
├── web/ # ✨ NEW: React Vite app
|
| 451 |
+
│ ├── src/
|
| 452 |
+
│ │ ├── App.jsx # Main component + state
|
| 453 |
+
│ │ ├── api.js # Fetch wrappers
|
| 454 |
+
│ │ └── main.jsx # Entry point
|
| 455 |
+
│ ├── index.html # HTML + Tailwind CDN
|
| 456 |
+
│ └── package.json # Dependencies
|
| 457 |
+
├── Makefile # Commands
|
| 458 |
+
├── requirements.txt # Python deps
|
| 459 |
+
└── data/
|
| 460 |
+
├── books_processed.csv # Metadata + review highlights
|
| 461 |
+
└── user_profiles.json # User data
|
| 462 |
+
```
|
| 463 |
+
|
| 464 |
+
---
|
| 465 |
+
|
| 466 |
+
## Commit Message
|
| 467 |
+
```
|
| 468 |
+
feat: add React UI and backend personalization features
|
| 469 |
+
|
| 470 |
+
- Create modern React UI (web/) with 纸间留白 design
|
| 471 |
+
* Semantic search + favorites + detail modal
|
| 472 |
+
* Tailwind CSS + lucide-react
|
| 473 |
+
* Vite dev server on port 5173
|
| 474 |
+
|
| 475 |
+
- Implement user personalization:
|
| 476 |
+
* src/user/profile_store.py: JSON favorites
|
| 477 |
+
* src/marketing/persona.py: User taste aggregation
|
| 478 |
+
* src/marketing/highlights.py: Persona-aware selling points
|
| 479 |
+
* 3 new API endpoints in FastAPI
|
| 480 |
+
|
| 481 |
+
- Add CORS support, update timeouts, improve infrastructure
|
| 482 |
+
```
|
| 483 |
+
|
| 484 |
+
---
|
| 485 |
+
|
| 486 |
+
## How to Continue
|
| 487 |
+
|
| 488 |
+
### If you want to test now:
|
| 489 |
+
1. `make run` (starts backend)
|
| 490 |
+
2. `cd web && npm run dev` (starts React UI)
|
| 491 |
+
3. Visit http://localhost:5173
|
| 492 |
+
4. Search for a book → click results → "加入藏书馆" → see persona highlights
|
| 493 |
+
|
| 494 |
+
### If you want to refine:
|
| 495 |
+
- Adjust persona algorithm in `src/marketing/persona.py`
|
| 496 |
+
- Tweak UI colors/layout in `web/src/App.jsx`
|
| 497 |
+
- Add more rules to highlights in `src/marketing/highlights.py`
|
| 498 |
+
|
| 499 |
+
### If you want to scale:
|
| 500 |
+
- Migrate to PostgreSQL (users table + favorites relationship)
|
| 501 |
+
- Add user auth (FastAPI auth middleware)
|
| 502 |
+
- Deploy with Docker + cloud (see DEPLOYMENT.md)
|
| 503 |
+
|
| 504 |
+
---
|
| 505 |
+
|
| 506 |
+
**Status:** ✅ **Ready to Deploy**
|
| 507 |
+
|
| 508 |
+
Next phase can focus on: multi-user support, LLM refinement, analytics, or social features.
|
| 509 |
+
|
docs/archived/REVIEW_HIGHLIGHTS.md
ADDED
|
@@ -0,0 +1,142 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Review Highlights Feature
|
| 2 |
+
|
| 3 |
+
## Overview
|
| 4 |
+
|
| 5 |
+
Added semantic sentence extraction to display representative reader reviews for each book. This feature enhances book discovery by showcasing authentic reader voices.
|
| 6 |
+
|
| 7 |
+
## Implementation
|
| 8 |
+
|
| 9 |
+
### 1. Data Generation (Server-side)
|
| 10 |
+
|
| 11 |
+
**Script**: `scripts/extract_review_sentences.py`
|
| 12 |
+
|
| 13 |
+
**Process**:
|
| 14 |
+
- Splits book descriptions into sentences using regex
|
| 15 |
+
- Uses `sentence-transformers/all-MiniLM-L6-v2` for sentence embeddings
|
| 16 |
+
- Clusters similar sentences via cosine similarity (threshold: 0.8)
|
| 17 |
+
- Extracts representative sentences from each cluster (top 5 per book)
|
| 18 |
+
- Stores as semicolon-separated `review_highlights` column in CSV
|
| 19 |
+
|
| 20 |
+
**Execution**:
|
| 21 |
+
```bash
|
| 22 |
+
# Run in container with GPU
|
| 23 |
+
export HF_ENDPOINT=https://hf-mirror.com
|
| 24 |
+
python scripts/extract_review_sentences.py \
|
| 25 |
+
--input data/books_processed.csv \
|
| 26 |
+
--output data/books_processed.csv \
|
| 27 |
+
--top-n 5 \
|
| 28 |
+
--similarity-threshold 0.8 \
|
| 29 |
+
--device 0 \
|
| 30 |
+
--batch-size 128
|
| 31 |
+
```
|
| 32 |
+
|
| 33 |
+
**Performance**: ~17 minutes for 222k books on GPU (211 it/s)
|
| 34 |
+
|
| 35 |
+
### 2. Backend Integration
|
| 36 |
+
|
| 37 |
+
**Files Modified**:
|
| 38 |
+
- `src/recommender.py`: Parse `review_highlights` from CSV, split by semicolon
|
| 39 |
+
- `src/main.py`: Add `review_highlights: List[str]` to `BookResponse` model
|
| 40 |
+
|
| 41 |
+
**Code**:
|
| 42 |
+
```python
|
| 43 |
+
# Parse review highlights from semicolon-separated string
|
| 44 |
+
highlights_raw = str(row.get("review_highlights", "")).strip()
|
| 45 |
+
review_highlights = [h.strip() for h in highlights_raw.split(";") if h.strip()]
|
| 46 |
+
```
|
| 47 |
+
|
| 48 |
+
### 3. Frontend Display
|
| 49 |
+
|
| 50 |
+
**File**: `web/src/App.jsx`
|
| 51 |
+
|
| 52 |
+
**Location**: Left column, bottom section (below Rating/Mood)
|
| 53 |
+
|
| 54 |
+
**Features**:
|
| 55 |
+
- Displays up to 3 representative sentences
|
| 56 |
+
- Bullet-point format with `-` prefix
|
| 57 |
+
- Complete sentences: `- "[sentence]"`
|
| 58 |
+
- Incomplete sentences: `- "...[sentence]"` (auto-detected via regex `/^[A-Z]/`)
|
| 59 |
+
- Styling: 10px italic gray text
|
| 60 |
+
|
| 61 |
+
**Layout**:
|
| 62 |
+
```jsx
|
| 63 |
+
{selectedBook.review_highlights && selectedBook.review_highlights.length > 0 && (
|
| 64 |
+
<div className="w-full mt-auto space-y-2 text-left">
|
| 65 |
+
{selectedBook.review_highlights.slice(0, 3).map((highlight, idx) => {
|
| 66 |
+
const isCompleteSentence = /^[A-Z]/.test(highlight.trim());
|
| 67 |
+
const prefix = isCompleteSentence ? '' : '...';
|
| 68 |
+
return (
|
| 69 |
+
<p key={idx} className="text-[10px] text-[#666] leading-relaxed italic pl-2">
|
| 70 |
+
- "{prefix}{highlight}"
|
| 71 |
+
</p>
|
| 72 |
+
);
|
| 73 |
+
})}
|
| 74 |
+
</div>
|
| 75 |
+
)}
|
| 76 |
+
```
|
| 77 |
+
|
| 78 |
+
## Related Changes
|
| 79 |
+
|
| 80 |
+
### Rating Display Enhancement
|
| 81 |
+
|
| 82 |
+
**Problem**: Hardcoded rating value of 4 stars for all books
|
| 83 |
+
|
| 84 |
+
**Solution**:
|
| 85 |
+
- Added `average_rating` field to backend API response
|
| 86 |
+
- Display format: `4.3` (1 decimal) + filled stars
|
| 87 |
+
- Moved rating display into AI highlight box (pink desc_block)
|
| 88 |
+
|
| 89 |
+
**Frontend mapping**:
|
| 90 |
+
```javascript
|
| 91 |
+
rating: r.average_rating || 0, // Keep float, no rounding
|
| 92 |
+
```
|
| 93 |
+
|
| 94 |
+
**Display**:
|
| 95 |
+
```jsx
|
| 96 |
+
<span>{selectedBook.rating ? selectedBook.rating.toFixed(1) : '0.0'}</span>
|
| 97 |
+
<div className="flex gap-0.5 text-[#f4acb7]">
|
| 98 |
+
{[1,2,3,4,5].map(i => <Star key={i} className={`w-3 h-3 ${i <= selectedBook.rating ? 'fill-current' : ''}`} />)}
|
| 99 |
+
</div>
|
| 100 |
+
```
|
| 101 |
+
|
| 102 |
+
### Layout Adjustments
|
| 103 |
+
|
| 104 |
+
- Grid ratio: 4:8 → 5:7 (more space for left column)
|
| 105 |
+
- Rating/Mood: Changed from vertical stack to consolidated display
|
| 106 |
+
- Rating moved into desc_block (AI highlight box)
|
| 107 |
+
- Review highlights positioned at bottom with `mt-auto`
|
| 108 |
+
|
| 109 |
+
## Data Schema
|
| 110 |
+
|
| 111 |
+
**CSV Column**: `review_highlights` (string, semicolon-separated)
|
| 112 |
+
|
| 113 |
+
**Example**:
|
| 114 |
+
```
|
| 115 |
+
"Having been brought up on the notion...;It transpires, some years ago...;This is a work full of wisdom..."
|
| 116 |
+
```
|
| 117 |
+
|
| 118 |
+
**API Response**:
|
| 119 |
+
```json
|
| 120 |
+
{
|
| 121 |
+
"review_highlights": [
|
| 122 |
+
"Having been brought up on the notion that Elizabeth Barrett Browning was the slighter poet...",
|
| 123 |
+
"It transpires, some years ago, Clarke hosted two hugely successful British television series...",
|
| 124 |
+
"This is a work full of wisdom and unusual perspectives."
|
| 125 |
+
],
|
| 126 |
+
"average_rating": 3.716216
|
| 127 |
+
}
|
| 128 |
+
```
|
| 129 |
+
|
| 130 |
+
## Notes
|
| 131 |
+
|
| 132 |
+
- Review highlights are pre-computed and stored in CSV (no runtime extraction)
|
| 133 |
+
- Data file `books_processed.csv` (~243MB) must be regenerated after container rebuild
|
| 134 |
+
- Use `scp` to transfer processed CSV back to local machine
|
| 135 |
+
- HuggingFace mirror (`HF_ENDPOINT`) required for model download in restricted networks
|
| 136 |
+
|
| 137 |
+
## Future Improvements
|
| 138 |
+
|
| 139 |
+
- Cache sentence embeddings to speed up re-generation
|
| 140 |
+
- Add sentiment analysis to highlights (positive/critical)
|
| 141 |
+
- Filter highlights by relevance to user query
|
| 142 |
+
- Display highlight source (verified purchase vs. regular review)
|
docs/archived/TAGS_AND_EMOTIONS.md
ADDED
|
@@ -0,0 +1,233 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Tags and Emotion Scoring
|
| 2 |
+
|
| 3 |
+
This document describes the tag generation and emotion scoring features added to enrich book metadata.
|
| 4 |
+
|
| 5 |
+
## Overview
|
| 6 |
+
|
| 7 |
+
- **Tags**: Keyword extraction from book descriptions using TF-IDF (5-8 terms per book)
|
| 8 |
+
- **Emotion Scores**: Five emotion dimensions (joy, sadness, fear, anger, surprise) computed via transformer model
|
| 9 |
+
|
| 10 |
+
## Data Generation
|
| 11 |
+
|
| 12 |
+
### 1. Tag Generation
|
| 13 |
+
|
| 14 |
+
Extracts thematic keywords from aggregated review text.
|
| 15 |
+
|
| 16 |
+
**Script**: `scripts/generate_tags.py`
|
| 17 |
+
|
| 18 |
+
**Usage**:
|
| 19 |
+
```bash
|
| 20 |
+
python scripts/generate_tags.py \
|
| 21 |
+
--input data/books_processed.csv \
|
| 22 |
+
--output data/books_processed.csv \
|
| 23 |
+
--top-n 8
|
| 24 |
+
```
|
| 25 |
+
|
| 26 |
+
**Algorithm**:
|
| 27 |
+
- TF-IDF vectorization (unigrams + bigrams)
|
| 28 |
+
- English stopwords + domain stoplist (e.g., "book", "author", "story")
|
| 29 |
+
- Top-N weighted terms per book
|
| 30 |
+
- Semicolon-joined storage in `tags` column
|
| 31 |
+
|
| 32 |
+
**Parameters**:
|
| 33 |
+
- `--top-n`: Max tags per book (default: 8)
|
| 34 |
+
- `--max-features`: TF-IDF vocabulary size (default: 60,000)
|
| 35 |
+
- `--min-df`: Minimum document frequency (default: 5)
|
| 36 |
+
- `--max-df`: Maximum document frequency ratio (default: 0.5)
|
| 37 |
+
|
| 38 |
+
### 2. Emotion Scoring
|
| 39 |
+
|
| 40 |
+
Computes emotion intensity scores from book descriptions.
|
| 41 |
+
|
| 42 |
+
**Script**: `scripts/generate_emotions.py`
|
| 43 |
+
|
| 44 |
+
**Model**: `j-hartmann/emotion-english-distilroberta-base`
|
| 45 |
+
|
| 46 |
+
**Usage**:
|
| 47 |
+
```bash
|
| 48 |
+
# CPU
|
| 49 |
+
python scripts/generate_emotions.py \
|
| 50 |
+
--input data/books_processed.csv \
|
| 51 |
+
--output data/books_processed.csv \
|
| 52 |
+
--batch-size 16
|
| 53 |
+
|
| 54 |
+
# Apple GPU (MPS)
|
| 55 |
+
python scripts/generate_emotions.py \
|
| 56 |
+
--input data/books_processed.csv \
|
| 57 |
+
--output data/books_processed.csv \
|
| 58 |
+
--batch-size 8 \
|
| 59 |
+
--device mps \
|
| 60 |
+
--checkpoint 2000 \
|
| 61 |
+
--resume
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
**Parameters**:
|
| 65 |
+
- `--batch-size`: Inference batch size (default: 16)
|
| 66 |
+
- `--device`: `mps` (Apple GPU), CUDA device id, or CPU (default)
|
| 67 |
+
- `--checkpoint`: Rows between checkpoint writes (default: 5000)
|
| 68 |
+
- `--resume`: Skip rows already scored (useful for resuming long runs)
|
| 69 |
+
- `--max-rows`: Limit processing to N rows (for testing)
|
| 70 |
+
|
| 71 |
+
**Output Columns**:
|
| 72 |
+
- `joy`: 0.0–1.0
|
| 73 |
+
- `sadness`: 0.0–1.0
|
| 74 |
+
- `fear`: 0.0–1.0
|
| 75 |
+
- `anger`: 0.0–1.0
|
| 76 |
+
- `surprise`: 0.0–1.0
|
| 77 |
+
|
| 78 |
+
**Performance**:
|
| 79 |
+
- ~1.1 it/s on Apple M-series GPU
|
| 80 |
+
- ~7 hours for 222k books (batch_size=8, MPS)
|
| 81 |
+
- One-time processing; results persist in CSV
|
| 82 |
+
|
| 83 |
+
## Data Schema
|
| 84 |
+
|
| 85 |
+
Updated `books_processed.csv` columns:
|
| 86 |
+
|
| 87 |
+
| Column | Type | Description |
|
| 88 |
+
|--------|------|-------------|
|
| 89 |
+
| `tags` | str | Semicolon-separated keywords (e.g., "irish;travel;humor") |
|
| 90 |
+
| `joy` | float | Joy emotion score (0.0–1.0) |
|
| 91 |
+
| `sadness` | float | Sadness emotion score (0.0–1.0) |
|
| 92 |
+
| `fear` | float | Fear emotion score (0.0–1.0) |
|
| 93 |
+
| `anger` | float | Anger emotion score (0.0–1.0) |
|
| 94 |
+
| `surprise` | float | Surprise emotion score (0.0–1.0) |
|
| 95 |
+
|
| 96 |
+
## API Integration
|
| 97 |
+
|
| 98 |
+
### Backend Changes
|
| 99 |
+
|
| 100 |
+
**File**: `src/recommender.py`
|
| 101 |
+
|
| 102 |
+
Added to `_format_results()`:
|
| 103 |
+
```python
|
| 104 |
+
# Parse tags
|
| 105 |
+
tags_raw = str(row.get("tags", "")).strip()
|
| 106 |
+
tags = [t.strip() for t in tags_raw.split(";") if t.strip()] if tags_raw else []
|
| 107 |
+
|
| 108 |
+
# Extract emotions
|
| 109 |
+
emotions = {
|
| 110 |
+
"joy": float(row.get("joy", 0.0)),
|
| 111 |
+
"sadness": float(row.get("sadness", 0.0)),
|
| 112 |
+
"fear": float(row.get("fear", 0.0)),
|
| 113 |
+
"anger": float(row.get("anger", 0.0)),
|
| 114 |
+
"surprise": float(row.get("surprise", 0.0)),
|
| 115 |
+
}
|
| 116 |
+
```
|
| 117 |
+
|
| 118 |
+
**File**: `src/main.py`
|
| 119 |
+
|
| 120 |
+
Updated Pydantic model:
|
| 121 |
+
```python
|
| 122 |
+
class BookResponse(BaseModel):
|
| 123 |
+
isbn: str
|
| 124 |
+
title: str
|
| 125 |
+
authors: str
|
| 126 |
+
description: str
|
| 127 |
+
thumbnail: str
|
| 128 |
+
caption: str
|
| 129 |
+
tags: List[str] = []
|
| 130 |
+
emotions: Dict[str, float] = {}
|
| 131 |
+
```
|
| 132 |
+
|
| 133 |
+
### API Response Example
|
| 134 |
+
|
| 135 |
+
```json
|
| 136 |
+
{
|
| 137 |
+
"recommendations": [
|
| 138 |
+
{
|
| 139 |
+
"isbn": "0001849883",
|
| 140 |
+
"title": "Bury My Bones But Keep My Words",
|
| 141 |
+
"authors": "Deborah Savage, Tony Fairman",
|
| 142 |
+
"tags": ["paulsen", "otters", "searches", "gary", "brian"],
|
| 143 |
+
"emotions": {
|
| 144 |
+
"joy": 0.020,
|
| 145 |
+
"sadness": 0.004,
|
| 146 |
+
"fear": 0.012,
|
| 147 |
+
"anger": 0.006,
|
| 148 |
+
"surprise": 0.086
|
| 149 |
+
}
|
| 150 |
+
}
|
| 151 |
+
]
|
| 152 |
+
}
|
| 153 |
+
```
|
| 154 |
+
|
| 155 |
+
## UI Display
|
| 156 |
+
|
| 157 |
+
### Search Results Grid
|
| 158 |
+
|
| 159 |
+
Each book card displays:
|
| 160 |
+
- **Dominant emotion label**: Emotion with highest score (bottom-right badge)
|
| 161 |
+
- Example: "joy", "sadness", "fear"
|
| 162 |
+
|
| 163 |
+
**Implementation** (`web/src/App.jsx`):
|
| 164 |
+
```jsx
|
| 165 |
+
{book.emotions && Object.keys(book.emotions).length > 0 ? (
|
| 166 |
+
<span className="text-[9px] bg-[#f8f9fa] border border-[#eee] px-1 text-[#999] capitalize">
|
| 167 |
+
{Object.entries(book.emotions).reduce((a, b) => a[1] > b[1] ? a : b)[0]}
|
| 168 |
+
</span>
|
| 169 |
+
) : (
|
| 170 |
+
<span className="text-[9px] bg-[#f8f9fa] border border-[#eee] px-1 text-[#999]">—</span>
|
| 171 |
+
)}
|
| 172 |
+
```
|
| 173 |
+
|
| 174 |
+
### Book Detail Modal
|
| 175 |
+
|
| 176 |
+
Two new sections:
|
| 177 |
+
|
| 178 |
+
**1. Key Themes**
|
| 179 |
+
- Displays all extracted tags as badges
|
| 180 |
+
- Shows "No themes found" if tags empty
|
| 181 |
+
|
| 182 |
+
**2. Emotional Tone**
|
| 183 |
+
- Five horizontal bars showing emotion scores
|
| 184 |
+
- Bar width = score percentage (0–100%)
|
| 185 |
+
- Format: `emotion_name | [bar] | percentage`
|
| 186 |
+
|
| 187 |
+
**Implementation** (`web/src/App.jsx`):
|
| 188 |
+
```jsx
|
| 189 |
+
<div className="space-y-2">
|
| 190 |
+
<h4>Emotional Tone</h4>
|
| 191 |
+
<div className="space-y-2 p-3 bg-[#faf9f6] border border-[#eee]">
|
| 192 |
+
{selectedBook.emotions && Object.entries(selectedBook.emotions).map(([emotion, score]) => (
|
| 193 |
+
<div key={emotion} className="flex items-center gap-2">
|
| 194 |
+
<span className="text-[9px] font-bold text-gray-500 w-16 capitalize">{emotion}</span>
|
| 195 |
+
<div className="flex-grow bg-white border border-[#eee] h-2 relative overflow-hidden">
|
| 196 |
+
<div
|
| 197 |
+
className="h-full bg-[#b392ac] transition-all"
|
| 198 |
+
style={{ width: `${Math.round(score * 100)}%` }}
|
| 199 |
+
/>
|
| 200 |
+
</div>
|
| 201 |
+
<span className="text-[8px] text-gray-400 w-10 text-right">{Math.round(score * 100)}%</span>
|
| 202 |
+
</div>
|
| 203 |
+
))}
|
| 204 |
+
</div>
|
| 205 |
+
</div>
|
| 206 |
+
```
|
| 207 |
+
|
| 208 |
+
## Future Improvements
|
| 209 |
+
|
| 210 |
+
- **Incremental updates**: Score only new books instead of full dataset
|
| 211 |
+
- **Smaller model**: Try lightweight emotion classifiers (faster inference)
|
| 212 |
+
- **Multi-label tags**: Use text classification for predefined categories
|
| 213 |
+
- **Tag filtering**: Allow users to filter by specific tags in search
|
| 214 |
+
- **Emotion-based sorting**: Sort results by dominant emotion match
|
| 215 |
+
- **Caching**: Cache emotion inference results in Redis for API speedup
|
| 216 |
+
|
| 217 |
+
## Dependencies
|
| 218 |
+
|
| 219 |
+
```
|
| 220 |
+
scikit-learn # TF-IDF vectorization
|
| 221 |
+
transformers # Emotion classification
|
| 222 |
+
torch # Model inference
|
| 223 |
+
tqdm # Progress bars
|
| 224 |
+
```
|
| 225 |
+
|
| 226 |
+
## Notes
|
| 227 |
+
|
| 228 |
+
- Tags and emotions are **one-time computed** and stored in CSV
|
| 229 |
+
- No re-computation on API requests (instant serving)
|
| 230 |
+
- CSV file (242MB) is in `.gitignore` (too large for GitHub)
|
| 231 |
+
- To regenerate on a new machine, run both scripts sequentially:
|
| 232 |
+
1. `generate_tags.py` (~5 minutes)
|
| 233 |
+
2. `generate_emotions.py` (~7 hours on MPS for full dataset)
|
docs/archived/interview_prep_v1.md
ADDED
|
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Interview Preparation Guide: Book Recommender System
|
| 2 |
+
|
| 3 |
+
> **Note**: This document is for personal interview preparation and should not be pushed to public repositories.
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## 1. Resume Descriptions
|
| 8 |
+
|
| 9 |
+
### Concise Version (1-Line)
|
| 10 |
+
```text
|
| 11 |
+
End-to-End AI E-Commerce Platform | Python, LangChain, RAG, ChromaDB, Redis, FastAPI, Docker | Oct 2025
|
| 12 |
+
• Built a unified AI platform integrating semantic search (200k+ items), RAG-based shopping agent, and automated marketing content generation.
|
| 13 |
+
```
|
| 14 |
+
|
| 15 |
+
### Detailed Version (3-Lines)
|
| 16 |
+
```text
|
| 17 |
+
End-to-End AI E-Commerce Platform Oct 2025
|
| 18 |
+
• Developed a multi-modal AI platform consolidating three core modules: Semantic Search, RAG Shopping Assistant, and Generative Marketing Engine.
|
| 19 |
+
• Engineered a high-performance retrieval system for 200,000+ books using ChromaDB (HNSW) and Redis caching, achieving sub-second latency.
|
| 20 |
+
• Implemented a microservices architecture with FastAPI and Docker, featuring automated content guardrails and zero-shot re-ranking capabilities.
|
| 21 |
+
```
|
| 22 |
+
|
| 23 |
+
### Technical Keywords
|
| 24 |
+
- **Search & Retrieval**: Semantic Search, Vector Embeddings (MiniLM), HNSW Indexing, Redis Caching.
|
| 25 |
+
- **Generative AI**: Retrieval-Augmented Generation (RAG), Zero-Shot Classification (BART-MNLI), Prompt Engineering.
|
| 26 |
+
- **Backend Engineering**: FastAPI, Asynchronous Processing, Microservices, Docker Containerization.
|
| 27 |
+
- **DevOps**: CI/CD (GitHub Actions), Unit Testing (Pytest), Cloud Deployment (Hugging Face Spaces).
|
| 28 |
+
|
| 29 |
+
---
|
| 30 |
+
|
| 31 |
+
## 2. Elevator Pitch (2 Minutes)
|
| 32 |
+
|
| 33 |
+
**Context**: "Tell me about a challenging project you have built."
|
| 34 |
+
|
| 35 |
+
"I developed an **End-to-End AI E-Commerce Platform** that demonstrates the complete lifecycle of modern AI applications—from data engineering to model deployment.
|
| 36 |
+
|
| 37 |
+
The platform solves the problem of information overload in e-commerce by integrating three distinct AI capabilities into a single 'Super App':
|
| 38 |
+
1. **Intelligent Discovery**: A semantic search engine that allows users to find products using natural language descriptions (e.g., 'a philosophical sci-fi about loneliness') rather than keywords. I scaled this to over 200,000 items using **ChromaDB** for vector retrieval and **Redis** for caching, ensuring low-latency performance.
|
| 39 |
+
2. **Conversational Assistant**: A RAG-based agent that acts as a shopping assistant. It retrieves relevant product context to ground its responses, significantly reducing hallucinations compared to raw LLMs.
|
| 40 |
+
3. **Marketing Engine**: A generative module that automates the creation of marketing copy. I implemented **safety guardrails** to ensure all generated content adheres to brand policies.
|
| 41 |
+
|
| 42 |
+
Technically, the system is built as a containerized microservice using **FastAPI** and **Docker**. I focused heavily on production readiness, implementing a robust ETL pipeline to process the Amazon Books dataset and comprehensive unit testing to ensure reliability. It represents a full-stack approach to AI engineering, bridging the gap between model research and practical application."
|
| 43 |
+
|
| 44 |
+
---
|
| 45 |
+
|
| 46 |
+
## 3. Real-World Applications
|
| 47 |
+
|
| 48 |
+
### Direct Use Cases
|
| 49 |
+
| Use Case | Description |
|
| 50 |
+
| :--- | :--- |
|
| 51 |
+
| **E-Commerce Search** | Enhancing keyword search with semantic understanding (e.g., 'gifts for dad' vs. 'tie'). |
|
| 52 |
+
| **Content Recommendation** | Powering 'More Like This' features in streaming or reading platforms. |
|
| 53 |
+
| **Customer Support** | Automating Level 1 support queries using RAG to query internal knowledge bases. |
|
| 54 |
+
| **Marketing Automation** | Scaling ad copy generation for thousands of SKUs while maintaining brand voice. |
|
| 55 |
+
|
| 56 |
+
### Technical Transferability
|
| 57 |
+
- **Vector Search**: Applicable to any domain requiring semantic similarity (e.g., legal discovery, candidate matching).
|
| 58 |
+
- **RAG Agents**: Standard pattern for building domain-specific chatbots (e.g., internal HR bots).
|
| 59 |
+
- **Guardrails**: Critical for deploying GenAI in regulated industries (finance, healthcare).
|
| 60 |
+
|
| 61 |
+
---
|
| 62 |
+
|
| 63 |
+
## 4. Architecture Comparison: Personal vs. Enterprise
|
| 64 |
+
|
| 65 |
+
### Similarities
|
| 66 |
+
* **Vector Database**: Usage of specialized vector stores (ChromaDB) and HNSW indexing.
|
| 67 |
+
* **Microservices**: Separation of concerns between UI (React), API (FastAPI), and Persistence (DB).
|
| 68 |
+
* **Containerization**: Use of Docker for consistent deployment environments.
|
| 69 |
+
|
| 70 |
+
### Differences and Scalability Planning
|
| 71 |
+
| Aspect | Current Implementation | Enterprise Scale | Strategy for Scaling |
|
| 72 |
+
| :--- | :--- | :--- | :--- |
|
| 73 |
+
| **Data Scale** | 200,000 items | Billions of items | Distributed vector DBs (Milvus/Piecone), Sharding. |
|
| 74 |
+
| **Updates** | Batch Indexing | Real-time Stream | Kafka/CDC integration for incremental indexing. |
|
| 75 |
+
| **Ranking** | Single-stage ANN | Multi-stage (Recall -> Rank) | Add Learning-to-Rank (LTR) or Cross-Encoder re-ranking layer. |
|
| 76 |
+
| **Observability** | Basic Logging | Full Telemetry | Integrate Prometheus (Metrics) and Jaeger (Tracing). |
|
| 77 |
+
|
| 78 |
+
---
|
| 79 |
+
|
| 80 |
+
## 5. Technical Q&A (STAR Method)
|
| 81 |
+
|
| 82 |
+
### Q1: Why did you choose ChromaDB over other vector databases?
|
| 83 |
+
**Situation**: I needed a vector store that was lightweight, open-source, and easy to integrate for a Python-based prototype.
|
| 84 |
+
**Task**: Select a database that supports HNSW indexing and persistence without heavy infrastructure overhead.
|
| 85 |
+
**Action**: I chose **ChromaDB** because it offers an embedded mode (serverless) perfect for development, automatic tokenization/embedding management, and seamless integration with LangChain.
|
| 86 |
+
**Result**: This allowed me to iterate quickly and deploy the initial prototype to Hugging Face Spaces without managing a separate database cluster.
|
| 87 |
+
|
| 88 |
+
### Q2: How did you handle the latency issues with the large dataset?
|
| 89 |
+
**Situation**: Upon scaling to 200,000 items, I noticed that repeated queries for popular categories were causing unnecessary re-computation.
|
| 90 |
+
**Task**: Optimize the system latency to maintain sub-second response times.
|
| 91 |
+
**Action**: I implemented a **Redis caching layer**. Before hitting the vector database, the system checks Redis for a hashed key of the query parameters.
|
| 92 |
+
**Result**: This reduced the latency for frequent queries from ~400ms to <10ms, significantly improving the user experience under load.
|
| 93 |
+
|
| 94 |
+
### Q3: What is RAG and why did you use it for the Agent module?
|
| 95 |
+
**Answer**: Retrieval-Augmented Generation (RAG) is a technique to optimize LLM output by referencing an authoritative knowledge base before generating a response. I used it to prevent the Shopping Assistant from 'hallucinating' products that don't exist. By retrieving real product details from the vector index and injecting them into the prompt, the agent generates responses grounded in actual inventory data.
|
| 96 |
+
|
| 97 |
+
### Q4: How does the Zero-Shot Classification work?
|
| 98 |
+
**Answer**: Zero-Shot Classification allows a model to classify text into labels it has never seen during training. I utilized a model trained on Natural Language Inference (NLI) tasks (BART-MNLI). The model treats the classification problem as an entailment problem: does the premise (book description) entail the hypothesis ('This book is about [Label]')? This enables dynamic filtering without training a specific classifier for every new genre.
|
| 99 |
+
|
| 100 |
+
---
|
| 101 |
+
|
| 102 |
+
## 6. Technical Stack Justification
|
| 103 |
+
|
| 104 |
+
| Component | Choice | Rationale |
|
| 105 |
+
| :--- | :--- | :--- |
|
| 106 |
+
| **Orchestration** | **FastAPI** | Native async support (ASGI) is crucial for I/O-bound operations like vector search; automatic validation via Pydantic. |
|
| 107 |
+
| **Vector DB** | **ChromaDB** | Simplifies the stack by running in-process; tailored for LLM workloads. |
|
| 108 |
+
| **Cache** | **Redis** | Industry standard for key-value caching; low latency; persistence options. |
|
| 109 |
+
| **Container** | **Docker** | Ensures the complex dependency tree (PyTorch, Transformers, Redis client) works consistently across environments. |
|
| 110 |
+
| **Frontend** | **React + Vite** | Modern component-based UI with Tailwind CSS; production-grade UX with fast development cycles. |
|
| 111 |
+
|
| 112 |
+
---
|
| 113 |
+
|
| 114 |
+
## 7. Development Roadmap
|
| 115 |
+
|
| 116 |
+
### Phase 1: Foundation (Data & Search)
|
| 117 |
+
- Established ETL pipelines for the Amazon 200k dataset.
|
| 118 |
+
- Implemented core Vector Search algorithms using Sentence Transformers.
|
| 119 |
+
|
| 120 |
+
### Phase 2: Intelligence (Agent & RAG)
|
| 121 |
+
- Integrated the Conversational Shopping Agent.
|
| 122 |
+
- Implemented RAG logic to connect the search engine with the chat interface.
|
| 123 |
+
|
| 124 |
+
### Phase 3: Reliability & Productization (Current)
|
| 125 |
+
- Added Redis caching for performance at scale.
|
| 126 |
+
- Implemented Content Guardrails for the Marketing module.
|
| 127 |
+
- Finalized Docker deployment and CI/CD pipelines.
|
| 128 |
+
|
| 129 |
+
---
|
| 130 |
+
|
| 131 |
+
## 8. Behavioral Interview Stories (STAR Format)
|
| 132 |
+
|
| 133 |
+
### Story 1: Debugging Silent Failures in Data Pipelines
|
| 134 |
+
**Context**: "Tell me about a time you had to troubleshoot a difficult bug."
|
| 135 |
+
|
| 136 |
+
* **Situation**: During the ETL migration for the 200k Amazon dataset, the pipeline script would execute confidently but produce no output files, with no error messages raised.
|
| 137 |
+
* **Task**: I needed to identify why the data aggregation process was failing silently and fix it to proceed with the project integration.
|
| 138 |
+
* **Action**: I conducted a root cause analysis and discovered two issues:
|
| 139 |
+
1. The script lacked a main execution block (`if __name__ == "__main__":`), meaning the functions were defined but never called.
|
| 140 |
+
2. After fixing the entry point, a data type mismatch occurred where a Pandas Series was being treated as a DataFrame.
|
| 141 |
+
I refactored the aggregation logic and, crucially, added **tqdm progress bars** to the `src/vector_db.py` loop.
|
| 142 |
+
* **Result**: The fix allowed the 2.7GB dataset to be processed correctly. The addition of progress bars provided immediate visual feedback on the system's state, preventing future "silent" wait times and improving developer experience.
|
| 143 |
+
|
| 144 |
+
### Story 2: Managing Technical Debt during Integration
|
| 145 |
+
**Context**: "Describe a time you had to refactor a complex codebase."
|
| 146 |
+
|
| 147 |
+
* **Situation**: I needed to integrate three distinct AI modules (`llm-recsys`, `marketing-engine`, `recommender`) into a single "Super App". Each had conflicting dependencies and directory structures (e.g., duplicate `src` folders).
|
| 148 |
+
* **Task**: My goal was to create a unified monorepo without breaking the existing functionality of the individual components.
|
| 149 |
+
* **Action**:
|
| 150 |
+
1. I adopted a strict modular architecture, renaming conflicting directories (e.g., `src/recommender/zero_shot` -> `src/zero_shot`) to avoid namespace collisions.
|
| 151 |
+
### Story 3: The "Mutex Lock" Dependency Hell (Debugging)
|
| 152 |
+
**Context**: "Tell me about a time you solved a complex environment issue."
|
| 153 |
+
|
| 154 |
+
* **Situation**: While deploying the vector database builder on a MacBook M1 (Apple Silicon), the application would persistently hang with a `[mutex.cc : 452] RAW: Lock blocking` error, with no Python stack trace.
|
| 155 |
+
* **Task**: Identify the root cause of the deadlock that was preventing the application from initializing the embedding model.
|
| 156 |
+
* **Action**:
|
| 157 |
+
1. I suspected a low-level threading conflict and first tried restricting OpenMP threads (`OMP_NUM_THREADS=1`), but the issue persisted.
|
| 158 |
+
2. I created a minimal reproduction script (`debug_env.py`) isolating the `sentence-transformers` import.
|
| 159 |
+
3. Through binary search of installed packages, I discovered a known conflict between **TensorFlow 2.16+** and **PyArrow** on macOS ARM architecture, which triggers a mutex deadlock when both are loaded (even if TF isn't used!).
|
| 160 |
+
4. Since my project relies on PyTorch, TensorFlow was an unnecessary transitive dependency.
|
| 161 |
+
* **Result**: I uninstalled TensorFlow, which immediately resolved the deadlock. I then re-enabled **MPS (Metal Performance Shaders)** acceleration, reducing the 200k indexing time from 20 minutes (CPU) to <3 minutes (GPU). This taught me to audit environments ruthlessly and remove unused heavy dependencies.
|
| 162 |
+
|
| 163 |
+
### Story 4: The Cloud Deployment Gauntlet
|
| 164 |
+
**Context**: "Tell me about a time you deployed a complex ML system to production."
|
| 165 |
+
|
| 166 |
+
* **Situation**: I needed to deploy the Book Recommender to a domestic GPU cloud server (AutoDL) to leverage NVIDIA RTX GPUs for indexing 200,000 documents. The environment was restrictive: transparent proxies blocked HuggingFace, system disks were tiny (20GB), and the pre-installed Python environment was filled with conflicting legacy packages.
|
| 167 |
+
* **Task**: Configure a robust production environment and establish a reliable CI/CD-like workflow for model and data provisioning.
|
| 168 |
+
* **Action**:
|
| 169 |
+
1. **Environment Isolation**: Instead of fighting the corrupted base image, I utilized Conda to create a fresh, isolated Python 3.10 environment, identifying and pinning critical dependencies (`huggingface-hub>=0.23.0`) to resolve a mismatch with modern Transformers libraries.
|
| 170 |
+
2. **Network Engineering**: I bypassed the "Great Firewall" restrictions by creating a custom loader script that utilized the official `hf-mirror.com` endpoint with aggressive timeouts and resumable download logic.
|
| 171 |
+
3. **Data Strategy**: To avoid transmitting the 2.7GB raw dataset over a slow SSH connection (which would take 4 hours), I developed a pre-processing strategy to compress and upload only the 200MB essential metadata CSVs, reducing transfer time to <1 minute.
|
| 172 |
+
4. **Access Security**: Instead of exposing the API publicly, I established an **SSH Tunnel** to securely map the remote Swagger UI to my local machine for verification.
|
| 173 |
+
* **Result**: Successfully built the 220,000-document vector index in just **6 minutes** (vs hour+ on CPU) and verified the end-to-end API functionality. This experience solidified my skills in Linux system administration and remote ML Ops.
|
docs/future_roadmap.md
ADDED
|
@@ -0,0 +1,70 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Advanced RAG Architecture: Future Roadmap
|
| 2 |
+
|
| 3 |
+
This document outlines the technical evolution path for the Book Recommender system, moving from a standard RAG demo to an enterprise-grade intelligent system.
|
| 4 |
+
|
| 5 |
+
## 1. Knowledge Representation: GraphRAG
|
| 6 |
+
|
| 7 |
+
**The Problem**: Vector search handles "similarity" well but fails at "connectivity" and structural reasoning (e.g., "Find hard sci-fi like *Three Body Problem* but discussing the *Fermi Paradox*").
|
| 8 |
+
|
| 9 |
+
**The Solution**:
|
| 10 |
+
- **Graph Construction**: Use LLM to extract entities (Book, Author, Genre) and relationships (Series, Influenced_By, Theme, Adapted_From) into a Knowledge Graph (e.g., Neo4j or NetworkX).
|
| 11 |
+
- **Graph-Enhanced Retrieval**:
|
| 12 |
+
1. **Traversal**: Perform multi-hop traversal to find structurally related books (e.g., query -> "Hard Sci-Fi" node -> "Fermi Paradox" theme node -> candidate books).
|
| 13 |
+
2. **Fusion**: Combine Graph Candidates with Vector Similarity Candidates for final ranking.
|
| 14 |
+
|
| 15 |
+
**Key Value**: Solves "Semantic Drift" in long-tail recommendations and enables reasoning over interconnected data.
|
| 16 |
+
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
## 2. Retrieval Precision: Domain-Specific Embeddings
|
| 20 |
+
|
| 21 |
+
**The Problem**: General-purpose embeddings (like OpenAI `text-embedding-3`) conflate domain-specific sentiments. In book reviews, "Sad" might mean "Depressing" (negative) or "Cathartic/Moving" (positive).
|
| 22 |
+
|
| 23 |
+
**The Solution**:
|
| 24 |
+
- **Contrastive Fine-Tuning**: Construct `(Query, Positive_Book, Negative_Book)` triplets from the user rating data (`Books_rating.csv`). Fine-tune a model like BGE or Sentence-BERT to learn the specific semantic space of book reviews.
|
| 25 |
+
- **Matryoshka Embeddings**: Train variable-length embeddings.
|
| 26 |
+
- Use short vectors (e.g., 64d) for extremely fast initial retrieval (10x speedup).
|
| 27 |
+
- Use full vectors (e.g., 768d) for precision reranking of the top candidates.
|
| 28 |
+
|
| 29 |
+
**Key Value**: Domain Adaptation (estimated +15% Recall) and significant Cost/Latency Efficiency.
|
| 30 |
+
|
| 31 |
+
---
|
| 32 |
+
|
| 33 |
+
## 3. System Architecture: Agentic RAG
|
| 34 |
+
|
| 35 |
+
**The Problem**: Linear RAG pipelines (`Query -> Retrieve -> Generate`) fail on complex, multi-dimensional questions (e.g., "Compare the author's early vs. late writing style").
|
| 36 |
+
|
| 37 |
+
**The Solution**:
|
| 38 |
+
- **Router Agent**: Analyzes query complexity to route the request:
|
| 39 |
+
- *Simple*: Direct Vector Search.
|
| 40 |
+
- *Complex*: Knowledge Graph Traversal + Vector Search.
|
| 41 |
+
- *External*: Web Search (Google Books API) for missing/real-time info.
|
| 42 |
+
- **Self-Correction (Self-RAG)**: The Agent evaluates its own retrieved documents. If they are irrelevant or insufficient, it rewrites the search query and tries again before attempting to answer.
|
| 43 |
+
|
| 44 |
+
**Key Value**: Solves "Hallucination" and enables handling of complex, investigative queries.
|
| 45 |
+
|
| 46 |
+
---
|
| 47 |
+
|
| 48 |
+
## 4. Cost & Performance: Context Compression
|
| 49 |
+
|
| 50 |
+
**The Problem**: Feeding large amounts of raw text (e.g., 50 full book reviews) to an LLM is expensive, slow, and causes "Lost in the Middle" (attention gradation) issues.
|
| 51 |
+
|
| 52 |
+
**The Solution**:
|
| 53 |
+
- **Compression Pipeline**: `Retrieval -> [Cross-Encoder / Summarizer Model] -> LLM`. Extract only the most relevant sentences/segments from the retrieved docs before sending to the LLM.
|
| 54 |
+
- **KV Cache Optimization**: For multi-turn chat, dynamically summarize the conversation history to maintain long-term context without linear growth in token usage.
|
| 55 |
+
|
| 56 |
+
**Key Value**: Up to 60% Token Cost Reduction and improved model attention/accuracy.
|
| 57 |
+
|
| 58 |
+
---
|
| 59 |
+
|
| 60 |
+
## 5. Recommendation Logic: Temporal Dynamics
|
| 61 |
+
|
| 62 |
+
**The Problem**: User profiles are often treated as static. The system doesn't distinguish between a book liked 5 years ago and one liked yesterday.
|
| 63 |
+
|
| 64 |
+
**The Solution**:
|
| 65 |
+
- **Decay Embeddings**: Apply time-decay functions to user interactions when building the User Profile Vector (Recent interactions > Historical ones).
|
| 66 |
+
- **Dual-Slot Profile**: Separate the user profile into:
|
| 67 |
+
- "Long-term Preference" (Stability/Identity)
|
| 68 |
+
- "Short-term Interest" (Burstiness/Current Mood)
|
| 69 |
+
|
| 70 |
+
**Key Value**: Solves "Recommendation Lag" and better captures user Interest Drift.
|
docs/interview_deep_dive.md
ADDED
|
@@ -0,0 +1,82 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Interview Deep Dive: Book Recommender Analysis
|
| 2 |
+
**Framework**: Based on "LLM Application Landing" (SFT vs RAG) criteria.
|
| 3 |
+
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
## I. Project Classification
|
| 7 |
+
**Type**: **Agentic RAG** (Retrieval-Augmented Generation with Router Control).
|
| 8 |
+
* *Not just "RAG"*: It includes a decision layer (`QueryRouter`) that changes strategy based on input.
|
| 9 |
+
* *Not just "Search"*: It generates grounded responses using retrieved context.
|
| 10 |
+
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
## II. RAG Technical Depth (The "Meat")
|
| 14 |
+
|
| 15 |
+
### 1. Architecture Type
|
| 16 |
+
* **Our Choice**: **Agentic RAG** (Router -> Branching Logic).
|
| 17 |
+
* **Why?**:
|
| 18 |
+
* A simple "Retriever-Generator" chain failed on **Exact Intents** (ISBNs) and **Freshness** queries.
|
| 19 |
+
* We needed dynamic logic: "If specific ID, use precise tool; If vague feeling, use semantic tool."
|
| 20 |
+
* *Common Interview Question*: "Why didn't you use GraphRAG?"
|
| 21 |
+
* *Answer*: "Overkill for MVP. Entities (Books) are independent atoms; we don't heavily rely on multi-hop relationships (e.g., 'Books written by the friend of the author of X'). Agentic Routing solved 80% of edge cases with 1% of the complexity."
|
| 22 |
+
|
| 23 |
+
### 2. Knowledge Base Construction (The Foundation)
|
| 24 |
+
* **Strategy**: **Atomic Documents** (Structure-Aware).
|
| 25 |
+
* *Implementation*: Instead of fixed-size chunking (e.g., 512 tokens), we treated each **Book** as a single atomic unit.
|
| 26 |
+
* *Content Construction*: `Title` + `Author` + `Description` + `Review Highlights` + `Emotions`.
|
| 27 |
+
* **Why not Fixed Chunking?**:
|
| 28 |
+
* Users search for *whole books*, not *fragments of a paragraph* inside a book description.
|
| 29 |
+
* *Trade-off*: We sacrifice granularity for context integrity.
|
| 30 |
+
* *Optimization*: We injected `Review Highlights` (User Opinions) into the text representation to allow semantic matching on "vibe" (e.g., "readers hate the ending").
|
| 31 |
+
|
| 32 |
+
### 3. Retrieval Strategy Optimization (The Core Battlefield)
|
| 33 |
+
* **A. User Intent Recognition**:
|
| 34 |
+
* *Tech*: RegEx & Keyword Routing (`src/core/router.py`).
|
| 35 |
+
* *Logic*: Distinguishes **Identificational** (ISBN), **Informational** (Topic), and **Recency** (Latest) queries.
|
| 36 |
+
* **B. Hybrid Search**:
|
| 37 |
+
* *Tech*: Reciprocal Rank Fusion (RRF) of BM25 (Sparse) + Chroma (Dense).
|
| 38 |
+
* *Why*: Dense vectors are bad at exact numbers (ISBNs) and rare proper nouns. BM25 covers this blind spot.
|
| 39 |
+
* **C. Reranking (Precision)**:
|
| 40 |
+
* *Tech*: Cross-Encoder (`ms-marco-MiniLM`).
|
| 41 |
+
* *Impact*: Moved semantic "noise" chunks down. Fixed the "Harry Potter Philosophy vs Sorcerer's Stone" relevance issue.
|
| 42 |
+
* **D. Non-Semantic Scoring**:
|
| 43 |
+
* *Tech*: **Temporal Dynamics** (Time Decay).
|
| 44 |
+
* *Logic*: $Score \times (1 + \frac{1}{\log(Age)})$.
|
| 45 |
+
* *Why*: Relevance isn't just "Topic Match"; for technology/news, "Newness" *is* relevance.
|
| 46 |
+
|
| 47 |
+
### 4. Generation Optimization
|
| 48 |
+
* **Prompt Engineering**:
|
| 49 |
+
* *Structure*: "Librarian Persona" + Strict Context Boundary ("If not in context, state general knowledge").
|
| 50 |
+
* **Context Compression**:
|
| 51 |
+
* *Problem*: Multi-turn chat exhausts token windows.
|
| 52 |
+
* *Solution*: Summarization of older turns + Raw retention of recent turns.
|
| 53 |
+
* *Trade-off*: Loss of specific wording in old turns vs. ability to sustain infinite conversation.
|
| 54 |
+
|
| 55 |
+
### 5. Post-Deployment Engineering
|
| 56 |
+
* **Observability**:
|
| 57 |
+
* *Tech*: Prometheus Middleware.
|
| 58 |
+
* *Metrics*: Latency (P99), Request Count, Error Rate.
|
| 59 |
+
* **Feedback Loop**:
|
| 60 |
+
* *User Signal*: "Add to Favorites" serves as implicit positive feedback.
|
| 61 |
+
* *Future*: This data could train a **Reward Model** for RLHF/DPO.
|
| 62 |
+
|
| 63 |
+
---
|
| 64 |
+
|
| 65 |
+
## III. SFT Potential (Where to go next?)
|
| 66 |
+
*If asked: "How would you use SFT to improve this?"*
|
| 67 |
+
|
| 68 |
+
1. **Data Design**:
|
| 69 |
+
* Construct `(User Query, Retrieved Context, Ideal Librarian Response)` triplets.
|
| 70 |
+
* **Goal**: Train the model to adopt a specific "Literary Critic" tone that default GPT-3.5 lacks.
|
| 71 |
+
2. **DPO (Direct Preference Optimization)**:
|
| 72 |
+
* Use the "Refused Recommendations" (users *didn't* click) vs "Accepted Recommendations" (users added to shelf) to construct Preference Pairs ($y_w, y_l$).
|
| 73 |
+
* Fine-tune the model to align with *successful* recommendation justifications.
|
| 74 |
+
|
| 75 |
+
---
|
| 76 |
+
|
| 77 |
+
## IV. The "Golden Thread" Narrative
|
| 78 |
+
**Motivation**: "I wanted to solve the 'Paradox of Choice' in book discovery—users know what they feel ('sad sci-fi') but search engines only understand keywords."
|
| 79 |
+
|
| 80 |
+
**Trade-off Highlight**: "I chose an **Embedded Vector DB** (Chroma) over a Service (Pinecone) to achieve **Zero Network Latency** and simplify the Ops stack, knowing the dataset (<1M books) fits easily in memory."
|
| 81 |
+
|
| 82 |
+
**Result**: "An Agentic system that corrects its own retrieval strategy, achieving 100% recall on ISBNs while maintaining deep semantic understanding."
|
docs/project_narrative.md
ADDED
|
@@ -0,0 +1,58 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Project Narrative & Strategic Thinking
|
| 2 |
+
**Role**: End-to-End ML Engineer / AI Engineer
|
| 3 |
+
**Framework**: Surface (What) -> Middle (How) -> Deep (Why & Trade-offs)
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## 1. Surface Level: The "What"
|
| 8 |
+
**Goal**: Define the tangible product and its unique value proposition.
|
| 9 |
+
|
| 10 |
+
* **Definition**: An "Intelligent Book Concierge Platform" (Not just a search engine).
|
| 11 |
+
* **Core Feature**: **Agentic RAG**. The system doesn't just match keywords; it understands intent, temporal context ("newest books"), and complex queries ("sad sci-fi about AI").
|
| 12 |
+
* **User Experience**:
|
| 13 |
+
* **Semantic Search**: "Heartbreaking WWII stories" works as well as "Harry Potter".
|
| 14 |
+
* **Interactive Chat**: Ask follow-up questions ("Is this suitable for kids?") and get grounded answers.
|
| 15 |
+
* **Personalization**: The system learns from your "Favorites" to adjust recommendations.
|
| 16 |
+
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
## 2. Middle Level: The "How"
|
| 20 |
+
**Goal**: Demonstrate engineering depth and optimization strategies.
|
| 21 |
+
|
| 22 |
+
### Architecture Flow
|
| 23 |
+
1. **Router Agent**: Classifies intent (ISBN vs. Keyword vs. Deep Question) to select the cheapest/best tool.
|
| 24 |
+
2. **Hybrid Retrieval**: Fuses **BM25** (Exact Match) and **ChromaDB** (Semantic Match) via Reciprocal Rank Fusion (RRF).
|
| 25 |
+
3. **Precision Layer**: Uses a **Cross-Encoder** to rerank the top 50 results for deep semantic relevance.
|
| 26 |
+
4. **Temporal Dynamics**: Applies a mathematical decay function to boost newer content when appropriate.
|
| 27 |
+
5. **Memory**: Compresses conversation history to allow infinite chat turns without token overflow.
|
| 28 |
+
|
| 29 |
+
### Key Innovations
|
| 30 |
+
* **No "False AI"**: Unlike simple keyword apps, this uses real-time vector embeddings and LLM reasoning.
|
| 31 |
+
* **Hallucination Control**: Strict RAG pipeline forces the LLM to cite its sources (book descriptions/reviews).
|
| 32 |
+
|
| 33 |
+
---
|
| 34 |
+
|
| 35 |
+
## 3. Deep Level: The "Architecture & Trade-offs"
|
| 36 |
+
**Goal**: Showcase architectural vision and system design skills.
|
| 37 |
+
|
| 38 |
+
### Tech Stack Decisions
|
| 39 |
+
* **Vector DB (ChromaDB)**:
|
| 40 |
+
* *Decision*: Embedded (In-Process) database.
|
| 41 |
+
* *Trade-off*: Sacrificed horizontal scalability for **Zero Network Latency** and zero-ops complexity. Perfect for the <1M dataset size.
|
| 42 |
+
* **Hybrid Search (Sparse + Dense)**:
|
| 43 |
+
* *Decision*: Implemented custom RRF fusion.
|
| 44 |
+
* *Why*: Pure Vector Search failed at Specific IDs (ISBNs). Pure BM25 failed at "vibe" searches. Hybrid captures 100% of cases.
|
| 45 |
+
* **Agentic Routing**:
|
| 46 |
+
* *Decision*: Rule-based Regex/Keyword Router.
|
| 47 |
+
* *Trade-off*: Chose deterministic rules over an "LLM Router" to save latency (2ms vs 500ms) and cost.
|
| 48 |
+
|
| 49 |
+
### Future Scalability
|
| 50 |
+
* **Vertical Scaling**: The current in-memory index fits in 2GB RAM. Can scale to ~5M books on a standard server.
|
| 51 |
+
* **Horizontal Scaling**: Easy migration path to Qdrant/Pinecone if user base grows >10k concurrent users.
|
| 52 |
+
|
| 53 |
+
---
|
| 54 |
+
|
| 55 |
+
## 4. Success Metrics
|
| 56 |
+
1. **Recall**: 100% on Exact Matches (ISBNs) via Router fix.
|
| 57 |
+
2. **Relevance**: Qualitative improvement in "Deep" queries via Cross-Encoder.
|
| 58 |
+
3. **Latency**: Sub-second (600ms) for typical queries; <3s for complex reasoning.
|
docs/rag_architecture.md
ADDED
|
@@ -0,0 +1,86 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Advanced RAG Architecture: Technical Overview
|
| 2 |
+
**Project**: Book Recommender with LLMs
|
| 3 |
+
**Date**: Jan 2026
|
| 4 |
+
|
| 5 |
+
## 1. System Overview
|
| 6 |
+
This project implements an **Agentic RAG (Retrieval-Augmented Generation)** system designed to overcome the limitations of standard semantic search. It uses a **Self-Reliant Router** to dynamically select the optimal retrieval strategy based on user intent.
|
| 7 |
+
|
| 8 |
+
### Key Capabilities
|
| 9 |
+
- **Exact Match**: Zero-error retrieval for ISBNs and specific IDs.
|
| 10 |
+
- **Deep Understanding**: Semantic search + Reranking for complex queries.
|
| 11 |
+
- **Temporal Awareness**: Recency bias for "latest/new" queries.
|
| 12 |
+
- **Efficient Memory**: Token-saving context compression.
|
| 13 |
+
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
## 2. Architecture Pipeline
|
| 17 |
+
|
| 18 |
+
```mermaid
|
| 19 |
+
graph TD
|
| 20 |
+
UserQuery[User Query] --> Router{Query Router}
|
| 21 |
+
|
| 22 |
+
%% Strategy 1: Exact
|
| 23 |
+
Router -- "ISBN Detected" --> Exact[BM25 Sparse Only]
|
| 24 |
+
Exact --> Result
|
| 25 |
+
|
| 26 |
+
%% Strategy 2: Fast
|
| 27 |
+
Router -- "Keywords (Short)" --> Fast[Hybrid Search (No Rerank)]
|
| 28 |
+
Fast --> Result
|
| 29 |
+
|
| 30 |
+
%% Strategy 3: Deep
|
| 31 |
+
Router -- "Natural Language" --> Hybrid[Hybrid Search (BM25 + Dense)]
|
| 32 |
+
Hybrid --> Fusion[Reciprocal Rank Fusion (RRF)]
|
| 33 |
+
Fusion --> Top50[Top 50 Candidates]
|
| 34 |
+
Top50 --> Rerank[Cross-Encoder (ms-marco-MiniLM)]
|
| 35 |
+
|
| 36 |
+
%% Temporal Layer
|
| 37 |
+
Rerank --> Temporal{Temporal Keywords?}
|
| 38 |
+
Temporal -- "Yes (e.g. 'latest')" --> Decay[Apply Time Decay Boost]
|
| 39 |
+
Temporal -- "No" --> RankScore
|
| 40 |
+
Decay --> RankScore[Final Top K]
|
| 41 |
+
|
| 42 |
+
RankScore --> Result[Context for LLM]
|
| 43 |
+
```
|
| 44 |
+
|
| 45 |
+
## 3. Component Details
|
| 46 |
+
|
| 47 |
+
### 3.1. Hybrid Search (The Foundation)
|
| 48 |
+
Combines **Sparse Retrieval (BM25)** and **Dense Retrieval (ChromaDB/All-MiniLM)** using **Reciprocal Rank Fusion (RRF)**.
|
| 49 |
+
- **Why?**: Dense vectors fail at exact keyword matching (e.g., "Harry Potter"). BM25 fails at semantic understanding. Together, they cover 100% of use cases.
|
| 50 |
+
- **Implementation**: `src/vector_db.py`
|
| 51 |
+
|
| 52 |
+
### 3.2. Cross-Encoder Reranking (The Refiner)
|
| 53 |
+
A second-stage pass using `cross-encoder/ms-marco-MiniLM-L-6-v2`.
|
| 54 |
+
- **Why?**: Bi-Encoders (Vectors) are fast but approximate. Cross-Encoders are slow but highly accurate. We only rerank the top 20-50 results.
|
| 55 |
+
- **Impact**: Improved precision for complex queries (e.g., distinguishing "Philosophy of Harry Potter" from "Harry Potter and the Sorcerer's Stone").
|
| 56 |
+
- **Implementation**: `src/core/reranker.py`
|
| 57 |
+
|
| 58 |
+
### 3.3. Agentic Router (The Brain)
|
| 59 |
+
Classifies input using Regex and Keyword analysis to short-circuit expensive steps.
|
| 60 |
+
- **Strategies**:
|
| 61 |
+
- **EXACT**: `alpha=1.0` (BM25 Only). Solves the "Exact Match" regression.
|
| 62 |
+
- **FAST**: `rerank=False`. < 500ms latency for simple lookups.
|
| 63 |
+
- **DEEP**: `rerank=True`. Full power for reasoning tasks.
|
| 64 |
+
- **Implementation**: `src/core/router.py`
|
| 65 |
+
|
| 66 |
+
### 3.4. Temporal Dynamics (The Bias)
|
| 67 |
+
Applies a log-linear decay function to boost newer documents.
|
| 68 |
+
- **Formula**: $Score_{new} = Score_{old} + \frac{2.0}{\ln(Age + 2.718)}$
|
| 69 |
+
- **Trigger**: Activated by words like "new", "latest", "2024".
|
| 70 |
+
- **Implementation**: `src/core/temporal.py`
|
| 71 |
+
|
| 72 |
+
### 3.5. Context Compression (The Memory)
|
| 73 |
+
Summarizes conversation history when it exceeds token limits.
|
| 74 |
+
- **Logic**: Retains the last 2 turns (4 messages) raw; summarizes everything older using a lightweight LLM call.
|
| 75 |
+
- **Implementation**: `src/core/context_compressor.py`
|
| 76 |
+
|
| 77 |
+
## 4. Performance Benchmarks
|
| 78 |
+
| Metric | Baseline (Dense) | Advanced (hybrid+Rerank) |
|
| 79 |
+
| :--- | :--- | :--- |
|
| 80 |
+
| **ISBN Success Rate** | 0% (Fail) | **100%** (via Router) |
|
| 81 |
+
| **Keyword Precision** | Low | **High** |
|
| 82 |
+
| **Latency (Avg)** | 20ms | 600ms - 1.2s |
|
| 83 |
+
|
| 84 |
+
## 5. Future Roadmap
|
| 85 |
+
- **GraphRAG**: For multi-hop reasoning across books.
|
| 86 |
+
- **Fine-tuning**: Domain-specific embedding adapter.
|
docs/technical_deep_dive_sota.md
ADDED
|
@@ -0,0 +1,197 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Technical Deep Dive: SOTA Techniques for Advanced RAG & SFT
|
| 2 |
+
**Date**: 2026-01-08
|
| 3 |
+
**Motivation**: Address the remaining gaps in the Book Recommender to achieve "Resume-Grade" technical depth.
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## Part I: SFT Data Pipeline (Style Alignment)
|
| 8 |
+
|
| 9 |
+
### 1.1 Problem Definition
|
| 10 |
+
**Current State**: The LLM responds in a generic, corporate tone.
|
| 11 |
+
**Desired State**: The LLM should speak like a passionate *Literary Critic* — emotional, opinionated, evocative.
|
| 12 |
+
|
| 13 |
+
**Why SFT (not just Prompting)?**
|
| 14 |
+
- Prompting can only do so much ("Be enthusiastic") — it doesn't teach the model *how* critics structure their arguments.
|
| 15 |
+
- SFT embeds the *style distribution* directly into the model's weights.
|
| 16 |
+
|
| 17 |
+
### 1.2 SOTA Technique: Self-Instruct with LLM-as-a-Judge
|
| 18 |
+
|
| 19 |
+
**References**:
|
| 20 |
+
- [Self-Instruct (Wang et al., 2022)](https://arxiv.org/abs/2212.10560): Generate instructions from seed data.
|
| 21 |
+
- [UltraChat (Ding et al., 2023)](https://arxiv.org/abs/2305.14233): Large-scale multi-turn dialogue synthesis.
|
| 22 |
+
- [Alpaca (Stanford, 2023)](https://crfm.stanford.edu/2023/03/13/alpaca.html): Instruction-following via distillation.
|
| 23 |
+
|
| 24 |
+
**Pipeline Design**:
|
| 25 |
+
|
| 26 |
+
```
|
| 27 |
+
┌─────────────────────────────────────────────────────────────────┐
|
| 28 |
+
│ SFT Data Synthesis Pipeline │
|
| 29 |
+
├─────────────────────────────────────────────────────────────────┤
|
| 30 |
+
│ 1. Seed Selection │
|
| 31 |
+
│ - Sample 1000 high-emotion reviews (rating=5, length>200) │
|
| 32 |
+
│ - Filter for reviews with subjective language (e.g. "I felt")│
|
| 33 |
+
├─────────────────────────────────────────────────────────────────┤
|
| 34 |
+
│ 2. Instruction Evolution (Self-Instruct) │
|
| 35 |
+
│ - Prompt GPT-4: "Given this review, generate a user question│
|
| 36 |
+
│ that would have prompted this recommendation." │
|
| 37 |
+
│ - Result: (Query, Review) pairs │
|
| 38 |
+
├─────────────────────────────────────────────────────────────────┤
|
| 39 |
+
│ 3. Response Transformation │
|
| 40 |
+
│ - Prompt GPT-4: "Rewrite the review as if you are an AI │
|
| 41 |
+
│ book concierge, keeping the emotional depth and specific │
|
| 42 |
+
│ evidence. Do NOT add external knowledge." │
|
| 43 |
+
│ - Result: (Query, AI Response) pairs │
|
| 44 |
+
├─────────────────────────────────────────────────────────────────┤
|
| 45 |
+
│ 4. Quality Filtering (LLM-as-a-Judge) │
|
| 46 |
+
│ - Prompt GPT-4: "Rate this dialogue on: Empathy (1-10), │
|
| 47 |
+
│ Specificity (1-10), Critique Depth (1-10). Explain." │
|
| 48 |
+
│ - Threshold: Keep only samples with average >= 8. │
|
| 49 |
+
├─────────────────────────────────────────────────────────────────┤
|
| 50 |
+
│ 5. DPO Pair Construction (Optional) │
|
| 51 |
+
│ - For each (Query, Response), generate a "Rejected" response│
|
| 52 |
+
│ by prompting GPT-4: "Rewrite this in a boring, generic way"│
|
| 53 |
+
│ - Result: (Query, Chosen, Rejected) triplets for DPO. │
|
| 54 |
+
└─────────────────────────────────────────────────────────────────┘
|
| 55 |
+
```
|
| 56 |
+
|
| 57 |
+
**Expected Output**:
|
| 58 |
+
- `data/sft/literary_critic_train.jsonl`: ~800 high-quality (Query, Response) pairs.
|
| 59 |
+
- `data/dpo/preference_pairs.jsonl`: ~500 (Chosen, Rejected) pairs.
|
| 60 |
+
|
| 61 |
+
**Interview Talking Point**:
|
| 62 |
+
> "I didn't just use the dataset as-is. I designed a data synthesis pipeline to evolve raw user reviews into instruction-following format, then applied LLM-as-a-Judge to filter for quality. This is the same approach used in Stanford Alpaca and Meta's Llama-2 post-training."
|
| 63 |
+
|
| 64 |
+
---
|
| 65 |
+
|
| 66 |
+
## Part II: Advanced RAG (Small-to-Big Retrieval)
|
| 67 |
+
|
| 68 |
+
### 2.1 Problem Definition
|
| 69 |
+
**Current State**: Each book is indexed as ONE atomic chunk (~500 tokens).
|
| 70 |
+
**Failure Case**: User asks "Book where the narrator is unreliable and you only realize at the end" — this detail is buried in a *specific review*, not the book description.
|
| 71 |
+
|
| 72 |
+
**Why Small-to-Big?**
|
| 73 |
+
- Small chunks have higher semantic precision (they match the query better).
|
| 74 |
+
- But small chunks alone lack *context* — the LLM needs the full book info to answer.
|
| 75 |
+
- Solution: **Retrieve Small, Return Big**.
|
| 76 |
+
|
| 77 |
+
### 2.2 SOTA Technique: Parent-Child Document Retrieval
|
| 78 |
+
|
| 79 |
+
**References**:
|
| 80 |
+
- [LlamaIndex: Recursive Retrieval](https://docs.llamaindex.ai/): Parent-child document linking.
|
| 81 |
+
- [RAPTOR (Sarthi et al., 2024)](https://arxiv.org/abs/2401.18059): Hierarchical tree-based indexing.
|
| 82 |
+
- [Multi-Vector Retriever (LangChain)](https://python.langchain.com/): Separate index for summaries vs full docs.
|
| 83 |
+
|
| 84 |
+
**Architecture Design**:
|
| 85 |
+
|
| 86 |
+
```
|
| 87 |
+
┌─────────────────────────────────────────────────────────────────┐
|
| 88 |
+
│ Small-to-Big Retrieval Architecture │
|
| 89 |
+
├─────────────────────────────────────────────────────────────────┤
|
| 90 |
+
│ │
|
| 91 |
+
│ ┌─────────────────────────────────────────────────────────┐ │
|
| 92 |
+
│ │ CHILD INDEX (Review Chunks) │ │
|
| 93 |
+
│ │ - Each review split into 1-3 sentences (~100 tokens) │ │
|
| 94 |
+
│ │ - Metadata: { "parent_isbn": "9780123456789" } │ │
|
| 95 |
+
│ │ - Stored in: ChromaDB (collection: "review_chunks") │ │
|
| 96 |
+
│ └─────────────────────────────────────────────────────────┘ │
|
| 97 |
+
│ │ │
|
| 98 |
+
│ │ similarity_search(query) │
|
| 99 |
+
│ ▼ │
|
| 100 |
+
│ ┌─────────────────────────────────────────────────────────┐ │
|
| 101 |
+
│ │ MATCH: Review Chunk #42 │ │
|
| 102 |
+
│ │ "The twist about the simulation was mind-blowing..." │ │
|
| 103 |
+
│ │ Metadata: { "parent_isbn": "9780123456789" } │ │
|
| 104 |
+
│ └─────────────────────────────────────────────────────────┘ │
|
| 105 |
+
│ │ │
|
| 106 |
+
│ │ lookup parent_isbn │
|
| 107 |
+
│ ▼ │
|
| 108 |
+
│ ┌─────────────────────────────────────────────────────────┐ │
|
| 109 |
+
│ │ PARENT INDEX (Full Books) │ │
|
| 110 |
+
│ │ - Full book metadata: Title, Author, Description, │ │
|
| 111 |
+
│ │ Review Highlights, Categories, Emotions │ │
|
| 112 |
+
│ │ - Stored in: ChromaDB (collection: "books") │ │
|
| 113 |
+
│ └─────────────────────────────────────────────────────────┘ │
|
| 114 |
+
│ │ │
|
| 115 |
+
│ ▼ │
|
| 116 |
+
│ ┌─────────────────────────────────────────────────────────┐ │
|
| 117 |
+
│ │ RETURN: Full Book Context │ │
|
| 118 |
+
│ │ Title: "Dark Matter" │ │
|
| 119 |
+
│ │ Author: "Blake Crouch" │ │
|
| 120 |
+
│ │ Description: "A physicist is abducted into..." │ │
|
| 121 |
+
│ │ (Sent to LLM as RAG context) │ │
|
| 122 |
+
│ └─────────────────────────────────────────────────────────┘ │
|
| 123 |
+
│ │
|
| 124 |
+
└─────────────────────────────────────────────────────────────────┘
|
| 125 |
+
```
|
| 126 |
+
|
| 127 |
+
**Implementation Plan**:
|
| 128 |
+
|
| 129 |
+
1. **Chunking Script** (`scripts/chunk_reviews.py`):
|
| 130 |
+
- Read `review_highlights.txt` (format: `ISBN review_text`).
|
| 131 |
+
- Split each review into sentences using NLTK or spaCy.
|
| 132 |
+
- Output: `data/review_chunks.jsonl` with `{ "text": "...", "isbn": "..." }`.
|
| 133 |
+
|
| 134 |
+
2. **Dual Index Initialization** (`scripts/init_dual_index.py`):
|
| 135 |
+
- Create ChromaDB collection `review_chunks` with the sentence-level data.
|
| 136 |
+
- Keep existing `books` collection for parent lookup.
|
| 137 |
+
|
| 138 |
+
3. **Retrieval Logic Update** (`src/vector_db.py`):
|
| 139 |
+
- New method: `small_to_big_search(query, k=5)`.
|
| 140 |
+
- Step 1: Query `review_chunks` collection → Get top-k chunk matches.
|
| 141 |
+
- Step 2: Extract unique `parent_isbn` from matches.
|
| 142 |
+
- Step 3: Fetch full book info from `books` collection using ISBN filter.
|
| 143 |
+
|
| 144 |
+
**Interview Talking Point**:
|
| 145 |
+
> "I implemented a hierarchical retrieval system inspired by LlamaIndex's Parent-Child pattern. Instead of indexing entire books, I indexed individual review sentences for high-precision matching, then recursively retrieved the parent book context. This solved the 'needle in a haystack' problem for detail-oriented queries."
|
| 146 |
+
|
| 147 |
+
---
|
| 148 |
+
|
| 149 |
+
## Part III: Query Expansion (HyDE - Future)
|
| 150 |
+
|
| 151 |
+
### 3.1 Problem Definition
|
| 152 |
+
**Failure Case**: User asks "That blue robot book" but the book description says "android with azure plating".
|
| 153 |
+
|
| 154 |
+
### 3.2 SOTA Technique: Hypothetical Document Embeddings (HyDE)
|
| 155 |
+
|
| 156 |
+
**Reference**: [HyDE (Gao et al., 2022)](https://arxiv.org/abs/2212.10496)
|
| 157 |
+
|
| 158 |
+
**Concept**: Before searching, generate a *hypothetical* document that would answer the query, then embed *that* instead of the query.
|
| 159 |
+
|
| 160 |
+
**Future Implementation**:
|
| 161 |
+
```python
|
| 162 |
+
def hyde_search(query: str) -> List[Document]:
|
| 163 |
+
# Step 1: Generate hypothetical document
|
| 164 |
+
prompt = f"Write a detailed book description that would perfectly match: {query}"
|
| 165 |
+
hypothetical_doc = llm.invoke(prompt)
|
| 166 |
+
|
| 167 |
+
# Step 2: Embed the hypothetical doc (not the query)
|
| 168 |
+
results = vector_db.search(hypothetical_doc, k=10)
|
| 169 |
+
return results
|
| 170 |
+
```
|
| 171 |
+
|
| 172 |
+
**Status**: Deferred to Phase 7. Current focus is Small-to-Big.
|
| 173 |
+
|
| 174 |
+
---
|
| 175 |
+
|
| 176 |
+
## Implementation Priority
|
| 177 |
+
|
| 178 |
+
| Priority | Feature | File | Status |
|
| 179 |
+
|----------|---------|------|--------|
|
| 180 |
+
| 1 | SFT Data Generator | `src/data_factory/generator.py` | TODO |
|
| 181 |
+
| 2 | LLM Judge | `src/data_factory/judge.py` | TODO |
|
| 182 |
+
| 3 | Review Chunker | `scripts/chunk_reviews.py` | TODO |
|
| 183 |
+
| 4 | Small-to-Big Index | `scripts/init_dual_index.py` | TODO |
|
| 184 |
+
| 5 | Small-to-Big Search | `src/vector_db.py` | TODO |
|
| 185 |
+
| 6 | HyDE | `src/core/hyde.py` | Deferred |
|
| 186 |
+
|
| 187 |
+
---
|
| 188 |
+
|
| 189 |
+
## Summary
|
| 190 |
+
|
| 191 |
+
This document establishes the **technical rationale** for two major upgrades:
|
| 192 |
+
|
| 193 |
+
1. **SFT Pipeline**: Not just "training a model" but designing a *data factory* with quality control — demonstrating Data-Centric AI thinking.
|
| 194 |
+
|
| 195 |
+
2. **Small-to-Big RAG**: Not just "adding more data" but restructuring the *retrieval topology* — demonstrating Systems Architecture thinking.
|
| 196 |
+
|
| 197 |
+
Both are aligned with 2024 SOTA practices and provide concrete talking points for MLE interviews.
|
environment.yml
ADDED
|
@@ -0,0 +1,41 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
name: book-rec
|
| 2 |
+
channels:
|
| 3 |
+
- conda-forge
|
| 4 |
+
- defaults
|
| 5 |
+
dependencies:
|
| 6 |
+
- python=3.10
|
| 7 |
+
- pip
|
| 8 |
+
- pip:
|
| 9 |
+
# --- Core Backend ---
|
| 10 |
+
- fastapi>=0.109.0
|
| 11 |
+
- uvicorn[standard]>=0.27.0
|
| 12 |
+
- pydantic>=2.0.0
|
| 13 |
+
- python-dotenv
|
| 14 |
+
|
| 15 |
+
# --- Data & Math ---
|
| 16 |
+
- numpy<2.0.0 # Constraint for broad compatibility
|
| 17 |
+
- pandas>=2.0.0
|
| 18 |
+
|
| 19 |
+
# --- AI / ML Core (M1 Friendly) ---
|
| 20 |
+
- torch
|
| 21 |
+
- sentence-transformers>=2.2.2
|
| 22 |
+
- scikit-learn
|
| 23 |
+
|
| 24 |
+
# --- RAG Stack ---
|
| 25 |
+
- langchain>=0.1.0
|
| 26 |
+
- langchain-community>=0.0.10
|
| 27 |
+
- langchain-openai>=0.0.5
|
| 28 |
+
- langchain-chroma>=0.1.0
|
| 29 |
+
- chromadb>=0.4.22
|
| 30 |
+
- huggingface-hub>=0.20.0
|
| 31 |
+
|
| 32 |
+
# --- Infrastructure ---
|
| 33 |
+
- redis
|
| 34 |
+
- prometheus-client
|
| 35 |
+
- python-json-logger
|
| 36 |
+
- httpx
|
| 37 |
+
|
| 38 |
+
# --- Dev Tools ---
|
| 39 |
+
- ruff
|
| 40 |
+
- pytest
|
| 41 |
+
- black
|
experiments/baseline_report.md
ADDED
|
@@ -0,0 +1,28 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Retrieval Baseline Report
|
| 2 |
+
**Date**: 2026-01-08
|
| 3 |
+
**Metric**: Recall@5 (Qualitative check)
|
| 4 |
+
|
| 5 |
+
## Experiment Setup
|
| 6 |
+
- **System**: ChromaDB (all-MiniLM-L6-v2) - Pure Dense Retrieval.
|
| 7 |
+
- **Dataset**: Book Reviews (~220k docs).
|
| 8 |
+
- **Benchmarks**:
|
| 9 |
+
1. **Semantic Queries** (e.g., "finding love"): Expected STRONG performance.
|
| 10 |
+
2. **Keyword Queries** (e.g., "Harry Potter"): Expected MODERATE performance.
|
| 11 |
+
3. **Exact Match** (e.g., ISBN): Expected WEAK performance.
|
| 12 |
+
|
| 13 |
+
## Results
|
| 14 |
+
| Query Type | Query | Result | Status |
|
| 15 |
+
| :--- | :--- | :--- | :--- |
|
| 16 |
+
| **Semantic** | "finding love..." | "All About Love" (found via similar vector) | ✅ **SUCCESS** |
|
| 17 |
+
| **Keyword** | "Harry Potter" | "Harry Potter and Philosophy" | ⚠️ **PARTIAL** (Found related, but missed main novels?) |
|
| 18 |
+
| **Exact** | "0060959479" | "National Geographic..." (Completely unrelated) | ❌ **FAILURE** |
|
| 19 |
+
|
| 20 |
+
## Analysis
|
| 21 |
+
The current **Dense Retrieval** model treats the ISBN `0060959479` as a semantic string. Since the embedding model (MiniLM) is not trained to recognize ISBN relationships, it maps the number to a vector space location that happens to be near "National Geographic" (likely random noise collision or digit similarity).
|
| 22 |
+
|
| 23 |
+
**Conclusion**: The system is **incapable of exact entity retrieval** by ID or specific unique identifier.
|
| 24 |
+
|
| 25 |
+
## Optimization Plan
|
| 26 |
+
**Implement Hybrid Search** to combine:
|
| 27 |
+
1. **BM25 (Sparse)**: For exact keyword/ID matching.
|
| 28 |
+
2. **Vector (Dense)**: For semantic understanding.
|
experiments/hybrid_report.md
ADDED
|
@@ -0,0 +1,28 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Hybrid Retrieval Benchmark Report
|
| 2 |
+
**Date**: 2026-01-08
|
| 3 |
+
**Metric**: Qualtitative Recall (Top-5)
|
| 4 |
+
|
| 5 |
+
## Experiment Setup
|
| 6 |
+
- **System**: Hybrid RRF (BM25 + Chroma Dense).
|
| 7 |
+
- **Comparison**: Baseline (Dense Only) vs Hybrid.
|
| 8 |
+
|
| 9 |
+
## Results Comparison
|
| 10 |
+
|
| 11 |
+
| Query Type | Query | Baseline Result | Hybrid Result | Status |
|
| 12 |
+
| :--- | :--- | :--- | :--- | :--- |
|
| 13 |
+
| **Semantic** | "finding love..." | "All About Love" | "Elusive Love", "Finding God..." | ✅ **Maintained** |
|
| 14 |
+
| **Keyword** | "Harry Potter" | "Harry Potter and Philosophy" | **"Harry Potter and the Sorcerer's Stone"** | 🚀 **IMPROVED** |
|
| 15 |
+
| **Exact** | "0060959479" | "National Geographic..." (Fail) | **"All About Love: New Visions"** | 🎉 **FIXED** |
|
| 16 |
+
|
| 17 |
+
## Performance Trade-off
|
| 18 |
+
- **Latency**: Increased from ~20ms (Dense) to ~600ms (Hybrid).
|
| 19 |
+
- **Cause**: In-memory BM25 scoring of 220k documents in Python.
|
| 20 |
+
- **Verdict**: Acceptable for "High Accuracy" mode.
|
| 21 |
+
|
| 22 |
+
## Technical Implementation
|
| 23 |
+
- **Sparse**: `rank_bm25` (Okapi BM25) on Title + Author + Desc + ISBN.
|
| 24 |
+
- **Dense**: `all-MiniLM-L6-v2` (Chroma).
|
| 25 |
+
- **Fusion**: Reciprocal Rank Fusion (RRF) with `k=60`.
|
| 26 |
+
|
| 27 |
+
## Conclusion
|
| 28 |
+
Hybrid Search successfully combines the "Literal Precision" of BM25 with the "Semantic Understanding" of Vectors. We have solved the "Exact Match" failure case.
|
experiments/rerank_report.md
ADDED
|
@@ -0,0 +1,25 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Reranking Benchmark Report
|
| 2 |
+
**Date**: 2026-01-08
|
| 3 |
+
**Model**: `cross-encoder/ms-marco-MiniLM-L-6-v2`
|
| 4 |
+
|
| 5 |
+
## Experiment Setup
|
| 6 |
+
- **Pipeline**: Hybrid Search (BM25 + Dense) -> Top 50 Candidates -> Cross-Encoder Rerank -> Top 5.
|
| 7 |
+
- **Metric**: Relevance Score & Qualitative Ranking.
|
| 8 |
+
|
| 9 |
+
## Results Comparison
|
| 10 |
+
|
| 11 |
+
| Query | Hybrid (Raw RRF) | Reranked Result (Top 1) | Score | Verdict |
|
| 12 |
+
| :--- | :--- | :--- | :--- | :--- |
|
| 13 |
+
| **"Harry Potter"** | "Harry Potter and **Philosophy**" | "**Harry Potter and The Sorcerer's Stone**" | 5.61 | 🚀 **HUGE WIN** (Fixed intent) |
|
| 14 |
+
| **"Jane Austen"** | "A Single Man" (Noise?) | "The Novels of Jane Austen" | 8.96 | ✅ **Precise** |
|
| 15 |
+
| **"finding love..."** | "Elusive Love" | "Together Apart" | 6.41 | ✅ **High Quality** |
|
| 16 |
+
| **ISBN "0060959479"** | "All About Love" (Rank 1) | "Physical Education..." (Rank 1)<br>"All About Love" (Rank 2) | -1.33 | ⚠️ **Regression** (Model confused by ID) |
|
| 17 |
+
|
| 18 |
+
## Latency Analysis
|
| 19 |
+
- **Cold Start**: ~11s (Model Load).
|
| 20 |
+
- **Warm Query**: ~0.7s - 1.5s.
|
| 21 |
+
- **Conclusion**: ~1s overhead is acceptable for "Smart Search" mode.
|
| 22 |
+
|
| 23 |
+
## Optimization Strategy (Next Steps)
|
| 24 |
+
1. **Dynamic Reranking**: Only trigger Reranker for natural language queries (detect length > 5 chars or no regex match for ISBN).
|
| 25 |
+
2. **Quantization**: Use ONNX version of Cross-Encoder for 2x speedup.
|
experiments/router_report.md
ADDED
|
@@ -0,0 +1,27 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Agentic Router Benchmark Report
|
| 2 |
+
**Date**: 2026-01-08
|
| 3 |
+
**Metric**: Adaptive Precision & Latency
|
| 4 |
+
|
| 5 |
+
## System Architecture
|
| 6 |
+
The **Query Router** dynamically assigns a retrieval strategy based on query analysis:
|
| 7 |
+
|
| 8 |
+
1. **EXACT (ISBN)**: `BM25 Only` (`alpha=1.0`, `rerank=False`).
|
| 9 |
+
2. **FAST (Keywords)**: `Hybrid RRF` (`alpha=0.5`, `rerank=False`).
|
| 10 |
+
3. **DEEP (Complex)**: `Hybrid RRF` + `Cross-Encoder Rerank`.
|
| 11 |
+
|
| 12 |
+
## Results Comparison
|
| 13 |
+
|
| 14 |
+
| Query | Detected Strategy | Top Result | Logic Validated? |
|
| 15 |
+
| :--- | :--- | :--- | :--- |
|
| 16 |
+
| **"0060959479"** (ISBN) | **EXACT** | **"All About Love: New Visions"** | ✅ **YES** (Noise Removed) |
|
| 17 |
+
| **"python programming"** | **FAST** | "Python Cookbook" | ✅ **YES** (Speed Optimized) |
|
| 18 |
+
| **"finding love..."** | **DEEP** | "Together Apart" (Score: 6.4) | ✅ **YES** (Contextual) |
|
| 19 |
+
|
| 20 |
+
## Performance Impact
|
| 21 |
+
- **ISBN Precision**: 100% (Up from ~50% with Rerank).
|
| 22 |
+
- **Latency**:
|
| 23 |
+
- Exact/Fast: ~0.5 - 1.2s
|
| 24 |
+
- Deep: ~2.0 - 5.0s (depending on CPU load).
|
| 25 |
+
|
| 26 |
+
## Conclusion
|
| 27 |
+
The **Agentic Router** successfully makes the retrieval "Self-Correcting". It applies expensive power (Reranking) only when needed and precise tools (BM25) when exactness is required.
|
experiments/temporal_report.md
ADDED
|
@@ -0,0 +1,25 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Temporal Dynamics Benchmark Report
|
| 2 |
+
**Date**: 2026-01-08
|
| 3 |
+
**Mechanism**: Recency Boosting (Log-Linear Decay).
|
| 4 |
+
|
| 5 |
+
## Experiment Setup
|
| 6 |
+
- **Query**: "latest advancements in technology and science"
|
| 7 |
+
- **Method**: Compare Rerank Score vs Temporal Boosted Score.
|
| 8 |
+
- **Boost Logic**: `Score_New = Score_Old + (2.0 / log(Age + e))`
|
| 9 |
+
|
| 10 |
+
## Results Comparison
|
| 11 |
+
|
| 12 |
+
| Title (Year) | Standard Score | Temporal Score | Boost | Age |
|
| 13 |
+
| :--- | :--- | :--- | :--- | :--- |
|
| 14 |
+
| **"Intro to Science..." (2011)** | 6.076 | **6.772** | +0.696 | 15 yrs |
|
| 15 |
+
| **"Environmental Sci..." (2012)** | -0.883 | **-0.173** | +0.710 | 14 yrs |
|
| 16 |
+
| **"ACP Complete..." (1999)** | 4.128 | 4.718 | +0.590 | 27 yrs |
|
| 17 |
+
|
| 18 |
+
## Analysis
|
| 19 |
+
- **Correlation**: Newer books receive a higher additive boost.
|
| 20 |
+
- **Magnitude**: ~0.7 points for a 15-year-old book vs ~0.59 for a 27-year-old book.
|
| 21 |
+
- **Impact**: Enough to tip the scales in close calls or move a "relevant but old" book below a "relevant and new" one.
|
| 22 |
+
- **Safety**: Does NOT bury classic books (1999 still retained high rank due to high base relevance).
|
| 23 |
+
|
| 24 |
+
## Conclusion
|
| 25 |
+
Temporal Dynamics successfully implements a "Freshness Bias" without compromising semantic relevance.
|
scripts/add_isbn13_to_books_data.py
ADDED
|
@@ -0,0 +1,16 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import pandas as pd
|
| 2 |
+
|
| 3 |
+
# 读取主表和 books_data_with_isbn.csv
|
| 4 |
+
main = pd.read_csv("data/books_with_emotions.csv", usecols=["title", "isbn13"])
|
| 5 |
+
data = pd.read_csv("data/books_data_with_isbn.csv")
|
| 6 |
+
|
| 7 |
+
# 标准化标题
|
| 8 |
+
main["title"] = main["title"].astype(str).str.strip().str.lower()
|
| 9 |
+
data["Title"] = data["Title"].astype(str).str.strip().str.lower()
|
| 10 |
+
|
| 11 |
+
# 合并,左连接
|
| 12 |
+
merged = data.merge(main, left_on="Title", right_on="title", how="left")
|
| 13 |
+
|
| 14 |
+
# 保存新文件
|
| 15 |
+
merged.to_csv("data/books_data_with_isbn13.csv", index=False)
|
| 16 |
+
print("已生成 data/books_data_with_isbn13.csv,包含 isbn13 字段。")
|
scripts/add_isbn_to_books_data.py
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import pandas as pd
|
| 2 |
+
|
| 3 |
+
# 读取 books_data.csv
|
| 4 |
+
books_data = pd.read_csv("data/books_data.csv")
|
| 5 |
+
|
| 6 |
+
# 读取 Books_rating.csv,只取 Title 和 Id 字段
|
| 7 |
+
ratings = pd.read_csv("data/Books_rating.csv", usecols=["Title", "Id"])
|
| 8 |
+
|
| 9 |
+
# 去重,避免多对一
|
| 10 |
+
ratings = ratings.drop_duplicates(subset=["Title"])
|
| 11 |
+
|
| 12 |
+
# 合并,左连接,保留 books_data.csv 所有行
|
| 13 |
+
merged = books_data.merge(ratings, on="Title", how="left")
|
| 14 |
+
|
| 15 |
+
# 重命名 Id 为 isbn
|
| 16 |
+
merged = merged.rename(columns={"Id": "isbn"})
|
| 17 |
+
|
| 18 |
+
# 保存新文件
|
| 19 |
+
merged.to_csv("data/books_data_with_isbn.csv", index=False)
|
| 20 |
+
|
| 21 |
+
print("已生成 data/books_data_with_isbn.csv,包含 isbn 字段。")
|
scripts/benchmark_compressor.py
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import asyncio
|
| 2 |
+
from langchain_core.messages import HumanMessage, AIMessage
|
| 3 |
+
from src.core.context_compressor import compressor
|
| 4 |
+
|
| 5 |
+
async def run_benchmark():
|
| 6 |
+
print("🚀 Starting Context Compression Benchmark...")
|
| 7 |
+
|
| 8 |
+
# 1. Simulate Long History (12 messages, 6 turns)
|
| 9 |
+
history = []
|
| 10 |
+
for i in range(1, 7):
|
| 11 |
+
history.append(HumanMessage(content=f"User question {i}: I like sci-fi."))
|
| 12 |
+
history.append(AIMessage(content=f"AI answer {i}: Here is a sci-fi book."))
|
| 13 |
+
|
| 14 |
+
print(f"Original History Length: {len(history)} messages")
|
| 15 |
+
|
| 16 |
+
# 2. Compress
|
| 17 |
+
print("Compressing...")
|
| 18 |
+
# Mock LLM generation usually takes time, so latency includes API call
|
| 19 |
+
compressed = await compressor.compress_history(history)
|
| 20 |
+
|
| 21 |
+
print(f"Compressed History Length: {len(compressed)} messages")
|
| 22 |
+
|
| 23 |
+
# 3. Validation
|
| 24 |
+
# Expected: 1 SystemMessage (Summary) + 4 Messages (Recent) = 5
|
| 25 |
+
if len(compressed) == 5:
|
| 26 |
+
print("✅ SUCCESS: History compressed to 5 messages.")
|
| 27 |
+
print(f"Summary Content: {compressed[0].content}")
|
| 28 |
+
print(f"Oldest Retained Message: {compressed[1].content}")
|
| 29 |
+
else:
|
| 30 |
+
print(f"❌ FAILURE: Expected 5 messages, got {len(compressed)}")
|
| 31 |
+
for i, m in enumerate(compressed):
|
| 32 |
+
print(f"[{i}] {type(m).__name__}: {m.content}")
|
| 33 |
+
|
| 34 |
+
if __name__ == "__main__":
|
| 35 |
+
asyncio.run(run_benchmark())
|
scripts/benchmark_hybrid.py
ADDED
|
@@ -0,0 +1,83 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import time
|
| 2 |
+
import pandas as pd
|
| 3 |
+
from src.vector_db import VectorDB
|
| 4 |
+
|
| 5 |
+
def run_benchmark():
|
| 6 |
+
print("🚀 Starting Hybrid Retrieval Benchmark...")
|
| 7 |
+
|
| 8 |
+
# Load Title Mapping
|
| 9 |
+
try:
|
| 10 |
+
books_df = pd.read_csv("data/books_processed.csv")
|
| 11 |
+
# Ensure string ISBN for matching
|
| 12 |
+
if 'isbn13' in books_df.columns:
|
| 13 |
+
books_df['isbn'] = books_df['isbn13'].astype(str)
|
| 14 |
+
else:
|
| 15 |
+
books_df['isbn'] = books_df['isbn'].astype(str)
|
| 16 |
+
|
| 17 |
+
isbn_map = books_df.set_index('isbn')['title'].to_dict()
|
| 18 |
+
except Exception as e:
|
| 19 |
+
print(f"⚠️ Failed to load books_processed.csv: {e}")
|
| 20 |
+
isbn_map = {}
|
| 21 |
+
|
| 22 |
+
db = VectorDB()
|
| 23 |
+
|
| 24 |
+
# Same Test Cases
|
| 25 |
+
test_queries = [
|
| 26 |
+
# 1. Semantic (Hybrid should match Dense)
|
| 27 |
+
{"type": "Semantic", "query": "books about finding love in unexpected places"},
|
| 28 |
+
{"type": "Semantic", "query": "scary stories that keep you up at night"},
|
| 29 |
+
|
| 30 |
+
# 2. Keyword/Proper Noun (Hybrid should improve)
|
| 31 |
+
{"type": "Keyword", "query": "Harry Potter"},
|
| 32 |
+
{"type": "Keyword", "query": "Python Programming"},
|
| 33 |
+
{"type": "Keyword", "query": "Jane Austen"},
|
| 34 |
+
|
| 35 |
+
# 3. Exact Match / ISBN (Hybrid should fix this)
|
| 36 |
+
{"type": "Exact", "query": "0060959479"},
|
| 37 |
+
]
|
| 38 |
+
|
| 39 |
+
results = []
|
| 40 |
+
|
| 41 |
+
for case in test_queries:
|
| 42 |
+
q = case["query"]
|
| 43 |
+
print(f"\nScanning: '{q}' ({case['type']})...")
|
| 44 |
+
|
| 45 |
+
start_time = time.time()
|
| 46 |
+
# USE HYBRID SEARCH
|
| 47 |
+
docs = db.hybrid_search(q, k=5)
|
| 48 |
+
duration = (time.time() - start_time) * 1000
|
| 49 |
+
|
| 50 |
+
# Capture simplified results
|
| 51 |
+
top_results = []
|
| 52 |
+
for doc in docs:
|
| 53 |
+
# Extract ISBN
|
| 54 |
+
parts = doc.page_content.strip().split(' ', 1)
|
| 55 |
+
isbn = parts[0]
|
| 56 |
+
# Fallback parsing for legacy docs
|
| 57 |
+
if "ISBN:" in doc.page_content:
|
| 58 |
+
isbn = doc.page_content.split("ISBN:")[1].strip().split()[0]
|
| 59 |
+
|
| 60 |
+
title = isbn_map.get(isbn, f"ISBN:{isbn}")
|
| 61 |
+
if len(title) > 40:
|
| 62 |
+
title = title[:37] + "..."
|
| 63 |
+
top_results.append(title)
|
| 64 |
+
|
| 65 |
+
print(f" -> Found: {top_results}")
|
| 66 |
+
results.append({
|
| 67 |
+
"query": q,
|
| 68 |
+
"type": case["type"],
|
| 69 |
+
"latency_ms": round(duration, 2),
|
| 70 |
+
"top_results": top_results
|
| 71 |
+
})
|
| 72 |
+
|
| 73 |
+
# Save
|
| 74 |
+
df = pd.DataFrame(results)
|
| 75 |
+
path = "experiments/02_hybrid_results.csv"
|
| 76 |
+
df.to_csv(path, index=False)
|
| 77 |
+
print(f"\n💾 Results saved to {path}")
|
| 78 |
+
|
| 79 |
+
print("\n## Hybrid Search Results")
|
| 80 |
+
print(df.to_string(index=False))
|
| 81 |
+
|
| 82 |
+
if __name__ == "__main__":
|
| 83 |
+
run_benchmark()
|
scripts/benchmark_rerank.py
ADDED
|
@@ -0,0 +1,82 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import time
|
| 2 |
+
import pandas as pd
|
| 3 |
+
from src.vector_db import VectorDB
|
| 4 |
+
|
| 5 |
+
def run_benchmark():
|
| 6 |
+
print("🚀 Starting Reranked Retrieval Benchmark...")
|
| 7 |
+
|
| 8 |
+
# Load Title Mapping
|
| 9 |
+
try:
|
| 10 |
+
books_df = pd.read_csv("data/books_processed.csv")
|
| 11 |
+
if 'isbn13' in books_df.columns:
|
| 12 |
+
books_df['isbn'] = books_df['isbn13'].astype(str)
|
| 13 |
+
else:
|
| 14 |
+
books_df['isbn'] = books_df['isbn'].astype(str)
|
| 15 |
+
isbn_map = books_df.set_index('isbn')['title'].to_dict()
|
| 16 |
+
except Exception as e:
|
| 17 |
+
print(f"⚠️ Failed to load books_processed.csv: {e}")
|
| 18 |
+
isbn_map = {}
|
| 19 |
+
|
| 20 |
+
db = VectorDB()
|
| 21 |
+
|
| 22 |
+
# Same Test Cases
|
| 23 |
+
test_queries = [
|
| 24 |
+
# 1. Semantic (Reranker should bubble up best Semantic matches)
|
| 25 |
+
{"type": "Semantic", "query": "books about finding love in unexpected places"},
|
| 26 |
+
# Complex mood query
|
| 27 |
+
{"type": "Complex", "query": "a dark sci-fi thriller with a female protagonist"},
|
| 28 |
+
|
| 29 |
+
# 2. Keyword/Proper Noun (Reranker should confirm these are relevant)
|
| 30 |
+
{"type": "Keyword", "query": "Harry Potter"},
|
| 31 |
+
{"type": "Keyword", "query": "Jane Austen"},
|
| 32 |
+
|
| 33 |
+
# 3. Exact Match (Should still work)
|
| 34 |
+
{"type": "Exact", "query": "0060959479"},
|
| 35 |
+
]
|
| 36 |
+
|
| 37 |
+
results = []
|
| 38 |
+
|
| 39 |
+
for case in test_queries:
|
| 40 |
+
q = case["query"]
|
| 41 |
+
print(f"\nScanning: '{q}' ({case['type']})...")
|
| 42 |
+
|
| 43 |
+
start_time = time.time()
|
| 44 |
+
# USE HYBRID WITH RERANK
|
| 45 |
+
docs = db.hybrid_search(q, k=5, rerank=True)
|
| 46 |
+
duration = (time.time() - start_time) * 1000
|
| 47 |
+
|
| 48 |
+
# Capture results with scores
|
| 49 |
+
top_results = []
|
| 50 |
+
for doc in docs:
|
| 51 |
+
# Extract ISBN
|
| 52 |
+
parts = doc.page_content.strip().split(' ', 1)
|
| 53 |
+
isbn = parts[0]
|
| 54 |
+
if "ISBN:" in doc.page_content:
|
| 55 |
+
isbn = doc.page_content.split("ISBN:")[1].strip().split()[0]
|
| 56 |
+
|
| 57 |
+
title = isbn_map.get(isbn, f"ISBN:{isbn}")
|
| 58 |
+
if len(title) > 30:
|
| 59 |
+
title = title[:27] + "..."
|
| 60 |
+
|
| 61 |
+
score = doc.metadata.get("relevance_score", 0.0)
|
| 62 |
+
top_results.append(f"{title} ({score:.4f})")
|
| 63 |
+
|
| 64 |
+
print(f" -> Found: {top_results}")
|
| 65 |
+
results.append({
|
| 66 |
+
"query": q,
|
| 67 |
+
"type": case["type"],
|
| 68 |
+
"latency_ms": round(duration, 2),
|
| 69 |
+
"top_results": top_results
|
| 70 |
+
})
|
| 71 |
+
|
| 72 |
+
# Save
|
| 73 |
+
df = pd.DataFrame(results)
|
| 74 |
+
path = "experiments/03_rerank_results.csv"
|
| 75 |
+
df.to_csv(path, index=False)
|
| 76 |
+
print(f"\n💾 Results saved to {path}")
|
| 77 |
+
|
| 78 |
+
print("\n## Reranked Search Results")
|
| 79 |
+
print(df.to_string(index=False))
|
| 80 |
+
|
| 81 |
+
if __name__ == "__main__":
|
| 82 |
+
run_benchmark()
|
scripts/benchmark_retrieval.py
ADDED
|
@@ -0,0 +1,82 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import time
|
| 2 |
+
import pandas as pd
|
| 3 |
+
from typing import List
|
| 4 |
+
from src.vector_db import VectorDB
|
| 5 |
+
|
| 6 |
+
def run_benchmark():
|
| 7 |
+
print("🚀 Starting Retrieval Benchmark (BASELINE)...")
|
| 8 |
+
|
| 9 |
+
# Load Title Mapping
|
| 10 |
+
try:
|
| 11 |
+
books_df = pd.read_csv("data/books_processed.csv")
|
| 12 |
+
# Ensure string ISBN for matching
|
| 13 |
+
books_df['isbn'] = books_df['isbn'].astype(str)
|
| 14 |
+
isbn_map = books_df.set_index('isbn')['title'].to_dict()
|
| 15 |
+
print(f"📚 Loaded {len(isbn_map)} titles for mapping.")
|
| 16 |
+
except Exception as e:
|
| 17 |
+
print(f"⚠️ Failed to load books_processed.csv: {e}")
|
| 18 |
+
isbn_map = {}
|
| 19 |
+
|
| 20 |
+
db = VectorDB()
|
| 21 |
+
|
| 22 |
+
# ... (Test Cases preserved) ...
|
| 23 |
+
test_queries = [
|
| 24 |
+
# 1. Semantic (Dense should win)
|
| 25 |
+
{"type": "Semantic", "query": "books about finding love in unexpected places"},
|
| 26 |
+
{"type": "Semantic", "query": "scary stories that keep you up at night"},
|
| 27 |
+
|
| 28 |
+
# 2. Keyword/Proper Noun (Dense might struggle)
|
| 29 |
+
{"type": "Keyword", "query": "Harry Potter"},
|
| 30 |
+
{"type": "Keyword", "query": "Python Programming"},
|
| 31 |
+
{"type": "Keyword", "query": "Jane Austen"},
|
| 32 |
+
|
| 33 |
+
# 3. Exact Match / ISBN
|
| 34 |
+
{"type": "Exact", "query": "0060959479"},
|
| 35 |
+
]
|
| 36 |
+
|
| 37 |
+
results = []
|
| 38 |
+
|
| 39 |
+
for case in test_queries:
|
| 40 |
+
q = case["query"]
|
| 41 |
+
print(f"\nScanning: '{q}' ({case['type']})...")
|
| 42 |
+
|
| 43 |
+
start_time = time.time()
|
| 44 |
+
docs = db.search(q, k=5)
|
| 45 |
+
duration = (time.time() - start_time) * 1000
|
| 46 |
+
|
| 47 |
+
# Capture simplified results
|
| 48 |
+
top_results = []
|
| 49 |
+
for doc in docs:
|
| 50 |
+
# Format: "ISBN ReviewText..."
|
| 51 |
+
# Extract ISBN (first token)
|
| 52 |
+
parts = doc.page_content.strip().split(' ', 1)
|
| 53 |
+
isbn = parts[0]
|
| 54 |
+
|
| 55 |
+
# Lookup Title
|
| 56 |
+
title = isbn_map.get(isbn, f"ISBN:{isbn}")
|
| 57 |
+
|
| 58 |
+
# Truncate for display
|
| 59 |
+
if len(title) > 40:
|
| 60 |
+
title = title[:37] + "..."
|
| 61 |
+
top_results.append(title)
|
| 62 |
+
|
| 63 |
+
print(f" -> Found: {top_results}")
|
| 64 |
+
results.append({
|
| 65 |
+
"query": q,
|
| 66 |
+
"type": case["type"],
|
| 67 |
+
"latency_ms": round(duration, 2),
|
| 68 |
+
"top_results": top_results
|
| 69 |
+
})
|
| 70 |
+
|
| 71 |
+
# Save Report
|
| 72 |
+
df = pd.DataFrame(results)
|
| 73 |
+
path = "experiments/01_baseline_results.csv"
|
| 74 |
+
df.to_csv(path, index=False)
|
| 75 |
+
print(f"\n💾 Results saved to {path}")
|
| 76 |
+
|
| 77 |
+
# Print Summary
|
| 78 |
+
print("\n## Baseline Results Summary")
|
| 79 |
+
print(df.to_string(index=False))
|
| 80 |
+
|
| 81 |
+
if __name__ == "__main__":
|
| 82 |
+
run_benchmark()
|
scripts/benchmark_router.py
ADDED
|
@@ -0,0 +1,99 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import time
|
| 2 |
+
import pandas as pd
|
| 3 |
+
from src.vector_db import VectorDB
|
| 4 |
+
from src.core.router import QueryRouter
|
| 5 |
+
|
| 6 |
+
def run_benchmark():
|
| 7 |
+
print("🚀 Starting Agentic Router Benchmark...")
|
| 8 |
+
|
| 9 |
+
# Init Components
|
| 10 |
+
db = VectorDB()
|
| 11 |
+
router = QueryRouter()
|
| 12 |
+
|
| 13 |
+
# Load Title Mapping (for display)
|
| 14 |
+
try:
|
| 15 |
+
books_df = pd.read_csv("data/books_processed.csv")
|
| 16 |
+
if 'isbn13' in books_df.columns:
|
| 17 |
+
books_df['isbn'] = books_df['isbn13'].astype(str)
|
| 18 |
+
else:
|
| 19 |
+
books_df['isbn'] = books_df['isbn'].astype(str)
|
| 20 |
+
isbn_map = books_df.set_index('isbn')['title'].to_dict()
|
| 21 |
+
except:
|
| 22 |
+
isbn_map = {}
|
| 23 |
+
|
| 24 |
+
test_queries = [
|
| 25 |
+
# 1. ISBN -> Should be EXACT (No Rerank) to avoid regression
|
| 26 |
+
{"query": "0060959479", "expected_strat": "exact"},
|
| 27 |
+
|
| 28 |
+
# 2. Keyword -> Should be FAST (No Rerank)
|
| 29 |
+
{"query": "python programming", "expected_strat": "fast"},
|
| 30 |
+
|
| 31 |
+
# 3. Complex -> Should be DEEP (With Rerank)
|
| 32 |
+
{"query": "books about finding love in unexpected places", "expected_strat": "deep"},
|
| 33 |
+
]
|
| 34 |
+
|
| 35 |
+
results = []
|
| 36 |
+
|
| 37 |
+
for case in test_queries:
|
| 38 |
+
q = case["query"]
|
| 39 |
+
print(f"\nUser Query: '{q}'")
|
| 40 |
+
|
| 41 |
+
# 1. ROUTING STEP
|
| 42 |
+
route_decision = router.route(q)
|
| 43 |
+
strat = route_decision["strategy"]
|
| 44 |
+
use_rerank = route_decision["rerank"]
|
| 45 |
+
alpha_val = route_decision.get("alpha", 0.5)
|
| 46 |
+
|
| 47 |
+
print(f" 🤖 Router Decision: {strat.upper()} (Rerank={use_rerank}, Alpha={alpha_val})")
|
| 48 |
+
|
| 49 |
+
# Check expectation
|
| 50 |
+
if strat != case["expected_strat"]:
|
| 51 |
+
print(f" ⚠️ WARNING: Expected {case['expected_strat']}, got {strat}")
|
| 52 |
+
|
| 53 |
+
# 2. RETRIEVAL STEP
|
| 54 |
+
start_time = time.time()
|
| 55 |
+
docs = db.hybrid_search(
|
| 56 |
+
q,
|
| 57 |
+
k=5,
|
| 58 |
+
rerank=use_rerank,
|
| 59 |
+
alpha=alpha_val
|
| 60 |
+
)
|
| 61 |
+
duration = (time.time() - start_time) * 1000
|
| 62 |
+
|
| 63 |
+
# Capture results
|
| 64 |
+
top_results = []
|
| 65 |
+
for doc in docs:
|
| 66 |
+
# Extract ISBN/Title
|
| 67 |
+
if "ISBN:" in doc.page_content:
|
| 68 |
+
isbn = doc.page_content.split("ISBN:")[1].strip().split()[0]
|
| 69 |
+
else:
|
| 70 |
+
parts = doc.page_content.strip().split(' ', 1)
|
| 71 |
+
isbn = parts[0]
|
| 72 |
+
|
| 73 |
+
title = isbn_map.get(isbn, f"ISBN:{isbn}")
|
| 74 |
+
if len(title) > 30:
|
| 75 |
+
title = title[:27] + "..."
|
| 76 |
+
|
| 77 |
+
score = doc.metadata.get("relevance_score", "N/A")
|
| 78 |
+
if score != "N/A":
|
| 79 |
+
top_results.append(f"{title} ({score:.4f})")
|
| 80 |
+
else:
|
| 81 |
+
top_results.append(f"{title}")
|
| 82 |
+
|
| 83 |
+
print(f" -> Found: {top_results[:3]}")
|
| 84 |
+
results.append({
|
| 85 |
+
"query": q,
|
| 86 |
+
"strategy": strat,
|
| 87 |
+
"latency_ms": round(duration, 2),
|
| 88 |
+
"top_1": top_results[0] if top_results else "None"
|
| 89 |
+
})
|
| 90 |
+
|
| 91 |
+
# Save
|
| 92 |
+
df = pd.DataFrame(results)
|
| 93 |
+
path = "experiments/04_router_results.csv"
|
| 94 |
+
df.to_csv(path, index=False)
|
| 95 |
+
print(f"\n💾 Results saved to {path}")
|
| 96 |
+
print(df.to_string(index=False))
|
| 97 |
+
|
| 98 |
+
if __name__ == "__main__":
|
| 99 |
+
run_benchmark()
|
scripts/benchmark_temporal.py
ADDED
|
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import pandas as pd
|
| 2 |
+
from src.vector_db import VectorDB
|
| 3 |
+
|
| 4 |
+
def run_benchmark():
|
| 5 |
+
print("🚀 Starting Temporal Dynamics Benchmark...")
|
| 6 |
+
|
| 7 |
+
db = VectorDB()
|
| 8 |
+
|
| 9 |
+
# We use a query where 'newness' matters
|
| 10 |
+
query = "latest advancements in technology and science"
|
| 11 |
+
|
| 12 |
+
print(f"\nQuery: '{query}'")
|
| 13 |
+
|
| 14 |
+
# 1. Standard Search
|
| 15 |
+
print("\n--- Standard Search (No Temporal) ---")
|
| 16 |
+
st_docs = db.hybrid_search(query, k=5, rerank=True, temporal=False)
|
| 17 |
+
for d in st_docs:
|
| 18 |
+
# Get Year
|
| 19 |
+
isbn = d.metadata.get("isbn") or d.metadata.get("isbn13")
|
| 20 |
+
if not isbn and "ISBN:" in d.page_content:
|
| 21 |
+
isbn = d.page_content.split("ISBN:")[1].strip().split()[0]
|
| 22 |
+
year = db.pub_years.get(str(isbn), "Unknown")
|
| 23 |
+
score = d.metadata.get("relevance_score", 0.0)
|
| 24 |
+
|
| 25 |
+
# Parse title
|
| 26 |
+
title = d.page_content.split('\n')[0].replace("Title: ", "")[:40]
|
| 27 |
+
print(f"[{year}] {title}... (Score: {score:.4f})")
|
| 28 |
+
|
| 29 |
+
# 2. Temporal Search
|
| 30 |
+
print("\n--- Temporal Search (Recent Boost) ---")
|
| 31 |
+
tm_docs = db.hybrid_search(query, k=5, rerank=True, temporal=True)
|
| 32 |
+
for d in tm_docs:
|
| 33 |
+
isbn = d.metadata.get("isbn") or d.metadata.get("isbn13")
|
| 34 |
+
if not isbn and "ISBN:" in d.page_content:
|
| 35 |
+
isbn = d.page_content.split("ISBN:")[1].strip().split()[0]
|
| 36 |
+
year = db.pub_years.get(str(isbn), "Unknown")
|
| 37 |
+
# In temporal mode, score is boosted
|
| 38 |
+
score = d.metadata.get("relevance_score", 0.0)
|
| 39 |
+
|
| 40 |
+
title = d.page_content.split('\n')[0].replace("Title: ", "")[:40]
|
| 41 |
+
print(f"[{year}] {title}... (Score: {score:.4f})")
|
| 42 |
+
|
| 43 |
+
if __name__ == "__main__":
|
| 44 |
+
run_benchmark()
|
scripts/build_books_basic_info.py
ADDED
|
@@ -0,0 +1,48 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import pandas as pd
|
| 2 |
+
import csv
|
| 3 |
+
|
| 4 |
+
# 读取原始数据,遇到格式错误行自动跳过,保证流程不中断
|
| 5 |
+
books_data = pd.read_csv(
|
| 6 |
+
"data/books_data.csv",
|
| 7 |
+
engine="python",
|
| 8 |
+
quotechar='"',
|
| 9 |
+
escapechar='\\',
|
| 10 |
+
on_bad_lines='skip' # pandas >=1.3
|
| 11 |
+
)
|
| 12 |
+
ratings = pd.read_csv("data/Books_rating.csv", engine="python", quotechar='"', escapechar='\\', on_bad_lines='skip')
|
| 13 |
+
|
| 14 |
+
# 只保留有用字段
|
| 15 |
+
books_cols = [
|
| 16 |
+
"Title", "description", "authors", "image", "publisher", "publishedDate", "categories"
|
| 17 |
+
]
|
| 18 |
+
books_data = books_data[books_cols]
|
| 19 |
+
|
| 20 |
+
# 只保留 Title, Id, review/score 字段用于合并
|
| 21 |
+
ratings_cols = ["Title", "Id", "review/score"]
|
| 22 |
+
ratings = ratings[ratings_cols]
|
| 23 |
+
|
| 24 |
+
# 去重
|
| 25 |
+
ratings = ratings.drop_duplicates(subset=["Title"])
|
| 26 |
+
|
| 27 |
+
# 合并,左连接,保留 books_data 所有行
|
| 28 |
+
merged = books_data.merge(ratings, on="Title", how="left")
|
| 29 |
+
|
| 30 |
+
# 重命名字段
|
| 31 |
+
merged = merged.rename(columns={
|
| 32 |
+
"Id": "isbn10",
|
| 33 |
+
"Title": "title",
|
| 34 |
+
"authors": "authors",
|
| 35 |
+
"description": "description",
|
| 36 |
+
"image": "image",
|
| 37 |
+
"publisher": "publisher",
|
| 38 |
+
"publishedDate": "publishedDate",
|
| 39 |
+
"categories": "categories",
|
| 40 |
+
"review/score": "average_rating"
|
| 41 |
+
})
|
| 42 |
+
|
| 43 |
+
# 生成 isbn13(如有更复杂规则可补充,这里仅占位)
|
| 44 |
+
merged["isbn13"] = None # 可后续补充isbn13生成逻辑
|
| 45 |
+
|
| 46 |
+
# 保存新表,强制所有字段加引号,防止description等字段被截断
|
| 47 |
+
merged.to_csv("data/books_basic_info.csv", index=False, quoting=csv.QUOTE_ALL, quotechar='"', escapechar='\\')
|
| 48 |
+
print("已生成 data/books_basic_info.csv,包含基础书籍信息字段。")
|
scripts/chunk_reviews.py
ADDED
|
@@ -0,0 +1,103 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Review Chunker Script
|
| 4 |
+
Splits review_highlights.txt into sentence-level chunks for Small-to-Big retrieval.
|
| 5 |
+
|
| 6 |
+
SOTA Reference: LlamaIndex Parent-Child Retrieval, RAPTOR (Sarthi et al., 2024)
|
| 7 |
+
"""
|
| 8 |
+
import json
|
| 9 |
+
import re
|
| 10 |
+
from pathlib import Path
|
| 11 |
+
from typing import List, Dict
|
| 12 |
+
|
| 13 |
+
# Simple sentence splitter (no external dependency)
|
| 14 |
+
def split_sentences(text: str) -> List[str]:
|
| 15 |
+
"""Split text into sentences using regex."""
|
| 16 |
+
# Handle common abbreviations
|
| 17 |
+
text = re.sub(r'(Mr|Mrs|Dr|Ms|Prof|Jr|Sr)\.', r'\1<DOT>', text)
|
| 18 |
+
# Split on sentence endings
|
| 19 |
+
sentences = re.split(r'(?<=[.!?])\s+', text)
|
| 20 |
+
# Restore abbreviations
|
| 21 |
+
sentences = [s.replace('<DOT>', '.') for s in sentences]
|
| 22 |
+
# Filter empty and very short sentences
|
| 23 |
+
return [s.strip() for s in sentences if len(s.strip()) > 20]
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+
def chunk_reviews(input_path: str, output_path: str, min_chunk_len: int = 50, max_chunk_len: int = 300):
|
| 27 |
+
"""
|
| 28 |
+
Read review_highlights.txt and output sentence-level chunks with parent ISBN.
|
| 29 |
+
|
| 30 |
+
Format of input: "ISBN review_text" per line
|
| 31 |
+
Format of output: JSONL with {"text": "...", "parent_isbn": "..."}
|
| 32 |
+
"""
|
| 33 |
+
input_file = Path(input_path)
|
| 34 |
+
output_file = Path(output_path)
|
| 35 |
+
|
| 36 |
+
if not input_file.exists():
|
| 37 |
+
print(f"Error: {input_path} not found.")
|
| 38 |
+
return
|
| 39 |
+
|
| 40 |
+
chunks = []
|
| 41 |
+
total_reviews = 0
|
| 42 |
+
|
| 43 |
+
print(f"Reading reviews from {input_path}...")
|
| 44 |
+
|
| 45 |
+
with open(input_file, 'r', encoding='utf-8') as f:
|
| 46 |
+
for line in f:
|
| 47 |
+
line = line.strip()
|
| 48 |
+
if not line:
|
| 49 |
+
continue
|
| 50 |
+
|
| 51 |
+
# Parse: First token is ISBN, rest is review
|
| 52 |
+
parts = line.split(' ', 1)
|
| 53 |
+
if len(parts) < 2:
|
| 54 |
+
continue
|
| 55 |
+
|
| 56 |
+
isbn = parts[0].strip()
|
| 57 |
+
review = parts[1].strip()
|
| 58 |
+
total_reviews += 1
|
| 59 |
+
|
| 60 |
+
# Split into sentences
|
| 61 |
+
sentences = split_sentences(review)
|
| 62 |
+
|
| 63 |
+
# Create chunks (may combine very short sentences)
|
| 64 |
+
current_chunk = ""
|
| 65 |
+
for sent in sentences:
|
| 66 |
+
if len(current_chunk) + len(sent) < max_chunk_len:
|
| 67 |
+
current_chunk += " " + sent if current_chunk else sent
|
| 68 |
+
else:
|
| 69 |
+
# Save current chunk if long enough
|
| 70 |
+
if len(current_chunk) >= min_chunk_len:
|
| 71 |
+
chunks.append({
|
| 72 |
+
"text": current_chunk.strip(),
|
| 73 |
+
"parent_isbn": isbn
|
| 74 |
+
})
|
| 75 |
+
current_chunk = sent
|
| 76 |
+
|
| 77 |
+
# Don't forget the last chunk
|
| 78 |
+
if len(current_chunk) >= min_chunk_len:
|
| 79 |
+
chunks.append({
|
| 80 |
+
"text": current_chunk.strip(),
|
| 81 |
+
"parent_isbn": isbn
|
| 82 |
+
})
|
| 83 |
+
|
| 84 |
+
# Write output
|
| 85 |
+
output_file.parent.mkdir(parents=True, exist_ok=True)
|
| 86 |
+
with open(output_file, 'w', encoding='utf-8') as f:
|
| 87 |
+
for chunk in chunks:
|
| 88 |
+
f.write(json.dumps(chunk, ensure_ascii=False) + '\n')
|
| 89 |
+
|
| 90 |
+
print(f"Processed {total_reviews} reviews -> {len(chunks)} chunks")
|
| 91 |
+
print(f"Output written to {output_path}")
|
| 92 |
+
|
| 93 |
+
# Show sample
|
| 94 |
+
print("\n--- Sample Chunks ---")
|
| 95 |
+
for c in chunks[:3]:
|
| 96 |
+
print(f"[{c['parent_isbn']}] {c['text'][:80]}...")
|
| 97 |
+
|
| 98 |
+
|
| 99 |
+
if __name__ == "__main__":
|
| 100 |
+
chunk_reviews(
|
| 101 |
+
input_path="data/review_highlights.txt",
|
| 102 |
+
output_path="data/review_chunks.jsonl"
|
| 103 |
+
)
|
scripts/init_dual_index.py
ADDED
|
@@ -0,0 +1,71 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Dual Index Initialization Script
|
| 4 |
+
Creates a separate ChromaDB collection for review chunks (Small-to-Big architecture).
|
| 5 |
+
|
| 6 |
+
SOTA Reference: LlamaIndex Parent-Child Retrieval
|
| 7 |
+
"""
|
| 8 |
+
import json
|
| 9 |
+
from pathlib import Path
|
| 10 |
+
from langchain_community.embeddings import HuggingFaceEmbeddings
|
| 11 |
+
from langchain_community.vectorstores import Chroma
|
| 12 |
+
from langchain_core.documents import Document
|
| 13 |
+
from tqdm import tqdm
|
| 14 |
+
|
| 15 |
+
CHUNK_PATH = "data/review_chunks.jsonl"
|
| 16 |
+
PERSIST_DIR = "data/chroma_chunks"
|
| 17 |
+
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
|
| 18 |
+
BATCH_SIZE = 5000
|
| 19 |
+
|
| 20 |
+
|
| 21 |
+
def load_chunks(path: str, limit: int = None):
|
| 22 |
+
"""Load chunks from JSONL file."""
|
| 23 |
+
chunks = []
|
| 24 |
+
with open(path, 'r', encoding='utf-8') as f:
|
| 25 |
+
for i, line in enumerate(f):
|
| 26 |
+
if limit and i >= limit:
|
| 27 |
+
break
|
| 28 |
+
data = json.loads(line)
|
| 29 |
+
doc = Document(
|
| 30 |
+
page_content=data["text"],
|
| 31 |
+
metadata={"parent_isbn": data["parent_isbn"]}
|
| 32 |
+
)
|
| 33 |
+
chunks.append(doc)
|
| 34 |
+
return chunks
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
def init_chunk_index():
|
| 38 |
+
"""Initialize the chunk-level ChromaDB index."""
|
| 39 |
+
print(f"Loading embedding model: {EMBEDDING_MODEL}")
|
| 40 |
+
embeddings = HuggingFaceEmbeddings(
|
| 41 |
+
model_name=EMBEDDING_MODEL,
|
| 42 |
+
model_kwargs={"device": "mps"}, # Use Metal on Mac
|
| 43 |
+
encode_kwargs={"normalize_embeddings": True}
|
| 44 |
+
)
|
| 45 |
+
|
| 46 |
+
print(f"Loading chunks from {CHUNK_PATH}...")
|
| 47 |
+
chunks = load_chunks(CHUNK_PATH)
|
| 48 |
+
print(f"Loaded {len(chunks)} chunks")
|
| 49 |
+
|
| 50 |
+
# Create index in batches
|
| 51 |
+
print(f"Creating ChromaDB index at {PERSIST_DIR}...")
|
| 52 |
+
|
| 53 |
+
# First batch creates the collection
|
| 54 |
+
db = Chroma.from_documents(
|
| 55 |
+
documents=chunks[:BATCH_SIZE],
|
| 56 |
+
embedding=embeddings,
|
| 57 |
+
persist_directory=PERSIST_DIR,
|
| 58 |
+
collection_name="review_chunks"
|
| 59 |
+
)
|
| 60 |
+
|
| 61 |
+
# Add remaining in batches
|
| 62 |
+
for i in tqdm(range(BATCH_SIZE, len(chunks), BATCH_SIZE), desc="Indexing"):
|
| 63 |
+
batch = chunks[i:i+BATCH_SIZE]
|
| 64 |
+
db.add_documents(batch)
|
| 65 |
+
|
| 66 |
+
print(f"Index created with {len(chunks)} chunks.")
|
| 67 |
+
print(f"Persisted to {PERSIST_DIR}")
|
| 68 |
+
|
| 69 |
+
|
| 70 |
+
if __name__ == "__main__":
|
| 71 |
+
init_chunk_index()
|
scripts/test_rag.py
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
|
| 2 |
+
import asyncio
|
| 3 |
+
import sys
|
| 4 |
+
from pathlib import Path
|
| 5 |
+
|
| 6 |
+
# Add project root
|
| 7 |
+
sys.path.append(str(Path(__file__).parent.parent))
|
| 8 |
+
|
| 9 |
+
from src.services.chat_service import chat_service
|
| 10 |
+
|
| 11 |
+
async def main():
|
| 12 |
+
isbn = "0001047604" # Aurora Leigh
|
| 13 |
+
query = "What is the emotional tone of this book?"
|
| 14 |
+
|
| 15 |
+
print(f"Testing ChatService with ISBN={isbn}, Query='{query}'...")
|
| 16 |
+
print("-" * 50)
|
| 17 |
+
|
| 18 |
+
try:
|
| 19 |
+
# Use 'mock' provider to test flow without key
|
| 20 |
+
async for chunk in chat_service.chat_stream(
|
| 21 |
+
isbn=isbn,
|
| 22 |
+
user_query=query,
|
| 23 |
+
provider="mock"
|
| 24 |
+
):
|
| 25 |
+
print(chunk, end="", flush=True)
|
| 26 |
+
print("\n" + "-" * 50)
|
| 27 |
+
print("✅ Test Completed Successfully!")
|
| 28 |
+
|
| 29 |
+
except Exception as e:
|
| 30 |
+
print(f"\n❌ Test Failed: {e}")
|
| 31 |
+
import traceback
|
| 32 |
+
traceback.print_exc()
|
| 33 |
+
|
| 34 |
+
if __name__ == "__main__":
|
| 35 |
+
asyncio.run(main())
|
scripts/verify_env.py
ADDED
|
@@ -0,0 +1,61 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
|
| 2 |
+
import sys
|
| 3 |
+
import os
|
| 4 |
+
import platform
|
| 5 |
+
import time
|
| 6 |
+
from pathlib import Path
|
| 7 |
+
|
| 8 |
+
# Add project root to path
|
| 9 |
+
sys.path.append(str(Path(__file__).parent.parent))
|
| 10 |
+
|
| 11 |
+
def print_status(component, status, message=""):
|
| 12 |
+
color = "\033[92m" if status == "OK" else "\033[91m"
|
| 13 |
+
reset = "\033[0m"
|
| 14 |
+
print(f"[{component.ljust(15)}] {color}{status}{reset} {message}")
|
| 15 |
+
|
| 16 |
+
def check_system():
|
| 17 |
+
print("\n=== System Info ===")
|
| 18 |
+
print(f"Python: {sys.version.split()[0]}")
|
| 19 |
+
print(f"OS: {platform.system()} {platform.release()}")
|
| 20 |
+
print_status("System", "OK")
|
| 21 |
+
|
| 22 |
+
def check_torch():
|
| 23 |
+
print("\n=== PyTorch Check ===")
|
| 24 |
+
try:
|
| 25 |
+
import torch
|
| 26 |
+
print(f"Torch Version: {torch.__version__}")
|
| 27 |
+
if torch.backends.mps.is_available():
|
| 28 |
+
print_status("Accelerator", "OK", "MPS (Metal Performance Shaders) is available! 🚀")
|
| 29 |
+
elif torch.cuda.is_available():
|
| 30 |
+
print_status("Accelerator", "OK", f"CUDA is available! ({torch.cuda.get_device_name(0)})")
|
| 31 |
+
else:
|
| 32 |
+
print_status("Accelerator", "WARN", "Running on CPU (Slow but safe)")
|
| 33 |
+
except ImportError:
|
| 34 |
+
print_status("PyTorch", "FAIL", "Not installed")
|
| 35 |
+
|
| 36 |
+
def check_vector_db():
|
| 37 |
+
print("\n=== Vector Database Check ===")
|
| 38 |
+
try:
|
| 39 |
+
from src.vector_db import VectorDB
|
| 40 |
+
|
| 41 |
+
start = time.perf_counter()
|
| 42 |
+
print("Loading VectorDB (this loads embeddings)...")
|
| 43 |
+
vdb = VectorDB()
|
| 44 |
+
print(f"Load Time: {time.perf_counter() - start:.2f}s")
|
| 45 |
+
|
| 46 |
+
# Test Query
|
| 47 |
+
test_q = "fantasy"
|
| 48 |
+
results = vdb.search(test_q, k=1)
|
| 49 |
+
if results:
|
| 50 |
+
print_status("ChromaDB", "OK", f"Query '{test_q}' returned: {results[0].page_content[:50]}...")
|
| 51 |
+
else:
|
| 52 |
+
print_status("ChromaDB", "WARN", "Database loaded but returned empty results")
|
| 53 |
+
|
| 54 |
+
except Exception as e:
|
| 55 |
+
print_status("ChromaDB", "FAIL", str(e))
|
| 56 |
+
print("Tip: Ensure 'data/chroma_db' exists.")
|
| 57 |
+
|
| 58 |
+
if __name__ == "__main__":
|
| 59 |
+
check_system()
|
| 60 |
+
check_torch()
|
| 61 |
+
check_vector_db()
|
src/api/chat.py
ADDED
|
@@ -0,0 +1,50 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from fastapi import APIRouter, Header, HTTPException, Depends
|
| 2 |
+
from fastapi.responses import StreamingResponse
|
| 3 |
+
from pydantic import BaseModel
|
| 4 |
+
from typing import Optional
|
| 5 |
+
|
| 6 |
+
from src.services.chat_service import chat_service
|
| 7 |
+
from src.utils import setup_logger
|
| 8 |
+
|
| 9 |
+
logger = setup_logger(__name__)
|
| 10 |
+
|
| 11 |
+
router = APIRouter(prefix="/chat", tags=["Chat"])
|
| 12 |
+
|
| 13 |
+
class ChatRequest(BaseModel):
|
| 14 |
+
isbn: str
|
| 15 |
+
query: str
|
| 16 |
+
user_id: Optional[str] = "local"
|
| 17 |
+
provider: Optional[str] = "openai" # openai, ollama
|
| 18 |
+
|
| 19 |
+
async def get_llm_key(x_llm_key: Optional[str] = Header(None, alias="X-LLM-Key")):
|
| 20 |
+
"""Dependency to extract API Key from header."""
|
| 21 |
+
# For Ollama, key is optional. For OpenAI, it's required (enforced by LLMFactory).
|
| 22 |
+
return x_llm_key
|
| 23 |
+
|
| 24 |
+
@router.post("/completions")
|
| 25 |
+
async def chat_completions(
|
| 26 |
+
request: ChatRequest,
|
| 27 |
+
api_key: Optional[str] = Depends(get_llm_key)
|
| 28 |
+
):
|
| 29 |
+
"""
|
| 30 |
+
Stream chat response for a book using RAG + LLM.
|
| 31 |
+
Requires 'X-LLM-Key' header for OpenAI.
|
| 32 |
+
"""
|
| 33 |
+
logger.info(f"Chat request: isbn={request.isbn}, query='{request.query}', provider={request.provider}")
|
| 34 |
+
|
| 35 |
+
# Check if provider is openai and key is missing
|
| 36 |
+
if request.provider == "openai" and not api_key:
|
| 37 |
+
# Check env var fallback inside service/factory, but good to warn here?
|
| 38 |
+
# LLMFactory checks env var too. So we pass None and let it fail if needed.
|
| 39 |
+
pass
|
| 40 |
+
|
| 41 |
+
return StreamingResponse(
|
| 42 |
+
chat_service.chat_stream(
|
| 43 |
+
isbn=request.isbn,
|
| 44 |
+
user_query=request.query,
|
| 45 |
+
user_id=request.user_id,
|
| 46 |
+
api_key=api_key,
|
| 47 |
+
provider=request.provider
|
| 48 |
+
),
|
| 49 |
+
media_type="text/plain"
|
| 50 |
+
)
|
src/config.py
CHANGED
|
@@ -8,10 +8,12 @@ load_dotenv()
|
|
| 8 |
# Project Root
|
| 9 |
PROJECT_ROOT = Path(__file__).parent.parent.absolute()
|
| 10 |
|
|
|
|
| 11 |
# Data Paths
|
| 12 |
DATA_DIR = PROJECT_ROOT / "data"
|
|
|
|
| 13 |
BOOKS_CSV = DATA_DIR / "books_with_emotions.csv"
|
| 14 |
-
|
| 15 |
CHROMA_DB_DIR = DATA_DIR / "chroma_db"
|
| 16 |
|
| 17 |
# Assets
|
|
|
|
| 8 |
# Project Root
|
| 9 |
PROJECT_ROOT = Path(__file__).parent.parent.absolute()
|
| 10 |
|
| 11 |
+
# Data Paths
|
| 12 |
# Data Paths
|
| 13 |
DATA_DIR = PROJECT_ROOT / "data"
|
| 14 |
+
PROCESSED_DATA_DIR = DATA_DIR # Alias for clearer intent
|
| 15 |
BOOKS_CSV = DATA_DIR / "books_with_emotions.csv"
|
| 16 |
+
REVIEW_HIGHLIGHTS_TXT = DATA_DIR / "review_highlights.txt"
|
| 17 |
CHROMA_DB_DIR = DATA_DIR / "chroma_db"
|
| 18 |
|
| 19 |
# Assets
|
src/core/context_compressor.py
ADDED
|
@@ -0,0 +1,89 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from typing import List, Any
|
| 2 |
+
from langchain_core.messages import BaseMessage, SystemMessage, HumanMessage, AIMessage
|
| 3 |
+
from src.core.llm import LLMFactory
|
| 4 |
+
from src.utils import setup_logger
|
| 5 |
+
|
| 6 |
+
logger = setup_logger(__name__)
|
| 7 |
+
|
| 8 |
+
class ContextCompressor:
|
| 9 |
+
"""
|
| 10 |
+
Service to compress RAG context and Conversation History.
|
| 11 |
+
Reduces token usage and 'Lost in the Middle' phenomenon.
|
| 12 |
+
"""
|
| 13 |
+
|
| 14 |
+
def __init__(self):
|
| 15 |
+
# We use a cheaper/faster model for summarization if possible
|
| 16 |
+
# For now, we reuse the default provider from LLMFactory
|
| 17 |
+
pass
|
| 18 |
+
|
| 19 |
+
async def compress_history(self, history: List[BaseMessage], max_token_limit: int = 2000) -> List[BaseMessage]:
|
| 20 |
+
"""
|
| 21 |
+
Compress conversation history if it exceeds limits.
|
| 22 |
+
Strategy: Keep last N messages raw, summarize the rest.
|
| 23 |
+
"""
|
| 24 |
+
# Simple heuristic: If history > 10 messages, summarize the oldest ones
|
| 25 |
+
if len(history) <= 6:
|
| 26 |
+
return history
|
| 27 |
+
|
| 28 |
+
# Keep last 4 messages (2 turns) intact
|
| 29 |
+
recent_history = history[-4:]
|
| 30 |
+
older_history = history[:-4]
|
| 31 |
+
|
| 32 |
+
# If older history is small, just return (avoid unnecessary summarization calls)
|
| 33 |
+
if len(older_history) < 2:
|
| 34 |
+
return history
|
| 35 |
+
|
| 36 |
+
logger.info(f"Compressing history: {len(history)} messages -> Summary + 4 recent")
|
| 37 |
+
|
| 38 |
+
try:
|
| 39 |
+
summary = await self._summarize_messages(older_history)
|
| 40 |
+
return [SystemMessage(content=f"Previous Conversation Summary: {summary}")] + recent_history
|
| 41 |
+
except Exception as e:
|
| 42 |
+
logger.error(f"History compression failed: {e}")
|
| 43 |
+
return history # Fallback: return full history (or could slice)
|
| 44 |
+
|
| 45 |
+
async def _summarize_messages(self, messages: List[BaseMessage]) -> str:
|
| 46 |
+
"""Use LLM to summarize a list of messages."""
|
| 47 |
+
conversation_text = ""
|
| 48 |
+
for msg in messages:
|
| 49 |
+
role = "User" if isinstance(msg, HumanMessage) else "AI"
|
| 50 |
+
conversation_text += f"{role}: {msg.content}\n"
|
| 51 |
+
|
| 52 |
+
prompt = (
|
| 53 |
+
"Summarize the following conversation concisely, focusing on key user preferences and questions. "
|
| 54 |
+
"Do not lose important details.\n\n"
|
| 55 |
+
f"{conversation_text}"
|
| 56 |
+
)
|
| 57 |
+
|
| 58 |
+
# Use simple mock if running in test environment/benchmark without keys
|
| 59 |
+
try:
|
| 60 |
+
llm = LLMFactory.create(temperature=0.3)
|
| 61 |
+
except:
|
| 62 |
+
# Fallback to mock for stability if env is not set
|
| 63 |
+
llm = LLMFactory.create(provider="mock")
|
| 64 |
+
|
| 65 |
+
response = llm.invoke([HumanMessage(content=prompt)])
|
| 66 |
+
return response.content
|
| 67 |
+
|
| 68 |
+
def format_docs(self, docs: List[Any], max_len_per_doc: int = 500) -> str:
|
| 69 |
+
"""
|
| 70 |
+
Format retrieved documents for the LLM Prompt.
|
| 71 |
+
Truncates content to avoid context overflow.
|
| 72 |
+
"""
|
| 73 |
+
formatted = ""
|
| 74 |
+
for i, doc in enumerate(docs):
|
| 75 |
+
content = doc.page_content.replace("\n", " ")
|
| 76 |
+
if len(content) > max_len_per_doc:
|
| 77 |
+
content = content[:max_len_per_doc] + "..."
|
| 78 |
+
|
| 79 |
+
# Add Relevance Score if available (from Reranker)
|
| 80 |
+
score_info = ""
|
| 81 |
+
if doc.metadata and "relevance_score" in doc.metadata:
|
| 82 |
+
score = doc.metadata["relevance_score"]
|
| 83 |
+
score_info = f" (Relevance: {score:.2f})"
|
| 84 |
+
|
| 85 |
+
formatted += f"[{i+1}] {content}{score_info}\n"
|
| 86 |
+
return formatted
|
| 87 |
+
|
| 88 |
+
# Singleton
|
| 89 |
+
compressor = ContextCompressor()
|
src/core/llm.py
ADDED
|
@@ -0,0 +1,78 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from typing import Optional, Literal
|
| 2 |
+
from langchain_core.language_models import BaseChatModel
|
| 3 |
+
from langchain_openai import ChatOpenAI
|
| 4 |
+
from langchain_community.chat_models import ChatOllama
|
| 5 |
+
from pydantic import SecretStr
|
| 6 |
+
|
| 7 |
+
from src.utils import setup_logger
|
| 8 |
+
|
| 9 |
+
logger = setup_logger(__name__)
|
| 10 |
+
|
| 11 |
+
class LLMFactory:
|
| 12 |
+
"""
|
| 13 |
+
Factory to create LLM instances based on provider and API key.
|
| 14 |
+
Supports 'Bring Your Own Key' (BYOK) architecture.
|
| 15 |
+
"""
|
| 16 |
+
|
| 17 |
+
@staticmethod
|
| 18 |
+
def create(
|
| 19 |
+
provider: Literal["openai", "ollama", "mock"] = "openai",
|
| 20 |
+
api_key: Optional[str] = None,
|
| 21 |
+
model_name: Optional[str] = None,
|
| 22 |
+
temperature: float = 0.7
|
| 23 |
+
) -> BaseChatModel:
|
| 24 |
+
"""
|
| 25 |
+
Create and return a configured LangChain Chat Model.
|
| 26 |
+
"""
|
| 27 |
+
logger.info(f"Creating LLM instance: provider={provider}, model={model_name}")
|
| 28 |
+
|
| 29 |
+
if provider == "mock":
|
| 30 |
+
from langchain_community.chat_models import FakeListChatModel
|
| 31 |
+
return FakeListChatModel(responses=[
|
| 32 |
+
"This is a MOCKED response from the RAG Agent.",
|
| 33 |
+
"I found the book 'Aurora Leigh' to be quite fascinating based on the description!",
|
| 34 |
+
"It fits your persona of liking Victorian literature."
|
| 35 |
+
])
|
| 36 |
+
|
| 37 |
+
if provider == "openai":
|
| 38 |
+
if not model_name:
|
| 39 |
+
model_name = "gpt-3.5-turbo"
|
| 40 |
+
|
| 41 |
+
if not api_key:
|
| 42 |
+
# Fallback to env var if not provided (for dev convenience)
|
| 43 |
+
import os
|
| 44 |
+
api_key = os.getenv("OPENAI_API_KEY")
|
| 45 |
+
|
| 46 |
+
if not api_key:
|
| 47 |
+
raise ValueError("OpenAI API Key is required for 'openai' provider.")
|
| 48 |
+
|
| 49 |
+
return ChatOpenAI(
|
| 50 |
+
api_key=SecretStr(api_key),
|
| 51 |
+
model_name=model_name,
|
| 52 |
+
temperature=temperature,
|
| 53 |
+
streaming=True # Support streaming by default
|
| 54 |
+
)
|
| 55 |
+
|
| 56 |
+
elif provider == "ollama":
|
| 57 |
+
# Ollama usually runs locally on default port 11434
|
| 58 |
+
if not model_name:
|
| 59 |
+
model_name = "llama3" # Default for Ollama
|
| 60 |
+
|
| 61 |
+
return ChatOllama(
|
| 62 |
+
model=model_name,
|
| 63 |
+
temperature=temperature,
|
| 64 |
+
)
|
| 65 |
+
|
| 66 |
+
else:
|
| 67 |
+
raise ValueError(f"Unsupported LLM provider: {provider}")
|
| 68 |
+
|
| 69 |
+
def get_llm_model(
|
| 70 |
+
provider: str = "openai",
|
| 71 |
+
api_key: Optional[str] = None
|
| 72 |
+
) -> BaseChatModel:
|
| 73 |
+
"""Helper for dependency injection or simple usage."""
|
| 74 |
+
try:
|
| 75 |
+
return LLMFactory.create(provider=provider, api_key=api_key)
|
| 76 |
+
except Exception as e:
|
| 77 |
+
logger.error(f"Failed to create LLM: {e}")
|
| 78 |
+
raise
|
src/core/reranker.py
ADDED
|
@@ -0,0 +1,104 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from typing import List, Tuple, Dict, Any
|
| 2 |
+
from sentence_transformers import CrossEncoder
|
| 3 |
+
import torch
|
| 4 |
+
from src.utils import setup_logger
|
| 5 |
+
|
| 6 |
+
logger = setup_logger(__name__)
|
| 7 |
+
|
| 8 |
+
# 轻量级重排序模型,速度快且效果不错
|
| 9 |
+
DEFAULT_RERANKER_MODEL = "cross-encoder/ms-marco-MiniLM-L-6-v2"
|
| 10 |
+
|
| 11 |
+
class RerankerService:
|
| 12 |
+
"""
|
| 13 |
+
Singleton service for re-ranking documents using a Cross-Encoder.
|
| 14 |
+
This significantly improves RAG precision by scoring the exact relevance
|
| 15 |
+
of (query, document) pairs.
|
| 16 |
+
"""
|
| 17 |
+
_instance = None
|
| 18 |
+
|
| 19 |
+
def __new__(cls):
|
| 20 |
+
if cls._instance is None:
|
| 21 |
+
cls._instance = super(RerankerService, cls).__new__(cls)
|
| 22 |
+
cls._instance.model = None
|
| 23 |
+
return cls._instance
|
| 24 |
+
|
| 25 |
+
def __init__(self):
|
| 26 |
+
if self.model is None:
|
| 27 |
+
self._load_model()
|
| 28 |
+
|
| 29 |
+
def _load_model(self):
|
| 30 |
+
try:
|
| 31 |
+
device = "mps" if torch.backends.mps.is_available() else "cpu"
|
| 32 |
+
logger.info(f"Loading Reranker model: {DEFAULT_RERANKER_MODEL} on {device}...")
|
| 33 |
+
self.model = CrossEncoder(DEFAULT_RERANKER_MODEL, device=device)
|
| 34 |
+
logger.info("Reranker model loaded.")
|
| 35 |
+
except Exception as e:
|
| 36 |
+
logger.error(f"Failed to load Reranker: {e}")
|
| 37 |
+
self.model = None
|
| 38 |
+
|
| 39 |
+
def rerank(self, query: str, docs: List[Dict[str, Any]], top_k: int = 5) -> List[Dict[str, Any]]:
|
| 40 |
+
"""
|
| 41 |
+
Rerank a list of documents based on relevance to the query.
|
| 42 |
+
|
| 43 |
+
Args:
|
| 44 |
+
query: User question
|
| 45 |
+
docs: List of dicts, each must have a 'content' field (or 'description')
|
| 46 |
+
top_k: Number of results to return
|
| 47 |
+
|
| 48 |
+
Returns:
|
| 49 |
+
Top-K sorted documents with added 'score' field.
|
| 50 |
+
"""
|
| 51 |
+
if not self.model or not docs:
|
| 52 |
+
return docs[:top_k]
|
| 53 |
+
|
| 54 |
+
# Prepare pairs for Cross-Encoder: [[query, doc1], [query, doc2], ...]
|
| 55 |
+
# We assume 'description' or 'page_content' holds the text
|
| 56 |
+
pairs = []
|
| 57 |
+
valid_docs = []
|
| 58 |
+
|
| 59 |
+
for doc in docs:
|
| 60 |
+
# Handle LangChain Document object
|
| 61 |
+
if hasattr(doc, "page_content"):
|
| 62 |
+
text = doc.page_content
|
| 63 |
+
# Handle Dict
|
| 64 |
+
else:
|
| 65 |
+
text = doc.get("description") or doc.get("page_content") or str(doc)
|
| 66 |
+
|
| 67 |
+
pairs.append([query, text])
|
| 68 |
+
valid_docs.append(doc)
|
| 69 |
+
|
| 70 |
+
if not pairs:
|
| 71 |
+
return docs[:top_k]
|
| 72 |
+
|
| 73 |
+
# Predict scores
|
| 74 |
+
scores = self.model.predict(pairs)
|
| 75 |
+
|
| 76 |
+
# Attach scores and sort
|
| 77 |
+
scored_results = []
|
| 78 |
+
for i, doc in enumerate(valid_docs):
|
| 79 |
+
score = float(scores[i])
|
| 80 |
+
if hasattr(doc, "metadata"):
|
| 81 |
+
# Handle Document
|
| 82 |
+
# Create a shallow copy to avoid mutating original if needed,
|
| 83 |
+
# but simplistic approach is fine here
|
| 84 |
+
doc.metadata["relevance_score"] = score
|
| 85 |
+
scored_results.append(doc)
|
| 86 |
+
else:
|
| 87 |
+
# Handle Dict
|
| 88 |
+
doc_copy = doc.copy()
|
| 89 |
+
doc_copy["score"] = score
|
| 90 |
+
scored_results.append(doc_copy)
|
| 91 |
+
|
| 92 |
+
# Sort descending by score
|
| 93 |
+
# Sort descending by score
|
| 94 |
+
def get_score(doc):
|
| 95 |
+
if hasattr(doc, "metadata"):
|
| 96 |
+
return doc.metadata.get("relevance_score", 0)
|
| 97 |
+
return doc.get("score", 0)
|
| 98 |
+
|
| 99 |
+
scored_results.sort(key=get_score, reverse=True)
|
| 100 |
+
|
| 101 |
+
return scored_results[:top_k]
|
| 102 |
+
|
| 103 |
+
# Global instance
|
| 104 |
+
reranker = RerankerService()
|
src/core/router.py
ADDED
|
@@ -0,0 +1,86 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import re
|
| 2 |
+
from typing import Dict, Any, List
|
| 3 |
+
from src.utils import setup_logger
|
| 4 |
+
|
| 5 |
+
logger = setup_logger(__name__)
|
| 6 |
+
|
| 7 |
+
class QueryRouter:
|
| 8 |
+
"""
|
| 9 |
+
Intelligent Router for the RAG Pipeline.
|
| 10 |
+
Classifies user queries to select the optimal retrieval strategy.
|
| 11 |
+
|
| 12 |
+
Strategies:
|
| 13 |
+
1. EXACT (ISBN/ID) -> Pure BM25 (High Precision, No Rerank noise).
|
| 14 |
+
2. FAST (Keywords) -> Hybrid (RRF), No Rerank (Low Latency).
|
| 15 |
+
3. DEEP (Complex) -> Hybrid + Rerank (High Latency, High contextual relevance).
|
| 16 |
+
"""
|
| 17 |
+
|
| 18 |
+
def __init__(self):
|
| 19 |
+
# Regex for ISBN-10 and ISBN-13
|
| 20 |
+
self.isbn_pattern = re.compile(r'^(?:\d{9}[\dX]|\d{13})$')
|
| 21 |
+
|
| 22 |
+
def route(self, query: str) -> Dict[str, Any]:
|
| 23 |
+
"""
|
| 24 |
+
Analyze query and return retrieval parameters.
|
| 25 |
+
Returns dict with: 'strategy', 'hybrid_alpha', 'rerank'
|
| 26 |
+
"""
|
| 27 |
+
cleaned_query = query.strip()
|
| 28 |
+
words = cleaned_query.split()
|
| 29 |
+
|
| 30 |
+
# 1. Check for ISBN (Exact Match)
|
| 31 |
+
# Remove hyphens/spaces for check
|
| 32 |
+
normalized = cleaned_query.replace("-", "").replace(" ", "")
|
| 33 |
+
if self.isbn_pattern.match(normalized):
|
| 34 |
+
logger.info(f"Router: Detected ISBN -> EXACT Strategy ({normalized})")
|
| 35 |
+
return {
|
| 36 |
+
"strategy": "exact",
|
| 37 |
+
"alpha": 1.0, # Pure BM25 (1.0 = All Sparse in our hybrid impl?)
|
| 38 |
+
# Wait, hybrid implementation uses alpha for weighting?
|
| 39 |
+
# Actually our hybrid_search doesn't use alpha for weight mixing in the implementation I wrote.
|
| 40 |
+
# It sums ranks. But let's verify vector_db.py logic.
|
| 41 |
+
# Actually, standard RRF sums 1/(k+rank). To prioritize BM25, we might need a different call.
|
| 42 |
+
# For now, let's assume we want standard Hybrid but NO Rerank.
|
| 43 |
+
# Or better: If ISBN, just use BM25 manually if possible.
|
| 44 |
+
# But hybrid_search is fine if we skip reranking.
|
| 45 |
+
"rerank": False,
|
| 46 |
+
"k_final": 5
|
| 47 |
+
}
|
| 48 |
+
|
| 49 |
+
# 2. Check for Temporal Keywords (Freshness Bias)
|
| 50 |
+
temporal_keywords = {"new", "newest", "latest", "recent", "modern", "contemporary", "2020", "2021", "2022", "2023", "2024", "2025"}
|
| 51 |
+
is_temporal = any(word.lower() in temporal_keywords for word in words)
|
| 52 |
+
|
| 53 |
+
# 3. Check for Detail-Oriented Queries (Triggers Small-to-Big)
|
| 54 |
+
# These are queries asking about specific plot points, reactions, or hidden details
|
| 55 |
+
detail_keywords = {"twist", "ending", "spoiler", "readers", "felt", "cried", "hated", "loved",
|
| 56 |
+
"review", "opinion", "think", "unreliable", "narrator", "realize", "find out"}
|
| 57 |
+
is_detail = any(word.lower() in detail_keywords for word in words)
|
| 58 |
+
|
| 59 |
+
if is_detail:
|
| 60 |
+
logger.info(f"Router: Detected Detail Query -> SMALL_TO_BIG Strategy")
|
| 61 |
+
return {
|
| 62 |
+
"strategy": "small_to_big",
|
| 63 |
+
"rerank": False, # Small-to-Big already does precision matching
|
| 64 |
+
"k_final": 5,
|
| 65 |
+
"temporal": is_temporal
|
| 66 |
+
}
|
| 67 |
+
|
| 68 |
+
# 4. Check for Simple Keyword Search (Short queries)
|
| 69 |
+
if len(words) <= 2:
|
| 70 |
+
logger.info(f"Router: Detected Keyword -> FAST Strategy (Temporal={is_temporal})")
|
| 71 |
+
return {
|
| 72 |
+
"strategy": "fast",
|
| 73 |
+
"rerank": False, # Skip expensive rerank
|
| 74 |
+
"k_final": 5,
|
| 75 |
+
"temporal": is_temporal
|
| 76 |
+
}
|
| 77 |
+
|
| 78 |
+
# 5. Default to Deep Search
|
| 79 |
+
logger.info(f"Router: Detected Natural Language -> DEEP Strategy (Temporal={is_temporal})")
|
| 80 |
+
return {
|
| 81 |
+
"strategy": "deep",
|
| 82 |
+
"rerank": True,
|
| 83 |
+
"k_final": 10,
|
| 84 |
+
"temporal": is_temporal
|
| 85 |
+
}
|
| 86 |
+
|
src/core/temporal.py
ADDED
|
@@ -0,0 +1,106 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from typing import Dict, Any, List
|
| 2 |
+
from datetime import datetime
|
| 3 |
+
import math
|
| 4 |
+
from src.utils import setup_logger
|
| 5 |
+
|
| 6 |
+
logger = setup_logger(__name__)
|
| 7 |
+
|
| 8 |
+
class TemporalRanker:
|
| 9 |
+
"""
|
| 10 |
+
Applies Time Decay to search results.
|
| 11 |
+
Boosts newer documents based on 'publishedDate'.
|
| 12 |
+
"""
|
| 13 |
+
|
| 14 |
+
def __init__(self):
|
| 15 |
+
self.current_year = datetime.now().year
|
| 16 |
+
|
| 17 |
+
def parse_year(self, date_str: Any) -> int:
|
| 18 |
+
"""Robustly extract year from various date formats."""
|
| 19 |
+
if not date_str:
|
| 20 |
+
return 0
|
| 21 |
+
try:
|
| 22 |
+
s = str(date_str).strip()
|
| 23 |
+
# Handle "2005-01-01" or "1999"
|
| 24 |
+
if len(s) >= 4 and s[:4].isdigit():
|
| 25 |
+
return int(s[:4])
|
| 26 |
+
except:
|
| 27 |
+
pass
|
| 28 |
+
return 0
|
| 29 |
+
|
| 30 |
+
def apply_decay(
|
| 31 |
+
self,
|
| 32 |
+
docs: List[Any],
|
| 33 |
+
pub_year_map: Dict[str, int],
|
| 34 |
+
boost_factor: float = 0.25
|
| 35 |
+
) -> List[Any]:
|
| 36 |
+
"""
|
| 37 |
+
Boost scores of newer books.
|
| 38 |
+
New Score = Old Score * (1 + boost_factor * recency_weight)
|
| 39 |
+
Recency Weight = 1 / log(Age + 2) (Soft decay)
|
| 40 |
+
"""
|
| 41 |
+
boosted_docs = []
|
| 42 |
+
|
| 43 |
+
for doc in docs:
|
| 44 |
+
# 1. robust ID extraction (as per recommender.py)
|
| 45 |
+
isbn = None
|
| 46 |
+
if doc.metadata and 'isbn' in doc.metadata and doc.metadata['isbn']:
|
| 47 |
+
isbn = str(doc.metadata['isbn'])
|
| 48 |
+
elif doc.metadata and 'isbn13' in doc.metadata and doc.metadata['isbn13']:
|
| 49 |
+
isbn = str(doc.metadata['isbn13'])
|
| 50 |
+
elif "ISBN:" in doc.page_content:
|
| 51 |
+
try:
|
| 52 |
+
isbn = doc.page_content.split("ISBN:")[1].strip().split()[0]
|
| 53 |
+
except:
|
| 54 |
+
pass
|
| 55 |
+
if not isbn:
|
| 56 |
+
isbn = doc.page_content.strip().split()[0]
|
| 57 |
+
|
| 58 |
+
# 2. Get Year
|
| 59 |
+
pub_year = pub_year_map.get(isbn, 0)
|
| 60 |
+
|
| 61 |
+
# 3. Calculate Boost
|
| 62 |
+
multiplier = 1.0
|
| 63 |
+
if pub_year > 1900: # Valid year
|
| 64 |
+
age = max(0, self.current_year - pub_year)
|
| 65 |
+
# Exponential Decay: weight = 1 / (1 + age/5)
|
| 66 |
+
# Or Linear-ish:
|
| 67 |
+
recency_weight = 1.0 / math.log(age + 2.718) # log(e)=1 at age 0
|
| 68 |
+
|
| 69 |
+
# If very old, weight is small. If new (age 0), weight is ~1.
|
| 70 |
+
multiplier = 1 + (boost_factor * recency_weight)
|
| 71 |
+
|
| 72 |
+
# 4. Update Score (if exists) or create it
|
| 73 |
+
# Note: doc.metadata['relevance_score'] usually comes from Reranker (Cross-Encoder)
|
| 74 |
+
# which can be negative (logit).
|
| 75 |
+
# If we don't have a score, we assume 1.0 baseline?
|
| 76 |
+
# Actually, usually we do this AFTER reranker or fusion.
|
| 77 |
+
|
| 78 |
+
# Handling negative logits from reranker:
|
| 79 |
+
# If score is -2.0, boosting simply by multiplication might flip sign or be weird.
|
| 80 |
+
# Best practice: Additive boost to logit.
|
| 81 |
+
# Score += (boost_factor * recency_weight)
|
| 82 |
+
|
| 83 |
+
if doc.metadata and "relevance_score" in doc.metadata:
|
| 84 |
+
original_score = doc.metadata["relevance_score"]
|
| 85 |
+
# Additive boost for logits
|
| 86 |
+
# e.g. -2.0 + (5.0 * 1.0) = 3.0 (Huge boost for new stuff)
|
| 87 |
+
# Let's be conservative: Boost = 2.0 max
|
| 88 |
+
boost = 2.0 * recency_weight if pub_year > 1900 else 0
|
| 89 |
+
doc.metadata["relevance_score"] = original_score + boost
|
| 90 |
+
doc.metadata["year"] = pub_year # Debug info
|
| 91 |
+
else:
|
| 92 |
+
# If no score (Hybrid only), maybe just add a dummy field?
|
| 93 |
+
# RRF doesn't have a 'score' field on the doc object usually, it has rank.
|
| 94 |
+
# Maybe we only apply this if Reranker was used.
|
| 95 |
+
pass
|
| 96 |
+
|
| 97 |
+
boosted_docs.append(doc)
|
| 98 |
+
|
| 99 |
+
# Resort
|
| 100 |
+
boosted_docs.sort(
|
| 101 |
+
key=lambda x: x.metadata.get("relevance_score", 0) if x.metadata else 0,
|
| 102 |
+
reverse=True
|
| 103 |
+
)
|
| 104 |
+
return boosted_docs
|
| 105 |
+
|
| 106 |
+
temporal_ranker = TemporalRanker()
|
src/cover_fetcher.py
CHANGED
|
@@ -30,15 +30,16 @@ PROJECT_ROOT = Path(__file__).resolve().parent.parent
|
|
| 30 |
PLACEHOLDER_COVER = str(PROJECT_ROOT / "assets" / "cover-not-found.jpg")
|
| 31 |
|
| 32 |
@lru_cache(maxsize=1000)
|
| 33 |
-
def fetch_book_cover(isbn: str, title: str = "") -> tuple[str, str]:
|
| 34 |
"""
|
| 35 |
-
Fetch book cover URL (Google Books -> Open Library) and
|
| 36 |
|
| 37 |
Returns:
|
| 38 |
-
(cover_url, authors_str)
|
| 39 |
"""
|
| 40 |
cover = PLACEHOLDER_COVER
|
| 41 |
authors_str = "Unknown"
|
|
|
|
| 42 |
|
| 43 |
# Try Google Books API first
|
| 44 |
try:
|
|
@@ -65,6 +66,8 @@ def fetch_book_cover(isbn: str, title: str = "") -> tuple[str, str]:
|
|
| 65 |
authors = volume.get("authors") or []
|
| 66 |
if authors:
|
| 67 |
authors_str = ", ".join(authors)
|
|
|
|
|
|
|
| 68 |
except Exception:
|
| 69 |
pass # Fall through to Open Library
|
| 70 |
|
|
@@ -78,7 +81,7 @@ def fetch_book_cover(isbn: str, title: str = "") -> tuple[str, str]:
|
|
| 78 |
except Exception:
|
| 79 |
pass
|
| 80 |
|
| 81 |
-
return cover, authors_str
|
| 82 |
|
| 83 |
|
| 84 |
def fetch_covers_batch(books_data: list) -> list:
|
|
@@ -94,10 +97,12 @@ def fetch_covers_batch(books_data: list) -> list:
|
|
| 94 |
for book in books_data:
|
| 95 |
isbn = book.get("isbn", "")
|
| 96 |
title = book.get("title", "")
|
| 97 |
-
cover, authors = fetch_book_cover(isbn, title)
|
| 98 |
book["thumbnail"] = cover
|
| 99 |
if authors != "Unknown":
|
| 100 |
book["authors"] = authors
|
|
|
|
|
|
|
| 101 |
# Small delay to avoid rate limiting
|
| 102 |
time.sleep(0.05)
|
| 103 |
|
|
|
|
| 30 |
PLACEHOLDER_COVER = str(PROJECT_ROOT / "assets" / "cover-not-found.jpg")
|
| 31 |
|
| 32 |
@lru_cache(maxsize=1000)
|
| 33 |
+
def fetch_book_cover(isbn: str, title: str = "") -> tuple[str, str, str]:
|
| 34 |
"""
|
| 35 |
+
Fetch book cover URL (Google Books -> Open Library), authors and description.
|
| 36 |
|
| 37 |
Returns:
|
| 38 |
+
(cover_url, authors_str, description_from_api)
|
| 39 |
"""
|
| 40 |
cover = PLACEHOLDER_COVER
|
| 41 |
authors_str = "Unknown"
|
| 42 |
+
api_description = ""
|
| 43 |
|
| 44 |
# Try Google Books API first
|
| 45 |
try:
|
|
|
|
| 66 |
authors = volume.get("authors") or []
|
| 67 |
if authors:
|
| 68 |
authors_str = ", ".join(authors)
|
| 69 |
+
# Optional: use Google Books description if provided
|
| 70 |
+
api_description = volume.get("description") or api_description
|
| 71 |
except Exception:
|
| 72 |
pass # Fall through to Open Library
|
| 73 |
|
|
|
|
| 81 |
except Exception:
|
| 82 |
pass
|
| 83 |
|
| 84 |
+
return cover, authors_str, api_description
|
| 85 |
|
| 86 |
|
| 87 |
def fetch_covers_batch(books_data: list) -> list:
|
|
|
|
| 97 |
for book in books_data:
|
| 98 |
isbn = book.get("isbn", "")
|
| 99 |
title = book.get("title", "")
|
| 100 |
+
cover, authors, api_desc = fetch_book_cover(isbn, title)
|
| 101 |
book["thumbnail"] = cover
|
| 102 |
if authors != "Unknown":
|
| 103 |
book["authors"] = authors
|
| 104 |
+
if api_desc:
|
| 105 |
+
book["description_api"] = api_desc
|
| 106 |
# Small delay to avoid rate limiting
|
| 107 |
time.sleep(0.05)
|
| 108 |
|
src/data_factory/__init__.py
ADDED
|
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# SFT Data Factory Module
|
| 2 |
+
from src.data_factory.generator import SFTDataGenerator, LLMJudge
|
| 3 |
+
|
| 4 |
+
__all__ = ["SFTDataGenerator", "LLMJudge"]
|
src/data_factory/generator.py
ADDED
|
@@ -0,0 +1,240 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
SFT Data Factory: Self-Instruct Pipeline with LLM-as-a-Judge
|
| 3 |
+
|
| 4 |
+
SOTA References:
|
| 5 |
+
- Self-Instruct (Wang et al., 2022)
|
| 6 |
+
- UltraChat (Ding et al., 2023)
|
| 7 |
+
- Alpaca (Stanford, 2023)
|
| 8 |
+
|
| 9 |
+
This module generates high-quality instruction-following data for fine-tuning
|
| 10 |
+
a Literary Critic persona into the model.
|
| 11 |
+
"""
|
| 12 |
+
import json
|
| 13 |
+
import random
|
| 14 |
+
from typing import List, Dict, Tuple, Optional
|
| 15 |
+
from pathlib import Path
|
| 16 |
+
from src.core.llm import LLMFactory
|
| 17 |
+
from src.utils import setup_logger
|
| 18 |
+
|
| 19 |
+
logger = setup_logger(__name__)
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
class SFTDataGenerator:
|
| 23 |
+
"""
|
| 24 |
+
Generates (Query, Response) pairs from raw reviews using Self-Instruct.
|
| 25 |
+
"""
|
| 26 |
+
|
| 27 |
+
def __init__(self, provider: str = "openai", api_key: str = None):
|
| 28 |
+
self.llm = LLMFactory.create(provider=provider, api_key=api_key, temperature=0.7)
|
| 29 |
+
|
| 30 |
+
def _sample_seed_reviews(self, path: str, n: int = 100, min_length: int = 200) -> List[Dict]:
|
| 31 |
+
"""Sample high-quality seed reviews."""
|
| 32 |
+
seeds = []
|
| 33 |
+
with open(path, 'r', encoding='utf-8') as f:
|
| 34 |
+
for line in f:
|
| 35 |
+
parts = line.strip().split(' ', 1)
|
| 36 |
+
if len(parts) < 2:
|
| 37 |
+
continue
|
| 38 |
+
isbn, review = parts[0], parts[1]
|
| 39 |
+
if len(review) >= min_length:
|
| 40 |
+
seeds.append({"isbn": isbn, "review": review})
|
| 41 |
+
|
| 42 |
+
# Random sample
|
| 43 |
+
if len(seeds) > n:
|
| 44 |
+
seeds = random.sample(seeds, n)
|
| 45 |
+
logger.info(f"Sampled {len(seeds)} seed reviews")
|
| 46 |
+
return seeds
|
| 47 |
+
|
| 48 |
+
def _evolve_instruction(self, review: str) -> Optional[str]:
|
| 49 |
+
"""
|
| 50 |
+
Self-Instruct Step 1: Generate a user question that would prompt this review.
|
| 51 |
+
"""
|
| 52 |
+
prompt = f"""You are helping create training data for a book recommendation AI.
|
| 53 |
+
|
| 54 |
+
Given this enthusiastic book review, generate a realistic USER QUESTION that would have prompted such a recommendation. The question should be natural, like what a real person would type into a book search.
|
| 55 |
+
|
| 56 |
+
REVIEW:
|
| 57 |
+
\"\"\"{review[:500]}\"\"\"
|
| 58 |
+
|
| 59 |
+
Generate ONLY the user question, nothing else. Be creative and natural."""
|
| 60 |
+
|
| 61 |
+
try:
|
| 62 |
+
response = self.llm.invoke(prompt)
|
| 63 |
+
return response.content.strip().strip('"')
|
| 64 |
+
except Exception as e:
|
| 65 |
+
logger.error(f"Instruction evolution failed: {e}")
|
| 66 |
+
return None
|
| 67 |
+
|
| 68 |
+
def _transform_response(self, review: str, query: str) -> Optional[str]:
|
| 69 |
+
"""
|
| 70 |
+
Self-Instruct Step 2: Transform review into AI assistant response style.
|
| 71 |
+
"""
|
| 72 |
+
prompt = f"""You are a passionate Literary Critic AI assistant.
|
| 73 |
+
|
| 74 |
+
A user asked: "{query}"
|
| 75 |
+
|
| 76 |
+
A human reviewer wrote this response:
|
| 77 |
+
\"\"\"{review[:600]}\"\"\"
|
| 78 |
+
|
| 79 |
+
Rewrite this as YOUR response to the user. Keep the emotional depth, specific evidence, and critical insight. But speak as a helpful AI book concierge, not as a random reviewer.
|
| 80 |
+
|
| 81 |
+
Your response (be enthusiastic but professional):"""
|
| 82 |
+
|
| 83 |
+
try:
|
| 84 |
+
response = self.llm.invoke(prompt)
|
| 85 |
+
return response.content.strip()
|
| 86 |
+
except Exception as e:
|
| 87 |
+
logger.error(f"Response transformation failed: {e}")
|
| 88 |
+
return None
|
| 89 |
+
|
| 90 |
+
def generate_dataset(
|
| 91 |
+
self,
|
| 92 |
+
review_path: str,
|
| 93 |
+
output_path: str,
|
| 94 |
+
n_samples: int = 100
|
| 95 |
+
) -> int:
|
| 96 |
+
"""
|
| 97 |
+
Main pipeline: Generate SFT dataset.
|
| 98 |
+
|
| 99 |
+
Returns: Number of successfully generated samples.
|
| 100 |
+
"""
|
| 101 |
+
seeds = self._sample_seed_reviews(review_path, n=n_samples * 2) # Over-sample
|
| 102 |
+
|
| 103 |
+
dataset = []
|
| 104 |
+
for seed in seeds:
|
| 105 |
+
if len(dataset) >= n_samples:
|
| 106 |
+
break
|
| 107 |
+
|
| 108 |
+
# Step 1: Evolve instruction
|
| 109 |
+
query = self._evolve_instruction(seed["review"])
|
| 110 |
+
if not query:
|
| 111 |
+
continue
|
| 112 |
+
|
| 113 |
+
# Step 2: Transform response
|
| 114 |
+
response = self._transform_response(seed["review"], query)
|
| 115 |
+
if not response:
|
| 116 |
+
continue
|
| 117 |
+
|
| 118 |
+
dataset.append({
|
| 119 |
+
"instruction": query,
|
| 120 |
+
"input": "", # No additional input for simple QA
|
| 121 |
+
"output": response,
|
| 122 |
+
"source_isbn": seed["isbn"]
|
| 123 |
+
})
|
| 124 |
+
|
| 125 |
+
if len(dataset) % 10 == 0:
|
| 126 |
+
logger.info(f"Generated {len(dataset)} / {n_samples} samples")
|
| 127 |
+
|
| 128 |
+
# Save
|
| 129 |
+
Path(output_path).parent.mkdir(parents=True, exist_ok=True)
|
| 130 |
+
with open(output_path, 'w', encoding='utf-8') as f:
|
| 131 |
+
for item in dataset:
|
| 132 |
+
f.write(json.dumps(item, ensure_ascii=False) + '\n')
|
| 133 |
+
|
| 134 |
+
logger.info(f"Saved {len(dataset)} samples to {output_path}")
|
| 135 |
+
return len(dataset)
|
| 136 |
+
|
| 137 |
+
|
| 138 |
+
class LLMJudge:
|
| 139 |
+
"""
|
| 140 |
+
Quality filter using LLM-as-a-Judge pattern.
|
| 141 |
+
Scores generated dialogues on multiple dimensions.
|
| 142 |
+
"""
|
| 143 |
+
|
| 144 |
+
def __init__(self, provider: str = "openai", api_key: str = None):
|
| 145 |
+
self.llm = LLMFactory.create(provider=provider, api_key=api_key, temperature=0.1)
|
| 146 |
+
|
| 147 |
+
def score(self, query: str, response: str) -> Dict:
|
| 148 |
+
"""
|
| 149 |
+
Score a (query, response) pair on multiple dimensions.
|
| 150 |
+
Returns: {"empathy": int, "specificity": int, "critique_depth": int, "avg": float}
|
| 151 |
+
"""
|
| 152 |
+
prompt = f"""You are evaluating the quality of an AI book recommendation response.
|
| 153 |
+
|
| 154 |
+
USER QUESTION: "{query}"
|
| 155 |
+
|
| 156 |
+
AI RESPONSE:
|
| 157 |
+
\"\"\"{response}\"\"\"
|
| 158 |
+
|
| 159 |
+
Rate the response on these dimensions (1-10 each):
|
| 160 |
+
1. EMPATHY: Does it understand and connect with what the user is looking for?
|
| 161 |
+
2. SPECIFICITY: Does it mention concrete details (plot points, themes, comparisons)?
|
| 162 |
+
3. CRITIQUE_DEPTH: Does it offer genuine literary insight, not just generic praise?
|
| 163 |
+
|
| 164 |
+
Respond in JSON format ONLY:
|
| 165 |
+
{{"empathy": X, "specificity": Y, "critique_depth": Z}}"""
|
| 166 |
+
|
| 167 |
+
try:
|
| 168 |
+
result = self.llm.invoke(prompt)
|
| 169 |
+
# Parse JSON from response
|
| 170 |
+
import re
|
| 171 |
+
match = re.search(r'\{.*\}', result.content, re.DOTALL)
|
| 172 |
+
if match:
|
| 173 |
+
scores = json.loads(match.group())
|
| 174 |
+
scores["avg"] = (scores["empathy"] + scores["specificity"] + scores["critique_depth"]) / 3
|
| 175 |
+
return scores
|
| 176 |
+
except Exception as e:
|
| 177 |
+
logger.error(f"Judge scoring failed: {e}")
|
| 178 |
+
|
| 179 |
+
return {"empathy": 0, "specificity": 0, "critique_depth": 0, "avg": 0}
|
| 180 |
+
|
| 181 |
+
def filter_dataset(
|
| 182 |
+
self,
|
| 183 |
+
input_path: str,
|
| 184 |
+
output_path: str,
|
| 185 |
+
threshold: float = 7.0
|
| 186 |
+
) -> Tuple[int, int]:
|
| 187 |
+
"""
|
| 188 |
+
Filter dataset keeping only high-quality samples.
|
| 189 |
+
|
| 190 |
+
Returns: (kept_count, total_count)
|
| 191 |
+
"""
|
| 192 |
+
kept = []
|
| 193 |
+
total = 0
|
| 194 |
+
|
| 195 |
+
with open(input_path, 'r', encoding='utf-8') as f:
|
| 196 |
+
for line in f:
|
| 197 |
+
total += 1
|
| 198 |
+
item = json.loads(line)
|
| 199 |
+
scores = self.score(item["instruction"], item["output"])
|
| 200 |
+
|
| 201 |
+
if scores["avg"] >= threshold:
|
| 202 |
+
item["quality_scores"] = scores
|
| 203 |
+
kept.append(item)
|
| 204 |
+
|
| 205 |
+
if total % 10 == 0:
|
| 206 |
+
logger.info(f"Judged {total} samples, kept {len(kept)}")
|
| 207 |
+
|
| 208 |
+
# Save filtered
|
| 209 |
+
with open(output_path, 'w', encoding='utf-8') as f:
|
| 210 |
+
for item in kept:
|
| 211 |
+
f.write(json.dumps(item, ensure_ascii=False) + '\n')
|
| 212 |
+
|
| 213 |
+
logger.info(f"Filtered: {len(kept)} / {total} passed (threshold={threshold})")
|
| 214 |
+
return len(kept), total
|
| 215 |
+
|
| 216 |
+
|
| 217 |
+
# CLI Entry Point
|
| 218 |
+
if __name__ == "__main__":
|
| 219 |
+
import argparse
|
| 220 |
+
parser = argparse.ArgumentParser(description="SFT Data Generator")
|
| 221 |
+
parser.add_argument("--mode", choices=["generate", "judge"], required=True)
|
| 222 |
+
parser.add_argument("--n", type=int, default=50, help="Number of samples to generate")
|
| 223 |
+
parser.add_argument("--provider", default="mock", help="LLM provider (openai/ollama/mock)")
|
| 224 |
+
parser.add_argument("--api-key", default=None, help="API key for provider")
|
| 225 |
+
args = parser.parse_args()
|
| 226 |
+
|
| 227 |
+
if args.mode == "generate":
|
| 228 |
+
generator = SFTDataGenerator(provider=args.provider, api_key=args.api_key)
|
| 229 |
+
generator.generate_dataset(
|
| 230 |
+
review_path="data/review_highlights.txt",
|
| 231 |
+
output_path="data/sft/raw_generated.jsonl",
|
| 232 |
+
n_samples=args.n
|
| 233 |
+
)
|
| 234 |
+
elif args.mode == "judge":
|
| 235 |
+
judge = LLMJudge(provider=args.provider, api_key=args.api_key)
|
| 236 |
+
judge.filter_dataset(
|
| 237 |
+
input_path="data/sft/raw_generated.jsonl",
|
| 238 |
+
output_path="data/sft/filtered_high_quality.jsonl",
|
| 239 |
+
threshold=7.0
|
| 240 |
+
)
|
src/etl.py
CHANGED
|
@@ -7,9 +7,9 @@ from src.utils import setup_logger
|
|
| 7 |
|
| 8 |
logger = setup_logger(__name__)
|
| 9 |
|
| 10 |
-
RAW_DATA_PATH = DATA_DIR / "Books_rating.csv"
|
| 11 |
PROCESSED_DATA_PATH = DATA_DIR / "books_processed.csv"
|
| 12 |
-
|
| 13 |
|
| 14 |
def load_books_data() -> pd.DataFrame:
|
| 15 |
"""
|
|
|
|
| 7 |
|
| 8 |
logger = setup_logger(__name__)
|
| 9 |
|
| 10 |
+
RAW_DATA_PATH = DATA_DIR / "raw" / "Books_rating.csv"
|
| 11 |
PROCESSED_DATA_PATH = DATA_DIR / "books_processed.csv"
|
| 12 |
+
REVIEW_HIGHLIGHTS_PATH = DATA_DIR / "review_highlights.txt"
|
| 13 |
|
| 14 |
def load_books_data() -> pd.DataFrame:
|
| 15 |
"""
|