Spaces:

ymlin105
/

book-rec-with-LLMs

Running

App Files Files Community

ymlin105 commited on 1 day ago

Commit

950f43a

1 Parent(s): 89c5a4f

chore: remove deprecated files and add legacy directory to .dockerignore

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.cursor/rules/research-prototype-style.mdc +29 -0
.dockerignore +3 -0
PHASE_2_DEVELOPMENT.md +0 -518
README.md +41 -192
REVIEW_HIGHLIGHTS.md +0 -142
TAGS_AND_EMOTIONS.md +0 -233
cover-not-found.jpg +0 -0
data-exploration.ipynb +0 -0
docs/README.md +18 -14
docs/TECHNICAL_REPORT.md +40 -23
docs/archived/PHASE_2_DEVELOPMENT.md +18 -9
DEPLOYMENT.md → docs/archived/graveyard/DEPLOYMENT.md +0 -0
business_logic.md → docs/archived/graveyard/business_logic.md +0 -0
docs/{future_roadmap.md → archived/graveyard/future_roadmap.md} +0 -0
docs/{interview_deep_dive.md → archived/graveyard/interview_deep_dive.md} +0 -0
interview_prep.md → docs/archived/graveyard/interview_prep.md +0 -0
docs/{phase7_plan.md → archived/graveyard/phase7_plan.md} +0 -0
project_analysis.md → docs/archived/graveyard/project_analysis.md +0 -0
docs/{project_narrative.md → archived/graveyard/project_narrative.md} +0 -0
docs/{rag_architecture.md → archived/graveyard/rag_architecture.md} +0 -0
technical_architecture.md → docs/archived/graveyard/technical_architecture.md +0 -0
docs/{technical_deep_dive_sota.md → archived/graveyard/technical_deep_dive_sota.md} +0 -0
story_and_strategy.md → docs/archived/story_and_strategy.md +0 -0
{experiments → docs/experiments/reports}/baseline_report.md +0 -0
benchmarks/results.md → docs/experiments/reports/benchmark_results.md +0 -0
{experiments → docs/experiments/reports}/hybrid_report.md +0 -0
{experiments → docs/experiments/reports}/rerank_report.md +0 -0
{experiments → docs/experiments/reports}/router_report.md +0 -0
{experiments → docs/experiments/reports}/temporal_report.md +0 -0
legacy/README.md +10 -0
{src → legacy}/agent/agent_core.py +0 -0
{src → legacy}/agent/data_loader.py +0 -0
{src → legacy}/agent/dialogue_manager.py +0 -0
{src → legacy}/agent/intent_parser.py +0 -0
{src → legacy}/agent/llm_generator.py +0 -0
{src → legacy}/agent/rag_indexer.py +0 -0
{src → legacy}/agent/rag_retriever.py +0 -0
app.py → legacy/app.py +1 -1
deploy.sh → legacy/deploy.sh +0 -0
download_fix.py → legacy/download_fix.py +0 -0
requirements.txt +0 -1
scripts/add_isbn13_to_books_data.py +0 -16
scripts/add_isbn_to_books_data.py +0 -21
scripts/benchmark_compressor.py +0 -35
scripts/benchmark_hybrid.py +0 -83
scripts/benchmark_rerank.py +0 -82
scripts/benchmark_retrieval.py +0 -82
scripts/benchmark_router.py +0 -99
scripts/benchmark_temporal.py +0 -44
scripts/build_books_basic_info.py +0 -48

.cursor/rules/research-prototype-style.mdc ADDED Viewed

	@@ -0,0 +1,29 @@

+---
+description: Research prototype coding style — graduate student, clarity over elegance
+globs: src/**/*.py, scripts/**/*.py
+alwaysApply: true
+---
+# Research Prototype Style
+Target author: graduate student, working alone, optimizing for clarity and debuggability. NOT for production or large-scale deployment.
+## Do
+- **Simple, linear control flow** — avoid nested callbacks, deep inheritance
+- **Minimal abstraction** — prefer direct implementation over factories/adapters
+- **Inline comments** — explain *why*, not *what*
+- **Strategic debug** — a few print/debug statements, preferably behind a flag (e.g. `if DEBUG:`)
+- **Small duplication OK** — if it improves readability over DRY
+## Avoid
+- Excessive logging or print statements
+- Heavy configuration systems (YAML configs, env-driven feature flags)
+- Comprehensive unit test suites (targeted tests for critical paths only)
+- Complex class hierarchies, abstract base classes
+- Premature optimization
+## When in Doubt
+Prefer simplicity over elegance. Code should look like something a careful graduate student would realistically write and maintain.

.dockerignore CHANGED Viewed

@@ -26,3 +26,6 @@ build/
 dist/
 eggs/
 .eggs/

 dist/
 eggs/
 .eggs/
+# Legacy (not used in main flow)
+legacy/

PHASE_2_DEVELOPMENT.md DELETED Viewed

@@ -1,518 +0,0 @@
-# Phase 2: Personalization & React UI Migration
-**Date:** January 2026
-**Status:** ✅ Complete & Deployed
----
-## Overview
-This phase shifted the project from a basic semantic book recommender to an **intelligent, personalized discovery platform** with a modern React frontend. The vision evolved from marketplace/swap features to a focused **recommendation engine grounded in user preferences and persona-driven insights**.
----
-## Phase Vision & Direction
-### Initial Pivot (from conversation)
-- **Original concept:** Second-hand book marketplace/swap platform
-- **User feedback:** Focus on recommendation engine first, then expand
-- **Final direction:** Keep it recommendation-only with two new pillars:
-  1. **Favorites** → persistent user library tracking
-  2. **Personalized Highlights** → AI-generated selling points based on user taste
-### Core Philosophy
-> "Books that understand you. Recommendations grounded in what you love."
-The system learns from your reading preferences and surfaces books that match both the search query AND your unique taste profile.
----
-## What Was Built
-### 1. **Backend Personalization Layer** (`src/`)
-#### A. User Favorites Storage
-- **File:** `src/user/profile_store.py`
-- **Mechanism:** JSON-based persistence (`data/user_profiles.json`)
-- **Features:**
-  - `add_favorite(user_id, isbn)` → idempotent add + deduplicate
-  - `list_favorites(user_id)` → retrieve user's library
-  - Works with any user_id (default: "local" for single-user dev)
-#### B. User Persona Aggregation
-- **File:** `src/marketing/persona.py`
-- **Input:** List of favorite ISBNs + book metadata DataFrame
-- **Output:** `{ summary, top_authors[], top_categories[] }`
-- **Algorithm:**
-  1. Fetch metadata for all favorited books
-  2. Extract top 3 authors (by frequency)
-  3. Extract top 3 categories
-  4. Generate natural language summary combining signals
-  - Example: *"您钟爱悬疑与科幻，偏好国际视野的作品。"* (You love mystery & sci-fi, prefer international perspectives)
-#### C. Personalized Highlights Generator
-- **File:** `src/marketing/highlights.py`
-- **Input:** ISBN + user persona + book metadata
-- **Output:** `{ title, authors, category, highlights[], persona_summary }`
-- **Generation Strategy:**
-  - Match persona themes to book content (author, category, description)
-  - Extract 3-5 contextual selling points
-  - Combine rule-based matching + description parsing
-  - Example output:
-    ```
-    - 作者获国际奖项，契合您对国际视野的热爱
-    - 悬疑与科幻的完美融合，正是您的最爱组合
-    - 情节紧凑，适合您快节奏阅读的偏好
-    ```
-### 2. **FastAPI Backend Integration** (`src/main.py`)
-**Three New Endpoints:**
-```python
-POST /favorites/add
-  Request:  { user_id: str, isbn: str }
-  Response: { status: "ok", favorites_count: int }
-GET /user/{user_id}/persona
-  Response: { user_id, favorites: [], persona: {...} }
-POST /marketing/highlights
-  Request:  { isbn: str, user_id?: str }
-  Response: { persona, highlights: [], meta: {...} }
-```
-**CORS Support:**
-- Enabled for localhost:5173 (React dev), 3000 (alt dev), 8080
-- Allows frontend to access backend without restrictions
----
-### 3. **Modern React UI** (`web/`)
-#### Architecture
-- **Build Tool:** Vite (ultra-fast dev server, ~200ms startup)
-- **Styling:** Tailwind CSS (CDN-based, no build required)
-- **Icons:** lucide-react (modern SVG icons)
-- **State Management:** React Hooks (useState only, no Redux)
-#### Design: "纸间留白" (Paper Shelf)
-A literary, minimalist aesthetic inspired by:
-- Japanese minimalism (留白 = leaving white space)
-- Second-hand bookstore vibes
-- Serif typography (font-serif)
-- Muted earth tones: `#b392ac` (mauve), `#f4acb7` (peach), `#faf9f6` (cream)
-#### Core Features
-**1. Discovery Tab (Default View)**
-```
-┌─────────────────────────────────┐
-│ 纸间留白                          │  Header + toggle "私人书斋"
-├─────────────────────────────────┤
-│ 墨色余温·灵魂契合 (if favorites) │  Smart carousel of alma-mate books
-├─────────────────────────────────┤
-│ [Search] [Category▼] [Mood▼]    │  Semantic search + filters
-│ 开启发现之旅 (Start Discovery)    │
-├─────────────────────────────────┤
-│ [Book 1] [Book 2] [Book 3] ...  │  5-column responsive grid
-│ (hover shows ai-generated hint)  │
-└─────────────────────────────────┘
-```
-**2. Book Detail Modal**
-```
-┌─────────────────────────────────┐
-│ [Close]                         │
-├─────────���────┬──────────────────┤
-│ Cover        │ Title            │
-│ ISBN         │ Highlights       │
-│ Score ★★★★★  │ Description      │
-│              │ Chat Interface   │
-│              │ [Add to Library] │
-└──────────────┴──────────────────┘
-```
-**3. Private Library ("私人书斋")**
-- Toggle view to see only favorited books
-- Shows reading statistics (mood distribution)
-- Same gallery grid + detail modal
-**4. Chat Interface (in modal)**
-- Suggested questions tied to book context
-- User messages vs AI responses styled differently
-- AI grounded to book metadata (not LLM-based yet)
-#### API Integration
-All four key flows wired to backend:
-```javascript
-// Search → Recommendation
-startDiscovery() → recommend(query, category, tone)
-// Select book → Load highlights
-openBook(book) → getHighlights(isbn)
-// Add to collection
-toggleCollect(book) → addFavorite(isbn)
-// (Future) Refresh persona
-persona = getPersona(userId)
-```
----
-## End-to-End Flow
-### User Journey: "Discovery to Collection"
-```
-1. User enters search query + filters
-   ↓
-2. startDiscovery() calls POST /recommend
-   → FastAPI semantic search + tone filtering
-   → Returns top N books with thumbnails
-   ↓
-3. Books render in grid (hover shows AI hint)
-   ↓
-4. User clicks book → openBook()
-   → Calls POST /marketing/highlights
-   → Gets persona + 3-5 personalized selling points
-   → Modal shows all details + chat
-   ↓
-5. User clicks "加入藏书馆" (Add to Collection)
-   → Calls POST /favorites/add
-   → Updates myCollection state
-   → Next search shows "灵魂契合" carousel (matched books)
-   ↓
-6. User clicks "私人书斋" to view collection
-   → Filters books to only favorites
-   → Shows reading persona stats
-```
----
-## Technical Decisions
-### Why JSON for Favorites (not SQLite)?
-- **Rationale:** Single-user dev focus, rapid iteration
-- **Trade-off:** 11k books × metadata in one file = acceptable overhead
-- **Future:** Easy migration to PostgreSQL when scaling to multi-user
-### Why No LLM for Highlights?
-- **Rationale:** Keep system lightweight, deterministic, fast
-- **Method:** Rule-based persona matching (Top-3 authors/categories)
-- **Future:** Could upgrade to LLM refinement (e.g., GPT for polish)
-### Why React + Vite (not Gradio)?
-- **Rationale:**
-  - Gradio good for prototypes, React needed for custom UX
-  - Vite super fast (no webpack pain)
-  - Tailwind CDN avoids npm build complexity
-- **Fallback:** Gradio UI (app.py) still available on port 7860
-### Why Persona from Favorites (not search history)?
-- **Rationale:** User intent explicit in favorites, not implicit in queries
-- **Semantics:** "Add to collection" = explicit preference signal
-- **Advantage:** Works offline, no tracking/privacy concerns
----
-## Architecture Diagram
-```
-┌──────────────────────────────────────────────────────┐
-│                   FRONTEND (React)                    │
-│  web/ → Vite dev server (localhost:5173)             │
-│  ┌────────────────────────────────────────────────┐  │
-│  │ App.jsx                                        │  │
-│  │  - SearchBar (query, category, mood)           │  │
-│  │  - Gallery (books grid)                        │  │
-│  │  - DetailModal (title, highlights, chat)       │  │
-│  │  - MyCollection (favorites view)               │  │
-│  └────────────────────────────────────────────────┘  │
-│  api.js → Fetch wrappers (recommend, highlights...)  │
-└──────────────────────────────────────────────────────┘
-                        ↓
-                    HTTP/CORS
-                        ↓
-┌──────────────────────────────────────────────────────┐
-│                  BACKEND (FastAPI)                    │
-│  src/main.py → uvicorn (localhost:6006)              │
-│  ┌────────────────────────────────────────────────┐  │
-│  │ GET /health                                    │  │
-│  │ POST /recommend (query, category, tone)        │  │
-│  │ GET /categories, /tones                        │  │
-│  │ ┌─────────────────────────────────────��────┐  │  │
-│  │ │ NEW: POST /favorites/add                 │  │  │
-│  │ │ NEW: GET /user/{id}/persona              │  │  │
-│  │ │ NEW: POST /marketing/highlights          │  │  │
-│  │ └──────────────────────────────────────────┘  │  │
-│  └────────────────────────────────────────────────┘  │
-└──────────────────────────────────────────────────────┘
-         ↓                              ↓
-    ┌─────────────┐            ┌──────────────────┐
-    │  ChromaDB   │            │  User Profiles   │
-    │  (11k docs) │            │  (JSON file)     │
-    │  ↓          │            │  ↓               │
-    │  Vector     │            │  Favorites +     │
-    │  Embeddings │            │  Persona         │
-    └─────────────┘            └──────────────────┘
-        ↓
-    ┌─────────────────────────────────┐
-    │  Books Metadata (CSV)           │
-    │  - title, authors, description  │
-    │  - isbn, category, rating       │
-    │  - emotion scores (joy/sad/etc) │
-    └─────────────────────────────────┘
-```
----
-## Key Data Models
-### User Profile (JSON)
-```json
-{
-  "local": {
-    "favorites": [
-      { "isbn": "9780451524935", "title": "1984", "added_at": "2026-01-06" },
-      { "isbn": "9780061120084", "title": "To Kill a Mockingbird", "added_at": "2026-01-06" }
-    ]
-  }
-}
-```
-### Book Recommendation Response
-```json
-{
-  "recommendations": [
-    {
-      "isbn": "9780451524935",
-      "title": "1984",
-      "authors": "George Orwell",
-      "description": "A dystopian novel...",
-      "thumbnail": "https://covers.openlibrary.org/...",
-      "caption": "(auto-generated short hint)"
-    }
-  ]
-}
-```
-### Highlights Response
-```json
-{
-  "persona": {
-    "summary": "您钟爱悬疑与科幻，偏好国际视野的作品。",
-    "top_authors": ["Agatha Christie", "Isaac Asimov"],
-    "top_categories": ["Mystery", "Science Fiction"]
-  },
-  "highlights": [
-    "国际推理大师之作，契合您的悬疑偏好",
-    "心理扭转的情节设计，适合您快节奏阅读",
-    "深层人性反思，引发思考"
-  ],
-  "meta": {
-    "title": "And Then There Were None",
-    "authors": "Agatha Christie",
-    "category": "Mystery",
-    "description": "..."
-  }
-}
-```
----
-## Running the System
-### Development Mode (3 services)
-**Terminal 1: FastAPI Backend**
-```bash
-cd /Users/ymlin/Downloads/003-Study/138-Projects/book-rec-with-LLMs
-make run
-# Starts on http://localhost:6006
-# Loads 11k books into ChromaDB
-# Initializes metrics, routes
-```
-**Terminal 2: React Frontend**
-```bash
-cd web
-npm run dev
-# Starts on http://localhost:5173
-# Hot reload on file changes
-# Connect to http://localhost:6006 backend
-```
-**Terminal 3 (Optional): Gradio Legacy UI**
-```bash
-python app.py
-# Starts on http://localhost:7860
-# Alternative UI for testing
-```
-### Production Workflow
-- React builds with `npm run build` → static files
-- FastAPI serves as single backend
-- Deploy as Docker containers (see DEPLOYMENT.md)
----
-## Testing the Features
-### 1. Test Semantic Search
-```
-Input: "悬疑推理小说，节奏快"
-Expected: Agatha Christie, Sherlock Holmes, modern thrillers
-```
-### 2. Test Favorites → Persona
-```
-1. Add 5 books to collection (mix of genres)
-2. Click a new book
-3. Check highlights mention added books' authors/categories
-✓ Persona should reflect your choices
-```
-### 3. Test Persona-Based Highlights
-```
-If you favorite: [Sci-Fi, Mystery, Literary]
-Then recommend: Horror book X
-Expected highlight: "虽不在您常读类型，但情节深度与科幻的想象力结合..."
-(Acknowledges taste + bridges to new territory)
-```
----
-## Future Enhancements
-### Phase 3: Recommendations (Backlog)
-**1. LLM-Powered Highlights**
-- Use Claude/GPT to refine rule-based highlights
-- Natural language refinement (currently ~70% rule-based quality)
-- Cache per (user_id, isbn) pair for speed
-**2. Emotional Resonance Scoring**
-- Leverage emotion embeddings (joy/sadness/fear/anger/surprise) in metadata
-- Recommend books matching user's current mood signal
-- "What are you feeling today?" filter
-**3. Multi-User Accounts**
-- Migrate from JSON to SQLite/PostgreSQL
-- User authentication (OAuth)
-- Social features (share collections, compare tastes)
-**4. Advanced Search**
-- Author-to-author recommendations ("If you like X, try Y's style")
-- Time-based recommendations ("What to read this season?")
-- Combination search (mood + timeframe + word-count)
-**5. Analytics Dashboard**
-- Show user: "You've read 15 books in the mystery genre"
-- Predict next book based on reading history
-- Genre comfort zone vs stretch zones
----
-## Phase Reflection
-### What Worked Well
-✅ **Modular backend design** → easy to add /highlights, /persona endpoints
-✅ **React UI responsiveness** → users see results instantly
-✅ **JSON-first approach** → no DB setup friction, iterate fast
-✅ **API-driven architecture** → Gradio + React both work
-✅ **Persona concept** → users feel "understood" by the system
-### Challenges Overcome
-🔧 **Port conflicts** (Gradio:7860 vs React:5173 vs FastAPI:6006) → Makefile organization
-🔧 **CORS issues** (frontend can't reach backend) → Added CORSMiddleware
-🔧 **Image loading** (external URLs not allowed in Gradio) → Runtime fetching + local fallback
-🔧 **Timeout errors** (cold startup > 10s) → Increased client timeouts, optimized startup
-### Design Philosophy Validated
-The shift from "marketplace" → "recommendation + personalization" was right because:
-1. **Clear unique value:** Persona-aware recommendations don't exist in typical bookstores
-2. **Tight scope:** Focused on one thing (smart discovery) vs scattered marketplace features
-3. **User empathy:** People want to be understood, not just transact
----
-## Code Structure Summary
-```
-book-rec-with-LLMs/
-├── src/
-│   ├── main.py                 # FastAPI app + 3 new endpoints
-│   ├── recommender.py          # Semantic search core
-│   ├── vector_db.py            # ChromaDB wrapper
-│   ├── cache.py                # Image caching
-│   ├── user/
-│   │   └── profile_store.py    # ✨ NEW: Favorites JSON storage
-│   └── marketing/
-│       ├── persona.py          # ✨ NEW: Persona aggregation
-│       ├── highlights.py       # ✨ NEW: Highlight generation
-│       └── guardrails.py       # Safety checks (stub)
-├── web/                        # ✨ NEW: React Vite app
-│   ├── src/
-│   │   ├── App.jsx             # Main component + state
-│   │   ├── api.js              # Fetch wrappers
-│   │   └── main.jsx            # Entry point
-│   ├── index.html              # HTML + Tailwind CDN
-│   └── package.json            # Dependencies
-├── app.py                      # Gradio UI (legacy)
-├── Makefile                    # Commands
-├── requirements.txt            # Python deps
-└── data/
-    ├── books_processed.csv     # Metadata
-    └── user_profiles.json      # ✨ NEW: User data
-```
----
-## Commit Message
-```
-feat: add React UI and backend personalization features
-- Create modern React UI (web/) with 纸间留白 design
-  * Semantic search + favorites + detail modal
-  * Tailwind CSS + lucide-react
-  * Vite dev server on port 5173
-- Implement user personalization:
-  * src/user/profile_store.py: JSON favorites
-  * src/marketing/persona.py: User taste aggregation
-  * src/marketing/highlights.py: Persona-aware selling points
-  * 3 new API endpoints in FastAPI
-- Maintain Gradio UI as optional fallback
-- Add CORS support, update timeouts, improve infrastructure
-```
----
-## How to Continue
-### If you want to test now:
-1. `make run` (starts backend)
-2. `cd web && npm run dev` (starts React UI)
-3. Visit http://localhost:5173
-4. Search for a book → click results → "加入藏书馆" → see persona highlights
-### If you want to refine:
-- Adjust persona algorithm in `src/marketing/persona.py`
-- Tweak UI colors/layout in `web/src/App.jsx`
-- Add more rules to highlights in `src/marketing/highlights.py`
-### If you want to scale:
-- Migrate to PostgreSQL (users table + favorites relationship)
-- Add user auth (FastAPI auth middleware)
-- Deploy with Docker + cloud (see DEPLOYMENT.md)
----
-**Status:** ✅ **Ready to Deploy**
-Next phase can focus on: multi-user support, LLM refinement, analytics, or social features.

README.md CHANGED Viewed

@@ -7,221 +7,70 @@ app_port: 8000
 # Intelligent Book Recommendation System
-> **Frozen at v2.6.0** — This project is in maintenance mode for portfolio use. No new features, experiments, or optimizations. Documentation and bug fixes only.
-> A production-grade **Agentic RAG + RecSys** platform combining semantic search, personalized recommendations, and conversational AI.
-## Highlights
-| Component | Technology | Achievement |
-|:---|:---|:---|
-| **Semantic Search** | ChromaDB + MiniLM-L6 | Sub-300ms retrieval on 200K+ books |
-| **Agentic Router** | Rule-based intent classification | 4 dynamic strategies (BM25, Hybrid, Rerank, Small-to-Big) |
-| **Personalized Rec** | 7-channel recall + LGBMRanker + Stacking | HR@10: 0.4545, MRR@5: 0.2893 |
-| **Conversational AI** | RAG + OpenAI/Ollama | Real-time streaming (Default: Local Ollama) |
----
-## System Architecture
-```
-┌─────────────────────────────────────────────────────────────────┐
-│                        Frontend (React)                         │
-│   Search UI │ My Bookshelf │ Chat │ Recommendations             │
-└──────────────────────────┬──────────────────────────────────────┘
-                           │ REST + SSE
-┌──────────────────────────▼──────────────────────────────────────┐
-│                      FastAPI Backend                            │
-│  ┌─────────────┐  ┌──────────────┐  ┌───────────────────────┐   │
-│  │ Query Router│→ │ RAG Pipeline │→ │ Personalized RecSys   │   │
-│  └─────────────┘  └──────────────┘  └───────────────────────┘   │
-│         │                │                    │                  │
-│    Intent Class    Hybrid Search      7-Channel Recall + RRF     │
-│    (ISBN/Keyword    + Cross-Encoder   (ItemCF + UserCF + Swing   │
-│     /Complex)       Reranking         + SASRec + Item2Vec...)   │
-└──────────────────────────┬──────────────────────────────────────┘
-                           │
-        ┌──────────────────┼──────────────────┐
-        ▼                  ▼                  ▼
-   ┌─────────┐      ┌───────────┐      ┌──────────────┐
-   │ChromaDB │      │LGBMRanker │      │ LLM Provider │
-   │(Vectors)│      │(LambdaRank│      │ (Chat/Recs)  │
-   └─────────┘      └───────────┘      └──────────────┘
-```
----
-## Key Features
-### 1. Agentic RAG with Dynamic Routing
-- **Query Intent Classification**: Automatically routes queries to optimal retrieval strategy
-  - ISBN → Pure BM25 (100% precision)
-  - Keywords → Hybrid Search (BM25 + Dense)
-  - Complex queries → Cross-Encoder Reranking
-  - Detail queries → Small-to-Big Retrieval (788K indexed sentences)
-### 2. Personalized Recommendation Engine
-- **7-Channel Recall**: ItemCF (direction-weighted), UserCF, Swing, SASRec, Item2Vec, YoutubeDNN, Popularity
-- **RRF Fusion**: Reciprocal Rank Fusion merges candidates across all recall channels
-- **SASRec Sequential Model**: 64-dim Transformer embeddings (30 epochs), used as both recall source and ranking feature
-- **Model Stacking**: LGBMRanker (LambdaRank) + XGBClassifier → LogisticRegression meta-learner
-- **LGBMRanker (LambdaRank)**: Directly optimizes NDCG with 17 engineered features and hard negative sampling
-- **Evaluation**: HR@10 = 0.4545, MRR@5 = 0.2893 (n=2000, Leave-Last-Out, title-relaxed)
-### 3. My Bookshelf (User Library)
-- **Rating System**: 5-star rating with persistence
-- **Reading Status**: Want to Read / Reading / Finished
-- **Statistics Dashboard**: Visual progress tracking
-### 4. Conversational Shopping Assistant
-- **RAG-Grounded Responses**: Context from ChromaDB reduces hallucinations
-- **Streaming Output**: Real-time token streaming via SSE
-- **Flexible LLM**: Defaults to local **Ollama** (free/privacy), supports **OpenAI API** for production
----
 ## Quick Start
-### Prerequisites
-- Python 3.10+ with Conda
-- Node.js 18+
-### Installation
 ```bash
-# Clone and setup environment
 git clone https://github.com/sylvia-ymlin/book-rec-with-LLMs.git
 cd book-rec-with-LLMs
-conda env create -f environment.yml
-conda activate book-rec
-# Initialize vector database (first run)
-python src/init_db.py
-# Start API server
-make run  # http://localhost:6006
-# Start frontend (new terminal)
-cd web && npm install && npm run dev  # http://localhost:5173
 ```
-### LLM Configuration
-| Provider | Setup | Use Case |
-|:---|:---|:---|
-| **Ollama** | `ollama pull llama3` | Free, local dev |
-| **OpenAI** | Set API key in UI Settings | Production |
----
-## API Endpoints
-| Endpoint | Method | Description |
-|:---|:---|:---|
-| `/recommend` | POST | Semantic search with emotion/category filters |
-| `/api/recommend/personal` | GET | Personalized recommendations (RecSys) |
-| `/favorites/add` | POST | Add book to collection |
-| `/favorites/update` | PUT | Update rating/reading status |
-| `/user/{id}/stats` | GET | Reading statistics |
-| `/chat/completions` | POST | RAG-powered chat (streaming) |
-| `/health` | GET | Service health check |
----
-## Project Structure
-```
-src/
-├── main.py              # FastAPI application
-├── recommender.py       # RAG search orchestration
-├── vector_db.py         # ChromaDB wrapper
-├── core/
-│   ├── router.py        # Agentic query routing
-│   └── reranker.py      # Cross-encoder reranking
-├── recall/
-│   ├── itemcf.py        # ItemCF with direction weight
-│   ├── usercf.py        # UserCF (Jaccard + activity penalty)
-│   ├── swing.py         # Swing (user-pair overlap weighting)
-│   ├── sasrec_recall.py # SASRec embedding dot-product recall
-│   ├── item2vec.py      # Item2Vec (Word2Vec on sequences)
-│   ├── youtube_dnn.py   # YoutubeDNN two-tower recall
-│   ├── popularity.py    # Popularity with time decay
-│   └── fusion.py        # RRF fusion of all channels
-├── ranking/
-│   └── features.py      # 17 ranking features
-├── services/
-│   └── recommend_service.py  # Recall → Rank → Dedup pipeline
-└── user/                # User profile storage
-web/
-├── src/App.jsx          # React UI
-└── src/api.js           # API client
-scripts/
-├── model/
-│   ├── train_sasrec.py          # SASRec sequential model training
-│   ├── build_recall_models.py   # ItemCF, UserCF, Swing, Popularity
-│   ├── train_ranker.py          # LGBMRanker with hard negative sampling
-│   └── evaluate.py              # HR@10, MRR@5 evaluation
-├── deploy/                      # Server deployment scripts
-└── data/                        # Data processing pipelines
-```
----
-## Performance
-### Recommendation Metrics (v2.6.0)
-| Metric | V2.5 | v2.6.0 | Method |
-|:---|:---|:---|:---|
-| **Hit Rate@10** | 0.2205 | **0.4545** | Leave-Last-Out, n=2000 |
-| **MRR@5** | 0.1584 | **0.2893** | Title-relaxed matching |
-v2.6.0 Key Enhancements:
-- **Item2Vec Recall**: Word2Vec (Skip-gram) on user interaction sequences, 7th recall channel.
-- **Model Stacking**: LGBMRanker + XGBClassifier → LogisticRegression meta-learner.
-- **Engineering**: Vectorized inference, non-blocking I/O, metadata singleton.
-| Dataset | Size |
-|:---|:---|
-| Training Set | 1,079,966 interactions |
-| Active Users | 167,968 |
-| Books | 221,998 |
-### Latency Benchmarks
-| Operation | P50 Latency |
 |:---|:---|
-| **Exact Search** | ~19ms |
-| **Hybrid Search** | ~230ms |
-| **Reranked Search** | ~710ms |
-| **Personal Rec (warm)** | ~19ms |
----
-## Project Documentation
-| Document | Description |
-|:---|:---|
-| [Memory Optimization Report](docs/memory_optimization.md) | Engineering overhaul: Zero-RAM SQLite architecture |
-| [Experiment Archive](docs/experiments/experiment_archive.md) | All experimental results from V1.0 to v2.6.0 |
-| [Performance Debugging Report](docs/performance_debugging_report.md) | Root cause analysis of evaluation issues |
-| [Roadmap](docs/roadmap.md) | Technical evolution (v2.6.0 frozen, V3.0 planned) |
-| [Technical Report](docs/technical_report.md) | System architecture deep dive |
-| [Build Guide](docs/build_guide.md) | Build and deployment instructions |
-## References
-1. Kang, W., & McAuley, J. (2018). *Self-Attentive Sequential Recommendation*. ICDM.
-2. Reimers, N., & Gurevych, I. (2019). *Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks*.
-3. Ke, G., et al. (2017). *LightGBM: A Highly Efficient Gradient Boosting Decision Tree*. NeurIPS.
-4. Gao, L., et al. (2022). *Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE)*.
-5. Yang, J., et al. (2020). *Large-scale Product Graph Construction for Recommendation in E-commerce* (Swing algorithm).
----
 ## License
 MIT

 # Intelligent Book Recommendation System
+*Frozen at v2.6.0 — maintenance mode for portfolio use.*
+## Problem
+Readers often can't articulate what they want. Can one system both understand their vague descriptions and give personalized recommendations based on their reading history?
+## Method
+Two parallel threads: **RAG** (Agentic Router → Hybrid Search → Reranking) for understanding vague queries; **RecSys** (7-channel recall → LGBMRanker → Stacking) for personalized recommendations from reading history.
+## Key Experiments
+| Experiment | Before | After | Conclusion |
+|:---|:---|:---|:---|
+| **RAG: Exact match** | Pure vector search, ISBN → 0% recall | Hybrid (BM25 + Dense) + Router → 100% | Vector-only fails on exact entities; BM25 + routing fixes it |
+| **RAG: Keyword intent** | "Harry Potter" → Philosophy book | Reranked → Sorcerer's Stone | Cross-encoder corrects semantic drift |
+| **RecSys: Personalization** | Baseline 0.138 HR@10 | Item2Vec + LGBMRanker + Stacking → **0.4545** HR@10 | 7-channel recall + LambdaRank + ensemble beats single model |
+*Evaluation: Leave-Last-Out, n=2000, title-relaxed. HR@10 = 0.4545, MRR@5 = 0.2893.*
+## Architecture
+```
+         Query                    No Query
+            │                         │
+            ▼                         ▼
+   ┌─────────────┐            ┌─────────────┐
+   │  RAG Path  │            │ RecSys Path │
+   │ Router →   │            │ 7-Channel   │
+   │ Hybrid →   │            │ Recall →    │
+   │ Rerank     │            │ LGBMRanker  │
+   └─────────────┘            └─────────────┘
+            │                         │
+            └──────────┬──────────────┘
+                       ▼
+                 Top-K Results
+```
 ## Quick Start
 ```bash
 git clone https://github.com/sylvia-ymlin/book-rec-with-LLMs.git
 cd book-rec-with-LLMs
+conda env create -f environment.yml && conda activate book-rec
+# First run (or use make data-pipeline for full build)
+python src/init_db.py              # Chroma vector DB
+python scripts/init_sqlite_db.py    # SQLite metadata (local build)
+make run                       # API http://localhost:6006
+cd web && npm install && npm run dev   # UI http://localhost:5173
 ```
+**LLM**: Default Ollama (`ollama pull llama3`). OpenAI API key in UI Settings for production.
+## Documentation
+| Doc | Purpose |
 |:---|:---|
+| [Technical Report](docs/TECHNICAL_REPORT.md) | Architecture, design decisions |
+| [Experiment Archive](docs/experiments/experiment_archive.md) | Full experiment log (V1.0 → v2.6.0) |
+| [Interview Guide](docs/interview_guide.md) | Q&A, STAR cases |
+| [Build Guide](docs/build_guide.md) | Deployment instructions |
 ## License
 MIT

REVIEW_HIGHLIGHTS.md DELETED Viewed

@@ -1,142 +0,0 @@
-# Review Highlights Feature
-## Overview
-Added semantic sentence extraction to display representative reader reviews for each book. This feature enhances book discovery by showcasing authentic reader voices.
-## Implementation
-### 1. Data Generation (Server-side)
-**Script**: `scripts/extract_review_sentences.py`
-**Process**:
-- Splits book descriptions into sentences using regex
-- Uses `sentence-transformers/all-MiniLM-L6-v2` for sentence embeddings
-- Clusters similar sentences via cosine similarity (threshold: 0.8)
-- Extracts representative sentences from each cluster (top 5 per book)
-- Stores as semicolon-separated `review_highlights` column in CSV
-**Execution**:
-```bash
-# Run in container with GPU
-export HF_ENDPOINT=https://hf-mirror.com
-python scripts/extract_review_sentences.py \
-  --input data/books_processed.csv \
-  --output data/books_processed.csv \
-  --top-n 5 \
-  --similarity-threshold 0.8 \
-  --device 0 \
-  --batch-size 128
-```
-**Performance**: ~17 minutes for 222k books on GPU (211 it/s)
-### 2. Backend Integration
-**Files Modified**:
-- `src/recommender.py`: Parse `review_highlights` from CSV, split by semicolon
-- `src/main.py`: Add `review_highlights: List[str]` to `BookResponse` model
-**Code**:
-```python
-# Parse review highlights from semicolon-separated string
-highlights_raw = str(row.get("review_highlights", "")).strip()
-review_highlights = [h.strip() for h in highlights_raw.split(";") if h.strip()]
-```
-### 3. Frontend Display
-**File**: `web/src/App.jsx`
-**Location**: Left column, bottom section (below Rating/Mood)
-**Features**:
-- Displays up to 3 representative sentences
-- Bullet-point format with `-` prefix
-- Complete sentences: `- "[sentence]"`
-- Incomplete sentences: `- "...[sentence]"` (auto-detected via regex `/^[A-Z]/`)
-- Styling: 10px italic gray text
-**Layout**:
-```jsx
-{selectedBook.review_highlights && selectedBook.review_highlights.length > 0 && (
-  <div className="w-full mt-auto space-y-2 text-left">
-    {selectedBook.review_highlights.slice(0, 3).map((highlight, idx) => {
-      const isCompleteSentence = /^[A-Z]/.test(highlight.trim());
-      const prefix = isCompleteSentence ? '' : '...';
-      return (
-        <p key={idx} className="text-[10px] text-[#666] leading-relaxed italic pl-2">
-          - "{prefix}{highlight}"
-        </p>
-      );
-    })}
-  </div>
-)}
-```
-## Related Changes
-### Rating Display Enhancement
-**Problem**: Hardcoded rating value of 4 stars for all books
-**Solution**:
-- Added `average_rating` field to backend API response
-- Display format: `4.3` (1 decimal) + filled stars
-- Moved rating display into AI highlight box (pink desc_block)
-**Frontend mapping**:
-```javascript
-rating: r.average_rating || 0,  // Keep float, no rounding
-```
-**Display**:
-```jsx
-<span>{selectedBook.rating ? selectedBook.rating.toFixed(1) : '0.0'}</span>
-<div className="flex gap-0.5 text-[#f4acb7]">
-  {[1,2,3,4,5].map(i => <Star key={i} className={`w-3 h-3 ${i <= selectedBook.rating ? 'fill-current' : ''}`} />)}
-</div>
-```
-### Layout Adjustments
-- Grid ratio: 4:8 → 5:7 (more space for left column)
-- Rating/Mood: Changed from vertical stack to consolidated display
-- Rating moved into desc_block (AI highlight box)
-- Review highlights positioned at bottom with `mt-auto`
-## Data Schema
-**CSV Column**: `review_highlights` (string, semicolon-separated)
-**Example**:
-```
-"Having been brought up on the notion...;It transpires, some years ago...;This is a work full of wisdom..."
-```
-**API Response**:
-```json
-{
-  "review_highlights": [
-    "Having been brought up on the notion that Elizabeth Barrett Browning was the slighter poet...",
-    "It transpires, some years ago, Clarke hosted two hugely successful British television series...",
-    "This is a work full of wisdom and unusual perspectives."
-  ],
-  "average_rating": 3.716216
-}
-```
-## Notes
-- Review highlights are pre-computed and stored in CSV (no runtime extraction)
-- Data file `books_processed.csv` (~243MB) must be regenerated after container rebuild
-- Use `scp` to transfer processed CSV back to local machine
-- HuggingFace mirror (`HF_ENDPOINT`) required for model download in restricted networks
-## Future Improvements
-- Cache sentence embeddings to speed up re-generation
-- Add sentiment analysis to highlights (positive/critical)
-- Filter highlights by relevance to user query
-- Display highlight source (verified purchase vs. regular review)

TAGS_AND_EMOTIONS.md DELETED Viewed

@@ -1,233 +0,0 @@
-# Tags and Emotion Scoring
-This document describes the tag generation and emotion scoring features added to enrich book metadata.
-## Overview
-- **Tags**: Keyword extraction from book descriptions using TF-IDF (5-8 terms per book)
-- **Emotion Scores**: Five emotion dimensions (joy, sadness, fear, anger, surprise) computed via transformer model
-## Data Generation
-### 1. Tag Generation
-Extracts thematic keywords from aggregated review text.
-**Script**: `scripts/generate_tags.py`
-**Usage**:
-```bash
-python scripts/generate_tags.py \
-  --input data/books_processed.csv \
-  --output data/books_processed.csv \
-  --top-n 8
-```
-**Algorithm**:
-- TF-IDF vectorization (unigrams + bigrams)
-- English stopwords + domain stoplist (e.g., "book", "author", "story")
-- Top-N weighted terms per book
-- Semicolon-joined storage in `tags` column
-**Parameters**:
-- `--top-n`: Max tags per book (default: 8)
-- `--max-features`: TF-IDF vocabulary size (default: 60,000)
-- `--min-df`: Minimum document frequency (default: 5)
-- `--max-df`: Maximum document frequency ratio (default: 0.5)
-### 2. Emotion Scoring
-Computes emotion intensity scores from book descriptions.
-**Script**: `scripts/generate_emotions.py`
-**Model**: `j-hartmann/emotion-english-distilroberta-base`
-**Usage**:
-```bash
-# CPU
-python scripts/generate_emotions.py \
-  --input data/books_processed.csv \
-  --output data/books_processed.csv \
-  --batch-size 16
-# Apple GPU (MPS)
-python scripts/generate_emotions.py \
-  --input data/books_processed.csv \
-  --output data/books_processed.csv \
-  --batch-size 8 \
-  --device mps \
-  --checkpoint 2000 \
-  --resume
-```
-**Parameters**:
-- `--batch-size`: Inference batch size (default: 16)
-- `--device`: `mps` (Apple GPU), CUDA device id, or CPU (default)
-- `--checkpoint`: Rows between checkpoint writes (default: 5000)
-- `--resume`: Skip rows already scored (useful for resuming long runs)
-- `--max-rows`: Limit processing to N rows (for testing)
-**Output Columns**:
-- `joy`: 0.0–1.0
-- `sadness`: 0.0–1.0
-- `fear`: 0.0–1.0
-- `anger`: 0.0–1.0
-- `surprise`: 0.0–1.0
-**Performance**:
-- ~1.1 it/s on Apple M-series GPU
-- ~7 hours for 222k books (batch_size=8, MPS)
-- One-time processing; results persist in CSV
-## Data Schema
-Updated `books_processed.csv` columns:
-| Column | Type | Description |
-|--------|------|-------------|
-| `tags` | str | Semicolon-separated keywords (e.g., "irish;travel;humor") |
-| `joy` | float | Joy emotion score (0.0–1.0) |
-| `sadness` | float | Sadness emotion score (0.0–1.0) |
-| `fear` | float | Fear emotion score (0.0–1.0) |
-| `anger` | float | Anger emotion score (0.0–1.0) |
-| `surprise` | float | Surprise emotion score (0.0–1.0) |
-## API Integration
-### Backend Changes
-**File**: `src/recommender.py`
-Added to `_format_results()`:
-```python
-# Parse tags
-tags_raw = str(row.get("tags", "")).strip()
-tags = [t.strip() for t in tags_raw.split(";") if t.strip()] if tags_raw else []
-# Extract emotions
-emotions = {
-    "joy": float(row.get("joy", 0.0)),
-    "sadness": float(row.get("sadness", 0.0)),
-    "fear": float(row.get("fear", 0.0)),
-    "anger": float(row.get("anger", 0.0)),
-    "surprise": float(row.get("surprise", 0.0)),
-}
-```
-**File**: `src/main.py`
-Updated Pydantic model:
-```python
-class BookResponse(BaseModel):
-    isbn: str
-    title: str
-    authors: str
-    description: str
-    thumbnail: str
-    caption: str
-    tags: List[str] = []
-    emotions: Dict[str, float] = {}
-```
-### API Response Example
-```json
-{
-  "recommendations": [
-    {
-      "isbn": "0001849883",
-      "title": "Bury My Bones But Keep My Words",
-      "authors": "Deborah Savage, Tony Fairman",
-      "tags": ["paulsen", "otters", "searches", "gary", "brian"],
-      "emotions": {
-        "joy": 0.020,
-        "sadness": 0.004,
-        "fear": 0.012,
-        "anger": 0.006,
-        "surprise": 0.086
-      }
-    }
-  ]
-}
-```
-## UI Display
-### Search Results Grid
-Each book card displays:
-- **Dominant emotion label**: Emotion with highest score (bottom-right badge)
-- Example: "joy", "sadness", "fear"
-**Implementation** (`web/src/App.jsx`):
-```jsx
-{book.emotions && Object.keys(book.emotions).length > 0 ? (
-  <span className="text-[9px] bg-[#f8f9fa] border border-[#eee] px-1 text-[#999] capitalize">
-    {Object.entries(book.emotions).reduce((a, b) => a[1] > b[1] ? a : b)[0]}
-  </span>
-) : (
-  <span className="text-[9px] bg-[#f8f9fa] border border-[#eee] px-1 text-[#999]">—</span>
-)}
-```
-### Book Detail Modal
-Two new sections:
-**1. Key Themes**
-- Displays all extracted tags as badges
-- Shows "No themes found" if tags empty
-**2. Emotional Tone**
-- Five horizontal bars showing emotion scores
-- Bar width = score percentage (0–100%)
-- Format: `emotion_name | [bar] | percentage`
-**Implementation** (`web/src/App.jsx`):
-```jsx
-<div className="space-y-2">
-  <h4>Emotional Tone</h4>
-  <div className="space-y-2 p-3 bg-[#faf9f6] border border-[#eee]">
-    {selectedBook.emotions && Object.entries(selectedBook.emotions).map(([emotion, score]) => (
-      <div key={emotion} className="flex items-center gap-2">
-        <span className="text-[9px] font-bold text-gray-500 w-16 capitalize">{emotion}</span>
-        <div className="flex-grow bg-white border border-[#eee] h-2 relative overflow-hidden">
-          <div
-            className="h-full bg-[#b392ac] transition-all"
-            style={{ width: `${Math.round(score * 100)}%` }}
-          />
-        </div>
-        <span className="text-[8px] text-gray-400 w-10 text-right">{Math.round(score * 100)}%</span>
-      </div>
-    ))}
-  </div>
-</div>
-```
-## Future Improvements
-- **Incremental updates**: Score only new books instead of full dataset
-- **Smaller model**: Try lightweight emotion classifiers (faster inference)
-- **Multi-label tags**: Use text classification for predefined categories
-- **Tag filtering**: Allow users to filter by specific tags in search
-- **Emotion-based sorting**: Sort results by dominant emotion match
-- **Caching**: Cache emotion inference results in Redis for API speedup
-## Dependencies
-```
-scikit-learn  # TF-IDF vectorization
-transformers  # Emotion classification
-torch         # Model inference
-tqdm          # Progress bars
-```
-## Notes
-- Tags and emotions are **one-time computed** and stored in CSV
-- No re-computation on API requests (instant serving)
-- CSV file (242MB) is in `.gitignore` (too large for GitHub)
-- To regenerate on a new machine, run both scripts sequentially:
-  1. `generate_tags.py` (~5 minutes)
-  2. `generate_emotions.py` (~7 hours on MPS for full dataset)

cover-not-found.jpg DELETED Viewed

Binary file (64.4 kB)

data-exploration.ipynb DELETED Viewed

The diff for this file is too large to render. See raw diff

docs/README.md CHANGED Viewed

@@ -1,26 +1,30 @@
 # Project Documentation
-## Core Documents
-| Document | Purpose | Audience |
-|:---|:---|:---|
-| [build_guide.md](build_guide.md) | Full build pipeline from zero to production | New developers |
-| [roadmap.md](roadmap.md) | Project status, architecture vision, implementation plan | Developers |
-| [technical_report.md](technical_report.md) | System architecture, technical details | Technical interviews |
-| [interview_guide.md](interview_guide.md) | Interview Q&A, STAR cases | Job preparation |
-## Archives
 | Document | Purpose |
 |:---|:---|
-| [experiments/experiment_archive.md](experiments/experiment_archive.md) | All experiment results and benchmarks |
-## Quick Links
-- **Frozen Version**: v2.6.0 (Agentic RAG + 7-Channel RecSys + Stacking)
-- **Performance**: HR@10 = 0.4545, MRR@5 = 0.2893 (n=2000, Leave-Last-Out)
 ---
-*Frozen January 2026 — maintenance mode for portfolio use*

 # Project Documentation
+## Layer 1 — Main Story (README, 5-min interview)
+| Document | Purpose |
+|:---|:---|
+| [Technical Report](TECHNICAL_REPORT.md) | Architecture, design decisions, method line |
+| [Experiment Archive](experiments/experiment_archive.md) | Consolidated experiment log (V1.0 → v2.6.0) |
+## Layer 2 — Capability Showcase (Resume, technical Q&A)
 | Document | Purpose |
 |:---|:---|
+| [Interview Guide](interview_guide.md) | Q&A, STAR cases |
+| [Memory Optimization](memory_optimization.md) | Zero-RAM SQLite, engineering decisions |
+| [Performance Debugging](performance_debugging_report.md) | Root cause analysis |
+| [Build Guide](build_guide.md) | Full build pipeline |
+| [Hugging Face Deployment](huggingface_deployment.md) | HF Spaces deployment |
+## Archives
+| Path | Contents |
+|:---|:---|
+| [archived/](archived/) | Deprecated docs (Phase 2, TAGS, REVIEW_HIGHLIGHTS, etc.) |
+| [archived/graveyard/](archived/graveyard/) | Layer 3 — tried but not in main story (future_roadmap, interview_deep_dive, etc.) |
+| [experiments/reports/](experiments/reports/) | Raw experiment reports (baseline, hybrid, rerank, router, temporal) |
 ---
+**Frozen v2.6.0** — HR@10 = 0.4545, MRR@5 = 0.2893

docs/TECHNICAL_REPORT.md CHANGED Viewed

@@ -9,7 +9,7 @@
 ## Executive Summary
-This project implements a production-grade Agentic RAG (Retrieval-Augmented Generation) system for book discovery, combined with a personalized recommendation engine. Unlike simple vector search, it uses a self-routing architecture that dynamically selects the optimal retrieval strategy based on query intent.
 Key achievements:
 - 100% recall on exact-match queries (ISBNs)
@@ -18,7 +18,7 @@ Key achievements:
 - Detail-level precision via hierarchical (Small-to-Big) retrieval
 - Personalized recommendations using 7-channel recall (Item2Vec, Stacking) and LGBMRanker (LambdaRank)
-The system demonstrates mastery of both Data-Centric AI (SFT data synthesis) and Advanced RAG Architecture (Hybrid Search, Reranking, Query Routing).
 ---
@@ -83,11 +83,12 @@ USER REQUEST (No Query)
           |
           v
 +---------------------------+
-|   6-CHANNEL RECALL (RRF)  |
 |  - ItemCF (direction wt)  |
 |  - UserCF (Jaccard)       |
 |  - Swing (user-pair)      |
 |  - SASRec (embedding)     |
 |  - YoutubeDNN (two-tower) |
 |  - Popularity (fallback)  |
 +---------------------------+
@@ -161,7 +162,7 @@ Implementation (based on LlamaIndex Parent-Child, RAPTOR):
 2. Matching: Query matches specific sentence ("I cried at the ending")
 3. Expansion: Map sentence to parent ISBN to full book context
-Result: Can answer queries like "books with unreliable narrator twist" that are invisible to description-level search.
 ### 3.5 Temporal Dynamics
@@ -187,7 +188,7 @@ Location: `src/core/context_compressor.py`
 ## 4. Personalized Recommendation System
-### 4.1 Multi-Channel Recall (6 Channels)
 | Recall Channel | Algorithm | Weight | Purpose |
 |:---|:---|:---|:---|
@@ -195,6 +196,7 @@ Location: `src/core/context_compressor.py`
 | UserCF | User similarity (Jaccard + activity penalty) | 1.0 | Similar user preferences |
 | Swing | User-pair overlap weighting: `1/(α + \|I_u ∩ I_v\|)` | 1.0 | Substitute relationships |
 | SASRec | Dot-product retrieval from pre-computed embeddings | 1.0 | Sequential patterns |
 | YoutubeDNN | Two-tower user-item dot product | 0.1 | Deep learning recall |
 | Popularity | Rating count with time decay | 0.5 | Cold-start fallback |
@@ -215,9 +217,9 @@ Architecture: Self-Attentive Sequential Recommendation with Transformer blocks
 - Training: 30 epochs, 64-dim embeddings, BCE loss with negative sampling
 - Dual use: (1) ranking feature via `sasrec_score`, (2) independent recall channel via embedding dot-product
-### 4.3 LGBMRanker (LambdaRank)
-Replaced XGBoost binary classifier with LightGBM LambdaRank that directly optimizes NDCG.
 **Training strategy**:
 - Hard negative sampling: negatives mined from recall results (not random items)
@@ -231,24 +233,25 @@ Replaced XGBoost binary classifier with LightGBM LambdaRank that directly optimi
 - Sequence: sasrec_score, sim_max, sim_min, sim_mean
 - CF scores: icf_sum, icf_max, ucf_sum
-Feature importance (V2.5 LGBMRanker):
 | Feature | Importance | Description |
 |:---|:---|:---|
-| i_cnt | 96 | Item popularity count |
-| sim_max | 91 | Last-N similarity max |
-| u_cnt | 80 | User activity count |
-| i_mean | 41 | Item average rating |
-| sasrec_score | 22 | SASRec embedding score |
-| icf_max | 23 | ItemCF max similarity |
 ### 4.4 Evaluation Results
-| Metric | V2.0 (XGBoost) | V2.5 (LGBMRanker) | Improvement |
 |:---|:---|:---|:---|
-| HR@10 | 0.1380 | **0.2205** | +59.8% |
-| MRR@5 | 0.1295 | **0.1584** | +22.3% |
-| Users Evaluated | 500 | 2,000 | |
 | Dataset | 167,968 active users, 221,998 books | | |
 ---
@@ -261,9 +264,9 @@ Feature importance (V2.5 LGBMRanker):
 |--------|------------------------|-------------|
 | ISBN Recall | 0% | 100% |
 | Keyword Precision | Low | High (BM25 boost) |
-| Detail Query Recall | 0% | High (Small-to-Big) |
 | Avg Latency | 100ms | 300-800ms |
-| Chat Context Limit | ~10 turns | Unlimited (compression) |
 ### 5.2 Latency Benchmarks
@@ -310,7 +313,9 @@ Feature importance (V2.5 LGBMRanker):
 ---
-## 8. SFT Data Pipeline
 ### 8.1 Problem
@@ -328,6 +333,8 @@ Output:
 - `data/sft/literary_critic_train.jsonl`: ~800 high-quality (Query, Response) pairs
 - `data/dpo/preference_pairs.jsonl`: ~500 (Chosen, Rejected) pairs
 ---
 ## 9. File Structure
@@ -346,7 +353,8 @@ src/
 │   ├── sasrec_recall.py       # SASRec Embedding Recall
 │   ├── popularity.py          # Popularity Recall
 │   ├── youtube_dnn.py         # Two-Tower Model
-│   └── fusion.py              # RRF Fusion (6 channels)
 ├── ranking/
 │   └── features.py            # 17 Ranking Features
 ├── data_factory/
@@ -359,7 +367,16 @@ src/
 ---
-## 10. Scalability
 Current capacity:
 - In-memory index: 2GB RAM, ~200K books

 ## Executive Summary
+This project implements an integrated Agentic RAG (Retrieval-Augmented Generation) system for book discovery, combined with a personalized recommendation engine. Unlike simple vector search, it uses a self-routing architecture that dynamically selects the optimal retrieval strategy based on query intent.
 Key achievements:
 - 100% recall on exact-match queries (ISBNs)
 - Detail-level precision via hierarchical (Small-to-Big) retrieval
 - Personalized recommendations using 7-channel recall (Item2Vec, Stacking) and LGBMRanker (LambdaRank)
+The system demonstrates both Advanced RAG Architecture (Hybrid Search, Reranking, Query Routing) and multi-channel RecSys (Item2Vec, LGBMRanker, Stacking).
 ---
           |
           v
 +---------------------------+
+|   7-CHANNEL RECALL (RRF)  |
 |  - ItemCF (direction wt)  |
 |  - UserCF (Jaccard)       |
 |  - Swing (user-pair)      |
 |  - SASRec (embedding)     |
+|  - Item2Vec (Word2Vec)    |
 |  - YoutubeDNN (two-tower) |
 |  - Popularity (fallback)  |
 +---------------------------+
 2. Matching: Query matches specific sentence ("I cried at the ending")
 3. Expansion: Map sentence to parent ISBN to full book context
+Result: Can answer queries like "books with unreliable narrator twist" that are invisible to description-level search. *RAG components (ISBN recall, reranking, Small-to-Big) were validated via curated examples and routing statistics; no large-scale human evaluation was conducted.*
 ### 3.5 Temporal Dynamics
 ## 4. Personalized Recommendation System
+### 4.1 Multi-Channel Recall (7 Channels)
 | Recall Channel | Algorithm | Weight | Purpose |
 |:---|:---|:---|:---|
 | UserCF | User similarity (Jaccard + activity penalty) | 1.0 | Similar user preferences |
 | Swing | User-pair overlap weighting: `1/(α + \|I_u ∩ I_v\|)` | 1.0 | Substitute relationships |
 | SASRec | Dot-product retrieval from pre-computed embeddings | 1.0 | Sequential patterns |
+| Item2Vec | Word2Vec (Skip-gram) on user interaction sequences | 0.8 | Implicit co-occurrence |
 | YoutubeDNN | Two-tower user-item dot product | 0.1 | Deep learning recall |
 | Popularity | Rating count with time decay | 0.5 | Cold-start fallback |
 - Training: 30 epochs, 64-dim embeddings, BCE loss with negative sampling
 - Dual use: (1) ranking feature via `sasrec_score`, (2) independent recall channel via embedding dot-product
+### 4.3 LGBMRanker (LambdaRank) + Model Stacking
+Replaced XGBoost binary classifier with LightGBM LambdaRank that directly optimizes NDCG. In v2.6.0, a Stacking ensemble (LGBMRanker + XGBClassifier → LogisticRegression meta-learner) further improves ranking robustness.
 **Training strategy**:
 - Hard negative sampling: negatives mined from recall results (not random items)
 - Sequence: sasrec_score, sim_max, sim_min, sim_mean
 - CF scores: icf_sum, icf_max, ucf_sum
+Feature importance (v2.6.0 LGBMRanker, representative subset):
 | Feature | Importance | Description |
 |:---|:---|:---|
+| u_cnt | 88 | User activity count |
+| sim_max | 76 | Last-N similarity max |
+| icf_max | 62 | ItemCF max similarity |
+| i_cnt | 59 | Item popularity count |
+| len_diff | 55 | Description complexity match |
+| sasrec_score | 25 | SASRec embedding score |
 ### 4.4 Evaluation Results
+*Protocol: Leave-Last-Out, n=2000 users, title-relaxed matching, filter_favorites=False.*
+| Metric | V2.0 (XGBoost) | V2.5 (LGBMRanker) | v2.6.0 (+Item2Vec, Stacking) |
 |:---|:---|:---|:---|
+| HR@10 | 0.1380 | 0.2205 | **0.4545** |
+| MRR@5 | 0.1295 | 0.1584 | **0.2893** |
 | Dataset | 167,968 active users, 221,998 books | | |
 ---
 |--------|------------------------|-------------|
 | ISBN Recall | 0% | 100% |
 | Keyword Precision | Low | High (BM25 boost) |
+| Detail Query Recall | 0% | Demonstrated via curated examples (Small-to-Big) |
 | Avg Latency | 100ms | 300-800ms |
+| Chat Context Limit | ~10 turns | Extended via compression (no formal limit) |
 ### 5.2 Latency Benchmarks
 ---
+## 8. SFT Data Pipeline (Supplementary)
+*Not integrated into the main RAG flow in v2.6.0.* This pipeline was developed for potential future fine-tuning of chat tone.
 ### 8.1 Problem
 - `data/sft/literary_critic_train.jsonl`: ~800 high-quality (Query, Response) pairs
 - `data/dpo/preference_pairs.jsonl`: ~500 (Chosen, Rejected) pairs
+See [Experiment Archive](experiments/experiment_archive.md) for full implementation details.
 ---
 ## 9. File Structure
 │   ├── sasrec_recall.py       # SASRec Embedding Recall
 │   ├── popularity.py          # Popularity Recall
 │   ├── youtube_dnn.py         # Two-Tower Model
+│   ├── item2vec.py            # Item2Vec Recall (Word2Vec)
+│   └── fusion.py              # RRF Fusion (7 channels)
 ├── ranking/
 │   └── features.py            # 17 Ranking Features
 ├── data_factory/
 ---
+## 10. Limitations
+- **Single-dataset evaluation**: All RecSys metrics are on Amazon Books 200K; no cross-domain or external validation.
+- **Rule-based router**: Intent classification uses heuristics (e.g., `len(words) <= 2` for keyword); may not generalize to other domains.
+- **RAG evaluation**: RAG quality is demonstrated via curated examples (e.g., "Harry Potter", ISBN recall); no systematic human evaluation or large-scale relevance judgments.
+- **Protocol sensitivity**: RecSys metrics can vary with evaluation protocol (e.g., ISBN-only vs title-relaxed matching); see [Experiment Archive](experiments/experiment_archive.md) for discussion.
+---
+## 11. Scalability
 Current capacity:
 - In-memory index: 2GB RAM, ~200K books

docs/archived/PHASE_2_DEVELOPMENT.md CHANGED Viewed

@@ -204,12 +204,12 @@ persona = getPersona(userId)
 - **Method:** Rule-based persona matching (Top-3 authors/categories)
 - **Future:** Could upgrade to LLM refinement (e.g., GPT for polish)
-### Why React + Vite?
 - **Rationale:**
-  - React needed for custom UX and production-grade interface
   - Vite super fast (no webpack pain)
-  - Tailwind CSS for modern styling
-- **Architecture:** React frontend (port 5173) + FastAPI backend (port 6006/8000)
 ### Why Persona from Favorites (not search history)?
 - **Rationale:** User intent explicit in favorites, not implicit in queries
@@ -345,6 +345,13 @@ npm run dev
 # Connect to http://localhost:6006 backend
 ```
 ### Production Workflow
 - React builds with `npm run build` → static files
 - FastAPI serves as single backend
@@ -415,13 +422,13 @@ Expected highlight: "虽不在您常读类型，但情节深度与科幻的想
 ✅ **Modular backend design** → easy to add /highlights, /persona endpoints
 ✅ **React UI responsiveness** → users see results instantly
 ✅ **JSON-first approach** → no DB setup friction, iterate fast
-✅ **API-driven architecture** → React frontend with FastAPI backend
 ✅ **Persona concept** → users feel "understood" by the system
 ### Challenges Overcome
-🔧 **Port configuration** (React:5173 vs FastAPI:6006/8000) → Makefile organization
 🔧 **CORS issues** (frontend can't reach backend) → Added CORSMiddleware
-🔧 **Image loading** (external URLs) → Runtime fetching + local fallback
 🔧 **Timeout errors** (cold startup > 10s) → Increased client timeouts, optimized startup
 ### Design Philosophy Validated
@@ -454,11 +461,12 @@ book-rec-with-LLMs/
 │   │   └── main.jsx            # Entry point
 │   ├── index.html              # HTML + Tailwind CDN
 │   └── package.json            # Dependencies
 ├── Makefile                    # Commands
 ├── requirements.txt            # Python deps
 └── data/
-    ├── books_processed.csv     # Metadata + review highlights
-    └── user_profiles.json      # User data
 ```
 ---
@@ -478,6 +486,7 @@ feat: add React UI and backend personalization features
   * src/marketing/highlights.py: Persona-aware selling points
   * 3 new API endpoints in FastAPI
 - Add CORS support, update timeouts, improve infrastructure
 ```

 - **Method:** Rule-based persona matching (Top-3 authors/categories)
 - **Future:** Could upgrade to LLM refinement (e.g., GPT for polish)
+### Why React + Vite (not Gradio)?
 - **Rationale:**
+  - Gradio good for prototypes, React needed for custom UX
   - Vite super fast (no webpack pain)
+  - Tailwind CDN avoids npm build complexity
+- **Fallback:** Gradio UI (app.py) still available on port 7860
 ### Why Persona from Favorites (not search history)?
 - **Rationale:** User intent explicit in favorites, not implicit in queries
 # Connect to http://localhost:6006 backend
 ```
+**Terminal 3 (Optional): Gradio Legacy UI**
+```bash
+python app.py
+# Starts on http://localhost:7860
+# Alternative UI for testing
+```
 ### Production Workflow
 - React builds with `npm run build` → static files
 - FastAPI serves as single backend
 ✅ **Modular backend design** → easy to add /highlights, /persona endpoints
 ✅ **React UI responsiveness** → users see results instantly
 ✅ **JSON-first approach** → no DB setup friction, iterate fast
+✅ **API-driven architecture** → Gradio + React both work
 ✅ **Persona concept** → users feel "understood" by the system
 ### Challenges Overcome
+🔧 **Port conflicts** (Gradio:7860 vs React:5173 vs FastAPI:6006) → Makefile organization
 🔧 **CORS issues** (frontend can't reach backend) → Added CORSMiddleware
+🔧 **Image loading** (external URLs not allowed in Gradio) → Runtime fetching + local fallback
 🔧 **Timeout errors** (cold startup > 10s) → Increased client timeouts, optimized startup
 ### Design Philosophy Validated
 │   │   └── main.jsx            # Entry point
 │   ├── index.html              # HTML + Tailwind CDN
 │   └── package.json            # Dependencies
+├── app.py                      # Gradio UI (legacy)
 ├── Makefile                    # Commands
 ├── requirements.txt            # Python deps
 └── data/
+    ├── books_processed.csv     # Metadata
+    └── user_profiles.json      # ✨ NEW: User data
 ```
 ---
   * src/marketing/highlights.py: Persona-aware selling points
   * 3 new API endpoints in FastAPI
+- Maintain Gradio UI as optional fallback
 - Add CORS support, update timeouts, improve infrastructure
 ```

DEPLOYMENT.md → docs/archived/graveyard/DEPLOYMENT.md RENAMED Viewed

File without changes

business_logic.md → docs/archived/graveyard/business_logic.md RENAMED Viewed

File without changes

docs/{future_roadmap.md → archived/graveyard/future_roadmap.md} RENAMED Viewed

File without changes

docs/{interview_deep_dive.md → archived/graveyard/interview_deep_dive.md} RENAMED Viewed

File without changes

interview_prep.md → docs/archived/graveyard/interview_prep.md RENAMED Viewed

File without changes

docs/{phase7_plan.md → archived/graveyard/phase7_plan.md} RENAMED Viewed

File without changes

project_analysis.md → docs/archived/graveyard/project_analysis.md RENAMED Viewed

File without changes

docs/{project_narrative.md → archived/graveyard/project_narrative.md} RENAMED Viewed

File without changes

docs/{rag_architecture.md → archived/graveyard/rag_architecture.md} RENAMED Viewed

File without changes

technical_architecture.md → docs/archived/graveyard/technical_architecture.md RENAMED Viewed

File without changes

docs/{technical_deep_dive_sota.md → archived/graveyard/technical_deep_dive_sota.md} RENAMED Viewed

File without changes

story_and_strategy.md → docs/archived/story_and_strategy.md RENAMED Viewed

File without changes

{experiments → docs/experiments/reports}/baseline_report.md RENAMED Viewed

File without changes

benchmarks/results.md → docs/experiments/reports/benchmark_results.md RENAMED Viewed

File without changes

{experiments → docs/experiments/reports}/hybrid_report.md RENAMED Viewed

File without changes

{experiments → docs/experiments/reports}/rerank_report.md RENAMED Viewed

File without changes

{experiments → docs/experiments/reports}/router_report.md RENAMED Viewed

File without changes

{experiments → docs/experiments/reports}/temporal_report.md RENAMED Viewed

File without changes

legacy/README.md ADDED Viewed

	@@ -0,0 +1,10 @@

+# Legacy — Not part of main architecture
+Code moved here is preserved but not used in the main flow (src.main FastAPI + React).
+| File | Note |
+|:---|:---|
+| app.py | Gradio UI (replaced by React + FastAPI) |
+| agent/ | Shopping agent (broken imports, not used) |
+| deploy.sh | Old Hugging Face deployment script |
+| download_fix.py | Temporary fix script |

{src → legacy}/agent/agent_core.py RENAMED Viewed

File without changes

{src → legacy}/agent/data_loader.py RENAMED Viewed

File without changes

{src → legacy}/agent/dialogue_manager.py RENAMED Viewed

File without changes

{src → legacy}/agent/intent_parser.py RENAMED Viewed

File without changes

{src → legacy}/agent/llm_generator.py RENAMED Viewed

File without changes

{src → legacy}/agent/rag_indexer.py RENAMED Viewed

File without changes

{src → legacy}/agent/rag_retriever.py RENAMED Viewed

File without changes

app.py → legacy/app.py RENAMED Viewed

@@ -48,7 +48,7 @@ tones = fetch_tones()
 # Initialize Shopping Agent (Mock or Real)
 # Note: Real agent requires FAISS index. We'll handle checks later.
 try:
-    # from src.agent.agent_core import ShoppingAgent
     # shopping_agent = ShoppingAgent(...)
     pass
 except ImportError:

 # Initialize Shopping Agent (Mock or Real)
 # Note: Real agent requires FAISS index. We'll handle checks later.
 try:
+    # from legacy.agent.agent_core import ShoppingAgent
     # shopping_agent = ShoppingAgent(...)
     pass
 except ImportError:

deploy.sh → legacy/deploy.sh RENAMED Viewed

File without changes

download_fix.py → legacy/download_fix.py RENAMED Viewed

File without changes

requirements.txt CHANGED Viewed

@@ -40,7 +40,6 @@ scipy
 requests
 # LLM Agent & Fine-tuning
-langchain
 faiss-cpu
 diffusers
 openai

 requests
 # LLM Agent & Fine-tuning
 faiss-cpu
 diffusers
 openai

scripts/add_isbn13_to_books_data.py DELETED Viewed

@@ -1,16 +0,0 @@
-import pandas as pd
-# 读取主表和 books_data_with_isbn.csv
-main = pd.read_csv("data/books_with_emotions.csv", usecols=["title", "isbn13"])
-data = pd.read_csv("data/books_data_with_isbn.csv")
-# 标准化标题
-main["title"] = main["title"].astype(str).str.strip().str.lower()
-data["Title"] = data["Title"].astype(str).str.strip().str.lower()
-# 合并，左连接
-merged = data.merge(main, left_on="Title", right_on="title", how="left")
-# 保存新文件
-merged.to_csv("data/books_data_with_isbn13.csv", index=False)
-print("已生成 data/books_data_with_isbn13.csv，包含 isbn13 字段。")

scripts/add_isbn_to_books_data.py DELETED Viewed

@@ -1,21 +0,0 @@
-import pandas as pd
-# 读取 books_data.csv
-books_data = pd.read_csv("data/books_data.csv")
-# 读取 Books_rating.csv，只取 Title 和 Id 字段
-ratings = pd.read_csv("data/Books_rating.csv", usecols=["Title", "Id"])
-# 去重，避免多对一
-ratings = ratings.drop_duplicates(subset=["Title"])
-# 合并，左连接，保留 books_data.csv 所有行
-merged = books_data.merge(ratings, on="Title", how="left")
-# 重命名 Id 为 isbn
-merged = merged.rename(columns={"Id": "isbn"})
-# 保存新文件
-merged.to_csv("data/books_data_with_isbn.csv", index=False)
-print("已生成 data/books_data_with_isbn.csv，包含 isbn 字段。")

scripts/benchmark_compressor.py DELETED Viewed

@@ -1,35 +0,0 @@
-import asyncio
-from langchain_core.messages import HumanMessage, AIMessage
-from src.core.context_compressor import compressor
-async def run_benchmark():
-    print("🚀 Starting Context Compression Benchmark...")
-    # 1. Simulate Long History (12 messages, 6 turns)
-    history = []
-    for i in range(1, 7):
-        history.append(HumanMessage(content=f"User question {i}: I like sci-fi."))
-        history.append(AIMessage(content=f"AI answer {i}: Here is a sci-fi book."))
-    print(f"Original History Length: {len(history)} messages")
-    # 2. Compress
-    print("Compressing...")
-    # Mock LLM generation usually takes time, so latency includes API call
-    compressed = await compressor.compress_history(history)
-    print(f"Compressed History Length: {len(compressed)} messages")
-    # 3. Validation
-    # Expected: 1 SystemMessage (Summary) + 4 Messages (Recent) = 5
-    if len(compressed) == 5:
-        print("✅ SUCCESS: History compressed to 5 messages.")
-        print(f"Summary Content: {compressed[0].content}")
-        print(f"Oldest Retained Message: {compressed[1].content}")
-    else:
-        print(f"❌ FAILURE: Expected 5 messages, got {len(compressed)}")
-        for i, m in enumerate(compressed):
-            print(f"[{i}] {type(m).__name__}: {m.content}")
-if __name__ == "__main__":
-    asyncio.run(run_benchmark())

scripts/benchmark_hybrid.py DELETED Viewed

@@ -1,83 +0,0 @@
-import time
-import pandas as pd
-from src.vector_db import VectorDB
-def run_benchmark():
-    print("🚀 Starting Hybrid Retrieval Benchmark...")
-    # Load Title Mapping
-    try:
-        books_df = pd.read_csv("data/books_processed.csv")
-        # Ensure string ISBN for matching
-        if 'isbn13' in books_df.columns:
-            books_df['isbn'] = books_df['isbn13'].astype(str)
-        else:
-             books_df['isbn'] = books_df['isbn'].astype(str)
-        isbn_map = books_df.set_index('isbn')['title'].to_dict()
-    except Exception as e:
-        print(f"⚠️ Failed to load books_processed.csv: {e}")
-        isbn_map = {}
-    db = VectorDB()
-    # Same Test Cases
-    test_queries = [
-        # 1. Semantic (Hybrid should match Dense)
-        {"type": "Semantic", "query": "books about finding love in unexpected places"},
-        {"type": "Semantic", "query": "scary stories that keep you up at night"},
-        # 2. Keyword/Proper Noun (Hybrid should improve)
-        {"type": "Keyword", "query": "Harry Potter"},
-        {"type": "Keyword", "query": "Python Programming"},
-        {"type": "Keyword", "query": "Jane Austen"},
-        # 3. Exact Match / ISBN (Hybrid should fix this)
-        {"type": "Exact", "query": "0060959479"},
-    ]
-    results = []
-    for case in test_queries:
-        q = case["query"]
-        print(f"\nScanning: '{q}' ({case['type']})...")
-        start_time = time.time()
-        # USE HYBRID SEARCH
-        docs = db.hybrid_search(q, k=5)
-        duration = (time.time() - start_time) * 1000
-        # Capture simplified results
-        top_results = []
-        for doc in docs:
-            # Extract ISBN
-            parts = doc.page_content.strip().split(' ', 1)
-            isbn = parts[0]
-             # Fallback parsing for legacy docs
-            if "ISBN:" in doc.page_content:
-                 isbn = doc.page_content.split("ISBN:")[1].strip().split()[0]
-            title = isbn_map.get(isbn, f"ISBN:{isbn}")
-            if len(title) > 40:
-                title = title[:37] + "..."
-            top_results.append(title)
-        print(f"  -> Found: {top_results}")
-        results.append({
-            "query": q,
-            "type": case["type"],
-            "latency_ms": round(duration, 2),
-            "top_results": top_results
-        })
-    # Save
-    df = pd.DataFrame(results)
-    path = "experiments/02_hybrid_results.csv"
-    df.to_csv(path, index=False)
-    print(f"\n💾 Results saved to {path}")
-    print("\n## Hybrid Search Results")
-    print(df.to_string(index=False))
-if __name__ == "__main__":
-    run_benchmark()

scripts/benchmark_rerank.py DELETED Viewed

@@ -1,82 +0,0 @@
-import time
-import pandas as pd
-from src.vector_db import VectorDB
-def run_benchmark():
-    print("🚀 Starting Reranked Retrieval Benchmark...")
-    # Load Title Mapping
-    try:
-        books_df = pd.read_csv("data/books_processed.csv")
-        if 'isbn13' in books_df.columns:
-            books_df['isbn'] = books_df['isbn13'].astype(str)
-        else:
-             books_df['isbn'] = books_df['isbn'].astype(str)
-        isbn_map = books_df.set_index('isbn')['title'].to_dict()
-    except Exception as e:
-        print(f"⚠️ Failed to load books_processed.csv: {e}")
-        isbn_map = {}
-    db = VectorDB()
-    # Same Test Cases
-    test_queries = [
-        # 1. Semantic (Reranker should bubble up best Semantic matches)
-        {"type": "Semantic", "query": "books about finding love in unexpected places"},
-        # Complex mood query
-        {"type": "Complex", "query": "a dark sci-fi thriller with a female protagonist"},
-        # 2. Keyword/Proper Noun (Reranker should confirm these are relevant)
-        {"type": "Keyword", "query": "Harry Potter"},
-        {"type": "Keyword", "query": "Jane Austen"},
-        # 3. Exact Match (Should still work)
-        {"type": "Exact", "query": "0060959479"},
-    ]
-    results = []
-    for case in test_queries:
-        q = case["query"]
-        print(f"\nScanning: '{q}' ({case['type']})...")
-        start_time = time.time()
-        # USE HYBRID WITH RERANK
-        docs = db.hybrid_search(q, k=5, rerank=True)
-        duration = (time.time() - start_time) * 1000
-        # Capture results with scores
-        top_results = []
-        for doc in docs:
-            # Extract ISBN
-            parts = doc.page_content.strip().split(' ', 1)
-            isbn = parts[0]
-            if "ISBN:" in doc.page_content:
-                 isbn = doc.page_content.split("ISBN:")[1].strip().split()[0]
-            title = isbn_map.get(isbn, f"ISBN:{isbn}")
-            if len(title) > 30:
-                title = title[:27] + "..."
-            score = doc.metadata.get("relevance_score", 0.0)
-            top_results.append(f"{title} ({score:.4f})")
-        print(f"  -> Found: {top_results}")
-        results.append({
-            "query": q,
-            "type": case["type"],
-            "latency_ms": round(duration, 2),
-            "top_results": top_results
-        })
-    # Save
-    df = pd.DataFrame(results)
-    path = "experiments/03_rerank_results.csv"
-    df.to_csv(path, index=False)
-    print(f"\n💾 Results saved to {path}")
-    print("\n## Reranked Search Results")
-    print(df.to_string(index=False))
-if __name__ == "__main__":
-    run_benchmark()

scripts/benchmark_retrieval.py DELETED Viewed

@@ -1,82 +0,0 @@
-import time
-import pandas as pd
-from typing import List
-from src.vector_db import VectorDB
-def run_benchmark():
-    print("🚀 Starting Retrieval Benchmark (BASELINE)...")
-    # Load Title Mapping
-    try:
-        books_df = pd.read_csv("data/books_processed.csv")
-        # Ensure string ISBN for matching
-        books_df['isbn'] = books_df['isbn'].astype(str)
-        isbn_map = books_df.set_index('isbn')['title'].to_dict()
-        print(f"📚 Loaded {len(isbn_map)} titles for mapping.")
-    except Exception as e:
-        print(f"⚠️ Failed to load books_processed.csv: {e}")
-        isbn_map = {}
-    db = VectorDB()
-    # ... (Test Cases preserved) ...
-    test_queries = [
-        # 1. Semantic (Dense should win)
-        {"type": "Semantic", "query": "books about finding love in unexpected places"},
-        {"type": "Semantic", "query": "scary stories that keep you up at night"},
-        # 2. Keyword/Proper Noun (Dense might struggle)
-        {"type": "Keyword", "query": "Harry Potter"},
-        {"type": "Keyword", "query": "Python Programming"},
-        {"type": "Keyword", "query": "Jane Austen"},
-        # 3. Exact Match / ISBN
-        {"type": "Exact", "query": "0060959479"},
-    ]
-    results = []
-    for case in test_queries:
-        q = case["query"]
-        print(f"\nScanning: '{q}' ({case['type']})...")
-        start_time = time.time()
-        docs = db.search(q, k=5)
-        duration = (time.time() - start_time) * 1000
-        # Capture simplified results
-        top_results = []
-        for doc in docs:
-            # Format: "ISBN ReviewText..."
-            # Extract ISBN (first token)
-            parts = doc.page_content.strip().split(' ', 1)
-            isbn = parts[0]
-            # Lookup Title
-            title = isbn_map.get(isbn, f"ISBN:{isbn}")
-            # Truncate for display
-            if len(title) > 40:
-                title = title[:37] + "..."
-            top_results.append(title)
-        print(f"  -> Found: {top_results}")
-        results.append({
-            "query": q,
-            "type": case["type"],
-            "latency_ms": round(duration, 2),
-            "top_results": top_results
-        })
-    # Save Report
-    df = pd.DataFrame(results)
-    path = "experiments/01_baseline_results.csv"
-    df.to_csv(path, index=False)
-    print(f"\n💾 Results saved to {path}")
-    # Print Summary
-    print("\n## Baseline Results Summary")
-    print(df.to_string(index=False))
-if __name__ == "__main__":
-    run_benchmark()

scripts/benchmark_router.py DELETED Viewed

@@ -1,99 +0,0 @@
-import time
-import pandas as pd
-from src.vector_db import VectorDB
-from src.core.router import QueryRouter
-def run_benchmark():
-    print("🚀 Starting Agentic Router Benchmark...")
-    # Init Components
-    db = VectorDB()
-    router = QueryRouter()
-    # Load Title Mapping (for display)
-    try:
-        books_df = pd.read_csv("data/books_processed.csv")
-        if 'isbn13' in books_df.columns:
-            books_df['isbn'] = books_df['isbn13'].astype(str)
-        else:
-             books_df['isbn'] = books_df['isbn'].astype(str)
-        isbn_map = books_df.set_index('isbn')['title'].to_dict()
-    except:
-        isbn_map = {}
-    test_queries = [
-        # 1. ISBN -> Should be EXACT (No Rerank) to avoid regression
-        {"query": "0060959479", "expected_strat": "exact"},
-        # 2. Keyword -> Should be FAST (No Rerank)
-        {"query": "python programming", "expected_strat": "fast"},
-        # 3. Complex -> Should be DEEP (With Rerank)
-        {"query": "books about finding love in unexpected places", "expected_strat": "deep"},
-    ]
-    results = []
-    for case in test_queries:
-        q = case["query"]
-        print(f"\nUser Query: '{q}'")
-        # 1. ROUTING STEP
-        route_decision = router.route(q)
-        strat = route_decision["strategy"]
-        use_rerank = route_decision["rerank"]
-        alpha_val = route_decision.get("alpha", 0.5)
-        print(f"  🤖 Router Decision: {strat.upper()} (Rerank={use_rerank}, Alpha={alpha_val})")
-        # Check expectation
-        if strat != case["expected_strat"]:
-             print(f"  ⚠️ WARNING: Expected {case['expected_strat']}, got {strat}")
-        # 2. RETRIEVAL STEP
-        start_time = time.time()
-        docs = db.hybrid_search(
-            q,
-            k=5,
-            rerank=use_rerank,
-            alpha=alpha_val
-        )
-        duration = (time.time() - start_time) * 1000
-        # Capture results
-        top_results = []
-        for doc in docs:
-            # Extract ISBN/Title
-            if "ISBN:" in doc.page_content:
-                 isbn = doc.page_content.split("ISBN:")[1].strip().split()[0]
-            else:
-                 parts = doc.page_content.strip().split(' ', 1)
-                 isbn = parts[0]
-            title = isbn_map.get(isbn, f"ISBN:{isbn}")
-            if len(title) > 30:
-                title = title[:27] + "..."
-            score = doc.metadata.get("relevance_score", "N/A")
-            if score != "N/A":
-                top_results.append(f"{title} ({score:.4f})")
-            else:
-                top_results.append(f"{title}")
-        print(f"  -> Found: {top_results[:3]}")
-        results.append({
-            "query": q,
-            "strategy": strat,
-            "latency_ms": round(duration, 2),
-            "top_1": top_results[0] if top_results else "None"
-        })
-    # Save
-    df = pd.DataFrame(results)
-    path = "experiments/04_router_results.csv"
-    df.to_csv(path, index=False)
-    print(f"\n💾 Results saved to {path}")
-    print(df.to_string(index=False))
-if __name__ == "__main__":
-    run_benchmark()

scripts/benchmark_temporal.py DELETED Viewed

@@ -1,44 +0,0 @@
-import pandas as pd
-from src.vector_db import VectorDB
-def run_benchmark():
-    print("🚀 Starting Temporal Dynamics Benchmark...")
-    db = VectorDB()
-    # We use a query where 'newness' matters
-    query = "latest advancements in technology and science"
-    print(f"\nQuery: '{query}'")
-    # 1. Standard Search
-    print("\n--- Standard Search (No Temporal) ---")
-    st_docs = db.hybrid_search(query, k=5, rerank=True, temporal=False)
-    for d in st_docs:
-        # Get Year
-        isbn = d.metadata.get("isbn") or d.metadata.get("isbn13")
-        if not isbn and "ISBN:" in d.page_content:
-             isbn = d.page_content.split("ISBN:")[1].strip().split()[0]
-        year = db.pub_years.get(str(isbn), "Unknown")
-        score = d.metadata.get("relevance_score", 0.0)
-        # Parse title
-        title = d.page_content.split('\n')[0].replace("Title: ", "")[:40]
-        print(f"[{year}] {title}... (Score: {score:.4f})")
-    # 2. Temporal Search
-    print("\n--- Temporal Search (Recent Boost) ---")
-    tm_docs = db.hybrid_search(query, k=5, rerank=True, temporal=True)
-    for d in tm_docs:
-        isbn = d.metadata.get("isbn") or d.metadata.get("isbn13")
-        if not isbn and "ISBN:" in d.page_content:
-             isbn = d.page_content.split("ISBN:")[1].strip().split()[0]
-        year = db.pub_years.get(str(isbn), "Unknown")
-        # In temporal mode, score is boosted
-        score = d.metadata.get("relevance_score", 0.0)
-        title = d.page_content.split('\n')[0].replace("Title: ", "")[:40]
-        print(f"[{year}] {title}... (Score: {score:.4f})")
-if __name__ == "__main__":
-    run_benchmark()

scripts/build_books_basic_info.py DELETED Viewed

@@ -1,48 +0,0 @@
-import pandas as pd
-import csv
-# 读取原始数据，遇到格式错误行自动跳过，保证流程不中断
-books_data = pd.read_csv(
-    "data/books_data.csv",
-    engine="python",
-    quotechar='"',
-    escapechar='\\',
-    on_bad_lines='skip'  # pandas >=1.3
-)
-ratings = pd.read_csv("data/Books_rating.csv", engine="python", quotechar='"', escapechar='\\', on_bad_lines='skip')
-# 只保留有用字段
-books_cols = [
-    "Title", "description", "authors", "image", "publisher", "publishedDate", "categories"
-]
-books_data = books_data[books_cols]
-# 只保留 Title, Id, review/score 字段用于合并
-ratings_cols = ["Title", "Id", "review/score"]
-ratings = ratings[ratings_cols]
-# 去重
-ratings = ratings.drop_duplicates(subset=["Title"])
-# 合并，左连接，保留 books_data 所有行
-merged = books_data.merge(ratings, on="Title", how="left")
-# 重命名字段
-merged = merged.rename(columns={
-    "Id": "isbn10",
-    "Title": "title",
-    "authors": "authors",
-    "description": "description",
-    "image": "image",
-    "publisher": "publisher",
-    "publishedDate": "publishedDate",
-    "categories": "categories",
-    "review/score": "average_rating"
-})
-# 生成 isbn13（如有更复杂规则可补充，这里仅占位）
-merged["isbn13"] = None  # 可后续补充isbn13生成逻辑
-# 保存新表，强制所有字段加引号，防止description等字段被截断
-merged.to_csv("data/books_basic_info.csv", index=False, quoting=csv.QUOTE_ALL, quotechar='"', escapechar='\\')
-print("已生成 data/books_basic_info.csv，包含基础书籍信息字段。")