Spaces:
Running
Running
chore: remove deprecated files and add legacy directory to .dockerignore
Browse filesThis view is limited to 50 files because it contains too many changes.
See raw diff
- .cursor/rules/research-prototype-style.mdc +29 -0
- .dockerignore +3 -0
- PHASE_2_DEVELOPMENT.md +0 -518
- README.md +41 -192
- REVIEW_HIGHLIGHTS.md +0 -142
- TAGS_AND_EMOTIONS.md +0 -233
- cover-not-found.jpg +0 -0
- data-exploration.ipynb +0 -0
- docs/README.md +18 -14
- docs/TECHNICAL_REPORT.md +40 -23
- docs/archived/PHASE_2_DEVELOPMENT.md +18 -9
- DEPLOYMENT.md → docs/archived/graveyard/DEPLOYMENT.md +0 -0
- business_logic.md → docs/archived/graveyard/business_logic.md +0 -0
- docs/{future_roadmap.md → archived/graveyard/future_roadmap.md} +0 -0
- docs/{interview_deep_dive.md → archived/graveyard/interview_deep_dive.md} +0 -0
- interview_prep.md → docs/archived/graveyard/interview_prep.md +0 -0
- docs/{phase7_plan.md → archived/graveyard/phase7_plan.md} +0 -0
- project_analysis.md → docs/archived/graveyard/project_analysis.md +0 -0
- docs/{project_narrative.md → archived/graveyard/project_narrative.md} +0 -0
- docs/{rag_architecture.md → archived/graveyard/rag_architecture.md} +0 -0
- technical_architecture.md → docs/archived/graveyard/technical_architecture.md +0 -0
- docs/{technical_deep_dive_sota.md → archived/graveyard/technical_deep_dive_sota.md} +0 -0
- story_and_strategy.md → docs/archived/story_and_strategy.md +0 -0
- {experiments → docs/experiments/reports}/baseline_report.md +0 -0
- benchmarks/results.md → docs/experiments/reports/benchmark_results.md +0 -0
- {experiments → docs/experiments/reports}/hybrid_report.md +0 -0
- {experiments → docs/experiments/reports}/rerank_report.md +0 -0
- {experiments → docs/experiments/reports}/router_report.md +0 -0
- {experiments → docs/experiments/reports}/temporal_report.md +0 -0
- legacy/README.md +10 -0
- {src → legacy}/agent/agent_core.py +0 -0
- {src → legacy}/agent/data_loader.py +0 -0
- {src → legacy}/agent/dialogue_manager.py +0 -0
- {src → legacy}/agent/intent_parser.py +0 -0
- {src → legacy}/agent/llm_generator.py +0 -0
- {src → legacy}/agent/rag_indexer.py +0 -0
- {src → legacy}/agent/rag_retriever.py +0 -0
- app.py → legacy/app.py +1 -1
- deploy.sh → legacy/deploy.sh +0 -0
- download_fix.py → legacy/download_fix.py +0 -0
- requirements.txt +0 -1
- scripts/add_isbn13_to_books_data.py +0 -16
- scripts/add_isbn_to_books_data.py +0 -21
- scripts/benchmark_compressor.py +0 -35
- scripts/benchmark_hybrid.py +0 -83
- scripts/benchmark_rerank.py +0 -82
- scripts/benchmark_retrieval.py +0 -82
- scripts/benchmark_router.py +0 -99
- scripts/benchmark_temporal.py +0 -44
- scripts/build_books_basic_info.py +0 -48
.cursor/rules/research-prototype-style.mdc
ADDED
|
@@ -0,0 +1,29 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
description: Research prototype coding style — graduate student, clarity over elegance
|
| 3 |
+
globs: src/**/*.py, scripts/**/*.py
|
| 4 |
+
alwaysApply: true
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Research Prototype Style
|
| 8 |
+
|
| 9 |
+
Target author: graduate student, working alone, optimizing for clarity and debuggability. NOT for production or large-scale deployment.
|
| 10 |
+
|
| 11 |
+
## Do
|
| 12 |
+
|
| 13 |
+
- **Simple, linear control flow** — avoid nested callbacks, deep inheritance
|
| 14 |
+
- **Minimal abstraction** — prefer direct implementation over factories/adapters
|
| 15 |
+
- **Inline comments** — explain *why*, not *what*
|
| 16 |
+
- **Strategic debug** — a few print/debug statements, preferably behind a flag (e.g. `if DEBUG:`)
|
| 17 |
+
- **Small duplication OK** — if it improves readability over DRY
|
| 18 |
+
|
| 19 |
+
## Avoid
|
| 20 |
+
|
| 21 |
+
- Excessive logging or print statements
|
| 22 |
+
- Heavy configuration systems (YAML configs, env-driven feature flags)
|
| 23 |
+
- Comprehensive unit test suites (targeted tests for critical paths only)
|
| 24 |
+
- Complex class hierarchies, abstract base classes
|
| 25 |
+
- Premature optimization
|
| 26 |
+
|
| 27 |
+
## When in Doubt
|
| 28 |
+
|
| 29 |
+
Prefer simplicity over elegance. Code should look like something a careful graduate student would realistically write and maintain.
|
.dockerignore
CHANGED
|
@@ -26,3 +26,6 @@ build/
|
|
| 26 |
dist/
|
| 27 |
eggs/
|
| 28 |
.eggs/
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
dist/
|
| 27 |
eggs/
|
| 28 |
.eggs/
|
| 29 |
+
|
| 30 |
+
# Legacy (not used in main flow)
|
| 31 |
+
legacy/
|
PHASE_2_DEVELOPMENT.md
DELETED
|
@@ -1,518 +0,0 @@
|
|
| 1 |
-
# Phase 2: Personalization & React UI Migration
|
| 2 |
-
|
| 3 |
-
**Date:** January 2026
|
| 4 |
-
**Status:** ✅ Complete & Deployed
|
| 5 |
-
|
| 6 |
-
---
|
| 7 |
-
|
| 8 |
-
## Overview
|
| 9 |
-
|
| 10 |
-
This phase shifted the project from a basic semantic book recommender to an **intelligent, personalized discovery platform** with a modern React frontend. The vision evolved from marketplace/swap features to a focused **recommendation engine grounded in user preferences and persona-driven insights**.
|
| 11 |
-
|
| 12 |
-
---
|
| 13 |
-
|
| 14 |
-
## Phase Vision & Direction
|
| 15 |
-
|
| 16 |
-
### Initial Pivot (from conversation)
|
| 17 |
-
- **Original concept:** Second-hand book marketplace/swap platform
|
| 18 |
-
- **User feedback:** Focus on recommendation engine first, then expand
|
| 19 |
-
- **Final direction:** Keep it recommendation-only with two new pillars:
|
| 20 |
-
1. **Favorites** → persistent user library tracking
|
| 21 |
-
2. **Personalized Highlights** → AI-generated selling points based on user taste
|
| 22 |
-
|
| 23 |
-
### Core Philosophy
|
| 24 |
-
> "Books that understand you. Recommendations grounded in what you love."
|
| 25 |
-
|
| 26 |
-
The system learns from your reading preferences and surfaces books that match both the search query AND your unique taste profile.
|
| 27 |
-
|
| 28 |
-
---
|
| 29 |
-
|
| 30 |
-
## What Was Built
|
| 31 |
-
|
| 32 |
-
### 1. **Backend Personalization Layer** (`src/`)
|
| 33 |
-
|
| 34 |
-
#### A. User Favorites Storage
|
| 35 |
-
- **File:** `src/user/profile_store.py`
|
| 36 |
-
- **Mechanism:** JSON-based persistence (`data/user_profiles.json`)
|
| 37 |
-
- **Features:**
|
| 38 |
-
- `add_favorite(user_id, isbn)` → idempotent add + deduplicate
|
| 39 |
-
- `list_favorites(user_id)` → retrieve user's library
|
| 40 |
-
- Works with any user_id (default: "local" for single-user dev)
|
| 41 |
-
|
| 42 |
-
#### B. User Persona Aggregation
|
| 43 |
-
- **File:** `src/marketing/persona.py`
|
| 44 |
-
- **Input:** List of favorite ISBNs + book metadata DataFrame
|
| 45 |
-
- **Output:** `{ summary, top_authors[], top_categories[] }`
|
| 46 |
-
- **Algorithm:**
|
| 47 |
-
1. Fetch metadata for all favorited books
|
| 48 |
-
2. Extract top 3 authors (by frequency)
|
| 49 |
-
3. Extract top 3 categories
|
| 50 |
-
4. Generate natural language summary combining signals
|
| 51 |
-
- Example: *"您钟爱悬疑与科幻,偏好国际视野的作品。"* (You love mystery & sci-fi, prefer international perspectives)
|
| 52 |
-
|
| 53 |
-
#### C. Personalized Highlights Generator
|
| 54 |
-
- **File:** `src/marketing/highlights.py`
|
| 55 |
-
- **Input:** ISBN + user persona + book metadata
|
| 56 |
-
- **Output:** `{ title, authors, category, highlights[], persona_summary }`
|
| 57 |
-
- **Generation Strategy:**
|
| 58 |
-
- Match persona themes to book content (author, category, description)
|
| 59 |
-
- Extract 3-5 contextual selling points
|
| 60 |
-
- Combine rule-based matching + description parsing
|
| 61 |
-
- Example output:
|
| 62 |
-
```
|
| 63 |
-
- 作者获国际奖项,契合您对国际视野的热爱
|
| 64 |
-
- 悬疑与科幻的完美融合,正是您的最爱组合
|
| 65 |
-
- 情节紧凑,适合您快节奏阅读的偏好
|
| 66 |
-
```
|
| 67 |
-
|
| 68 |
-
### 2. **FastAPI Backend Integration** (`src/main.py`)
|
| 69 |
-
|
| 70 |
-
**Three New Endpoints:**
|
| 71 |
-
|
| 72 |
-
```python
|
| 73 |
-
POST /favorites/add
|
| 74 |
-
Request: { user_id: str, isbn: str }
|
| 75 |
-
Response: { status: "ok", favorites_count: int }
|
| 76 |
-
|
| 77 |
-
GET /user/{user_id}/persona
|
| 78 |
-
Response: { user_id, favorites: [], persona: {...} }
|
| 79 |
-
|
| 80 |
-
POST /marketing/highlights
|
| 81 |
-
Request: { isbn: str, user_id?: str }
|
| 82 |
-
Response: { persona, highlights: [], meta: {...} }
|
| 83 |
-
```
|
| 84 |
-
|
| 85 |
-
**CORS Support:**
|
| 86 |
-
- Enabled for localhost:5173 (React dev), 3000 (alt dev), 8080
|
| 87 |
-
- Allows frontend to access backend without restrictions
|
| 88 |
-
|
| 89 |
-
---
|
| 90 |
-
|
| 91 |
-
### 3. **Modern React UI** (`web/`)
|
| 92 |
-
|
| 93 |
-
#### Architecture
|
| 94 |
-
- **Build Tool:** Vite (ultra-fast dev server, ~200ms startup)
|
| 95 |
-
- **Styling:** Tailwind CSS (CDN-based, no build required)
|
| 96 |
-
- **Icons:** lucide-react (modern SVG icons)
|
| 97 |
-
- **State Management:** React Hooks (useState only, no Redux)
|
| 98 |
-
|
| 99 |
-
#### Design: "纸间留白" (Paper Shelf)
|
| 100 |
-
A literary, minimalist aesthetic inspired by:
|
| 101 |
-
- Japanese minimalism (留白 = leaving white space)
|
| 102 |
-
- Second-hand bookstore vibes
|
| 103 |
-
- Serif typography (font-serif)
|
| 104 |
-
- Muted earth tones: `#b392ac` (mauve), `#f4acb7` (peach), `#faf9f6` (cream)
|
| 105 |
-
|
| 106 |
-
#### Core Features
|
| 107 |
-
|
| 108 |
-
**1. Discovery Tab (Default View)**
|
| 109 |
-
```
|
| 110 |
-
┌─────────────────────────────────┐
|
| 111 |
-
│ 纸间留白 │ Header + toggle "私人书斋"
|
| 112 |
-
├─────────────────────────────────┤
|
| 113 |
-
│ 墨色余温·灵魂契合 (if favorites) │ Smart carousel of alma-mate books
|
| 114 |
-
├─────────────────────────────────┤
|
| 115 |
-
│ [Search] [Category▼] [Mood▼] │ Semantic search + filters
|
| 116 |
-
│ 开启发现之旅 (Start Discovery) │
|
| 117 |
-
├─────────────────────────────────┤
|
| 118 |
-
│ [Book 1] [Book 2] [Book 3] ... │ 5-column responsive grid
|
| 119 |
-
│ (hover shows ai-generated hint) │
|
| 120 |
-
└─────────────────────────────────┘
|
| 121 |
-
```
|
| 122 |
-
|
| 123 |
-
**2. Book Detail Modal**
|
| 124 |
-
```
|
| 125 |
-
┌─────────────────────────────────┐
|
| 126 |
-
│ [Close] │
|
| 127 |
-
├─────────���────┬──────────────────┤
|
| 128 |
-
│ Cover │ Title │
|
| 129 |
-
│ ISBN │ Highlights │
|
| 130 |
-
│ Score ★★★★★ │ Description │
|
| 131 |
-
│ │ Chat Interface │
|
| 132 |
-
│ │ [Add to Library] │
|
| 133 |
-
└──────────────┴──────────────────┘
|
| 134 |
-
```
|
| 135 |
-
|
| 136 |
-
**3. Private Library ("私人书斋")**
|
| 137 |
-
- Toggle view to see only favorited books
|
| 138 |
-
- Shows reading statistics (mood distribution)
|
| 139 |
-
- Same gallery grid + detail modal
|
| 140 |
-
|
| 141 |
-
**4. Chat Interface (in modal)**
|
| 142 |
-
- Suggested questions tied to book context
|
| 143 |
-
- User messages vs AI responses styled differently
|
| 144 |
-
- AI grounded to book metadata (not LLM-based yet)
|
| 145 |
-
|
| 146 |
-
#### API Integration
|
| 147 |
-
All four key flows wired to backend:
|
| 148 |
-
|
| 149 |
-
```javascript
|
| 150 |
-
// Search → Recommendation
|
| 151 |
-
startDiscovery() → recommend(query, category, tone)
|
| 152 |
-
|
| 153 |
-
// Select book → Load highlights
|
| 154 |
-
openBook(book) → getHighlights(isbn)
|
| 155 |
-
|
| 156 |
-
// Add to collection
|
| 157 |
-
toggleCollect(book) → addFavorite(isbn)
|
| 158 |
-
|
| 159 |
-
// (Future) Refresh persona
|
| 160 |
-
persona = getPersona(userId)
|
| 161 |
-
```
|
| 162 |
-
|
| 163 |
-
---
|
| 164 |
-
|
| 165 |
-
## End-to-End Flow
|
| 166 |
-
|
| 167 |
-
### User Journey: "Discovery to Collection"
|
| 168 |
-
|
| 169 |
-
```
|
| 170 |
-
1. User enters search query + filters
|
| 171 |
-
↓
|
| 172 |
-
2. startDiscovery() calls POST /recommend
|
| 173 |
-
→ FastAPI semantic search + tone filtering
|
| 174 |
-
→ Returns top N books with thumbnails
|
| 175 |
-
↓
|
| 176 |
-
3. Books render in grid (hover shows AI hint)
|
| 177 |
-
↓
|
| 178 |
-
4. User clicks book → openBook()
|
| 179 |
-
→ Calls POST /marketing/highlights
|
| 180 |
-
→ Gets persona + 3-5 personalized selling points
|
| 181 |
-
→ Modal shows all details + chat
|
| 182 |
-
↓
|
| 183 |
-
5. User clicks "加入藏书馆" (Add to Collection)
|
| 184 |
-
→ Calls POST /favorites/add
|
| 185 |
-
→ Updates myCollection state
|
| 186 |
-
→ Next search shows "灵魂契合" carousel (matched books)
|
| 187 |
-
↓
|
| 188 |
-
6. User clicks "私人书斋" to view collection
|
| 189 |
-
→ Filters books to only favorites
|
| 190 |
-
→ Shows reading persona stats
|
| 191 |
-
```
|
| 192 |
-
|
| 193 |
-
---
|
| 194 |
-
|
| 195 |
-
## Technical Decisions
|
| 196 |
-
|
| 197 |
-
### Why JSON for Favorites (not SQLite)?
|
| 198 |
-
- **Rationale:** Single-user dev focus, rapid iteration
|
| 199 |
-
- **Trade-off:** 11k books × metadata in one file = acceptable overhead
|
| 200 |
-
- **Future:** Easy migration to PostgreSQL when scaling to multi-user
|
| 201 |
-
|
| 202 |
-
### Why No LLM for Highlights?
|
| 203 |
-
- **Rationale:** Keep system lightweight, deterministic, fast
|
| 204 |
-
- **Method:** Rule-based persona matching (Top-3 authors/categories)
|
| 205 |
-
- **Future:** Could upgrade to LLM refinement (e.g., GPT for polish)
|
| 206 |
-
|
| 207 |
-
### Why React + Vite (not Gradio)?
|
| 208 |
-
- **Rationale:**
|
| 209 |
-
- Gradio good for prototypes, React needed for custom UX
|
| 210 |
-
- Vite super fast (no webpack pain)
|
| 211 |
-
- Tailwind CDN avoids npm build complexity
|
| 212 |
-
- **Fallback:** Gradio UI (app.py) still available on port 7860
|
| 213 |
-
|
| 214 |
-
### Why Persona from Favorites (not search history)?
|
| 215 |
-
- **Rationale:** User intent explicit in favorites, not implicit in queries
|
| 216 |
-
- **Semantics:** "Add to collection" = explicit preference signal
|
| 217 |
-
- **Advantage:** Works offline, no tracking/privacy concerns
|
| 218 |
-
|
| 219 |
-
---
|
| 220 |
-
|
| 221 |
-
## Architecture Diagram
|
| 222 |
-
|
| 223 |
-
```
|
| 224 |
-
┌──────────────────────────────────────────────────────┐
|
| 225 |
-
│ FRONTEND (React) │
|
| 226 |
-
│ web/ → Vite dev server (localhost:5173) │
|
| 227 |
-
│ ┌────────────────────────────────────────────────┐ │
|
| 228 |
-
│ │ App.jsx │ │
|
| 229 |
-
│ │ - SearchBar (query, category, mood) │ │
|
| 230 |
-
│ │ - Gallery (books grid) │ │
|
| 231 |
-
│ │ - DetailModal (title, highlights, chat) │ │
|
| 232 |
-
│ │ - MyCollection (favorites view) │ │
|
| 233 |
-
│ └────────────────────────────────────────────────┘ │
|
| 234 |
-
│ api.js → Fetch wrappers (recommend, highlights...) │
|
| 235 |
-
└──────────────────────────────────────────────────────┘
|
| 236 |
-
↓
|
| 237 |
-
HTTP/CORS
|
| 238 |
-
↓
|
| 239 |
-
┌──────────────────────────────────────────────────────┐
|
| 240 |
-
│ BACKEND (FastAPI) │
|
| 241 |
-
│ src/main.py → uvicorn (localhost:6006) │
|
| 242 |
-
│ ┌────────────────────────────────────────────────┐ │
|
| 243 |
-
│ │ GET /health │ │
|
| 244 |
-
│ │ POST /recommend (query, category, tone) │ │
|
| 245 |
-
│ │ GET /categories, /tones │ │
|
| 246 |
-
│ │ ┌─────────────────────────────────────��────┐ │ │
|
| 247 |
-
│ │ │ NEW: POST /favorites/add │ │ │
|
| 248 |
-
│ │ │ NEW: GET /user/{id}/persona │ │ │
|
| 249 |
-
│ │ │ NEW: POST /marketing/highlights │ │ │
|
| 250 |
-
│ │ └──────────────────────────────────────────┘ │ │
|
| 251 |
-
│ └────────────────────────────────────────────────┘ │
|
| 252 |
-
└──────────────────────────────────────────────────────┘
|
| 253 |
-
↓ ↓
|
| 254 |
-
┌─────────────┐ ┌──────────────────┐
|
| 255 |
-
│ ChromaDB │ │ User Profiles │
|
| 256 |
-
│ (11k docs) │ │ (JSON file) │
|
| 257 |
-
│ ↓ │ │ ↓ │
|
| 258 |
-
│ Vector │ │ Favorites + │
|
| 259 |
-
│ Embeddings │ │ Persona │
|
| 260 |
-
└─────────────┘ └──────────────────┘
|
| 261 |
-
↓
|
| 262 |
-
┌─────────────────────────────────┐
|
| 263 |
-
│ Books Metadata (CSV) │
|
| 264 |
-
│ - title, authors, description │
|
| 265 |
-
│ - isbn, category, rating │
|
| 266 |
-
│ - emotion scores (joy/sad/etc) │
|
| 267 |
-
└─────────────────────────────────┘
|
| 268 |
-
```
|
| 269 |
-
|
| 270 |
-
---
|
| 271 |
-
|
| 272 |
-
## Key Data Models
|
| 273 |
-
|
| 274 |
-
### User Profile (JSON)
|
| 275 |
-
```json
|
| 276 |
-
{
|
| 277 |
-
"local": {
|
| 278 |
-
"favorites": [
|
| 279 |
-
{ "isbn": "9780451524935", "title": "1984", "added_at": "2026-01-06" },
|
| 280 |
-
{ "isbn": "9780061120084", "title": "To Kill a Mockingbird", "added_at": "2026-01-06" }
|
| 281 |
-
]
|
| 282 |
-
}
|
| 283 |
-
}
|
| 284 |
-
```
|
| 285 |
-
|
| 286 |
-
### Book Recommendation Response
|
| 287 |
-
```json
|
| 288 |
-
{
|
| 289 |
-
"recommendations": [
|
| 290 |
-
{
|
| 291 |
-
"isbn": "9780451524935",
|
| 292 |
-
"title": "1984",
|
| 293 |
-
"authors": "George Orwell",
|
| 294 |
-
"description": "A dystopian novel...",
|
| 295 |
-
"thumbnail": "https://covers.openlibrary.org/...",
|
| 296 |
-
"caption": "(auto-generated short hint)"
|
| 297 |
-
}
|
| 298 |
-
]
|
| 299 |
-
}
|
| 300 |
-
```
|
| 301 |
-
|
| 302 |
-
### Highlights Response
|
| 303 |
-
```json
|
| 304 |
-
{
|
| 305 |
-
"persona": {
|
| 306 |
-
"summary": "您钟爱悬疑与科幻,偏好国际视野的作品。",
|
| 307 |
-
"top_authors": ["Agatha Christie", "Isaac Asimov"],
|
| 308 |
-
"top_categories": ["Mystery", "Science Fiction"]
|
| 309 |
-
},
|
| 310 |
-
"highlights": [
|
| 311 |
-
"国际推理大师之作,契合您的悬疑偏好",
|
| 312 |
-
"心理扭转的情节设计,适合您快节奏阅读",
|
| 313 |
-
"深层人性反思,引发思考"
|
| 314 |
-
],
|
| 315 |
-
"meta": {
|
| 316 |
-
"title": "And Then There Were None",
|
| 317 |
-
"authors": "Agatha Christie",
|
| 318 |
-
"category": "Mystery",
|
| 319 |
-
"description": "..."
|
| 320 |
-
}
|
| 321 |
-
}
|
| 322 |
-
```
|
| 323 |
-
|
| 324 |
-
---
|
| 325 |
-
|
| 326 |
-
## Running the System
|
| 327 |
-
|
| 328 |
-
### Development Mode (3 services)
|
| 329 |
-
|
| 330 |
-
**Terminal 1: FastAPI Backend**
|
| 331 |
-
```bash
|
| 332 |
-
cd /Users/ymlin/Downloads/003-Study/138-Projects/book-rec-with-LLMs
|
| 333 |
-
make run
|
| 334 |
-
# Starts on http://localhost:6006
|
| 335 |
-
# Loads 11k books into ChromaDB
|
| 336 |
-
# Initializes metrics, routes
|
| 337 |
-
```
|
| 338 |
-
|
| 339 |
-
**Terminal 2: React Frontend**
|
| 340 |
-
```bash
|
| 341 |
-
cd web
|
| 342 |
-
npm run dev
|
| 343 |
-
# Starts on http://localhost:5173
|
| 344 |
-
# Hot reload on file changes
|
| 345 |
-
# Connect to http://localhost:6006 backend
|
| 346 |
-
```
|
| 347 |
-
|
| 348 |
-
**Terminal 3 (Optional): Gradio Legacy UI**
|
| 349 |
-
```bash
|
| 350 |
-
python app.py
|
| 351 |
-
# Starts on http://localhost:7860
|
| 352 |
-
# Alternative UI for testing
|
| 353 |
-
```
|
| 354 |
-
|
| 355 |
-
### Production Workflow
|
| 356 |
-
- React builds with `npm run build` → static files
|
| 357 |
-
- FastAPI serves as single backend
|
| 358 |
-
- Deploy as Docker containers (see DEPLOYMENT.md)
|
| 359 |
-
|
| 360 |
-
---
|
| 361 |
-
|
| 362 |
-
## Testing the Features
|
| 363 |
-
|
| 364 |
-
### 1. Test Semantic Search
|
| 365 |
-
```
|
| 366 |
-
Input: "悬疑推理小说,节奏快"
|
| 367 |
-
Expected: Agatha Christie, Sherlock Holmes, modern thrillers
|
| 368 |
-
```
|
| 369 |
-
|
| 370 |
-
### 2. Test Favorites → Persona
|
| 371 |
-
```
|
| 372 |
-
1. Add 5 books to collection (mix of genres)
|
| 373 |
-
2. Click a new book
|
| 374 |
-
3. Check highlights mention added books' authors/categories
|
| 375 |
-
✓ Persona should reflect your choices
|
| 376 |
-
```
|
| 377 |
-
|
| 378 |
-
### 3. Test Persona-Based Highlights
|
| 379 |
-
```
|
| 380 |
-
If you favorite: [Sci-Fi, Mystery, Literary]
|
| 381 |
-
Then recommend: Horror book X
|
| 382 |
-
Expected highlight: "虽不在您常读类型,但情节深度与科幻的想象力结合..."
|
| 383 |
-
(Acknowledges taste + bridges to new territory)
|
| 384 |
-
```
|
| 385 |
-
|
| 386 |
-
---
|
| 387 |
-
|
| 388 |
-
## Future Enhancements
|
| 389 |
-
|
| 390 |
-
### Phase 3: Recommendations (Backlog)
|
| 391 |
-
|
| 392 |
-
**1. LLM-Powered Highlights**
|
| 393 |
-
- Use Claude/GPT to refine rule-based highlights
|
| 394 |
-
- Natural language refinement (currently ~70% rule-based quality)
|
| 395 |
-
- Cache per (user_id, isbn) pair for speed
|
| 396 |
-
|
| 397 |
-
**2. Emotional Resonance Scoring**
|
| 398 |
-
- Leverage emotion embeddings (joy/sadness/fear/anger/surprise) in metadata
|
| 399 |
-
- Recommend books matching user's current mood signal
|
| 400 |
-
- "What are you feeling today?" filter
|
| 401 |
-
|
| 402 |
-
**3. Multi-User Accounts**
|
| 403 |
-
- Migrate from JSON to SQLite/PostgreSQL
|
| 404 |
-
- User authentication (OAuth)
|
| 405 |
-
- Social features (share collections, compare tastes)
|
| 406 |
-
|
| 407 |
-
**4. Advanced Search**
|
| 408 |
-
- Author-to-author recommendations ("If you like X, try Y's style")
|
| 409 |
-
- Time-based recommendations ("What to read this season?")
|
| 410 |
-
- Combination search (mood + timeframe + word-count)
|
| 411 |
-
|
| 412 |
-
**5. Analytics Dashboard**
|
| 413 |
-
- Show user: "You've read 15 books in the mystery genre"
|
| 414 |
-
- Predict next book based on reading history
|
| 415 |
-
- Genre comfort zone vs stretch zones
|
| 416 |
-
|
| 417 |
-
---
|
| 418 |
-
|
| 419 |
-
## Phase Reflection
|
| 420 |
-
|
| 421 |
-
### What Worked Well
|
| 422 |
-
✅ **Modular backend design** → easy to add /highlights, /persona endpoints
|
| 423 |
-
✅ **React UI responsiveness** → users see results instantly
|
| 424 |
-
✅ **JSON-first approach** → no DB setup friction, iterate fast
|
| 425 |
-
✅ **API-driven architecture** → Gradio + React both work
|
| 426 |
-
✅ **Persona concept** → users feel "understood" by the system
|
| 427 |
-
|
| 428 |
-
### Challenges Overcome
|
| 429 |
-
🔧 **Port conflicts** (Gradio:7860 vs React:5173 vs FastAPI:6006) → Makefile organization
|
| 430 |
-
🔧 **CORS issues** (frontend can't reach backend) → Added CORSMiddleware
|
| 431 |
-
🔧 **Image loading** (external URLs not allowed in Gradio) → Runtime fetching + local fallback
|
| 432 |
-
🔧 **Timeout errors** (cold startup > 10s) → Increased client timeouts, optimized startup
|
| 433 |
-
|
| 434 |
-
### Design Philosophy Validated
|
| 435 |
-
The shift from "marketplace" → "recommendation + personalization" was right because:
|
| 436 |
-
1. **Clear unique value:** Persona-aware recommendations don't exist in typical bookstores
|
| 437 |
-
2. **Tight scope:** Focused on one thing (smart discovery) vs scattered marketplace features
|
| 438 |
-
3. **User empathy:** People want to be understood, not just transact
|
| 439 |
-
|
| 440 |
-
---
|
| 441 |
-
|
| 442 |
-
## Code Structure Summary
|
| 443 |
-
|
| 444 |
-
```
|
| 445 |
-
book-rec-with-LLMs/
|
| 446 |
-
├── src/
|
| 447 |
-
│ ├── main.py # FastAPI app + 3 new endpoints
|
| 448 |
-
│ ├── recommender.py # Semantic search core
|
| 449 |
-
│ ├── vector_db.py # ChromaDB wrapper
|
| 450 |
-
│ ├── cache.py # Image caching
|
| 451 |
-
│ ├── user/
|
| 452 |
-
│ │ └── profile_store.py # ✨ NEW: Favorites JSON storage
|
| 453 |
-
│ └── marketing/
|
| 454 |
-
│ ├── persona.py # ✨ NEW: Persona aggregation
|
| 455 |
-
│ ├── highlights.py # ✨ NEW: Highlight generation
|
| 456 |
-
│ └── guardrails.py # Safety checks (stub)
|
| 457 |
-
├── web/ # ✨ NEW: React Vite app
|
| 458 |
-
│ ├── src/
|
| 459 |
-
│ │ ├── App.jsx # Main component + state
|
| 460 |
-
│ │ ├── api.js # Fetch wrappers
|
| 461 |
-
│ │ └── main.jsx # Entry point
|
| 462 |
-
│ ├── index.html # HTML + Tailwind CDN
|
| 463 |
-
│ └── package.json # Dependencies
|
| 464 |
-
├── app.py # Gradio UI (legacy)
|
| 465 |
-
├── Makefile # Commands
|
| 466 |
-
├── requirements.txt # Python deps
|
| 467 |
-
└── data/
|
| 468 |
-
├── books_processed.csv # Metadata
|
| 469 |
-
└── user_profiles.json # ✨ NEW: User data
|
| 470 |
-
```
|
| 471 |
-
|
| 472 |
-
---
|
| 473 |
-
|
| 474 |
-
## Commit Message
|
| 475 |
-
```
|
| 476 |
-
feat: add React UI and backend personalization features
|
| 477 |
-
|
| 478 |
-
- Create modern React UI (web/) with 纸间留白 design
|
| 479 |
-
* Semantic search + favorites + detail modal
|
| 480 |
-
* Tailwind CSS + lucide-react
|
| 481 |
-
* Vite dev server on port 5173
|
| 482 |
-
|
| 483 |
-
- Implement user personalization:
|
| 484 |
-
* src/user/profile_store.py: JSON favorites
|
| 485 |
-
* src/marketing/persona.py: User taste aggregation
|
| 486 |
-
* src/marketing/highlights.py: Persona-aware selling points
|
| 487 |
-
* 3 new API endpoints in FastAPI
|
| 488 |
-
|
| 489 |
-
- Maintain Gradio UI as optional fallback
|
| 490 |
-
- Add CORS support, update timeouts, improve infrastructure
|
| 491 |
-
```
|
| 492 |
-
|
| 493 |
-
---
|
| 494 |
-
|
| 495 |
-
## How to Continue
|
| 496 |
-
|
| 497 |
-
### If you want to test now:
|
| 498 |
-
1. `make run` (starts backend)
|
| 499 |
-
2. `cd web && npm run dev` (starts React UI)
|
| 500 |
-
3. Visit http://localhost:5173
|
| 501 |
-
4. Search for a book → click results → "加入藏书馆" → see persona highlights
|
| 502 |
-
|
| 503 |
-
### If you want to refine:
|
| 504 |
-
- Adjust persona algorithm in `src/marketing/persona.py`
|
| 505 |
-
- Tweak UI colors/layout in `web/src/App.jsx`
|
| 506 |
-
- Add more rules to highlights in `src/marketing/highlights.py`
|
| 507 |
-
|
| 508 |
-
### If you want to scale:
|
| 509 |
-
- Migrate to PostgreSQL (users table + favorites relationship)
|
| 510 |
-
- Add user auth (FastAPI auth middleware)
|
| 511 |
-
- Deploy with Docker + cloud (see DEPLOYMENT.md)
|
| 512 |
-
|
| 513 |
-
---
|
| 514 |
-
|
| 515 |
-
**Status:** ✅ **Ready to Deploy**
|
| 516 |
-
|
| 517 |
-
Next phase can focus on: multi-user support, LLM refinement, analytics, or social features.
|
| 518 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
README.md
CHANGED
|
@@ -7,221 +7,70 @@ app_port: 8000
|
|
| 7 |
|
| 8 |
# Intelligent Book Recommendation System
|
| 9 |
|
| 10 |
-
|
| 11 |
|
| 12 |
-
|
| 13 |
|
| 14 |
-
|
| 15 |
|
| 16 |
-
|
| 17 |
-
|:---|:---|:---|
|
| 18 |
-
| **Semantic Search** | ChromaDB + MiniLM-L6 | Sub-300ms retrieval on 200K+ books |
|
| 19 |
-
| **Agentic Router** | Rule-based intent classification | 4 dynamic strategies (BM25, Hybrid, Rerank, Small-to-Big) |
|
| 20 |
-
| **Personalized Rec** | 7-channel recall + LGBMRanker + Stacking | HR@10: 0.4545, MRR@5: 0.2893 |
|
| 21 |
-
| **Conversational AI** | RAG + OpenAI/Ollama | Real-time streaming (Default: Local Ollama) |
|
| 22 |
|
| 23 |
-
|
| 24 |
|
| 25 |
-
##
|
| 26 |
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
│ REST + SSE
|
| 33 |
-
┌──────────────────────────▼──────────────────────────────────────┐
|
| 34 |
-
│ FastAPI Backend │
|
| 35 |
-
│ ┌─────────────┐ ┌──────────────┐ ┌───────────────────────┐ │
|
| 36 |
-
│ │ Query Router│→ │ RAG Pipeline │→ │ Personalized RecSys │ │
|
| 37 |
-
│ └─────────────┘ └──────────────┘ └───────────────────────┘ │
|
| 38 |
-
│ │ │ │ │
|
| 39 |
-
│ Intent Class Hybrid Search 7-Channel Recall + RRF │
|
| 40 |
-
│ (ISBN/Keyword + Cross-Encoder (ItemCF + UserCF + Swing │
|
| 41 |
-
│ /Complex) Reranking + SASRec + Item2Vec...) │
|
| 42 |
-
└──────────────────────────┬──────────────────────────────────────┘
|
| 43 |
-
│
|
| 44 |
-
┌──────────────────┼──────────────────┐
|
| 45 |
-
▼ ▼ ▼
|
| 46 |
-
┌─────────┐ ┌───────────┐ ┌──────────────┐
|
| 47 |
-
│ChromaDB │ │LGBMRanker │ │ LLM Provider │
|
| 48 |
-
│(Vectors)│ │(LambdaRank│ │ (Chat/Recs) │
|
| 49 |
-
└─────────┘ └───────────┘ └──────────────┘
|
| 50 |
-
```
|
| 51 |
|
| 52 |
-
|
| 53 |
|
| 54 |
-
##
|
| 55 |
-
|
| 56 |
-
### 1. Agentic RAG with Dynamic Routing
|
| 57 |
-
- **Query Intent Classification**: Automatically routes queries to optimal retrieval strategy
|
| 58 |
-
- ISBN → Pure BM25 (100% precision)
|
| 59 |
-
- Keywords → Hybrid Search (BM25 + Dense)
|
| 60 |
-
- Complex queries → Cross-Encoder Reranking
|
| 61 |
-
- Detail queries → Small-to-Big Retrieval (788K indexed sentences)
|
| 62 |
-
|
| 63 |
-
### 2. Personalized Recommendation Engine
|
| 64 |
-
- **7-Channel Recall**: ItemCF (direction-weighted), UserCF, Swing, SASRec, Item2Vec, YoutubeDNN, Popularity
|
| 65 |
-
- **RRF Fusion**: Reciprocal Rank Fusion merges candidates across all recall channels
|
| 66 |
-
- **SASRec Sequential Model**: 64-dim Transformer embeddings (30 epochs), used as both recall source and ranking feature
|
| 67 |
-
- **Model Stacking**: LGBMRanker (LambdaRank) + XGBClassifier → LogisticRegression meta-learner
|
| 68 |
-
- **LGBMRanker (LambdaRank)**: Directly optimizes NDCG with 17 engineered features and hard negative sampling
|
| 69 |
-
- **Evaluation**: HR@10 = 0.4545, MRR@5 = 0.2893 (n=2000, Leave-Last-Out, title-relaxed)
|
| 70 |
-
|
| 71 |
-
### 3. My Bookshelf (User Library)
|
| 72 |
-
- **Rating System**: 5-star rating with persistence
|
| 73 |
-
- **Reading Status**: Want to Read / Reading / Finished
|
| 74 |
-
- **Statistics Dashboard**: Visual progress tracking
|
| 75 |
-
|
| 76 |
-
### 4. Conversational Shopping Assistant
|
| 77 |
-
- **RAG-Grounded Responses**: Context from ChromaDB reduces hallucinations
|
| 78 |
-
- **Streaming Output**: Real-time token streaming via SSE
|
| 79 |
-
- **Flexible LLM**: Defaults to local **Ollama** (free/privacy), supports **OpenAI API** for production
|
| 80 |
|
| 81 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 82 |
|
| 83 |
## Quick Start
|
| 84 |
|
| 85 |
-
### Prerequisites
|
| 86 |
-
- Python 3.10+ with Conda
|
| 87 |
-
- Node.js 18+
|
| 88 |
-
|
| 89 |
-
### Installation
|
| 90 |
-
|
| 91 |
```bash
|
| 92 |
-
# Clone and setup environment
|
| 93 |
git clone https://github.com/sylvia-ymlin/book-rec-with-LLMs.git
|
| 94 |
cd book-rec-with-LLMs
|
| 95 |
-
conda env create -f environment.yml
|
| 96 |
-
conda activate book-rec
|
| 97 |
-
|
| 98 |
-
# Initialize vector database (first run)
|
| 99 |
-
python src/init_db.py
|
| 100 |
|
| 101 |
-
#
|
| 102 |
-
|
|
|
|
| 103 |
|
| 104 |
-
#
|
| 105 |
-
cd web && npm install && npm run dev
|
| 106 |
```
|
| 107 |
|
| 108 |
-
|
| 109 |
|
| 110 |
-
|
| 111 |
-
|:---|:---|:---|
|
| 112 |
-
| **Ollama** | `ollama pull llama3` | Free, local dev |
|
| 113 |
-
| **OpenAI** | Set API key in UI Settings | Production |
|
| 114 |
-
|
| 115 |
-
---
|
| 116 |
-
|
| 117 |
-
## API Endpoints
|
| 118 |
-
|
| 119 |
-
| Endpoint | Method | Description |
|
| 120 |
-
|:---|:---|:---|
|
| 121 |
-
| `/recommend` | POST | Semantic search with emotion/category filters |
|
| 122 |
-
| `/api/recommend/personal` | GET | Personalized recommendations (RecSys) |
|
| 123 |
-
| `/favorites/add` | POST | Add book to collection |
|
| 124 |
-
| `/favorites/update` | PUT | Update rating/reading status |
|
| 125 |
-
| `/user/{id}/stats` | GET | Reading statistics |
|
| 126 |
-
| `/chat/completions` | POST | RAG-powered chat (streaming) |
|
| 127 |
-
| `/health` | GET | Service health check |
|
| 128 |
-
|
| 129 |
-
---
|
| 130 |
-
|
| 131 |
-
## Project Structure
|
| 132 |
-
|
| 133 |
-
```
|
| 134 |
-
src/
|
| 135 |
-
├── main.py # FastAPI application
|
| 136 |
-
├── recommender.py # RAG search orchestration
|
| 137 |
-
├── vector_db.py # ChromaDB wrapper
|
| 138 |
-
├── core/
|
| 139 |
-
│ ├── router.py # Agentic query routing
|
| 140 |
-
│ └── reranker.py # Cross-encoder reranking
|
| 141 |
-
├── recall/
|
| 142 |
-
│ ├── itemcf.py # ItemCF with direction weight
|
| 143 |
-
│ ├── usercf.py # UserCF (Jaccard + activity penalty)
|
| 144 |
-
│ ├── swing.py # Swing (user-pair overlap weighting)
|
| 145 |
-
│ ├── sasrec_recall.py # SASRec embedding dot-product recall
|
| 146 |
-
│ ├── item2vec.py # Item2Vec (Word2Vec on sequences)
|
| 147 |
-
│ ├── youtube_dnn.py # YoutubeDNN two-tower recall
|
| 148 |
-
│ ├── popularity.py # Popularity with time decay
|
| 149 |
-
│ └── fusion.py # RRF fusion of all channels
|
| 150 |
-
├── ranking/
|
| 151 |
-
│ └── features.py # 17 ranking features
|
| 152 |
-
├── services/
|
| 153 |
-
│ └── recommend_service.py # Recall → Rank → Dedup pipeline
|
| 154 |
-
└── user/ # User profile storage
|
| 155 |
-
|
| 156 |
-
web/
|
| 157 |
-
├── src/App.jsx # React UI
|
| 158 |
-
└── src/api.js # API client
|
| 159 |
-
|
| 160 |
-
scripts/
|
| 161 |
-
├── model/
|
| 162 |
-
│ ├── train_sasrec.py # SASRec sequential model training
|
| 163 |
-
│ ├── build_recall_models.py # ItemCF, UserCF, Swing, Popularity
|
| 164 |
-
│ ├── train_ranker.py # LGBMRanker with hard negative sampling
|
| 165 |
-
│ └── evaluate.py # HR@10, MRR@5 evaluation
|
| 166 |
-
├── deploy/ # Server deployment scripts
|
| 167 |
-
└── data/ # Data processing pipelines
|
| 168 |
-
```
|
| 169 |
|
| 170 |
-
|
| 171 |
-
|
| 172 |
-
## Performance
|
| 173 |
-
|
| 174 |
-
|
| 175 |
-
### Recommendation Metrics (v2.6.0)
|
| 176 |
-
|
| 177 |
-
| Metric | V2.5 | v2.6.0 | Method |
|
| 178 |
-
|:---|:---|:---|:---|
|
| 179 |
-
| **Hit Rate@10** | 0.2205 | **0.4545** | Leave-Last-Out, n=2000 |
|
| 180 |
-
| **MRR@5** | 0.1584 | **0.2893** | Title-relaxed matching |
|
| 181 |
-
|
| 182 |
-
v2.6.0 Key Enhancements:
|
| 183 |
-
- **Item2Vec Recall**: Word2Vec (Skip-gram) on user interaction sequences, 7th recall channel.
|
| 184 |
-
- **Model Stacking**: LGBMRanker + XGBClassifier → LogisticRegression meta-learner.
|
| 185 |
-
- **Engineering**: Vectorized inference, non-blocking I/O, metadata singleton.
|
| 186 |
-
|
| 187 |
-
| Dataset | Size |
|
| 188 |
-
|:---|:---|
|
| 189 |
-
| Training Set | 1,079,966 interactions |
|
| 190 |
-
| Active Users | 167,968 |
|
| 191 |
-
| Books | 221,998 |
|
| 192 |
-
|
| 193 |
-
### Latency Benchmarks
|
| 194 |
-
| Operation | P50 Latency |
|
| 195 |
|:---|:---|
|
| 196 |
-
|
|
| 197 |
-
|
|
| 198 |
-
|
|
| 199 |
-
|
|
| 200 |
-
|
| 201 |
-
---
|
| 202 |
-
|
| 203 |
-
## Project Documentation
|
| 204 |
-
|
| 205 |
-
| Document | Description |
|
| 206 |
-
|:---|:---|
|
| 207 |
-
| [Memory Optimization Report](docs/memory_optimization.md) | Engineering overhaul: Zero-RAM SQLite architecture |
|
| 208 |
-
| [Experiment Archive](docs/experiments/experiment_archive.md) | All experimental results from V1.0 to v2.6.0 |
|
| 209 |
-
| [Performance Debugging Report](docs/performance_debugging_report.md) | Root cause analysis of evaluation issues |
|
| 210 |
-
| [Roadmap](docs/roadmap.md) | Technical evolution (v2.6.0 frozen, V3.0 planned) |
|
| 211 |
-
| [Technical Report](docs/technical_report.md) | System architecture deep dive |
|
| 212 |
-
| [Build Guide](docs/build_guide.md) | Build and deployment instructions |
|
| 213 |
-
|
| 214 |
-
## References
|
| 215 |
-
|
| 216 |
-
1. Kang, W., & McAuley, J. (2018). *Self-Attentive Sequential Recommendation*. ICDM.
|
| 217 |
-
2. Reimers, N., & Gurevych, I. (2019). *Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks*.
|
| 218 |
-
3. Ke, G., et al. (2017). *LightGBM: A Highly Efficient Gradient Boosting Decision Tree*. NeurIPS.
|
| 219 |
-
4. Gao, L., et al. (2022). *Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE)*.
|
| 220 |
-
5. Yang, J., et al. (2020). *Large-scale Product Graph Construction for Recommendation in E-commerce* (Swing algorithm).
|
| 221 |
-
|
| 222 |
-
---
|
| 223 |
|
| 224 |
## License
|
| 225 |
|
| 226 |
MIT
|
| 227 |
-
|
|
|
|
| 7 |
|
| 8 |
# Intelligent Book Recommendation System
|
| 9 |
|
| 10 |
+
*Frozen at v2.6.0 — maintenance mode for portfolio use.*
|
| 11 |
|
| 12 |
+
## Problem
|
| 13 |
|
| 14 |
+
Readers often can't articulate what they want. Can one system both understand their vague descriptions and give personalized recommendations based on their reading history?
|
| 15 |
|
| 16 |
+
## Method
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
|
| 18 |
+
Two parallel threads: **RAG** (Agentic Router → Hybrid Search → Reranking) for understanding vague queries; **RecSys** (7-channel recall → LGBMRanker → Stacking) for personalized recommendations from reading history.
|
| 19 |
|
| 20 |
+
## Key Experiments
|
| 21 |
|
| 22 |
+
| Experiment | Before | After | Conclusion |
|
| 23 |
+
|:---|:---|:---|:---|
|
| 24 |
+
| **RAG: Exact match** | Pure vector search, ISBN → 0% recall | Hybrid (BM25 + Dense) + Router → 100% | Vector-only fails on exact entities; BM25 + routing fixes it |
|
| 25 |
+
| **RAG: Keyword intent** | "Harry Potter" → Philosophy book | Reranked → Sorcerer's Stone | Cross-encoder corrects semantic drift |
|
| 26 |
+
| **RecSys: Personalization** | Baseline 0.138 HR@10 | Item2Vec + LGBMRanker + Stacking → **0.4545** HR@10 | 7-channel recall + LambdaRank + ensemble beats single model |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
|
| 28 |
+
*Evaluation: Leave-Last-Out, n=2000, title-relaxed. HR@10 = 0.4545, MRR@5 = 0.2893.*
|
| 29 |
|
| 30 |
+
## Architecture
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
+
```
|
| 33 |
+
Query No Query
|
| 34 |
+
│ │
|
| 35 |
+
▼ ▼
|
| 36 |
+
┌─────────────┐ ┌─────────────┐
|
| 37 |
+
│ RAG Path │ │ RecSys Path │
|
| 38 |
+
│ Router → │ │ 7-Channel │
|
| 39 |
+
│ Hybrid → │ │ Recall → │
|
| 40 |
+
│ Rerank │ │ LGBMRanker │
|
| 41 |
+
└─────────────┘ └─────────────┘
|
| 42 |
+
│ │
|
| 43 |
+
└──────────┬──────────────┘
|
| 44 |
+
▼
|
| 45 |
+
Top-K Results
|
| 46 |
+
```
|
| 47 |
|
| 48 |
## Quick Start
|
| 49 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
```bash
|
|
|
|
| 51 |
git clone https://github.com/sylvia-ymlin/book-rec-with-LLMs.git
|
| 52 |
cd book-rec-with-LLMs
|
| 53 |
+
conda env create -f environment.yml && conda activate book-rec
|
|
|
|
|
|
|
|
|
|
|
|
|
| 54 |
|
| 55 |
+
# First run (or use make data-pipeline for full build)
|
| 56 |
+
python src/init_db.py # Chroma vector DB
|
| 57 |
+
python scripts/init_sqlite_db.py # SQLite metadata (local build)
|
| 58 |
|
| 59 |
+
make run # API http://localhost:6006
|
| 60 |
+
cd web && npm install && npm run dev # UI http://localhost:5173
|
| 61 |
```
|
| 62 |
|
| 63 |
+
**LLM**: Default Ollama (`ollama pull llama3`). OpenAI API key in UI Settings for production.
|
| 64 |
|
| 65 |
+
## Documentation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
|
| 67 |
+
| Doc | Purpose |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 68 |
|:---|:---|
|
| 69 |
+
| [Technical Report](docs/TECHNICAL_REPORT.md) | Architecture, design decisions |
|
| 70 |
+
| [Experiment Archive](docs/experiments/experiment_archive.md) | Full experiment log (V1.0 → v2.6.0) |
|
| 71 |
+
| [Interview Guide](docs/interview_guide.md) | Q&A, STAR cases |
|
| 72 |
+
| [Build Guide](docs/build_guide.md) | Deployment instructions |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
|
| 74 |
## License
|
| 75 |
|
| 76 |
MIT
|
|
|
REVIEW_HIGHLIGHTS.md
DELETED
|
@@ -1,142 +0,0 @@
|
|
| 1 |
-
# Review Highlights Feature
|
| 2 |
-
|
| 3 |
-
## Overview
|
| 4 |
-
|
| 5 |
-
Added semantic sentence extraction to display representative reader reviews for each book. This feature enhances book discovery by showcasing authentic reader voices.
|
| 6 |
-
|
| 7 |
-
## Implementation
|
| 8 |
-
|
| 9 |
-
### 1. Data Generation (Server-side)
|
| 10 |
-
|
| 11 |
-
**Script**: `scripts/extract_review_sentences.py`
|
| 12 |
-
|
| 13 |
-
**Process**:
|
| 14 |
-
- Splits book descriptions into sentences using regex
|
| 15 |
-
- Uses `sentence-transformers/all-MiniLM-L6-v2` for sentence embeddings
|
| 16 |
-
- Clusters similar sentences via cosine similarity (threshold: 0.8)
|
| 17 |
-
- Extracts representative sentences from each cluster (top 5 per book)
|
| 18 |
-
- Stores as semicolon-separated `review_highlights` column in CSV
|
| 19 |
-
|
| 20 |
-
**Execution**:
|
| 21 |
-
```bash
|
| 22 |
-
# Run in container with GPU
|
| 23 |
-
export HF_ENDPOINT=https://hf-mirror.com
|
| 24 |
-
python scripts/extract_review_sentences.py \
|
| 25 |
-
--input data/books_processed.csv \
|
| 26 |
-
--output data/books_processed.csv \
|
| 27 |
-
--top-n 5 \
|
| 28 |
-
--similarity-threshold 0.8 \
|
| 29 |
-
--device 0 \
|
| 30 |
-
--batch-size 128
|
| 31 |
-
```
|
| 32 |
-
|
| 33 |
-
**Performance**: ~17 minutes for 222k books on GPU (211 it/s)
|
| 34 |
-
|
| 35 |
-
### 2. Backend Integration
|
| 36 |
-
|
| 37 |
-
**Files Modified**:
|
| 38 |
-
- `src/recommender.py`: Parse `review_highlights` from CSV, split by semicolon
|
| 39 |
-
- `src/main.py`: Add `review_highlights: List[str]` to `BookResponse` model
|
| 40 |
-
|
| 41 |
-
**Code**:
|
| 42 |
-
```python
|
| 43 |
-
# Parse review highlights from semicolon-separated string
|
| 44 |
-
highlights_raw = str(row.get("review_highlights", "")).strip()
|
| 45 |
-
review_highlights = [h.strip() for h in highlights_raw.split(";") if h.strip()]
|
| 46 |
-
```
|
| 47 |
-
|
| 48 |
-
### 3. Frontend Display
|
| 49 |
-
|
| 50 |
-
**File**: `web/src/App.jsx`
|
| 51 |
-
|
| 52 |
-
**Location**: Left column, bottom section (below Rating/Mood)
|
| 53 |
-
|
| 54 |
-
**Features**:
|
| 55 |
-
- Displays up to 3 representative sentences
|
| 56 |
-
- Bullet-point format with `-` prefix
|
| 57 |
-
- Complete sentences: `- "[sentence]"`
|
| 58 |
-
- Incomplete sentences: `- "...[sentence]"` (auto-detected via regex `/^[A-Z]/`)
|
| 59 |
-
- Styling: 10px italic gray text
|
| 60 |
-
|
| 61 |
-
**Layout**:
|
| 62 |
-
```jsx
|
| 63 |
-
{selectedBook.review_highlights && selectedBook.review_highlights.length > 0 && (
|
| 64 |
-
<div className="w-full mt-auto space-y-2 text-left">
|
| 65 |
-
{selectedBook.review_highlights.slice(0, 3).map((highlight, idx) => {
|
| 66 |
-
const isCompleteSentence = /^[A-Z]/.test(highlight.trim());
|
| 67 |
-
const prefix = isCompleteSentence ? '' : '...';
|
| 68 |
-
return (
|
| 69 |
-
<p key={idx} className="text-[10px] text-[#666] leading-relaxed italic pl-2">
|
| 70 |
-
- "{prefix}{highlight}"
|
| 71 |
-
</p>
|
| 72 |
-
);
|
| 73 |
-
})}
|
| 74 |
-
</div>
|
| 75 |
-
)}
|
| 76 |
-
```
|
| 77 |
-
|
| 78 |
-
## Related Changes
|
| 79 |
-
|
| 80 |
-
### Rating Display Enhancement
|
| 81 |
-
|
| 82 |
-
**Problem**: Hardcoded rating value of 4 stars for all books
|
| 83 |
-
|
| 84 |
-
**Solution**:
|
| 85 |
-
- Added `average_rating` field to backend API response
|
| 86 |
-
- Display format: `4.3` (1 decimal) + filled stars
|
| 87 |
-
- Moved rating display into AI highlight box (pink desc_block)
|
| 88 |
-
|
| 89 |
-
**Frontend mapping**:
|
| 90 |
-
```javascript
|
| 91 |
-
rating: r.average_rating || 0, // Keep float, no rounding
|
| 92 |
-
```
|
| 93 |
-
|
| 94 |
-
**Display**:
|
| 95 |
-
```jsx
|
| 96 |
-
<span>{selectedBook.rating ? selectedBook.rating.toFixed(1) : '0.0'}</span>
|
| 97 |
-
<div className="flex gap-0.5 text-[#f4acb7]">
|
| 98 |
-
{[1,2,3,4,5].map(i => <Star key={i} className={`w-3 h-3 ${i <= selectedBook.rating ? 'fill-current' : ''}`} />)}
|
| 99 |
-
</div>
|
| 100 |
-
```
|
| 101 |
-
|
| 102 |
-
### Layout Adjustments
|
| 103 |
-
|
| 104 |
-
- Grid ratio: 4:8 → 5:7 (more space for left column)
|
| 105 |
-
- Rating/Mood: Changed from vertical stack to consolidated display
|
| 106 |
-
- Rating moved into desc_block (AI highlight box)
|
| 107 |
-
- Review highlights positioned at bottom with `mt-auto`
|
| 108 |
-
|
| 109 |
-
## Data Schema
|
| 110 |
-
|
| 111 |
-
**CSV Column**: `review_highlights` (string, semicolon-separated)
|
| 112 |
-
|
| 113 |
-
**Example**:
|
| 114 |
-
```
|
| 115 |
-
"Having been brought up on the notion...;It transpires, some years ago...;This is a work full of wisdom..."
|
| 116 |
-
```
|
| 117 |
-
|
| 118 |
-
**API Response**:
|
| 119 |
-
```json
|
| 120 |
-
{
|
| 121 |
-
"review_highlights": [
|
| 122 |
-
"Having been brought up on the notion that Elizabeth Barrett Browning was the slighter poet...",
|
| 123 |
-
"It transpires, some years ago, Clarke hosted two hugely successful British television series...",
|
| 124 |
-
"This is a work full of wisdom and unusual perspectives."
|
| 125 |
-
],
|
| 126 |
-
"average_rating": 3.716216
|
| 127 |
-
}
|
| 128 |
-
```
|
| 129 |
-
|
| 130 |
-
## Notes
|
| 131 |
-
|
| 132 |
-
- Review highlights are pre-computed and stored in CSV (no runtime extraction)
|
| 133 |
-
- Data file `books_processed.csv` (~243MB) must be regenerated after container rebuild
|
| 134 |
-
- Use `scp` to transfer processed CSV back to local machine
|
| 135 |
-
- HuggingFace mirror (`HF_ENDPOINT`) required for model download in restricted networks
|
| 136 |
-
|
| 137 |
-
## Future Improvements
|
| 138 |
-
|
| 139 |
-
- Cache sentence embeddings to speed up re-generation
|
| 140 |
-
- Add sentiment analysis to highlights (positive/critical)
|
| 141 |
-
- Filter highlights by relevance to user query
|
| 142 |
-
- Display highlight source (verified purchase vs. regular review)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
TAGS_AND_EMOTIONS.md
DELETED
|
@@ -1,233 +0,0 @@
|
|
| 1 |
-
# Tags and Emotion Scoring
|
| 2 |
-
|
| 3 |
-
This document describes the tag generation and emotion scoring features added to enrich book metadata.
|
| 4 |
-
|
| 5 |
-
## Overview
|
| 6 |
-
|
| 7 |
-
- **Tags**: Keyword extraction from book descriptions using TF-IDF (5-8 terms per book)
|
| 8 |
-
- **Emotion Scores**: Five emotion dimensions (joy, sadness, fear, anger, surprise) computed via transformer model
|
| 9 |
-
|
| 10 |
-
## Data Generation
|
| 11 |
-
|
| 12 |
-
### 1. Tag Generation
|
| 13 |
-
|
| 14 |
-
Extracts thematic keywords from aggregated review text.
|
| 15 |
-
|
| 16 |
-
**Script**: `scripts/generate_tags.py`
|
| 17 |
-
|
| 18 |
-
**Usage**:
|
| 19 |
-
```bash
|
| 20 |
-
python scripts/generate_tags.py \
|
| 21 |
-
--input data/books_processed.csv \
|
| 22 |
-
--output data/books_processed.csv \
|
| 23 |
-
--top-n 8
|
| 24 |
-
```
|
| 25 |
-
|
| 26 |
-
**Algorithm**:
|
| 27 |
-
- TF-IDF vectorization (unigrams + bigrams)
|
| 28 |
-
- English stopwords + domain stoplist (e.g., "book", "author", "story")
|
| 29 |
-
- Top-N weighted terms per book
|
| 30 |
-
- Semicolon-joined storage in `tags` column
|
| 31 |
-
|
| 32 |
-
**Parameters**:
|
| 33 |
-
- `--top-n`: Max tags per book (default: 8)
|
| 34 |
-
- `--max-features`: TF-IDF vocabulary size (default: 60,000)
|
| 35 |
-
- `--min-df`: Minimum document frequency (default: 5)
|
| 36 |
-
- `--max-df`: Maximum document frequency ratio (default: 0.5)
|
| 37 |
-
|
| 38 |
-
### 2. Emotion Scoring
|
| 39 |
-
|
| 40 |
-
Computes emotion intensity scores from book descriptions.
|
| 41 |
-
|
| 42 |
-
**Script**: `scripts/generate_emotions.py`
|
| 43 |
-
|
| 44 |
-
**Model**: `j-hartmann/emotion-english-distilroberta-base`
|
| 45 |
-
|
| 46 |
-
**Usage**:
|
| 47 |
-
```bash
|
| 48 |
-
# CPU
|
| 49 |
-
python scripts/generate_emotions.py \
|
| 50 |
-
--input data/books_processed.csv \
|
| 51 |
-
--output data/books_processed.csv \
|
| 52 |
-
--batch-size 16
|
| 53 |
-
|
| 54 |
-
# Apple GPU (MPS)
|
| 55 |
-
python scripts/generate_emotions.py \
|
| 56 |
-
--input data/books_processed.csv \
|
| 57 |
-
--output data/books_processed.csv \
|
| 58 |
-
--batch-size 8 \
|
| 59 |
-
--device mps \
|
| 60 |
-
--checkpoint 2000 \
|
| 61 |
-
--resume
|
| 62 |
-
```
|
| 63 |
-
|
| 64 |
-
**Parameters**:
|
| 65 |
-
- `--batch-size`: Inference batch size (default: 16)
|
| 66 |
-
- `--device`: `mps` (Apple GPU), CUDA device id, or CPU (default)
|
| 67 |
-
- `--checkpoint`: Rows between checkpoint writes (default: 5000)
|
| 68 |
-
- `--resume`: Skip rows already scored (useful for resuming long runs)
|
| 69 |
-
- `--max-rows`: Limit processing to N rows (for testing)
|
| 70 |
-
|
| 71 |
-
**Output Columns**:
|
| 72 |
-
- `joy`: 0.0–1.0
|
| 73 |
-
- `sadness`: 0.0–1.0
|
| 74 |
-
- `fear`: 0.0–1.0
|
| 75 |
-
- `anger`: 0.0–1.0
|
| 76 |
-
- `surprise`: 0.0–1.0
|
| 77 |
-
|
| 78 |
-
**Performance**:
|
| 79 |
-
- ~1.1 it/s on Apple M-series GPU
|
| 80 |
-
- ~7 hours for 222k books (batch_size=8, MPS)
|
| 81 |
-
- One-time processing; results persist in CSV
|
| 82 |
-
|
| 83 |
-
## Data Schema
|
| 84 |
-
|
| 85 |
-
Updated `books_processed.csv` columns:
|
| 86 |
-
|
| 87 |
-
| Column | Type | Description |
|
| 88 |
-
|--------|------|-------------|
|
| 89 |
-
| `tags` | str | Semicolon-separated keywords (e.g., "irish;travel;humor") |
|
| 90 |
-
| `joy` | float | Joy emotion score (0.0–1.0) |
|
| 91 |
-
| `sadness` | float | Sadness emotion score (0.0–1.0) |
|
| 92 |
-
| `fear` | float | Fear emotion score (0.0–1.0) |
|
| 93 |
-
| `anger` | float | Anger emotion score (0.0–1.0) |
|
| 94 |
-
| `surprise` | float | Surprise emotion score (0.0–1.0) |
|
| 95 |
-
|
| 96 |
-
## API Integration
|
| 97 |
-
|
| 98 |
-
### Backend Changes
|
| 99 |
-
|
| 100 |
-
**File**: `src/recommender.py`
|
| 101 |
-
|
| 102 |
-
Added to `_format_results()`:
|
| 103 |
-
```python
|
| 104 |
-
# Parse tags
|
| 105 |
-
tags_raw = str(row.get("tags", "")).strip()
|
| 106 |
-
tags = [t.strip() for t in tags_raw.split(";") if t.strip()] if tags_raw else []
|
| 107 |
-
|
| 108 |
-
# Extract emotions
|
| 109 |
-
emotions = {
|
| 110 |
-
"joy": float(row.get("joy", 0.0)),
|
| 111 |
-
"sadness": float(row.get("sadness", 0.0)),
|
| 112 |
-
"fear": float(row.get("fear", 0.0)),
|
| 113 |
-
"anger": float(row.get("anger", 0.0)),
|
| 114 |
-
"surprise": float(row.get("surprise", 0.0)),
|
| 115 |
-
}
|
| 116 |
-
```
|
| 117 |
-
|
| 118 |
-
**File**: `src/main.py`
|
| 119 |
-
|
| 120 |
-
Updated Pydantic model:
|
| 121 |
-
```python
|
| 122 |
-
class BookResponse(BaseModel):
|
| 123 |
-
isbn: str
|
| 124 |
-
title: str
|
| 125 |
-
authors: str
|
| 126 |
-
description: str
|
| 127 |
-
thumbnail: str
|
| 128 |
-
caption: str
|
| 129 |
-
tags: List[str] = []
|
| 130 |
-
emotions: Dict[str, float] = {}
|
| 131 |
-
```
|
| 132 |
-
|
| 133 |
-
### API Response Example
|
| 134 |
-
|
| 135 |
-
```json
|
| 136 |
-
{
|
| 137 |
-
"recommendations": [
|
| 138 |
-
{
|
| 139 |
-
"isbn": "0001849883",
|
| 140 |
-
"title": "Bury My Bones But Keep My Words",
|
| 141 |
-
"authors": "Deborah Savage, Tony Fairman",
|
| 142 |
-
"tags": ["paulsen", "otters", "searches", "gary", "brian"],
|
| 143 |
-
"emotions": {
|
| 144 |
-
"joy": 0.020,
|
| 145 |
-
"sadness": 0.004,
|
| 146 |
-
"fear": 0.012,
|
| 147 |
-
"anger": 0.006,
|
| 148 |
-
"surprise": 0.086
|
| 149 |
-
}
|
| 150 |
-
}
|
| 151 |
-
]
|
| 152 |
-
}
|
| 153 |
-
```
|
| 154 |
-
|
| 155 |
-
## UI Display
|
| 156 |
-
|
| 157 |
-
### Search Results Grid
|
| 158 |
-
|
| 159 |
-
Each book card displays:
|
| 160 |
-
- **Dominant emotion label**: Emotion with highest score (bottom-right badge)
|
| 161 |
-
- Example: "joy", "sadness", "fear"
|
| 162 |
-
|
| 163 |
-
**Implementation** (`web/src/App.jsx`):
|
| 164 |
-
```jsx
|
| 165 |
-
{book.emotions && Object.keys(book.emotions).length > 0 ? (
|
| 166 |
-
<span className="text-[9px] bg-[#f8f9fa] border border-[#eee] px-1 text-[#999] capitalize">
|
| 167 |
-
{Object.entries(book.emotions).reduce((a, b) => a[1] > b[1] ? a : b)[0]}
|
| 168 |
-
</span>
|
| 169 |
-
) : (
|
| 170 |
-
<span className="text-[9px] bg-[#f8f9fa] border border-[#eee] px-1 text-[#999]">—</span>
|
| 171 |
-
)}
|
| 172 |
-
```
|
| 173 |
-
|
| 174 |
-
### Book Detail Modal
|
| 175 |
-
|
| 176 |
-
Two new sections:
|
| 177 |
-
|
| 178 |
-
**1. Key Themes**
|
| 179 |
-
- Displays all extracted tags as badges
|
| 180 |
-
- Shows "No themes found" if tags empty
|
| 181 |
-
|
| 182 |
-
**2. Emotional Tone**
|
| 183 |
-
- Five horizontal bars showing emotion scores
|
| 184 |
-
- Bar width = score percentage (0–100%)
|
| 185 |
-
- Format: `emotion_name | [bar] | percentage`
|
| 186 |
-
|
| 187 |
-
**Implementation** (`web/src/App.jsx`):
|
| 188 |
-
```jsx
|
| 189 |
-
<div className="space-y-2">
|
| 190 |
-
<h4>Emotional Tone</h4>
|
| 191 |
-
<div className="space-y-2 p-3 bg-[#faf9f6] border border-[#eee]">
|
| 192 |
-
{selectedBook.emotions && Object.entries(selectedBook.emotions).map(([emotion, score]) => (
|
| 193 |
-
<div key={emotion} className="flex items-center gap-2">
|
| 194 |
-
<span className="text-[9px] font-bold text-gray-500 w-16 capitalize">{emotion}</span>
|
| 195 |
-
<div className="flex-grow bg-white border border-[#eee] h-2 relative overflow-hidden">
|
| 196 |
-
<div
|
| 197 |
-
className="h-full bg-[#b392ac] transition-all"
|
| 198 |
-
style={{ width: `${Math.round(score * 100)}%` }}
|
| 199 |
-
/>
|
| 200 |
-
</div>
|
| 201 |
-
<span className="text-[8px] text-gray-400 w-10 text-right">{Math.round(score * 100)}%</span>
|
| 202 |
-
</div>
|
| 203 |
-
))}
|
| 204 |
-
</div>
|
| 205 |
-
</div>
|
| 206 |
-
```
|
| 207 |
-
|
| 208 |
-
## Future Improvements
|
| 209 |
-
|
| 210 |
-
- **Incremental updates**: Score only new books instead of full dataset
|
| 211 |
-
- **Smaller model**: Try lightweight emotion classifiers (faster inference)
|
| 212 |
-
- **Multi-label tags**: Use text classification for predefined categories
|
| 213 |
-
- **Tag filtering**: Allow users to filter by specific tags in search
|
| 214 |
-
- **Emotion-based sorting**: Sort results by dominant emotion match
|
| 215 |
-
- **Caching**: Cache emotion inference results in Redis for API speedup
|
| 216 |
-
|
| 217 |
-
## Dependencies
|
| 218 |
-
|
| 219 |
-
```
|
| 220 |
-
scikit-learn # TF-IDF vectorization
|
| 221 |
-
transformers # Emotion classification
|
| 222 |
-
torch # Model inference
|
| 223 |
-
tqdm # Progress bars
|
| 224 |
-
```
|
| 225 |
-
|
| 226 |
-
## Notes
|
| 227 |
-
|
| 228 |
-
- Tags and emotions are **one-time computed** and stored in CSV
|
| 229 |
-
- No re-computation on API requests (instant serving)
|
| 230 |
-
- CSV file (242MB) is in `.gitignore` (too large for GitHub)
|
| 231 |
-
- To regenerate on a new machine, run both scripts sequentially:
|
| 232 |
-
1. `generate_tags.py` (~5 minutes)
|
| 233 |
-
2. `generate_emotions.py` (~7 hours on MPS for full dataset)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
cover-not-found.jpg
DELETED
|
Binary file (64.4 kB)
|
|
|
data-exploration.ipynb
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
docs/README.md
CHANGED
|
@@ -1,26 +1,30 @@
|
|
| 1 |
# Project Documentation
|
| 2 |
|
| 3 |
-
##
|
| 4 |
|
| 5 |
-
| Document | Purpose |
|
| 6 |
-
|
| 7 |
-
| [
|
| 8 |
-
| [
|
| 9 |
-
| [technical_report.md](technical_report.md) | System architecture, technical details | Technical interviews |
|
| 10 |
-
| [interview_guide.md](interview_guide.md) | Interview Q&A, STAR cases | Job preparation |
|
| 11 |
|
| 12 |
-
##
|
| 13 |
|
| 14 |
| Document | Purpose |
|
| 15 |
|:---|:---|
|
| 16 |
-
| [
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
|
| 18 |
-
##
|
| 19 |
|
| 20 |
-
|
| 21 |
-
|
|
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |
---
|
| 24 |
|
| 25 |
-
|
| 26 |
-
|
|
|
|
| 1 |
# Project Documentation
|
| 2 |
|
| 3 |
+
## Layer 1 — Main Story (README, 5-min interview)
|
| 4 |
|
| 5 |
+
| Document | Purpose |
|
| 6 |
+
|:---|:---|
|
| 7 |
+
| [Technical Report](TECHNICAL_REPORT.md) | Architecture, design decisions, method line |
|
| 8 |
+
| [Experiment Archive](experiments/experiment_archive.md) | Consolidated experiment log (V1.0 → v2.6.0) |
|
|
|
|
|
|
|
| 9 |
|
| 10 |
+
## Layer 2 — Capability Showcase (Resume, technical Q&A)
|
| 11 |
|
| 12 |
| Document | Purpose |
|
| 13 |
|:---|:---|
|
| 14 |
+
| [Interview Guide](interview_guide.md) | Q&A, STAR cases |
|
| 15 |
+
| [Memory Optimization](memory_optimization.md) | Zero-RAM SQLite, engineering decisions |
|
| 16 |
+
| [Performance Debugging](performance_debugging_report.md) | Root cause analysis |
|
| 17 |
+
| [Build Guide](build_guide.md) | Full build pipeline |
|
| 18 |
+
| [Hugging Face Deployment](huggingface_deployment.md) | HF Spaces deployment |
|
| 19 |
|
| 20 |
+
## Archives
|
| 21 |
|
| 22 |
+
| Path | Contents |
|
| 23 |
+
|:---|:---|
|
| 24 |
+
| [archived/](archived/) | Deprecated docs (Phase 2, TAGS, REVIEW_HIGHLIGHTS, etc.) |
|
| 25 |
+
| [archived/graveyard/](archived/graveyard/) | Layer 3 — tried but not in main story (future_roadmap, interview_deep_dive, etc.) |
|
| 26 |
+
| [experiments/reports/](experiments/reports/) | Raw experiment reports (baseline, hybrid, rerank, router, temporal) |
|
| 27 |
|
| 28 |
---
|
| 29 |
|
| 30 |
+
**Frozen v2.6.0** — HR@10 = 0.4545, MRR@5 = 0.2893
|
|
|
docs/TECHNICAL_REPORT.md
CHANGED
|
@@ -9,7 +9,7 @@
|
|
| 9 |
|
| 10 |
## Executive Summary
|
| 11 |
|
| 12 |
-
This project implements
|
| 13 |
|
| 14 |
Key achievements:
|
| 15 |
- 100% recall on exact-match queries (ISBNs)
|
|
@@ -18,7 +18,7 @@ Key achievements:
|
|
| 18 |
- Detail-level precision via hierarchical (Small-to-Big) retrieval
|
| 19 |
- Personalized recommendations using 7-channel recall (Item2Vec, Stacking) and LGBMRanker (LambdaRank)
|
| 20 |
|
| 21 |
-
The system demonstrates
|
| 22 |
|
| 23 |
---
|
| 24 |
|
|
@@ -83,11 +83,12 @@ USER REQUEST (No Query)
|
|
| 83 |
|
|
| 84 |
v
|
| 85 |
+---------------------------+
|
| 86 |
-
|
|
| 87 |
| - ItemCF (direction wt) |
|
| 88 |
| - UserCF (Jaccard) |
|
| 89 |
| - Swing (user-pair) |
|
| 90 |
| - SASRec (embedding) |
|
|
|
|
| 91 |
| - YoutubeDNN (two-tower) |
|
| 92 |
| - Popularity (fallback) |
|
| 93 |
+---------------------------+
|
|
@@ -161,7 +162,7 @@ Implementation (based on LlamaIndex Parent-Child, RAPTOR):
|
|
| 161 |
2. Matching: Query matches specific sentence ("I cried at the ending")
|
| 162 |
3. Expansion: Map sentence to parent ISBN to full book context
|
| 163 |
|
| 164 |
-
Result: Can answer queries like "books with unreliable narrator twist" that are invisible to description-level search.
|
| 165 |
|
| 166 |
### 3.5 Temporal Dynamics
|
| 167 |
|
|
@@ -187,7 +188,7 @@ Location: `src/core/context_compressor.py`
|
|
| 187 |
|
| 188 |
## 4. Personalized Recommendation System
|
| 189 |
|
| 190 |
-
### 4.1 Multi-Channel Recall (
|
| 191 |
|
| 192 |
| Recall Channel | Algorithm | Weight | Purpose |
|
| 193 |
|:---|:---|:---|:---|
|
|
@@ -195,6 +196,7 @@ Location: `src/core/context_compressor.py`
|
|
| 195 |
| UserCF | User similarity (Jaccard + activity penalty) | 1.0 | Similar user preferences |
|
| 196 |
| Swing | User-pair overlap weighting: `1/(α + \|I_u ∩ I_v\|)` | 1.0 | Substitute relationships |
|
| 197 |
| SASRec | Dot-product retrieval from pre-computed embeddings | 1.0 | Sequential patterns |
|
|
|
|
| 198 |
| YoutubeDNN | Two-tower user-item dot product | 0.1 | Deep learning recall |
|
| 199 |
| Popularity | Rating count with time decay | 0.5 | Cold-start fallback |
|
| 200 |
|
|
@@ -215,9 +217,9 @@ Architecture: Self-Attentive Sequential Recommendation with Transformer blocks
|
|
| 215 |
- Training: 30 epochs, 64-dim embeddings, BCE loss with negative sampling
|
| 216 |
- Dual use: (1) ranking feature via `sasrec_score`, (2) independent recall channel via embedding dot-product
|
| 217 |
|
| 218 |
-
### 4.3 LGBMRanker (LambdaRank)
|
| 219 |
|
| 220 |
-
Replaced XGBoost binary classifier with LightGBM LambdaRank that directly optimizes NDCG.
|
| 221 |
|
| 222 |
**Training strategy**:
|
| 223 |
- Hard negative sampling: negatives mined from recall results (not random items)
|
|
@@ -231,24 +233,25 @@ Replaced XGBoost binary classifier with LightGBM LambdaRank that directly optimi
|
|
| 231 |
- Sequence: sasrec_score, sim_max, sim_min, sim_mean
|
| 232 |
- CF scores: icf_sum, icf_max, ucf_sum
|
| 233 |
|
| 234 |
-
Feature importance (
|
| 235 |
|
| 236 |
| Feature | Importance | Description |
|
| 237 |
|:---|:---|:---|
|
| 238 |
-
|
|
| 239 |
-
| sim_max |
|
| 240 |
-
|
|
| 241 |
-
|
|
| 242 |
-
|
|
| 243 |
-
|
|
| 244 |
|
| 245 |
### 4.4 Evaluation Results
|
| 246 |
|
| 247 |
-
|
|
|
|
|
|
|
| 248 |
|:---|:---|:---|:---|
|
| 249 |
-
| HR@10 | 0.1380 |
|
| 250 |
-
| MRR@5 | 0.1295 |
|
| 251 |
-
| Users Evaluated | 500 | 2,000 | |
|
| 252 |
| Dataset | 167,968 active users, 221,998 books | | |
|
| 253 |
|
| 254 |
---
|
|
@@ -261,9 +264,9 @@ Feature importance (V2.5 LGBMRanker):
|
|
| 261 |
|--------|------------------------|-------------|
|
| 262 |
| ISBN Recall | 0% | 100% |
|
| 263 |
| Keyword Precision | Low | High (BM25 boost) |
|
| 264 |
-
| Detail Query Recall | 0% |
|
| 265 |
| Avg Latency | 100ms | 300-800ms |
|
| 266 |
-
| Chat Context Limit | ~10 turns |
|
| 267 |
|
| 268 |
### 5.2 Latency Benchmarks
|
| 269 |
|
|
@@ -310,7 +313,9 @@ Feature importance (V2.5 LGBMRanker):
|
|
| 310 |
|
| 311 |
---
|
| 312 |
|
| 313 |
-
## 8. SFT Data Pipeline
|
|
|
|
|
|
|
| 314 |
|
| 315 |
### 8.1 Problem
|
| 316 |
|
|
@@ -328,6 +333,8 @@ Output:
|
|
| 328 |
- `data/sft/literary_critic_train.jsonl`: ~800 high-quality (Query, Response) pairs
|
| 329 |
- `data/dpo/preference_pairs.jsonl`: ~500 (Chosen, Rejected) pairs
|
| 330 |
|
|
|
|
|
|
|
| 331 |
---
|
| 332 |
|
| 333 |
## 9. File Structure
|
|
@@ -346,7 +353,8 @@ src/
|
|
| 346 |
│ ├── sasrec_recall.py # SASRec Embedding Recall
|
| 347 |
│ ├── popularity.py # Popularity Recall
|
| 348 |
│ ├── youtube_dnn.py # Two-Tower Model
|
| 349 |
-
│
|
|
|
|
| 350 |
├── ranking/
|
| 351 |
│ └── features.py # 17 Ranking Features
|
| 352 |
├── data_factory/
|
|
@@ -359,7 +367,16 @@ src/
|
|
| 359 |
|
| 360 |
---
|
| 361 |
|
| 362 |
-
## 10.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 363 |
|
| 364 |
Current capacity:
|
| 365 |
- In-memory index: 2GB RAM, ~200K books
|
|
|
|
| 9 |
|
| 10 |
## Executive Summary
|
| 11 |
|
| 12 |
+
This project implements an integrated Agentic RAG (Retrieval-Augmented Generation) system for book discovery, combined with a personalized recommendation engine. Unlike simple vector search, it uses a self-routing architecture that dynamically selects the optimal retrieval strategy based on query intent.
|
| 13 |
|
| 14 |
Key achievements:
|
| 15 |
- 100% recall on exact-match queries (ISBNs)
|
|
|
|
| 18 |
- Detail-level precision via hierarchical (Small-to-Big) retrieval
|
| 19 |
- Personalized recommendations using 7-channel recall (Item2Vec, Stacking) and LGBMRanker (LambdaRank)
|
| 20 |
|
| 21 |
+
The system demonstrates both Advanced RAG Architecture (Hybrid Search, Reranking, Query Routing) and multi-channel RecSys (Item2Vec, LGBMRanker, Stacking).
|
| 22 |
|
| 23 |
---
|
| 24 |
|
|
|
|
| 83 |
|
|
| 84 |
v
|
| 85 |
+---------------------------+
|
| 86 |
+
| 7-CHANNEL RECALL (RRF) |
|
| 87 |
| - ItemCF (direction wt) |
|
| 88 |
| - UserCF (Jaccard) |
|
| 89 |
| - Swing (user-pair) |
|
| 90 |
| - SASRec (embedding) |
|
| 91 |
+
| - Item2Vec (Word2Vec) |
|
| 92 |
| - YoutubeDNN (two-tower) |
|
| 93 |
| - Popularity (fallback) |
|
| 94 |
+---------------------------+
|
|
|
|
| 162 |
2. Matching: Query matches specific sentence ("I cried at the ending")
|
| 163 |
3. Expansion: Map sentence to parent ISBN to full book context
|
| 164 |
|
| 165 |
+
Result: Can answer queries like "books with unreliable narrator twist" that are invisible to description-level search. *RAG components (ISBN recall, reranking, Small-to-Big) were validated via curated examples and routing statistics; no large-scale human evaluation was conducted.*
|
| 166 |
|
| 167 |
### 3.5 Temporal Dynamics
|
| 168 |
|
|
|
|
| 188 |
|
| 189 |
## 4. Personalized Recommendation System
|
| 190 |
|
| 191 |
+
### 4.1 Multi-Channel Recall (7 Channels)
|
| 192 |
|
| 193 |
| Recall Channel | Algorithm | Weight | Purpose |
|
| 194 |
|:---|:---|:---|:---|
|
|
|
|
| 196 |
| UserCF | User similarity (Jaccard + activity penalty) | 1.0 | Similar user preferences |
|
| 197 |
| Swing | User-pair overlap weighting: `1/(α + \|I_u ∩ I_v\|)` | 1.0 | Substitute relationships |
|
| 198 |
| SASRec | Dot-product retrieval from pre-computed embeddings | 1.0 | Sequential patterns |
|
| 199 |
+
| Item2Vec | Word2Vec (Skip-gram) on user interaction sequences | 0.8 | Implicit co-occurrence |
|
| 200 |
| YoutubeDNN | Two-tower user-item dot product | 0.1 | Deep learning recall |
|
| 201 |
| Popularity | Rating count with time decay | 0.5 | Cold-start fallback |
|
| 202 |
|
|
|
|
| 217 |
- Training: 30 epochs, 64-dim embeddings, BCE loss with negative sampling
|
| 218 |
- Dual use: (1) ranking feature via `sasrec_score`, (2) independent recall channel via embedding dot-product
|
| 219 |
|
| 220 |
+
### 4.3 LGBMRanker (LambdaRank) + Model Stacking
|
| 221 |
|
| 222 |
+
Replaced XGBoost binary classifier with LightGBM LambdaRank that directly optimizes NDCG. In v2.6.0, a Stacking ensemble (LGBMRanker + XGBClassifier → LogisticRegression meta-learner) further improves ranking robustness.
|
| 223 |
|
| 224 |
**Training strategy**:
|
| 225 |
- Hard negative sampling: negatives mined from recall results (not random items)
|
|
|
|
| 233 |
- Sequence: sasrec_score, sim_max, sim_min, sim_mean
|
| 234 |
- CF scores: icf_sum, icf_max, ucf_sum
|
| 235 |
|
| 236 |
+
Feature importance (v2.6.0 LGBMRanker, representative subset):
|
| 237 |
|
| 238 |
| Feature | Importance | Description |
|
| 239 |
|:---|:---|:---|
|
| 240 |
+
| u_cnt | 88 | User activity count |
|
| 241 |
+
| sim_max | 76 | Last-N similarity max |
|
| 242 |
+
| icf_max | 62 | ItemCF max similarity |
|
| 243 |
+
| i_cnt | 59 | Item popularity count |
|
| 244 |
+
| len_diff | 55 | Description complexity match |
|
| 245 |
+
| sasrec_score | 25 | SASRec embedding score |
|
| 246 |
|
| 247 |
### 4.4 Evaluation Results
|
| 248 |
|
| 249 |
+
*Protocol: Leave-Last-Out, n=2000 users, title-relaxed matching, filter_favorites=False.*
|
| 250 |
+
|
| 251 |
+
| Metric | V2.0 (XGBoost) | V2.5 (LGBMRanker) | v2.6.0 (+Item2Vec, Stacking) |
|
| 252 |
|:---|:---|:---|:---|
|
| 253 |
+
| HR@10 | 0.1380 | 0.2205 | **0.4545** |
|
| 254 |
+
| MRR@5 | 0.1295 | 0.1584 | **0.2893** |
|
|
|
|
| 255 |
| Dataset | 167,968 active users, 221,998 books | | |
|
| 256 |
|
| 257 |
---
|
|
|
|
| 264 |
|--------|------------------------|-------------|
|
| 265 |
| ISBN Recall | 0% | 100% |
|
| 266 |
| Keyword Precision | Low | High (BM25 boost) |
|
| 267 |
+
| Detail Query Recall | 0% | Demonstrated via curated examples (Small-to-Big) |
|
| 268 |
| Avg Latency | 100ms | 300-800ms |
|
| 269 |
+
| Chat Context Limit | ~10 turns | Extended via compression (no formal limit) |
|
| 270 |
|
| 271 |
### 5.2 Latency Benchmarks
|
| 272 |
|
|
|
|
| 313 |
|
| 314 |
---
|
| 315 |
|
| 316 |
+
## 8. SFT Data Pipeline (Supplementary)
|
| 317 |
+
|
| 318 |
+
*Not integrated into the main RAG flow in v2.6.0.* This pipeline was developed for potential future fine-tuning of chat tone.
|
| 319 |
|
| 320 |
### 8.1 Problem
|
| 321 |
|
|
|
|
| 333 |
- `data/sft/literary_critic_train.jsonl`: ~800 high-quality (Query, Response) pairs
|
| 334 |
- `data/dpo/preference_pairs.jsonl`: ~500 (Chosen, Rejected) pairs
|
| 335 |
|
| 336 |
+
See [Experiment Archive](experiments/experiment_archive.md) for full implementation details.
|
| 337 |
+
|
| 338 |
---
|
| 339 |
|
| 340 |
## 9. File Structure
|
|
|
|
| 353 |
│ ├── sasrec_recall.py # SASRec Embedding Recall
|
| 354 |
│ ├── popularity.py # Popularity Recall
|
| 355 |
│ ├── youtube_dnn.py # Two-Tower Model
|
| 356 |
+
│ ├── item2vec.py # Item2Vec Recall (Word2Vec)
|
| 357 |
+
│ └── fusion.py # RRF Fusion (7 channels)
|
| 358 |
├── ranking/
|
| 359 |
│ └── features.py # 17 Ranking Features
|
| 360 |
├── data_factory/
|
|
|
|
| 367 |
|
| 368 |
---
|
| 369 |
|
| 370 |
+
## 10. Limitations
|
| 371 |
+
|
| 372 |
+
- **Single-dataset evaluation**: All RecSys metrics are on Amazon Books 200K; no cross-domain or external validation.
|
| 373 |
+
- **Rule-based router**: Intent classification uses heuristics (e.g., `len(words) <= 2` for keyword); may not generalize to other domains.
|
| 374 |
+
- **RAG evaluation**: RAG quality is demonstrated via curated examples (e.g., "Harry Potter", ISBN recall); no systematic human evaluation or large-scale relevance judgments.
|
| 375 |
+
- **Protocol sensitivity**: RecSys metrics can vary with evaluation protocol (e.g., ISBN-only vs title-relaxed matching); see [Experiment Archive](experiments/experiment_archive.md) for discussion.
|
| 376 |
+
|
| 377 |
+
---
|
| 378 |
+
|
| 379 |
+
## 11. Scalability
|
| 380 |
|
| 381 |
Current capacity:
|
| 382 |
- In-memory index: 2GB RAM, ~200K books
|
docs/archived/PHASE_2_DEVELOPMENT.md
CHANGED
|
@@ -204,12 +204,12 @@ persona = getPersona(userId)
|
|
| 204 |
- **Method:** Rule-based persona matching (Top-3 authors/categories)
|
| 205 |
- **Future:** Could upgrade to LLM refinement (e.g., GPT for polish)
|
| 206 |
|
| 207 |
-
### Why React + Vite?
|
| 208 |
- **Rationale:**
|
| 209 |
-
- React needed for custom UX
|
| 210 |
- Vite super fast (no webpack pain)
|
| 211 |
-
- Tailwind
|
| 212 |
-
- **
|
| 213 |
|
| 214 |
### Why Persona from Favorites (not search history)?
|
| 215 |
- **Rationale:** User intent explicit in favorites, not implicit in queries
|
|
@@ -345,6 +345,13 @@ npm run dev
|
|
| 345 |
# Connect to http://localhost:6006 backend
|
| 346 |
```
|
| 347 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 348 |
### Production Workflow
|
| 349 |
- React builds with `npm run build` → static files
|
| 350 |
- FastAPI serves as single backend
|
|
@@ -415,13 +422,13 @@ Expected highlight: "虽不在您常读类型,但情节深度与科幻的想
|
|
| 415 |
✅ **Modular backend design** → easy to add /highlights, /persona endpoints
|
| 416 |
✅ **React UI responsiveness** → users see results instantly
|
| 417 |
✅ **JSON-first approach** → no DB setup friction, iterate fast
|
| 418 |
-
✅ **API-driven architecture** →
|
| 419 |
✅ **Persona concept** → users feel "understood" by the system
|
| 420 |
|
| 421 |
### Challenges Overcome
|
| 422 |
-
🔧 **Port
|
| 423 |
🔧 **CORS issues** (frontend can't reach backend) → Added CORSMiddleware
|
| 424 |
-
🔧 **Image loading** (external URLs) → Runtime fetching + local fallback
|
| 425 |
🔧 **Timeout errors** (cold startup > 10s) → Increased client timeouts, optimized startup
|
| 426 |
|
| 427 |
### Design Philosophy Validated
|
|
@@ -454,11 +461,12 @@ book-rec-with-LLMs/
|
|
| 454 |
│ │ └── main.jsx # Entry point
|
| 455 |
│ ├── index.html # HTML + Tailwind CDN
|
| 456 |
│ └── package.json # Dependencies
|
|
|
|
| 457 |
├── Makefile # Commands
|
| 458 |
├── requirements.txt # Python deps
|
| 459 |
└── data/
|
| 460 |
-
├── books_processed.csv # Metadata
|
| 461 |
-
└── user_profiles.json # User data
|
| 462 |
```
|
| 463 |
|
| 464 |
---
|
|
@@ -478,6 +486,7 @@ feat: add React UI and backend personalization features
|
|
| 478 |
* src/marketing/highlights.py: Persona-aware selling points
|
| 479 |
* 3 new API endpoints in FastAPI
|
| 480 |
|
|
|
|
| 481 |
- Add CORS support, update timeouts, improve infrastructure
|
| 482 |
```
|
| 483 |
|
|
|
|
| 204 |
- **Method:** Rule-based persona matching (Top-3 authors/categories)
|
| 205 |
- **Future:** Could upgrade to LLM refinement (e.g., GPT for polish)
|
| 206 |
|
| 207 |
+
### Why React + Vite (not Gradio)?
|
| 208 |
- **Rationale:**
|
| 209 |
+
- Gradio good for prototypes, React needed for custom UX
|
| 210 |
- Vite super fast (no webpack pain)
|
| 211 |
+
- Tailwind CDN avoids npm build complexity
|
| 212 |
+
- **Fallback:** Gradio UI (app.py) still available on port 7860
|
| 213 |
|
| 214 |
### Why Persona from Favorites (not search history)?
|
| 215 |
- **Rationale:** User intent explicit in favorites, not implicit in queries
|
|
|
|
| 345 |
# Connect to http://localhost:6006 backend
|
| 346 |
```
|
| 347 |
|
| 348 |
+
**Terminal 3 (Optional): Gradio Legacy UI**
|
| 349 |
+
```bash
|
| 350 |
+
python app.py
|
| 351 |
+
# Starts on http://localhost:7860
|
| 352 |
+
# Alternative UI for testing
|
| 353 |
+
```
|
| 354 |
+
|
| 355 |
### Production Workflow
|
| 356 |
- React builds with `npm run build` → static files
|
| 357 |
- FastAPI serves as single backend
|
|
|
|
| 422 |
✅ **Modular backend design** → easy to add /highlights, /persona endpoints
|
| 423 |
✅ **React UI responsiveness** → users see results instantly
|
| 424 |
✅ **JSON-first approach** → no DB setup friction, iterate fast
|
| 425 |
+
✅ **API-driven architecture** → Gradio + React both work
|
| 426 |
✅ **Persona concept** → users feel "understood" by the system
|
| 427 |
|
| 428 |
### Challenges Overcome
|
| 429 |
+
🔧 **Port conflicts** (Gradio:7860 vs React:5173 vs FastAPI:6006) → Makefile organization
|
| 430 |
🔧 **CORS issues** (frontend can't reach backend) → Added CORSMiddleware
|
| 431 |
+
🔧 **Image loading** (external URLs not allowed in Gradio) → Runtime fetching + local fallback
|
| 432 |
🔧 **Timeout errors** (cold startup > 10s) → Increased client timeouts, optimized startup
|
| 433 |
|
| 434 |
### Design Philosophy Validated
|
|
|
|
| 461 |
│ │ └── main.jsx # Entry point
|
| 462 |
│ ├── index.html # HTML + Tailwind CDN
|
| 463 |
│ └── package.json # Dependencies
|
| 464 |
+
├── app.py # Gradio UI (legacy)
|
| 465 |
├── Makefile # Commands
|
| 466 |
├── requirements.txt # Python deps
|
| 467 |
└── data/
|
| 468 |
+
├── books_processed.csv # Metadata
|
| 469 |
+
└── user_profiles.json # ✨ NEW: User data
|
| 470 |
```
|
| 471 |
|
| 472 |
---
|
|
|
|
| 486 |
* src/marketing/highlights.py: Persona-aware selling points
|
| 487 |
* 3 new API endpoints in FastAPI
|
| 488 |
|
| 489 |
+
- Maintain Gradio UI as optional fallback
|
| 490 |
- Add CORS support, update timeouts, improve infrastructure
|
| 491 |
```
|
| 492 |
|
DEPLOYMENT.md → docs/archived/graveyard/DEPLOYMENT.md
RENAMED
|
File without changes
|
business_logic.md → docs/archived/graveyard/business_logic.md
RENAMED
|
File without changes
|
docs/{future_roadmap.md → archived/graveyard/future_roadmap.md}
RENAMED
|
File without changes
|
docs/{interview_deep_dive.md → archived/graveyard/interview_deep_dive.md}
RENAMED
|
File without changes
|
interview_prep.md → docs/archived/graveyard/interview_prep.md
RENAMED
|
File without changes
|
docs/{phase7_plan.md → archived/graveyard/phase7_plan.md}
RENAMED
|
File without changes
|
project_analysis.md → docs/archived/graveyard/project_analysis.md
RENAMED
|
File without changes
|
docs/{project_narrative.md → archived/graveyard/project_narrative.md}
RENAMED
|
File without changes
|
docs/{rag_architecture.md → archived/graveyard/rag_architecture.md}
RENAMED
|
File without changes
|
technical_architecture.md → docs/archived/graveyard/technical_architecture.md
RENAMED
|
File without changes
|
docs/{technical_deep_dive_sota.md → archived/graveyard/technical_deep_dive_sota.md}
RENAMED
|
File without changes
|
story_and_strategy.md → docs/archived/story_and_strategy.md
RENAMED
|
File without changes
|
{experiments → docs/experiments/reports}/baseline_report.md
RENAMED
|
File without changes
|
benchmarks/results.md → docs/experiments/reports/benchmark_results.md
RENAMED
|
File without changes
|
{experiments → docs/experiments/reports}/hybrid_report.md
RENAMED
|
File without changes
|
{experiments → docs/experiments/reports}/rerank_report.md
RENAMED
|
File without changes
|
{experiments → docs/experiments/reports}/router_report.md
RENAMED
|
File without changes
|
{experiments → docs/experiments/reports}/temporal_report.md
RENAMED
|
File without changes
|
legacy/README.md
ADDED
|
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Legacy — Not part of main architecture
|
| 2 |
+
|
| 3 |
+
Code moved here is preserved but not used in the main flow (src.main FastAPI + React).
|
| 4 |
+
|
| 5 |
+
| File | Note |
|
| 6 |
+
|:---|:---|
|
| 7 |
+
| app.py | Gradio UI (replaced by React + FastAPI) |
|
| 8 |
+
| agent/ | Shopping agent (broken imports, not used) |
|
| 9 |
+
| deploy.sh | Old Hugging Face deployment script |
|
| 10 |
+
| download_fix.py | Temporary fix script |
|
{src → legacy}/agent/agent_core.py
RENAMED
|
File without changes
|
{src → legacy}/agent/data_loader.py
RENAMED
|
File without changes
|
{src → legacy}/agent/dialogue_manager.py
RENAMED
|
File without changes
|
{src → legacy}/agent/intent_parser.py
RENAMED
|
File without changes
|
{src → legacy}/agent/llm_generator.py
RENAMED
|
File without changes
|
{src → legacy}/agent/rag_indexer.py
RENAMED
|
File without changes
|
{src → legacy}/agent/rag_retriever.py
RENAMED
|
File without changes
|
app.py → legacy/app.py
RENAMED
|
@@ -48,7 +48,7 @@ tones = fetch_tones()
|
|
| 48 |
# Initialize Shopping Agent (Mock or Real)
|
| 49 |
# Note: Real agent requires FAISS index. We'll handle checks later.
|
| 50 |
try:
|
| 51 |
-
# from
|
| 52 |
# shopping_agent = ShoppingAgent(...)
|
| 53 |
pass
|
| 54 |
except ImportError:
|
|
|
|
| 48 |
# Initialize Shopping Agent (Mock or Real)
|
| 49 |
# Note: Real agent requires FAISS index. We'll handle checks later.
|
| 50 |
try:
|
| 51 |
+
# from legacy.agent.agent_core import ShoppingAgent
|
| 52 |
# shopping_agent = ShoppingAgent(...)
|
| 53 |
pass
|
| 54 |
except ImportError:
|
deploy.sh → legacy/deploy.sh
RENAMED
|
File without changes
|
download_fix.py → legacy/download_fix.py
RENAMED
|
File without changes
|
requirements.txt
CHANGED
|
@@ -40,7 +40,6 @@ scipy
|
|
| 40 |
requests
|
| 41 |
|
| 42 |
# LLM Agent & Fine-tuning
|
| 43 |
-
langchain
|
| 44 |
faiss-cpu
|
| 45 |
diffusers
|
| 46 |
openai
|
|
|
|
| 40 |
requests
|
| 41 |
|
| 42 |
# LLM Agent & Fine-tuning
|
|
|
|
| 43 |
faiss-cpu
|
| 44 |
diffusers
|
| 45 |
openai
|
scripts/add_isbn13_to_books_data.py
DELETED
|
@@ -1,16 +0,0 @@
|
|
| 1 |
-
import pandas as pd
|
| 2 |
-
|
| 3 |
-
# 读取主表和 books_data_with_isbn.csv
|
| 4 |
-
main = pd.read_csv("data/books_with_emotions.csv", usecols=["title", "isbn13"])
|
| 5 |
-
data = pd.read_csv("data/books_data_with_isbn.csv")
|
| 6 |
-
|
| 7 |
-
# 标准化标题
|
| 8 |
-
main["title"] = main["title"].astype(str).str.strip().str.lower()
|
| 9 |
-
data["Title"] = data["Title"].astype(str).str.strip().str.lower()
|
| 10 |
-
|
| 11 |
-
# 合并,左连接
|
| 12 |
-
merged = data.merge(main, left_on="Title", right_on="title", how="left")
|
| 13 |
-
|
| 14 |
-
# 保存新文件
|
| 15 |
-
merged.to_csv("data/books_data_with_isbn13.csv", index=False)
|
| 16 |
-
print("已生成 data/books_data_with_isbn13.csv,包含 isbn13 字段。")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
scripts/add_isbn_to_books_data.py
DELETED
|
@@ -1,21 +0,0 @@
|
|
| 1 |
-
import pandas as pd
|
| 2 |
-
|
| 3 |
-
# 读取 books_data.csv
|
| 4 |
-
books_data = pd.read_csv("data/books_data.csv")
|
| 5 |
-
|
| 6 |
-
# 读取 Books_rating.csv,只取 Title 和 Id 字段
|
| 7 |
-
ratings = pd.read_csv("data/Books_rating.csv", usecols=["Title", "Id"])
|
| 8 |
-
|
| 9 |
-
# 去重,避免多对一
|
| 10 |
-
ratings = ratings.drop_duplicates(subset=["Title"])
|
| 11 |
-
|
| 12 |
-
# 合并,左连接,保留 books_data.csv 所有行
|
| 13 |
-
merged = books_data.merge(ratings, on="Title", how="left")
|
| 14 |
-
|
| 15 |
-
# 重命名 Id 为 isbn
|
| 16 |
-
merged = merged.rename(columns={"Id": "isbn"})
|
| 17 |
-
|
| 18 |
-
# 保存新文件
|
| 19 |
-
merged.to_csv("data/books_data_with_isbn.csv", index=False)
|
| 20 |
-
|
| 21 |
-
print("已生成 data/books_data_with_isbn.csv,包含 isbn 字段。")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
scripts/benchmark_compressor.py
DELETED
|
@@ -1,35 +0,0 @@
|
|
| 1 |
-
import asyncio
|
| 2 |
-
from langchain_core.messages import HumanMessage, AIMessage
|
| 3 |
-
from src.core.context_compressor import compressor
|
| 4 |
-
|
| 5 |
-
async def run_benchmark():
|
| 6 |
-
print("🚀 Starting Context Compression Benchmark...")
|
| 7 |
-
|
| 8 |
-
# 1. Simulate Long History (12 messages, 6 turns)
|
| 9 |
-
history = []
|
| 10 |
-
for i in range(1, 7):
|
| 11 |
-
history.append(HumanMessage(content=f"User question {i}: I like sci-fi."))
|
| 12 |
-
history.append(AIMessage(content=f"AI answer {i}: Here is a sci-fi book."))
|
| 13 |
-
|
| 14 |
-
print(f"Original History Length: {len(history)} messages")
|
| 15 |
-
|
| 16 |
-
# 2. Compress
|
| 17 |
-
print("Compressing...")
|
| 18 |
-
# Mock LLM generation usually takes time, so latency includes API call
|
| 19 |
-
compressed = await compressor.compress_history(history)
|
| 20 |
-
|
| 21 |
-
print(f"Compressed History Length: {len(compressed)} messages")
|
| 22 |
-
|
| 23 |
-
# 3. Validation
|
| 24 |
-
# Expected: 1 SystemMessage (Summary) + 4 Messages (Recent) = 5
|
| 25 |
-
if len(compressed) == 5:
|
| 26 |
-
print("✅ SUCCESS: History compressed to 5 messages.")
|
| 27 |
-
print(f"Summary Content: {compressed[0].content}")
|
| 28 |
-
print(f"Oldest Retained Message: {compressed[1].content}")
|
| 29 |
-
else:
|
| 30 |
-
print(f"❌ FAILURE: Expected 5 messages, got {len(compressed)}")
|
| 31 |
-
for i, m in enumerate(compressed):
|
| 32 |
-
print(f"[{i}] {type(m).__name__}: {m.content}")
|
| 33 |
-
|
| 34 |
-
if __name__ == "__main__":
|
| 35 |
-
asyncio.run(run_benchmark())
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
scripts/benchmark_hybrid.py
DELETED
|
@@ -1,83 +0,0 @@
|
|
| 1 |
-
import time
|
| 2 |
-
import pandas as pd
|
| 3 |
-
from src.vector_db import VectorDB
|
| 4 |
-
|
| 5 |
-
def run_benchmark():
|
| 6 |
-
print("🚀 Starting Hybrid Retrieval Benchmark...")
|
| 7 |
-
|
| 8 |
-
# Load Title Mapping
|
| 9 |
-
try:
|
| 10 |
-
books_df = pd.read_csv("data/books_processed.csv")
|
| 11 |
-
# Ensure string ISBN for matching
|
| 12 |
-
if 'isbn13' in books_df.columns:
|
| 13 |
-
books_df['isbn'] = books_df['isbn13'].astype(str)
|
| 14 |
-
else:
|
| 15 |
-
books_df['isbn'] = books_df['isbn'].astype(str)
|
| 16 |
-
|
| 17 |
-
isbn_map = books_df.set_index('isbn')['title'].to_dict()
|
| 18 |
-
except Exception as e:
|
| 19 |
-
print(f"⚠️ Failed to load books_processed.csv: {e}")
|
| 20 |
-
isbn_map = {}
|
| 21 |
-
|
| 22 |
-
db = VectorDB()
|
| 23 |
-
|
| 24 |
-
# Same Test Cases
|
| 25 |
-
test_queries = [
|
| 26 |
-
# 1. Semantic (Hybrid should match Dense)
|
| 27 |
-
{"type": "Semantic", "query": "books about finding love in unexpected places"},
|
| 28 |
-
{"type": "Semantic", "query": "scary stories that keep you up at night"},
|
| 29 |
-
|
| 30 |
-
# 2. Keyword/Proper Noun (Hybrid should improve)
|
| 31 |
-
{"type": "Keyword", "query": "Harry Potter"},
|
| 32 |
-
{"type": "Keyword", "query": "Python Programming"},
|
| 33 |
-
{"type": "Keyword", "query": "Jane Austen"},
|
| 34 |
-
|
| 35 |
-
# 3. Exact Match / ISBN (Hybrid should fix this)
|
| 36 |
-
{"type": "Exact", "query": "0060959479"},
|
| 37 |
-
]
|
| 38 |
-
|
| 39 |
-
results = []
|
| 40 |
-
|
| 41 |
-
for case in test_queries:
|
| 42 |
-
q = case["query"]
|
| 43 |
-
print(f"\nScanning: '{q}' ({case['type']})...")
|
| 44 |
-
|
| 45 |
-
start_time = time.time()
|
| 46 |
-
# USE HYBRID SEARCH
|
| 47 |
-
docs = db.hybrid_search(q, k=5)
|
| 48 |
-
duration = (time.time() - start_time) * 1000
|
| 49 |
-
|
| 50 |
-
# Capture simplified results
|
| 51 |
-
top_results = []
|
| 52 |
-
for doc in docs:
|
| 53 |
-
# Extract ISBN
|
| 54 |
-
parts = doc.page_content.strip().split(' ', 1)
|
| 55 |
-
isbn = parts[0]
|
| 56 |
-
# Fallback parsing for legacy docs
|
| 57 |
-
if "ISBN:" in doc.page_content:
|
| 58 |
-
isbn = doc.page_content.split("ISBN:")[1].strip().split()[0]
|
| 59 |
-
|
| 60 |
-
title = isbn_map.get(isbn, f"ISBN:{isbn}")
|
| 61 |
-
if len(title) > 40:
|
| 62 |
-
title = title[:37] + "..."
|
| 63 |
-
top_results.append(title)
|
| 64 |
-
|
| 65 |
-
print(f" -> Found: {top_results}")
|
| 66 |
-
results.append({
|
| 67 |
-
"query": q,
|
| 68 |
-
"type": case["type"],
|
| 69 |
-
"latency_ms": round(duration, 2),
|
| 70 |
-
"top_results": top_results
|
| 71 |
-
})
|
| 72 |
-
|
| 73 |
-
# Save
|
| 74 |
-
df = pd.DataFrame(results)
|
| 75 |
-
path = "experiments/02_hybrid_results.csv"
|
| 76 |
-
df.to_csv(path, index=False)
|
| 77 |
-
print(f"\n💾 Results saved to {path}")
|
| 78 |
-
|
| 79 |
-
print("\n## Hybrid Search Results")
|
| 80 |
-
print(df.to_string(index=False))
|
| 81 |
-
|
| 82 |
-
if __name__ == "__main__":
|
| 83 |
-
run_benchmark()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
scripts/benchmark_rerank.py
DELETED
|
@@ -1,82 +0,0 @@
|
|
| 1 |
-
import time
|
| 2 |
-
import pandas as pd
|
| 3 |
-
from src.vector_db import VectorDB
|
| 4 |
-
|
| 5 |
-
def run_benchmark():
|
| 6 |
-
print("🚀 Starting Reranked Retrieval Benchmark...")
|
| 7 |
-
|
| 8 |
-
# Load Title Mapping
|
| 9 |
-
try:
|
| 10 |
-
books_df = pd.read_csv("data/books_processed.csv")
|
| 11 |
-
if 'isbn13' in books_df.columns:
|
| 12 |
-
books_df['isbn'] = books_df['isbn13'].astype(str)
|
| 13 |
-
else:
|
| 14 |
-
books_df['isbn'] = books_df['isbn'].astype(str)
|
| 15 |
-
isbn_map = books_df.set_index('isbn')['title'].to_dict()
|
| 16 |
-
except Exception as e:
|
| 17 |
-
print(f"⚠️ Failed to load books_processed.csv: {e}")
|
| 18 |
-
isbn_map = {}
|
| 19 |
-
|
| 20 |
-
db = VectorDB()
|
| 21 |
-
|
| 22 |
-
# Same Test Cases
|
| 23 |
-
test_queries = [
|
| 24 |
-
# 1. Semantic (Reranker should bubble up best Semantic matches)
|
| 25 |
-
{"type": "Semantic", "query": "books about finding love in unexpected places"},
|
| 26 |
-
# Complex mood query
|
| 27 |
-
{"type": "Complex", "query": "a dark sci-fi thriller with a female protagonist"},
|
| 28 |
-
|
| 29 |
-
# 2. Keyword/Proper Noun (Reranker should confirm these are relevant)
|
| 30 |
-
{"type": "Keyword", "query": "Harry Potter"},
|
| 31 |
-
{"type": "Keyword", "query": "Jane Austen"},
|
| 32 |
-
|
| 33 |
-
# 3. Exact Match (Should still work)
|
| 34 |
-
{"type": "Exact", "query": "0060959479"},
|
| 35 |
-
]
|
| 36 |
-
|
| 37 |
-
results = []
|
| 38 |
-
|
| 39 |
-
for case in test_queries:
|
| 40 |
-
q = case["query"]
|
| 41 |
-
print(f"\nScanning: '{q}' ({case['type']})...")
|
| 42 |
-
|
| 43 |
-
start_time = time.time()
|
| 44 |
-
# USE HYBRID WITH RERANK
|
| 45 |
-
docs = db.hybrid_search(q, k=5, rerank=True)
|
| 46 |
-
duration = (time.time() - start_time) * 1000
|
| 47 |
-
|
| 48 |
-
# Capture results with scores
|
| 49 |
-
top_results = []
|
| 50 |
-
for doc in docs:
|
| 51 |
-
# Extract ISBN
|
| 52 |
-
parts = doc.page_content.strip().split(' ', 1)
|
| 53 |
-
isbn = parts[0]
|
| 54 |
-
if "ISBN:" in doc.page_content:
|
| 55 |
-
isbn = doc.page_content.split("ISBN:")[1].strip().split()[0]
|
| 56 |
-
|
| 57 |
-
title = isbn_map.get(isbn, f"ISBN:{isbn}")
|
| 58 |
-
if len(title) > 30:
|
| 59 |
-
title = title[:27] + "..."
|
| 60 |
-
|
| 61 |
-
score = doc.metadata.get("relevance_score", 0.0)
|
| 62 |
-
top_results.append(f"{title} ({score:.4f})")
|
| 63 |
-
|
| 64 |
-
print(f" -> Found: {top_results}")
|
| 65 |
-
results.append({
|
| 66 |
-
"query": q,
|
| 67 |
-
"type": case["type"],
|
| 68 |
-
"latency_ms": round(duration, 2),
|
| 69 |
-
"top_results": top_results
|
| 70 |
-
})
|
| 71 |
-
|
| 72 |
-
# Save
|
| 73 |
-
df = pd.DataFrame(results)
|
| 74 |
-
path = "experiments/03_rerank_results.csv"
|
| 75 |
-
df.to_csv(path, index=False)
|
| 76 |
-
print(f"\n💾 Results saved to {path}")
|
| 77 |
-
|
| 78 |
-
print("\n## Reranked Search Results")
|
| 79 |
-
print(df.to_string(index=False))
|
| 80 |
-
|
| 81 |
-
if __name__ == "__main__":
|
| 82 |
-
run_benchmark()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
scripts/benchmark_retrieval.py
DELETED
|
@@ -1,82 +0,0 @@
|
|
| 1 |
-
import time
|
| 2 |
-
import pandas as pd
|
| 3 |
-
from typing import List
|
| 4 |
-
from src.vector_db import VectorDB
|
| 5 |
-
|
| 6 |
-
def run_benchmark():
|
| 7 |
-
print("🚀 Starting Retrieval Benchmark (BASELINE)...")
|
| 8 |
-
|
| 9 |
-
# Load Title Mapping
|
| 10 |
-
try:
|
| 11 |
-
books_df = pd.read_csv("data/books_processed.csv")
|
| 12 |
-
# Ensure string ISBN for matching
|
| 13 |
-
books_df['isbn'] = books_df['isbn'].astype(str)
|
| 14 |
-
isbn_map = books_df.set_index('isbn')['title'].to_dict()
|
| 15 |
-
print(f"📚 Loaded {len(isbn_map)} titles for mapping.")
|
| 16 |
-
except Exception as e:
|
| 17 |
-
print(f"⚠️ Failed to load books_processed.csv: {e}")
|
| 18 |
-
isbn_map = {}
|
| 19 |
-
|
| 20 |
-
db = VectorDB()
|
| 21 |
-
|
| 22 |
-
# ... (Test Cases preserved) ...
|
| 23 |
-
test_queries = [
|
| 24 |
-
# 1. Semantic (Dense should win)
|
| 25 |
-
{"type": "Semantic", "query": "books about finding love in unexpected places"},
|
| 26 |
-
{"type": "Semantic", "query": "scary stories that keep you up at night"},
|
| 27 |
-
|
| 28 |
-
# 2. Keyword/Proper Noun (Dense might struggle)
|
| 29 |
-
{"type": "Keyword", "query": "Harry Potter"},
|
| 30 |
-
{"type": "Keyword", "query": "Python Programming"},
|
| 31 |
-
{"type": "Keyword", "query": "Jane Austen"},
|
| 32 |
-
|
| 33 |
-
# 3. Exact Match / ISBN
|
| 34 |
-
{"type": "Exact", "query": "0060959479"},
|
| 35 |
-
]
|
| 36 |
-
|
| 37 |
-
results = []
|
| 38 |
-
|
| 39 |
-
for case in test_queries:
|
| 40 |
-
q = case["query"]
|
| 41 |
-
print(f"\nScanning: '{q}' ({case['type']})...")
|
| 42 |
-
|
| 43 |
-
start_time = time.time()
|
| 44 |
-
docs = db.search(q, k=5)
|
| 45 |
-
duration = (time.time() - start_time) * 1000
|
| 46 |
-
|
| 47 |
-
# Capture simplified results
|
| 48 |
-
top_results = []
|
| 49 |
-
for doc in docs:
|
| 50 |
-
# Format: "ISBN ReviewText..."
|
| 51 |
-
# Extract ISBN (first token)
|
| 52 |
-
parts = doc.page_content.strip().split(' ', 1)
|
| 53 |
-
isbn = parts[0]
|
| 54 |
-
|
| 55 |
-
# Lookup Title
|
| 56 |
-
title = isbn_map.get(isbn, f"ISBN:{isbn}")
|
| 57 |
-
|
| 58 |
-
# Truncate for display
|
| 59 |
-
if len(title) > 40:
|
| 60 |
-
title = title[:37] + "..."
|
| 61 |
-
top_results.append(title)
|
| 62 |
-
|
| 63 |
-
print(f" -> Found: {top_results}")
|
| 64 |
-
results.append({
|
| 65 |
-
"query": q,
|
| 66 |
-
"type": case["type"],
|
| 67 |
-
"latency_ms": round(duration, 2),
|
| 68 |
-
"top_results": top_results
|
| 69 |
-
})
|
| 70 |
-
|
| 71 |
-
# Save Report
|
| 72 |
-
df = pd.DataFrame(results)
|
| 73 |
-
path = "experiments/01_baseline_results.csv"
|
| 74 |
-
df.to_csv(path, index=False)
|
| 75 |
-
print(f"\n💾 Results saved to {path}")
|
| 76 |
-
|
| 77 |
-
# Print Summary
|
| 78 |
-
print("\n## Baseline Results Summary")
|
| 79 |
-
print(df.to_string(index=False))
|
| 80 |
-
|
| 81 |
-
if __name__ == "__main__":
|
| 82 |
-
run_benchmark()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
scripts/benchmark_router.py
DELETED
|
@@ -1,99 +0,0 @@
|
|
| 1 |
-
import time
|
| 2 |
-
import pandas as pd
|
| 3 |
-
from src.vector_db import VectorDB
|
| 4 |
-
from src.core.router import QueryRouter
|
| 5 |
-
|
| 6 |
-
def run_benchmark():
|
| 7 |
-
print("🚀 Starting Agentic Router Benchmark...")
|
| 8 |
-
|
| 9 |
-
# Init Components
|
| 10 |
-
db = VectorDB()
|
| 11 |
-
router = QueryRouter()
|
| 12 |
-
|
| 13 |
-
# Load Title Mapping (for display)
|
| 14 |
-
try:
|
| 15 |
-
books_df = pd.read_csv("data/books_processed.csv")
|
| 16 |
-
if 'isbn13' in books_df.columns:
|
| 17 |
-
books_df['isbn'] = books_df['isbn13'].astype(str)
|
| 18 |
-
else:
|
| 19 |
-
books_df['isbn'] = books_df['isbn'].astype(str)
|
| 20 |
-
isbn_map = books_df.set_index('isbn')['title'].to_dict()
|
| 21 |
-
except:
|
| 22 |
-
isbn_map = {}
|
| 23 |
-
|
| 24 |
-
test_queries = [
|
| 25 |
-
# 1. ISBN -> Should be EXACT (No Rerank) to avoid regression
|
| 26 |
-
{"query": "0060959479", "expected_strat": "exact"},
|
| 27 |
-
|
| 28 |
-
# 2. Keyword -> Should be FAST (No Rerank)
|
| 29 |
-
{"query": "python programming", "expected_strat": "fast"},
|
| 30 |
-
|
| 31 |
-
# 3. Complex -> Should be DEEP (With Rerank)
|
| 32 |
-
{"query": "books about finding love in unexpected places", "expected_strat": "deep"},
|
| 33 |
-
]
|
| 34 |
-
|
| 35 |
-
results = []
|
| 36 |
-
|
| 37 |
-
for case in test_queries:
|
| 38 |
-
q = case["query"]
|
| 39 |
-
print(f"\nUser Query: '{q}'")
|
| 40 |
-
|
| 41 |
-
# 1. ROUTING STEP
|
| 42 |
-
route_decision = router.route(q)
|
| 43 |
-
strat = route_decision["strategy"]
|
| 44 |
-
use_rerank = route_decision["rerank"]
|
| 45 |
-
alpha_val = route_decision.get("alpha", 0.5)
|
| 46 |
-
|
| 47 |
-
print(f" 🤖 Router Decision: {strat.upper()} (Rerank={use_rerank}, Alpha={alpha_val})")
|
| 48 |
-
|
| 49 |
-
# Check expectation
|
| 50 |
-
if strat != case["expected_strat"]:
|
| 51 |
-
print(f" ⚠️ WARNING: Expected {case['expected_strat']}, got {strat}")
|
| 52 |
-
|
| 53 |
-
# 2. RETRIEVAL STEP
|
| 54 |
-
start_time = time.time()
|
| 55 |
-
docs = db.hybrid_search(
|
| 56 |
-
q,
|
| 57 |
-
k=5,
|
| 58 |
-
rerank=use_rerank,
|
| 59 |
-
alpha=alpha_val
|
| 60 |
-
)
|
| 61 |
-
duration = (time.time() - start_time) * 1000
|
| 62 |
-
|
| 63 |
-
# Capture results
|
| 64 |
-
top_results = []
|
| 65 |
-
for doc in docs:
|
| 66 |
-
# Extract ISBN/Title
|
| 67 |
-
if "ISBN:" in doc.page_content:
|
| 68 |
-
isbn = doc.page_content.split("ISBN:")[1].strip().split()[0]
|
| 69 |
-
else:
|
| 70 |
-
parts = doc.page_content.strip().split(' ', 1)
|
| 71 |
-
isbn = parts[0]
|
| 72 |
-
|
| 73 |
-
title = isbn_map.get(isbn, f"ISBN:{isbn}")
|
| 74 |
-
if len(title) > 30:
|
| 75 |
-
title = title[:27] + "..."
|
| 76 |
-
|
| 77 |
-
score = doc.metadata.get("relevance_score", "N/A")
|
| 78 |
-
if score != "N/A":
|
| 79 |
-
top_results.append(f"{title} ({score:.4f})")
|
| 80 |
-
else:
|
| 81 |
-
top_results.append(f"{title}")
|
| 82 |
-
|
| 83 |
-
print(f" -> Found: {top_results[:3]}")
|
| 84 |
-
results.append({
|
| 85 |
-
"query": q,
|
| 86 |
-
"strategy": strat,
|
| 87 |
-
"latency_ms": round(duration, 2),
|
| 88 |
-
"top_1": top_results[0] if top_results else "None"
|
| 89 |
-
})
|
| 90 |
-
|
| 91 |
-
# Save
|
| 92 |
-
df = pd.DataFrame(results)
|
| 93 |
-
path = "experiments/04_router_results.csv"
|
| 94 |
-
df.to_csv(path, index=False)
|
| 95 |
-
print(f"\n💾 Results saved to {path}")
|
| 96 |
-
print(df.to_string(index=False))
|
| 97 |
-
|
| 98 |
-
if __name__ == "__main__":
|
| 99 |
-
run_benchmark()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
scripts/benchmark_temporal.py
DELETED
|
@@ -1,44 +0,0 @@
|
|
| 1 |
-
import pandas as pd
|
| 2 |
-
from src.vector_db import VectorDB
|
| 3 |
-
|
| 4 |
-
def run_benchmark():
|
| 5 |
-
print("🚀 Starting Temporal Dynamics Benchmark...")
|
| 6 |
-
|
| 7 |
-
db = VectorDB()
|
| 8 |
-
|
| 9 |
-
# We use a query where 'newness' matters
|
| 10 |
-
query = "latest advancements in technology and science"
|
| 11 |
-
|
| 12 |
-
print(f"\nQuery: '{query}'")
|
| 13 |
-
|
| 14 |
-
# 1. Standard Search
|
| 15 |
-
print("\n--- Standard Search (No Temporal) ---")
|
| 16 |
-
st_docs = db.hybrid_search(query, k=5, rerank=True, temporal=False)
|
| 17 |
-
for d in st_docs:
|
| 18 |
-
# Get Year
|
| 19 |
-
isbn = d.metadata.get("isbn") or d.metadata.get("isbn13")
|
| 20 |
-
if not isbn and "ISBN:" in d.page_content:
|
| 21 |
-
isbn = d.page_content.split("ISBN:")[1].strip().split()[0]
|
| 22 |
-
year = db.pub_years.get(str(isbn), "Unknown")
|
| 23 |
-
score = d.metadata.get("relevance_score", 0.0)
|
| 24 |
-
|
| 25 |
-
# Parse title
|
| 26 |
-
title = d.page_content.split('\n')[0].replace("Title: ", "")[:40]
|
| 27 |
-
print(f"[{year}] {title}... (Score: {score:.4f})")
|
| 28 |
-
|
| 29 |
-
# 2. Temporal Search
|
| 30 |
-
print("\n--- Temporal Search (Recent Boost) ---")
|
| 31 |
-
tm_docs = db.hybrid_search(query, k=5, rerank=True, temporal=True)
|
| 32 |
-
for d in tm_docs:
|
| 33 |
-
isbn = d.metadata.get("isbn") or d.metadata.get("isbn13")
|
| 34 |
-
if not isbn and "ISBN:" in d.page_content:
|
| 35 |
-
isbn = d.page_content.split("ISBN:")[1].strip().split()[0]
|
| 36 |
-
year = db.pub_years.get(str(isbn), "Unknown")
|
| 37 |
-
# In temporal mode, score is boosted
|
| 38 |
-
score = d.metadata.get("relevance_score", 0.0)
|
| 39 |
-
|
| 40 |
-
title = d.page_content.split('\n')[0].replace("Title: ", "")[:40]
|
| 41 |
-
print(f"[{year}] {title}... (Score: {score:.4f})")
|
| 42 |
-
|
| 43 |
-
if __name__ == "__main__":
|
| 44 |
-
run_benchmark()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
scripts/build_books_basic_info.py
DELETED
|
@@ -1,48 +0,0 @@
|
|
| 1 |
-
import pandas as pd
|
| 2 |
-
import csv
|
| 3 |
-
|
| 4 |
-
# 读取原始数据,遇到格式错误行自动跳过,保证流程不中断
|
| 5 |
-
books_data = pd.read_csv(
|
| 6 |
-
"data/books_data.csv",
|
| 7 |
-
engine="python",
|
| 8 |
-
quotechar='"',
|
| 9 |
-
escapechar='\\',
|
| 10 |
-
on_bad_lines='skip' # pandas >=1.3
|
| 11 |
-
)
|
| 12 |
-
ratings = pd.read_csv("data/Books_rating.csv", engine="python", quotechar='"', escapechar='\\', on_bad_lines='skip')
|
| 13 |
-
|
| 14 |
-
# 只保留有用字段
|
| 15 |
-
books_cols = [
|
| 16 |
-
"Title", "description", "authors", "image", "publisher", "publishedDate", "categories"
|
| 17 |
-
]
|
| 18 |
-
books_data = books_data[books_cols]
|
| 19 |
-
|
| 20 |
-
# 只保留 Title, Id, review/score 字段用于合并
|
| 21 |
-
ratings_cols = ["Title", "Id", "review/score"]
|
| 22 |
-
ratings = ratings[ratings_cols]
|
| 23 |
-
|
| 24 |
-
# 去重
|
| 25 |
-
ratings = ratings.drop_duplicates(subset=["Title"])
|
| 26 |
-
|
| 27 |
-
# 合并,左连接,保留 books_data 所有行
|
| 28 |
-
merged = books_data.merge(ratings, on="Title", how="left")
|
| 29 |
-
|
| 30 |
-
# 重命名字段
|
| 31 |
-
merged = merged.rename(columns={
|
| 32 |
-
"Id": "isbn10",
|
| 33 |
-
"Title": "title",
|
| 34 |
-
"authors": "authors",
|
| 35 |
-
"description": "description",
|
| 36 |
-
"image": "image",
|
| 37 |
-
"publisher": "publisher",
|
| 38 |
-
"publishedDate": "publishedDate",
|
| 39 |
-
"categories": "categories",
|
| 40 |
-
"review/score": "average_rating"
|
| 41 |
-
})
|
| 42 |
-
|
| 43 |
-
# 生成 isbn13(如有更复杂规则可补充,这里仅占位)
|
| 44 |
-
merged["isbn13"] = None # 可后续补充isbn13生成逻辑
|
| 45 |
-
|
| 46 |
-
# 保存新表,强制所有字段加引号,防止description等字段被截断
|
| 47 |
-
merged.to_csv("data/books_basic_info.csv", index=False, quoting=csv.QUOTE_ALL, quotechar='"', escapechar='\\')
|
| 48 |
-
print("已生成 data/books_basic_info.csv,包含基础书籍信息字段。")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|