ymlin105 commited on
Commit
950f43a
·
1 Parent(s): 89c5a4f

chore: remove deprecated files and add legacy directory to .dockerignore

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .cursor/rules/research-prototype-style.mdc +29 -0
  2. .dockerignore +3 -0
  3. PHASE_2_DEVELOPMENT.md +0 -518
  4. README.md +41 -192
  5. REVIEW_HIGHLIGHTS.md +0 -142
  6. TAGS_AND_EMOTIONS.md +0 -233
  7. cover-not-found.jpg +0 -0
  8. data-exploration.ipynb +0 -0
  9. docs/README.md +18 -14
  10. docs/TECHNICAL_REPORT.md +40 -23
  11. docs/archived/PHASE_2_DEVELOPMENT.md +18 -9
  12. DEPLOYMENT.md → docs/archived/graveyard/DEPLOYMENT.md +0 -0
  13. business_logic.md → docs/archived/graveyard/business_logic.md +0 -0
  14. docs/{future_roadmap.md → archived/graveyard/future_roadmap.md} +0 -0
  15. docs/{interview_deep_dive.md → archived/graveyard/interview_deep_dive.md} +0 -0
  16. interview_prep.md → docs/archived/graveyard/interview_prep.md +0 -0
  17. docs/{phase7_plan.md → archived/graveyard/phase7_plan.md} +0 -0
  18. project_analysis.md → docs/archived/graveyard/project_analysis.md +0 -0
  19. docs/{project_narrative.md → archived/graveyard/project_narrative.md} +0 -0
  20. docs/{rag_architecture.md → archived/graveyard/rag_architecture.md} +0 -0
  21. technical_architecture.md → docs/archived/graveyard/technical_architecture.md +0 -0
  22. docs/{technical_deep_dive_sota.md → archived/graveyard/technical_deep_dive_sota.md} +0 -0
  23. story_and_strategy.md → docs/archived/story_and_strategy.md +0 -0
  24. {experiments → docs/experiments/reports}/baseline_report.md +0 -0
  25. benchmarks/results.md → docs/experiments/reports/benchmark_results.md +0 -0
  26. {experiments → docs/experiments/reports}/hybrid_report.md +0 -0
  27. {experiments → docs/experiments/reports}/rerank_report.md +0 -0
  28. {experiments → docs/experiments/reports}/router_report.md +0 -0
  29. {experiments → docs/experiments/reports}/temporal_report.md +0 -0
  30. legacy/README.md +10 -0
  31. {src → legacy}/agent/agent_core.py +0 -0
  32. {src → legacy}/agent/data_loader.py +0 -0
  33. {src → legacy}/agent/dialogue_manager.py +0 -0
  34. {src → legacy}/agent/intent_parser.py +0 -0
  35. {src → legacy}/agent/llm_generator.py +0 -0
  36. {src → legacy}/agent/rag_indexer.py +0 -0
  37. {src → legacy}/agent/rag_retriever.py +0 -0
  38. app.py → legacy/app.py +1 -1
  39. deploy.sh → legacy/deploy.sh +0 -0
  40. download_fix.py → legacy/download_fix.py +0 -0
  41. requirements.txt +0 -1
  42. scripts/add_isbn13_to_books_data.py +0 -16
  43. scripts/add_isbn_to_books_data.py +0 -21
  44. scripts/benchmark_compressor.py +0 -35
  45. scripts/benchmark_hybrid.py +0 -83
  46. scripts/benchmark_rerank.py +0 -82
  47. scripts/benchmark_retrieval.py +0 -82
  48. scripts/benchmark_router.py +0 -99
  49. scripts/benchmark_temporal.py +0 -44
  50. scripts/build_books_basic_info.py +0 -48
.cursor/rules/research-prototype-style.mdc ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ description: Research prototype coding style — graduate student, clarity over elegance
3
+ globs: src/**/*.py, scripts/**/*.py
4
+ alwaysApply: true
5
+ ---
6
+
7
+ # Research Prototype Style
8
+
9
+ Target author: graduate student, working alone, optimizing for clarity and debuggability. NOT for production or large-scale deployment.
10
+
11
+ ## Do
12
+
13
+ - **Simple, linear control flow** — avoid nested callbacks, deep inheritance
14
+ - **Minimal abstraction** — prefer direct implementation over factories/adapters
15
+ - **Inline comments** — explain *why*, not *what*
16
+ - **Strategic debug** — a few print/debug statements, preferably behind a flag (e.g. `if DEBUG:`)
17
+ - **Small duplication OK** — if it improves readability over DRY
18
+
19
+ ## Avoid
20
+
21
+ - Excessive logging or print statements
22
+ - Heavy configuration systems (YAML configs, env-driven feature flags)
23
+ - Comprehensive unit test suites (targeted tests for critical paths only)
24
+ - Complex class hierarchies, abstract base classes
25
+ - Premature optimization
26
+
27
+ ## When in Doubt
28
+
29
+ Prefer simplicity over elegance. Code should look like something a careful graduate student would realistically write and maintain.
.dockerignore CHANGED
@@ -26,3 +26,6 @@ build/
26
  dist/
27
  eggs/
28
  .eggs/
 
 
 
 
26
  dist/
27
  eggs/
28
  .eggs/
29
+
30
+ # Legacy (not used in main flow)
31
+ legacy/
PHASE_2_DEVELOPMENT.md DELETED
@@ -1,518 +0,0 @@
1
- # Phase 2: Personalization & React UI Migration
2
-
3
- **Date:** January 2026
4
- **Status:** ✅ Complete & Deployed
5
-
6
- ---
7
-
8
- ## Overview
9
-
10
- This phase shifted the project from a basic semantic book recommender to an **intelligent, personalized discovery platform** with a modern React frontend. The vision evolved from marketplace/swap features to a focused **recommendation engine grounded in user preferences and persona-driven insights**.
11
-
12
- ---
13
-
14
- ## Phase Vision & Direction
15
-
16
- ### Initial Pivot (from conversation)
17
- - **Original concept:** Second-hand book marketplace/swap platform
18
- - **User feedback:** Focus on recommendation engine first, then expand
19
- - **Final direction:** Keep it recommendation-only with two new pillars:
20
- 1. **Favorites** → persistent user library tracking
21
- 2. **Personalized Highlights** → AI-generated selling points based on user taste
22
-
23
- ### Core Philosophy
24
- > "Books that understand you. Recommendations grounded in what you love."
25
-
26
- The system learns from your reading preferences and surfaces books that match both the search query AND your unique taste profile.
27
-
28
- ---
29
-
30
- ## What Was Built
31
-
32
- ### 1. **Backend Personalization Layer** (`src/`)
33
-
34
- #### A. User Favorites Storage
35
- - **File:** `src/user/profile_store.py`
36
- - **Mechanism:** JSON-based persistence (`data/user_profiles.json`)
37
- - **Features:**
38
- - `add_favorite(user_id, isbn)` → idempotent add + deduplicate
39
- - `list_favorites(user_id)` → retrieve user's library
40
- - Works with any user_id (default: "local" for single-user dev)
41
-
42
- #### B. User Persona Aggregation
43
- - **File:** `src/marketing/persona.py`
44
- - **Input:** List of favorite ISBNs + book metadata DataFrame
45
- - **Output:** `{ summary, top_authors[], top_categories[] }`
46
- - **Algorithm:**
47
- 1. Fetch metadata for all favorited books
48
- 2. Extract top 3 authors (by frequency)
49
- 3. Extract top 3 categories
50
- 4. Generate natural language summary combining signals
51
- - Example: *"您钟爱悬疑与科幻,偏好国际视野的作品。"* (You love mystery & sci-fi, prefer international perspectives)
52
-
53
- #### C. Personalized Highlights Generator
54
- - **File:** `src/marketing/highlights.py`
55
- - **Input:** ISBN + user persona + book metadata
56
- - **Output:** `{ title, authors, category, highlights[], persona_summary }`
57
- - **Generation Strategy:**
58
- - Match persona themes to book content (author, category, description)
59
- - Extract 3-5 contextual selling points
60
- - Combine rule-based matching + description parsing
61
- - Example output:
62
- ```
63
- - 作者获国际奖项,契合您对国际视野的热爱
64
- - 悬疑与科幻的完美融合,正是您的最爱组合
65
- - 情节紧凑,适合您快节奏阅读的偏好
66
- ```
67
-
68
- ### 2. **FastAPI Backend Integration** (`src/main.py`)
69
-
70
- **Three New Endpoints:**
71
-
72
- ```python
73
- POST /favorites/add
74
- Request: { user_id: str, isbn: str }
75
- Response: { status: "ok", favorites_count: int }
76
-
77
- GET /user/{user_id}/persona
78
- Response: { user_id, favorites: [], persona: {...} }
79
-
80
- POST /marketing/highlights
81
- Request: { isbn: str, user_id?: str }
82
- Response: { persona, highlights: [], meta: {...} }
83
- ```
84
-
85
- **CORS Support:**
86
- - Enabled for localhost:5173 (React dev), 3000 (alt dev), 8080
87
- - Allows frontend to access backend without restrictions
88
-
89
- ---
90
-
91
- ### 3. **Modern React UI** (`web/`)
92
-
93
- #### Architecture
94
- - **Build Tool:** Vite (ultra-fast dev server, ~200ms startup)
95
- - **Styling:** Tailwind CSS (CDN-based, no build required)
96
- - **Icons:** lucide-react (modern SVG icons)
97
- - **State Management:** React Hooks (useState only, no Redux)
98
-
99
- #### Design: "纸间留白" (Paper Shelf)
100
- A literary, minimalist aesthetic inspired by:
101
- - Japanese minimalism (留白 = leaving white space)
102
- - Second-hand bookstore vibes
103
- - Serif typography (font-serif)
104
- - Muted earth tones: `#b392ac` (mauve), `#f4acb7` (peach), `#faf9f6` (cream)
105
-
106
- #### Core Features
107
-
108
- **1. Discovery Tab (Default View)**
109
- ```
110
- ┌─────────────────────────────────┐
111
- │ 纸间留白 │ Header + toggle "私人书斋"
112
- ├─────────────────────────────────┤
113
- │ 墨色余温·灵魂契合 (if favorites) │ Smart carousel of alma-mate books
114
- ├─────────────────────────────────┤
115
- │ [Search] [Category▼] [Mood▼] │ Semantic search + filters
116
- │ 开启发现之旅 (Start Discovery) │
117
- ├─────────────────────────────────┤
118
- │ [Book 1] [Book 2] [Book 3] ... │ 5-column responsive grid
119
- │ (hover shows ai-generated hint) │
120
- └─────────────────────────────────┘
121
- ```
122
-
123
- **2. Book Detail Modal**
124
- ```
125
- ┌─────────────────────────────────┐
126
- │ [Close] │
127
- ├─────────���────┬──────────────────┤
128
- │ Cover │ Title │
129
- │ ISBN │ Highlights │
130
- │ Score ★★★★★ │ Description │
131
- │ │ Chat Interface │
132
- │ │ [Add to Library] │
133
- └──────────────┴──────────────────┘
134
- ```
135
-
136
- **3. Private Library ("私人书斋")**
137
- - Toggle view to see only favorited books
138
- - Shows reading statistics (mood distribution)
139
- - Same gallery grid + detail modal
140
-
141
- **4. Chat Interface (in modal)**
142
- - Suggested questions tied to book context
143
- - User messages vs AI responses styled differently
144
- - AI grounded to book metadata (not LLM-based yet)
145
-
146
- #### API Integration
147
- All four key flows wired to backend:
148
-
149
- ```javascript
150
- // Search → Recommendation
151
- startDiscovery() → recommend(query, category, tone)
152
-
153
- // Select book → Load highlights
154
- openBook(book) → getHighlights(isbn)
155
-
156
- // Add to collection
157
- toggleCollect(book) → addFavorite(isbn)
158
-
159
- // (Future) Refresh persona
160
- persona = getPersona(userId)
161
- ```
162
-
163
- ---
164
-
165
- ## End-to-End Flow
166
-
167
- ### User Journey: "Discovery to Collection"
168
-
169
- ```
170
- 1. User enters search query + filters
171
-
172
- 2. startDiscovery() calls POST /recommend
173
- → FastAPI semantic search + tone filtering
174
- → Returns top N books with thumbnails
175
-
176
- 3. Books render in grid (hover shows AI hint)
177
-
178
- 4. User clicks book → openBook()
179
- → Calls POST /marketing/highlights
180
- → Gets persona + 3-5 personalized selling points
181
- → Modal shows all details + chat
182
-
183
- 5. User clicks "加入藏书馆" (Add to Collection)
184
- → Calls POST /favorites/add
185
- → Updates myCollection state
186
- → Next search shows "灵魂契合" carousel (matched books)
187
-
188
- 6. User clicks "私人书斋" to view collection
189
- → Filters books to only favorites
190
- → Shows reading persona stats
191
- ```
192
-
193
- ---
194
-
195
- ## Technical Decisions
196
-
197
- ### Why JSON for Favorites (not SQLite)?
198
- - **Rationale:** Single-user dev focus, rapid iteration
199
- - **Trade-off:** 11k books × metadata in one file = acceptable overhead
200
- - **Future:** Easy migration to PostgreSQL when scaling to multi-user
201
-
202
- ### Why No LLM for Highlights?
203
- - **Rationale:** Keep system lightweight, deterministic, fast
204
- - **Method:** Rule-based persona matching (Top-3 authors/categories)
205
- - **Future:** Could upgrade to LLM refinement (e.g., GPT for polish)
206
-
207
- ### Why React + Vite (not Gradio)?
208
- - **Rationale:**
209
- - Gradio good for prototypes, React needed for custom UX
210
- - Vite super fast (no webpack pain)
211
- - Tailwind CDN avoids npm build complexity
212
- - **Fallback:** Gradio UI (app.py) still available on port 7860
213
-
214
- ### Why Persona from Favorites (not search history)?
215
- - **Rationale:** User intent explicit in favorites, not implicit in queries
216
- - **Semantics:** "Add to collection" = explicit preference signal
217
- - **Advantage:** Works offline, no tracking/privacy concerns
218
-
219
- ---
220
-
221
- ## Architecture Diagram
222
-
223
- ```
224
- ┌──────────────────────────────────────────────────────┐
225
- │ FRONTEND (React) │
226
- │ web/ → Vite dev server (localhost:5173) │
227
- │ ┌────────────────────────────────────────────────┐ │
228
- │ │ App.jsx │ │
229
- │ │ - SearchBar (query, category, mood) │ │
230
- │ │ - Gallery (books grid) │ │
231
- │ │ - DetailModal (title, highlights, chat) │ │
232
- │ │ - MyCollection (favorites view) │ │
233
- │ └────────────────────────────────────────────────┘ │
234
- │ api.js → Fetch wrappers (recommend, highlights...) │
235
- └──────────────────────────────────────────────────────┘
236
-
237
- HTTP/CORS
238
-
239
- ┌──────────────────────────────────────────────────────┐
240
- │ BACKEND (FastAPI) │
241
- │ src/main.py → uvicorn (localhost:6006) │
242
- │ ┌────────────────────────────────────────────────┐ │
243
- │ │ GET /health │ │
244
- │ │ POST /recommend (query, category, tone) │ │
245
- │ │ GET /categories, /tones │ │
246
- │ │ ┌─────────────────────────────────────��────┐ │ │
247
- │ │ │ NEW: POST /favorites/add │ │ │
248
- │ │ │ NEW: GET /user/{id}/persona │ │ │
249
- │ │ │ NEW: POST /marketing/highlights │ │ │
250
- │ │ └──────────────────────────────────────────┘ │ │
251
- │ └────────────────────────────────────────────────┘ │
252
- └──────────────────────────────────────────────────────┘
253
- ↓ ↓
254
- ┌─────────────┐ ┌──────────────────┐
255
- │ ChromaDB │ │ User Profiles │
256
- │ (11k docs) │ │ (JSON file) │
257
- │ ↓ │ │ ↓ │
258
- │ Vector │ │ Favorites + │
259
- │ Embeddings │ │ Persona │
260
- └─────────────┘ └──────────────────┘
261
-
262
- ┌─────────────────────────────────┐
263
- │ Books Metadata (CSV) │
264
- │ - title, authors, description │
265
- │ - isbn, category, rating │
266
- │ - emotion scores (joy/sad/etc) │
267
- └─────────────────────────────────┘
268
- ```
269
-
270
- ---
271
-
272
- ## Key Data Models
273
-
274
- ### User Profile (JSON)
275
- ```json
276
- {
277
- "local": {
278
- "favorites": [
279
- { "isbn": "9780451524935", "title": "1984", "added_at": "2026-01-06" },
280
- { "isbn": "9780061120084", "title": "To Kill a Mockingbird", "added_at": "2026-01-06" }
281
- ]
282
- }
283
- }
284
- ```
285
-
286
- ### Book Recommendation Response
287
- ```json
288
- {
289
- "recommendations": [
290
- {
291
- "isbn": "9780451524935",
292
- "title": "1984",
293
- "authors": "George Orwell",
294
- "description": "A dystopian novel...",
295
- "thumbnail": "https://covers.openlibrary.org/...",
296
- "caption": "(auto-generated short hint)"
297
- }
298
- ]
299
- }
300
- ```
301
-
302
- ### Highlights Response
303
- ```json
304
- {
305
- "persona": {
306
- "summary": "您钟爱悬疑与科幻,偏好国际视野的作品。",
307
- "top_authors": ["Agatha Christie", "Isaac Asimov"],
308
- "top_categories": ["Mystery", "Science Fiction"]
309
- },
310
- "highlights": [
311
- "国际推理大师之作,契合您的悬疑偏好",
312
- "心理扭转的情节设计,适合您快节奏阅读",
313
- "深层人性反思,引发思考"
314
- ],
315
- "meta": {
316
- "title": "And Then There Were None",
317
- "authors": "Agatha Christie",
318
- "category": "Mystery",
319
- "description": "..."
320
- }
321
- }
322
- ```
323
-
324
- ---
325
-
326
- ## Running the System
327
-
328
- ### Development Mode (3 services)
329
-
330
- **Terminal 1: FastAPI Backend**
331
- ```bash
332
- cd /Users/ymlin/Downloads/003-Study/138-Projects/book-rec-with-LLMs
333
- make run
334
- # Starts on http://localhost:6006
335
- # Loads 11k books into ChromaDB
336
- # Initializes metrics, routes
337
- ```
338
-
339
- **Terminal 2: React Frontend**
340
- ```bash
341
- cd web
342
- npm run dev
343
- # Starts on http://localhost:5173
344
- # Hot reload on file changes
345
- # Connect to http://localhost:6006 backend
346
- ```
347
-
348
- **Terminal 3 (Optional): Gradio Legacy UI**
349
- ```bash
350
- python app.py
351
- # Starts on http://localhost:7860
352
- # Alternative UI for testing
353
- ```
354
-
355
- ### Production Workflow
356
- - React builds with `npm run build` → static files
357
- - FastAPI serves as single backend
358
- - Deploy as Docker containers (see DEPLOYMENT.md)
359
-
360
- ---
361
-
362
- ## Testing the Features
363
-
364
- ### 1. Test Semantic Search
365
- ```
366
- Input: "悬疑推理小说,节奏快"
367
- Expected: Agatha Christie, Sherlock Holmes, modern thrillers
368
- ```
369
-
370
- ### 2. Test Favorites → Persona
371
- ```
372
- 1. Add 5 books to collection (mix of genres)
373
- 2. Click a new book
374
- 3. Check highlights mention added books' authors/categories
375
- ✓ Persona should reflect your choices
376
- ```
377
-
378
- ### 3. Test Persona-Based Highlights
379
- ```
380
- If you favorite: [Sci-Fi, Mystery, Literary]
381
- Then recommend: Horror book X
382
- Expected highlight: "虽不在您常读类型,但情节深度与科幻的想象力结合..."
383
- (Acknowledges taste + bridges to new territory)
384
- ```
385
-
386
- ---
387
-
388
- ## Future Enhancements
389
-
390
- ### Phase 3: Recommendations (Backlog)
391
-
392
- **1. LLM-Powered Highlights**
393
- - Use Claude/GPT to refine rule-based highlights
394
- - Natural language refinement (currently ~70% rule-based quality)
395
- - Cache per (user_id, isbn) pair for speed
396
-
397
- **2. Emotional Resonance Scoring**
398
- - Leverage emotion embeddings (joy/sadness/fear/anger/surprise) in metadata
399
- - Recommend books matching user's current mood signal
400
- - "What are you feeling today?" filter
401
-
402
- **3. Multi-User Accounts**
403
- - Migrate from JSON to SQLite/PostgreSQL
404
- - User authentication (OAuth)
405
- - Social features (share collections, compare tastes)
406
-
407
- **4. Advanced Search**
408
- - Author-to-author recommendations ("If you like X, try Y's style")
409
- - Time-based recommendations ("What to read this season?")
410
- - Combination search (mood + timeframe + word-count)
411
-
412
- **5. Analytics Dashboard**
413
- - Show user: "You've read 15 books in the mystery genre"
414
- - Predict next book based on reading history
415
- - Genre comfort zone vs stretch zones
416
-
417
- ---
418
-
419
- ## Phase Reflection
420
-
421
- ### What Worked Well
422
- ✅ **Modular backend design** → easy to add /highlights, /persona endpoints
423
- ✅ **React UI responsiveness** → users see results instantly
424
- ✅ **JSON-first approach** → no DB setup friction, iterate fast
425
- ✅ **API-driven architecture** → Gradio + React both work
426
- ✅ **Persona concept** → users feel "understood" by the system
427
-
428
- ### Challenges Overcome
429
- 🔧 **Port conflicts** (Gradio:7860 vs React:5173 vs FastAPI:6006) → Makefile organization
430
- 🔧 **CORS issues** (frontend can't reach backend) → Added CORSMiddleware
431
- 🔧 **Image loading** (external URLs not allowed in Gradio) → Runtime fetching + local fallback
432
- 🔧 **Timeout errors** (cold startup > 10s) → Increased client timeouts, optimized startup
433
-
434
- ### Design Philosophy Validated
435
- The shift from "marketplace" → "recommendation + personalization" was right because:
436
- 1. **Clear unique value:** Persona-aware recommendations don't exist in typical bookstores
437
- 2. **Tight scope:** Focused on one thing (smart discovery) vs scattered marketplace features
438
- 3. **User empathy:** People want to be understood, not just transact
439
-
440
- ---
441
-
442
- ## Code Structure Summary
443
-
444
- ```
445
- book-rec-with-LLMs/
446
- ├── src/
447
- │ ├── main.py # FastAPI app + 3 new endpoints
448
- │ ├── recommender.py # Semantic search core
449
- │ ├── vector_db.py # ChromaDB wrapper
450
- │ ├── cache.py # Image caching
451
- │ ├── user/
452
- │ │ └── profile_store.py # ✨ NEW: Favorites JSON storage
453
- │ └── marketing/
454
- │ ├── persona.py # ✨ NEW: Persona aggregation
455
- │ ├── highlights.py # ✨ NEW: Highlight generation
456
- │ └── guardrails.py # Safety checks (stub)
457
- ├── web/ # ✨ NEW: React Vite app
458
- │ ├── src/
459
- │ │ ├── App.jsx # Main component + state
460
- │ │ ├── api.js # Fetch wrappers
461
- │ │ └── main.jsx # Entry point
462
- │ ├── index.html # HTML + Tailwind CDN
463
- │ └── package.json # Dependencies
464
- ├── app.py # Gradio UI (legacy)
465
- ├── Makefile # Commands
466
- ├── requirements.txt # Python deps
467
- └── data/
468
- ├── books_processed.csv # Metadata
469
- └── user_profiles.json # ✨ NEW: User data
470
- ```
471
-
472
- ---
473
-
474
- ## Commit Message
475
- ```
476
- feat: add React UI and backend personalization features
477
-
478
- - Create modern React UI (web/) with 纸间留白 design
479
- * Semantic search + favorites + detail modal
480
- * Tailwind CSS + lucide-react
481
- * Vite dev server on port 5173
482
-
483
- - Implement user personalization:
484
- * src/user/profile_store.py: JSON favorites
485
- * src/marketing/persona.py: User taste aggregation
486
- * src/marketing/highlights.py: Persona-aware selling points
487
- * 3 new API endpoints in FastAPI
488
-
489
- - Maintain Gradio UI as optional fallback
490
- - Add CORS support, update timeouts, improve infrastructure
491
- ```
492
-
493
- ---
494
-
495
- ## How to Continue
496
-
497
- ### If you want to test now:
498
- 1. `make run` (starts backend)
499
- 2. `cd web && npm run dev` (starts React UI)
500
- 3. Visit http://localhost:5173
501
- 4. Search for a book → click results → "加入藏书馆" → see persona highlights
502
-
503
- ### If you want to refine:
504
- - Adjust persona algorithm in `src/marketing/persona.py`
505
- - Tweak UI colors/layout in `web/src/App.jsx`
506
- - Add more rules to highlights in `src/marketing/highlights.py`
507
-
508
- ### If you want to scale:
509
- - Migrate to PostgreSQL (users table + favorites relationship)
510
- - Add user auth (FastAPI auth middleware)
511
- - Deploy with Docker + cloud (see DEPLOYMENT.md)
512
-
513
- ---
514
-
515
- **Status:** ✅ **Ready to Deploy**
516
-
517
- Next phase can focus on: multi-user support, LLM refinement, analytics, or social features.
518
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
README.md CHANGED
@@ -7,221 +7,70 @@ app_port: 8000
7
 
8
  # Intelligent Book Recommendation System
9
 
10
- > **Frozen at v2.6.0**This project is in maintenance mode for portfolio use. No new features, experiments, or optimizations. Documentation and bug fixes only.
11
 
12
- > A production-grade **Agentic RAG + RecSys** platform combining semantic search, personalized recommendations, and conversational AI.
13
 
14
- ## Highlights
15
 
16
- | Component | Technology | Achievement |
17
- |:---|:---|:---|
18
- | **Semantic Search** | ChromaDB + MiniLM-L6 | Sub-300ms retrieval on 200K+ books |
19
- | **Agentic Router** | Rule-based intent classification | 4 dynamic strategies (BM25, Hybrid, Rerank, Small-to-Big) |
20
- | **Personalized Rec** | 7-channel recall + LGBMRanker + Stacking | HR@10: 0.4545, MRR@5: 0.2893 |
21
- | **Conversational AI** | RAG + OpenAI/Ollama | Real-time streaming (Default: Local Ollama) |
22
 
23
- ---
24
 
25
- ## System Architecture
26
 
27
- ```
28
- ┌─────────────────────────────────────────────────────────────────┐
29
- │ Frontend (React)
30
- │ Search UI My Bookshelf Chat Recommendations │
31
- └──────────────────────────┬──────────────────────────────────────┘
32
- │ REST + SSE
33
- ┌──────────────────────────▼──────────────────────────────────────┐
34
- │ FastAPI Backend │
35
- │ ┌─────────────┐ ┌──────────────┐ ┌───────────────────────┐ │
36
- │ │ Query Router│→ │ RAG Pipeline │→ │ Personalized RecSys │ │
37
- │ └─────────────┘ └──────────────┘ └───────────────────────┘ │
38
- │ │ │ │ │
39
- │ Intent Class Hybrid Search 7-Channel Recall + RRF │
40
- │ (ISBN/Keyword + Cross-Encoder (ItemCF + UserCF + Swing │
41
- │ /Complex) Reranking + SASRec + Item2Vec...) │
42
- └──────────────────────────┬──────────────────────────────────────┘
43
-
44
- ┌──────────────────┼──────────────────┐
45
- ▼ ▼ ▼
46
- ┌─────────┐ ┌───────────┐ ┌──────────────┐
47
- │ChromaDB │ │LGBMRanker │ │ LLM Provider │
48
- │(Vectors)│ │(LambdaRank│ │ (Chat/Recs) │
49
- └─────────┘ └───────────┘ └──────────────┘
50
- ```
51
 
52
- ---
53
 
54
- ## Key Features
55
-
56
- ### 1. Agentic RAG with Dynamic Routing
57
- - **Query Intent Classification**: Automatically routes queries to optimal retrieval strategy
58
- - ISBN → Pure BM25 (100% precision)
59
- - Keywords → Hybrid Search (BM25 + Dense)
60
- - Complex queries → Cross-Encoder Reranking
61
- - Detail queries → Small-to-Big Retrieval (788K indexed sentences)
62
-
63
- ### 2. Personalized Recommendation Engine
64
- - **7-Channel Recall**: ItemCF (direction-weighted), UserCF, Swing, SASRec, Item2Vec, YoutubeDNN, Popularity
65
- - **RRF Fusion**: Reciprocal Rank Fusion merges candidates across all recall channels
66
- - **SASRec Sequential Model**: 64-dim Transformer embeddings (30 epochs), used as both recall source and ranking feature
67
- - **Model Stacking**: LGBMRanker (LambdaRank) + XGBClassifier → LogisticRegression meta-learner
68
- - **LGBMRanker (LambdaRank)**: Directly optimizes NDCG with 17 engineered features and hard negative sampling
69
- - **Evaluation**: HR@10 = 0.4545, MRR@5 = 0.2893 (n=2000, Leave-Last-Out, title-relaxed)
70
-
71
- ### 3. My Bookshelf (User Library)
72
- - **Rating System**: 5-star rating with persistence
73
- - **Reading Status**: Want to Read / Reading / Finished
74
- - **Statistics Dashboard**: Visual progress tracking
75
-
76
- ### 4. Conversational Shopping Assistant
77
- - **RAG-Grounded Responses**: Context from ChromaDB reduces hallucinations
78
- - **Streaming Output**: Real-time token streaming via SSE
79
- - **Flexible LLM**: Defaults to local **Ollama** (free/privacy), supports **OpenAI API** for production
80
 
81
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82
 
83
  ## Quick Start
84
 
85
- ### Prerequisites
86
- - Python 3.10+ with Conda
87
- - Node.js 18+
88
-
89
- ### Installation
90
-
91
  ```bash
92
- # Clone and setup environment
93
  git clone https://github.com/sylvia-ymlin/book-rec-with-LLMs.git
94
  cd book-rec-with-LLMs
95
- conda env create -f environment.yml
96
- conda activate book-rec
97
-
98
- # Initialize vector database (first run)
99
- python src/init_db.py
100
 
101
- # Start API server
102
- make run # http://localhost:6006
 
103
 
104
- # Start frontend (new terminal)
105
- cd web && npm install && npm run dev # http://localhost:5173
106
  ```
107
 
108
- ### LLM Configuration
109
 
110
- | Provider | Setup | Use Case |
111
- |:---|:---|:---|
112
- | **Ollama** | `ollama pull llama3` | Free, local dev |
113
- | **OpenAI** | Set API key in UI Settings | Production |
114
-
115
- ---
116
-
117
- ## API Endpoints
118
-
119
- | Endpoint | Method | Description |
120
- |:---|:---|:---|
121
- | `/recommend` | POST | Semantic search with emotion/category filters |
122
- | `/api/recommend/personal` | GET | Personalized recommendations (RecSys) |
123
- | `/favorites/add` | POST | Add book to collection |
124
- | `/favorites/update` | PUT | Update rating/reading status |
125
- | `/user/{id}/stats` | GET | Reading statistics |
126
- | `/chat/completions` | POST | RAG-powered chat (streaming) |
127
- | `/health` | GET | Service health check |
128
-
129
- ---
130
-
131
- ## Project Structure
132
-
133
- ```
134
- src/
135
- ├── main.py # FastAPI application
136
- ├── recommender.py # RAG search orchestration
137
- ├── vector_db.py # ChromaDB wrapper
138
- ├── core/
139
- │ ├── router.py # Agentic query routing
140
- │ └── reranker.py # Cross-encoder reranking
141
- ├── recall/
142
- │ ├── itemcf.py # ItemCF with direction weight
143
- │ ├── usercf.py # UserCF (Jaccard + activity penalty)
144
- │ ├── swing.py # Swing (user-pair overlap weighting)
145
- │ ├── sasrec_recall.py # SASRec embedding dot-product recall
146
- │ ├── item2vec.py # Item2Vec (Word2Vec on sequences)
147
- │ ├── youtube_dnn.py # YoutubeDNN two-tower recall
148
- │ ├── popularity.py # Popularity with time decay
149
- │ └── fusion.py # RRF fusion of all channels
150
- ├── ranking/
151
- │ └── features.py # 17 ranking features
152
- ├── services/
153
- │ └── recommend_service.py # Recall → Rank → Dedup pipeline
154
- └── user/ # User profile storage
155
-
156
- web/
157
- ├── src/App.jsx # React UI
158
- └── src/api.js # API client
159
-
160
- scripts/
161
- ├── model/
162
- │ ├── train_sasrec.py # SASRec sequential model training
163
- │ ├── build_recall_models.py # ItemCF, UserCF, Swing, Popularity
164
- │ ├── train_ranker.py # LGBMRanker with hard negative sampling
165
- │ └── evaluate.py # HR@10, MRR@5 evaluation
166
- ├── deploy/ # Server deployment scripts
167
- └── data/ # Data processing pipelines
168
- ```
169
 
170
- ---
171
-
172
- ## Performance
173
-
174
-
175
- ### Recommendation Metrics (v2.6.0)
176
-
177
- | Metric | V2.5 | v2.6.0 | Method |
178
- |:---|:---|:---|:---|
179
- | **Hit Rate@10** | 0.2205 | **0.4545** | Leave-Last-Out, n=2000 |
180
- | **MRR@5** | 0.1584 | **0.2893** | Title-relaxed matching |
181
-
182
- v2.6.0 Key Enhancements:
183
- - **Item2Vec Recall**: Word2Vec (Skip-gram) on user interaction sequences, 7th recall channel.
184
- - **Model Stacking**: LGBMRanker + XGBClassifier → LogisticRegression meta-learner.
185
- - **Engineering**: Vectorized inference, non-blocking I/O, metadata singleton.
186
-
187
- | Dataset | Size |
188
- |:---|:---|
189
- | Training Set | 1,079,966 interactions |
190
- | Active Users | 167,968 |
191
- | Books | 221,998 |
192
-
193
- ### Latency Benchmarks
194
- | Operation | P50 Latency |
195
  |:---|:---|
196
- | **Exact Search** | ~19ms |
197
- | **Hybrid Search** | ~230ms |
198
- | **Reranked Search** | ~710ms |
199
- | **Personal Rec (warm)** | ~19ms |
200
-
201
- ---
202
-
203
- ## Project Documentation
204
-
205
- | Document | Description |
206
- |:---|:---|
207
- | [Memory Optimization Report](docs/memory_optimization.md) | Engineering overhaul: Zero-RAM SQLite architecture |
208
- | [Experiment Archive](docs/experiments/experiment_archive.md) | All experimental results from V1.0 to v2.6.0 |
209
- | [Performance Debugging Report](docs/performance_debugging_report.md) | Root cause analysis of evaluation issues |
210
- | [Roadmap](docs/roadmap.md) | Technical evolution (v2.6.0 frozen, V3.0 planned) |
211
- | [Technical Report](docs/technical_report.md) | System architecture deep dive |
212
- | [Build Guide](docs/build_guide.md) | Build and deployment instructions |
213
-
214
- ## References
215
-
216
- 1. Kang, W., & McAuley, J. (2018). *Self-Attentive Sequential Recommendation*. ICDM.
217
- 2. Reimers, N., & Gurevych, I. (2019). *Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks*.
218
- 3. Ke, G., et al. (2017). *LightGBM: A Highly Efficient Gradient Boosting Decision Tree*. NeurIPS.
219
- 4. Gao, L., et al. (2022). *Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE)*.
220
- 5. Yang, J., et al. (2020). *Large-scale Product Graph Construction for Recommendation in E-commerce* (Swing algorithm).
221
-
222
- ---
223
 
224
  ## License
225
 
226
  MIT
227
-
 
7
 
8
  # Intelligent Book Recommendation System
9
 
10
+ *Frozen at v2.6.0 — maintenance mode for portfolio use.*
11
 
12
+ ## Problem
13
 
14
+ Readers often can't articulate what they want. Can one system both understand their vague descriptions and give personalized recommendations based on their reading history?
15
 
16
+ ## Method
 
 
 
 
 
17
 
18
+ Two parallel threads: **RAG** (Agentic Router → Hybrid Search → Reranking) for understanding vague queries; **RecSys** (7-channel recall → LGBMRanker → Stacking) for personalized recommendations from reading history.
19
 
20
+ ## Key Experiments
21
 
22
+ | Experiment | Before | After | Conclusion |
23
+ |:---|:---|:---|:---|
24
+ | **RAG: Exact match** | Pure vector search, ISBN → 0% recall | Hybrid (BM25 + Dense) + Router → 100% | Vector-only fails on exact entities; BM25 + routing fixes it |
25
+ | **RAG: Keyword intent** | "Harry Potter" Philosophy book | Reranked → Sorcerer's Stone | Cross-encoder corrects semantic drift |
26
+ | **RecSys: Personalization** | Baseline 0.138 HR@10 | Item2Vec + LGBMRanker + Stacking → **0.4545** HR@10 | 7-channel recall + LambdaRank + ensemble beats single model |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
+ *Evaluation: Leave-Last-Out, n=2000, title-relaxed. HR@10 = 0.4545, MRR@5 = 0.2893.*
29
 
30
+ ## Architecture
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
 
32
+ ```
33
+ Query No Query
34
+ │ │
35
+ ▼ ▼
36
+ ┌─────────────┐ ┌─────────────┐
37
+ │ RAG Path │ │ RecSys Path │
38
+ │ Router → │ │ 7-Channel │
39
+ │ Hybrid → │ │ Recall → │
40
+ │ Rerank │ │ LGBMRanker │
41
+ └─────────────┘ └─────────────┘
42
+ │ │
43
+ └──────────┬──────────────┘
44
+
45
+ Top-K Results
46
+ ```
47
 
48
  ## Quick Start
49
 
 
 
 
 
 
 
50
  ```bash
 
51
  git clone https://github.com/sylvia-ymlin/book-rec-with-LLMs.git
52
  cd book-rec-with-LLMs
53
+ conda env create -f environment.yml && conda activate book-rec
 
 
 
 
54
 
55
+ # First run (or use make data-pipeline for full build)
56
+ python src/init_db.py # Chroma vector DB
57
+ python scripts/init_sqlite_db.py # SQLite metadata (local build)
58
 
59
+ make run # API http://localhost:6006
60
+ cd web && npm install && npm run dev # UI http://localhost:5173
61
  ```
62
 
63
+ **LLM**: Default Ollama (`ollama pull llama3`). OpenAI API key in UI Settings for production.
64
 
65
+ ## Documentation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
 
67
+ | Doc | Purpose |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
  |:---|:---|
69
+ | [Technical Report](docs/TECHNICAL_REPORT.md) | Architecture, design decisions |
70
+ | [Experiment Archive](docs/experiments/experiment_archive.md) | Full experiment log (V1.0 → v2.6.0) |
71
+ | [Interview Guide](docs/interview_guide.md) | Q&A, STAR cases |
72
+ | [Build Guide](docs/build_guide.md) | Deployment instructions |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73
 
74
  ## License
75
 
76
  MIT
 
REVIEW_HIGHLIGHTS.md DELETED
@@ -1,142 +0,0 @@
1
- # Review Highlights Feature
2
-
3
- ## Overview
4
-
5
- Added semantic sentence extraction to display representative reader reviews for each book. This feature enhances book discovery by showcasing authentic reader voices.
6
-
7
- ## Implementation
8
-
9
- ### 1. Data Generation (Server-side)
10
-
11
- **Script**: `scripts/extract_review_sentences.py`
12
-
13
- **Process**:
14
- - Splits book descriptions into sentences using regex
15
- - Uses `sentence-transformers/all-MiniLM-L6-v2` for sentence embeddings
16
- - Clusters similar sentences via cosine similarity (threshold: 0.8)
17
- - Extracts representative sentences from each cluster (top 5 per book)
18
- - Stores as semicolon-separated `review_highlights` column in CSV
19
-
20
- **Execution**:
21
- ```bash
22
- # Run in container with GPU
23
- export HF_ENDPOINT=https://hf-mirror.com
24
- python scripts/extract_review_sentences.py \
25
- --input data/books_processed.csv \
26
- --output data/books_processed.csv \
27
- --top-n 5 \
28
- --similarity-threshold 0.8 \
29
- --device 0 \
30
- --batch-size 128
31
- ```
32
-
33
- **Performance**: ~17 minutes for 222k books on GPU (211 it/s)
34
-
35
- ### 2. Backend Integration
36
-
37
- **Files Modified**:
38
- - `src/recommender.py`: Parse `review_highlights` from CSV, split by semicolon
39
- - `src/main.py`: Add `review_highlights: List[str]` to `BookResponse` model
40
-
41
- **Code**:
42
- ```python
43
- # Parse review highlights from semicolon-separated string
44
- highlights_raw = str(row.get("review_highlights", "")).strip()
45
- review_highlights = [h.strip() for h in highlights_raw.split(";") if h.strip()]
46
- ```
47
-
48
- ### 3. Frontend Display
49
-
50
- **File**: `web/src/App.jsx`
51
-
52
- **Location**: Left column, bottom section (below Rating/Mood)
53
-
54
- **Features**:
55
- - Displays up to 3 representative sentences
56
- - Bullet-point format with `-` prefix
57
- - Complete sentences: `- "[sentence]"`
58
- - Incomplete sentences: `- "...[sentence]"` (auto-detected via regex `/^[A-Z]/`)
59
- - Styling: 10px italic gray text
60
-
61
- **Layout**:
62
- ```jsx
63
- {selectedBook.review_highlights && selectedBook.review_highlights.length > 0 && (
64
- <div className="w-full mt-auto space-y-2 text-left">
65
- {selectedBook.review_highlights.slice(0, 3).map((highlight, idx) => {
66
- const isCompleteSentence = /^[A-Z]/.test(highlight.trim());
67
- const prefix = isCompleteSentence ? '' : '...';
68
- return (
69
- <p key={idx} className="text-[10px] text-[#666] leading-relaxed italic pl-2">
70
- - "{prefix}{highlight}"
71
- </p>
72
- );
73
- })}
74
- </div>
75
- )}
76
- ```
77
-
78
- ## Related Changes
79
-
80
- ### Rating Display Enhancement
81
-
82
- **Problem**: Hardcoded rating value of 4 stars for all books
83
-
84
- **Solution**:
85
- - Added `average_rating` field to backend API response
86
- - Display format: `4.3` (1 decimal) + filled stars
87
- - Moved rating display into AI highlight box (pink desc_block)
88
-
89
- **Frontend mapping**:
90
- ```javascript
91
- rating: r.average_rating || 0, // Keep float, no rounding
92
- ```
93
-
94
- **Display**:
95
- ```jsx
96
- <span>{selectedBook.rating ? selectedBook.rating.toFixed(1) : '0.0'}</span>
97
- <div className="flex gap-0.5 text-[#f4acb7]">
98
- {[1,2,3,4,5].map(i => <Star key={i} className={`w-3 h-3 ${i <= selectedBook.rating ? 'fill-current' : ''}`} />)}
99
- </div>
100
- ```
101
-
102
- ### Layout Adjustments
103
-
104
- - Grid ratio: 4:8 → 5:7 (more space for left column)
105
- - Rating/Mood: Changed from vertical stack to consolidated display
106
- - Rating moved into desc_block (AI highlight box)
107
- - Review highlights positioned at bottom with `mt-auto`
108
-
109
- ## Data Schema
110
-
111
- **CSV Column**: `review_highlights` (string, semicolon-separated)
112
-
113
- **Example**:
114
- ```
115
- "Having been brought up on the notion...;It transpires, some years ago...;This is a work full of wisdom..."
116
- ```
117
-
118
- **API Response**:
119
- ```json
120
- {
121
- "review_highlights": [
122
- "Having been brought up on the notion that Elizabeth Barrett Browning was the slighter poet...",
123
- "It transpires, some years ago, Clarke hosted two hugely successful British television series...",
124
- "This is a work full of wisdom and unusual perspectives."
125
- ],
126
- "average_rating": 3.716216
127
- }
128
- ```
129
-
130
- ## Notes
131
-
132
- - Review highlights are pre-computed and stored in CSV (no runtime extraction)
133
- - Data file `books_processed.csv` (~243MB) must be regenerated after container rebuild
134
- - Use `scp` to transfer processed CSV back to local machine
135
- - HuggingFace mirror (`HF_ENDPOINT`) required for model download in restricted networks
136
-
137
- ## Future Improvements
138
-
139
- - Cache sentence embeddings to speed up re-generation
140
- - Add sentiment analysis to highlights (positive/critical)
141
- - Filter highlights by relevance to user query
142
- - Display highlight source (verified purchase vs. regular review)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
TAGS_AND_EMOTIONS.md DELETED
@@ -1,233 +0,0 @@
1
- # Tags and Emotion Scoring
2
-
3
- This document describes the tag generation and emotion scoring features added to enrich book metadata.
4
-
5
- ## Overview
6
-
7
- - **Tags**: Keyword extraction from book descriptions using TF-IDF (5-8 terms per book)
8
- - **Emotion Scores**: Five emotion dimensions (joy, sadness, fear, anger, surprise) computed via transformer model
9
-
10
- ## Data Generation
11
-
12
- ### 1. Tag Generation
13
-
14
- Extracts thematic keywords from aggregated review text.
15
-
16
- **Script**: `scripts/generate_tags.py`
17
-
18
- **Usage**:
19
- ```bash
20
- python scripts/generate_tags.py \
21
- --input data/books_processed.csv \
22
- --output data/books_processed.csv \
23
- --top-n 8
24
- ```
25
-
26
- **Algorithm**:
27
- - TF-IDF vectorization (unigrams + bigrams)
28
- - English stopwords + domain stoplist (e.g., "book", "author", "story")
29
- - Top-N weighted terms per book
30
- - Semicolon-joined storage in `tags` column
31
-
32
- **Parameters**:
33
- - `--top-n`: Max tags per book (default: 8)
34
- - `--max-features`: TF-IDF vocabulary size (default: 60,000)
35
- - `--min-df`: Minimum document frequency (default: 5)
36
- - `--max-df`: Maximum document frequency ratio (default: 0.5)
37
-
38
- ### 2. Emotion Scoring
39
-
40
- Computes emotion intensity scores from book descriptions.
41
-
42
- **Script**: `scripts/generate_emotions.py`
43
-
44
- **Model**: `j-hartmann/emotion-english-distilroberta-base`
45
-
46
- **Usage**:
47
- ```bash
48
- # CPU
49
- python scripts/generate_emotions.py \
50
- --input data/books_processed.csv \
51
- --output data/books_processed.csv \
52
- --batch-size 16
53
-
54
- # Apple GPU (MPS)
55
- python scripts/generate_emotions.py \
56
- --input data/books_processed.csv \
57
- --output data/books_processed.csv \
58
- --batch-size 8 \
59
- --device mps \
60
- --checkpoint 2000 \
61
- --resume
62
- ```
63
-
64
- **Parameters**:
65
- - `--batch-size`: Inference batch size (default: 16)
66
- - `--device`: `mps` (Apple GPU), CUDA device id, or CPU (default)
67
- - `--checkpoint`: Rows between checkpoint writes (default: 5000)
68
- - `--resume`: Skip rows already scored (useful for resuming long runs)
69
- - `--max-rows`: Limit processing to N rows (for testing)
70
-
71
- **Output Columns**:
72
- - `joy`: 0.0–1.0
73
- - `sadness`: 0.0–1.0
74
- - `fear`: 0.0–1.0
75
- - `anger`: 0.0–1.0
76
- - `surprise`: 0.0–1.0
77
-
78
- **Performance**:
79
- - ~1.1 it/s on Apple M-series GPU
80
- - ~7 hours for 222k books (batch_size=8, MPS)
81
- - One-time processing; results persist in CSV
82
-
83
- ## Data Schema
84
-
85
- Updated `books_processed.csv` columns:
86
-
87
- | Column | Type | Description |
88
- |--------|------|-------------|
89
- | `tags` | str | Semicolon-separated keywords (e.g., "irish;travel;humor") |
90
- | `joy` | float | Joy emotion score (0.0–1.0) |
91
- | `sadness` | float | Sadness emotion score (0.0–1.0) |
92
- | `fear` | float | Fear emotion score (0.0–1.0) |
93
- | `anger` | float | Anger emotion score (0.0–1.0) |
94
- | `surprise` | float | Surprise emotion score (0.0–1.0) |
95
-
96
- ## API Integration
97
-
98
- ### Backend Changes
99
-
100
- **File**: `src/recommender.py`
101
-
102
- Added to `_format_results()`:
103
- ```python
104
- # Parse tags
105
- tags_raw = str(row.get("tags", "")).strip()
106
- tags = [t.strip() for t in tags_raw.split(";") if t.strip()] if tags_raw else []
107
-
108
- # Extract emotions
109
- emotions = {
110
- "joy": float(row.get("joy", 0.0)),
111
- "sadness": float(row.get("sadness", 0.0)),
112
- "fear": float(row.get("fear", 0.0)),
113
- "anger": float(row.get("anger", 0.0)),
114
- "surprise": float(row.get("surprise", 0.0)),
115
- }
116
- ```
117
-
118
- **File**: `src/main.py`
119
-
120
- Updated Pydantic model:
121
- ```python
122
- class BookResponse(BaseModel):
123
- isbn: str
124
- title: str
125
- authors: str
126
- description: str
127
- thumbnail: str
128
- caption: str
129
- tags: List[str] = []
130
- emotions: Dict[str, float] = {}
131
- ```
132
-
133
- ### API Response Example
134
-
135
- ```json
136
- {
137
- "recommendations": [
138
- {
139
- "isbn": "0001849883",
140
- "title": "Bury My Bones But Keep My Words",
141
- "authors": "Deborah Savage, Tony Fairman",
142
- "tags": ["paulsen", "otters", "searches", "gary", "brian"],
143
- "emotions": {
144
- "joy": 0.020,
145
- "sadness": 0.004,
146
- "fear": 0.012,
147
- "anger": 0.006,
148
- "surprise": 0.086
149
- }
150
- }
151
- ]
152
- }
153
- ```
154
-
155
- ## UI Display
156
-
157
- ### Search Results Grid
158
-
159
- Each book card displays:
160
- - **Dominant emotion label**: Emotion with highest score (bottom-right badge)
161
- - Example: "joy", "sadness", "fear"
162
-
163
- **Implementation** (`web/src/App.jsx`):
164
- ```jsx
165
- {book.emotions && Object.keys(book.emotions).length > 0 ? (
166
- <span className="text-[9px] bg-[#f8f9fa] border border-[#eee] px-1 text-[#999] capitalize">
167
- {Object.entries(book.emotions).reduce((a, b) => a[1] > b[1] ? a : b)[0]}
168
- </span>
169
- ) : (
170
- <span className="text-[9px] bg-[#f8f9fa] border border-[#eee] px-1 text-[#999]">—</span>
171
- )}
172
- ```
173
-
174
- ### Book Detail Modal
175
-
176
- Two new sections:
177
-
178
- **1. Key Themes**
179
- - Displays all extracted tags as badges
180
- - Shows "No themes found" if tags empty
181
-
182
- **2. Emotional Tone**
183
- - Five horizontal bars showing emotion scores
184
- - Bar width = score percentage (0–100%)
185
- - Format: `emotion_name | [bar] | percentage`
186
-
187
- **Implementation** (`web/src/App.jsx`):
188
- ```jsx
189
- <div className="space-y-2">
190
- <h4>Emotional Tone</h4>
191
- <div className="space-y-2 p-3 bg-[#faf9f6] border border-[#eee]">
192
- {selectedBook.emotions && Object.entries(selectedBook.emotions).map(([emotion, score]) => (
193
- <div key={emotion} className="flex items-center gap-2">
194
- <span className="text-[9px] font-bold text-gray-500 w-16 capitalize">{emotion}</span>
195
- <div className="flex-grow bg-white border border-[#eee] h-2 relative overflow-hidden">
196
- <div
197
- className="h-full bg-[#b392ac] transition-all"
198
- style={{ width: `${Math.round(score * 100)}%` }}
199
- />
200
- </div>
201
- <span className="text-[8px] text-gray-400 w-10 text-right">{Math.round(score * 100)}%</span>
202
- </div>
203
- ))}
204
- </div>
205
- </div>
206
- ```
207
-
208
- ## Future Improvements
209
-
210
- - **Incremental updates**: Score only new books instead of full dataset
211
- - **Smaller model**: Try lightweight emotion classifiers (faster inference)
212
- - **Multi-label tags**: Use text classification for predefined categories
213
- - **Tag filtering**: Allow users to filter by specific tags in search
214
- - **Emotion-based sorting**: Sort results by dominant emotion match
215
- - **Caching**: Cache emotion inference results in Redis for API speedup
216
-
217
- ## Dependencies
218
-
219
- ```
220
- scikit-learn # TF-IDF vectorization
221
- transformers # Emotion classification
222
- torch # Model inference
223
- tqdm # Progress bars
224
- ```
225
-
226
- ## Notes
227
-
228
- - Tags and emotions are **one-time computed** and stored in CSV
229
- - No re-computation on API requests (instant serving)
230
- - CSV file (242MB) is in `.gitignore` (too large for GitHub)
231
- - To regenerate on a new machine, run both scripts sequentially:
232
- 1. `generate_tags.py` (~5 minutes)
233
- 2. `generate_emotions.py` (~7 hours on MPS for full dataset)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cover-not-found.jpg DELETED
Binary file (64.4 kB)
 
data-exploration.ipynb DELETED
The diff for this file is too large to render. See raw diff
 
docs/README.md CHANGED
@@ -1,26 +1,30 @@
1
  # Project Documentation
2
 
3
- ## Core Documents
4
 
5
- | Document | Purpose | Audience |
6
- |:---|:---|:---|
7
- | [build_guide.md](build_guide.md) | Full build pipeline from zero to production | New developers |
8
- | [roadmap.md](roadmap.md) | Project status, architecture vision, implementation plan | Developers |
9
- | [technical_report.md](technical_report.md) | System architecture, technical details | Technical interviews |
10
- | [interview_guide.md](interview_guide.md) | Interview Q&A, STAR cases | Job preparation |
11
 
12
- ## Archives
13
 
14
  | Document | Purpose |
15
  |:---|:---|
16
- | [experiments/experiment_archive.md](experiments/experiment_archive.md) | All experiment results and benchmarks |
 
 
 
 
17
 
18
- ## Quick Links
19
 
20
- - **Frozen Version**: v2.6.0 (Agentic RAG + 7-Channel RecSys + Stacking)
21
- - **Performance**: HR@10 = 0.4545, MRR@5 = 0.2893 (n=2000, Leave-Last-Out)
 
 
 
22
 
23
  ---
24
 
25
- *Frozen January 2026 maintenance mode for portfolio use*
26
-
 
1
  # Project Documentation
2
 
3
+ ## Layer 1 — Main Story (README, 5-min interview)
4
 
5
+ | Document | Purpose |
6
+ |:---|:---|
7
+ | [Technical Report](TECHNICAL_REPORT.md) | Architecture, design decisions, method line |
8
+ | [Experiment Archive](experiments/experiment_archive.md) | Consolidated experiment log (V1.0 v2.6.0) |
 
 
9
 
10
+ ## Layer 2 — Capability Showcase (Resume, technical Q&A)
11
 
12
  | Document | Purpose |
13
  |:---|:---|
14
+ | [Interview Guide](interview_guide.md) | Q&A, STAR cases |
15
+ | [Memory Optimization](memory_optimization.md) | Zero-RAM SQLite, engineering decisions |
16
+ | [Performance Debugging](performance_debugging_report.md) | Root cause analysis |
17
+ | [Build Guide](build_guide.md) | Full build pipeline |
18
+ | [Hugging Face Deployment](huggingface_deployment.md) | HF Spaces deployment |
19
 
20
+ ## Archives
21
 
22
+ | Path | Contents |
23
+ |:---|:---|
24
+ | [archived/](archived/) | Deprecated docs (Phase 2, TAGS, REVIEW_HIGHLIGHTS, etc.) |
25
+ | [archived/graveyard/](archived/graveyard/) | Layer 3 — tried but not in main story (future_roadmap, interview_deep_dive, etc.) |
26
+ | [experiments/reports/](experiments/reports/) | Raw experiment reports (baseline, hybrid, rerank, router, temporal) |
27
 
28
  ---
29
 
30
+ **Frozen v2.6.0**HR@10 = 0.4545, MRR@5 = 0.2893
 
docs/TECHNICAL_REPORT.md CHANGED
@@ -9,7 +9,7 @@
9
 
10
  ## Executive Summary
11
 
12
- This project implements a production-grade Agentic RAG (Retrieval-Augmented Generation) system for book discovery, combined with a personalized recommendation engine. Unlike simple vector search, it uses a self-routing architecture that dynamically selects the optimal retrieval strategy based on query intent.
13
 
14
  Key achievements:
15
  - 100% recall on exact-match queries (ISBNs)
@@ -18,7 +18,7 @@ Key achievements:
18
  - Detail-level precision via hierarchical (Small-to-Big) retrieval
19
  - Personalized recommendations using 7-channel recall (Item2Vec, Stacking) and LGBMRanker (LambdaRank)
20
 
21
- The system demonstrates mastery of both Data-Centric AI (SFT data synthesis) and Advanced RAG Architecture (Hybrid Search, Reranking, Query Routing).
22
 
23
  ---
24
 
@@ -83,11 +83,12 @@ USER REQUEST (No Query)
83
  |
84
  v
85
  +---------------------------+
86
- | 6-CHANNEL RECALL (RRF) |
87
  | - ItemCF (direction wt) |
88
  | - UserCF (Jaccard) |
89
  | - Swing (user-pair) |
90
  | - SASRec (embedding) |
 
91
  | - YoutubeDNN (two-tower) |
92
  | - Popularity (fallback) |
93
  +---------------------------+
@@ -161,7 +162,7 @@ Implementation (based on LlamaIndex Parent-Child, RAPTOR):
161
  2. Matching: Query matches specific sentence ("I cried at the ending")
162
  3. Expansion: Map sentence to parent ISBN to full book context
163
 
164
- Result: Can answer queries like "books with unreliable narrator twist" that are invisible to description-level search.
165
 
166
  ### 3.5 Temporal Dynamics
167
 
@@ -187,7 +188,7 @@ Location: `src/core/context_compressor.py`
187
 
188
  ## 4. Personalized Recommendation System
189
 
190
- ### 4.1 Multi-Channel Recall (6 Channels)
191
 
192
  | Recall Channel | Algorithm | Weight | Purpose |
193
  |:---|:---|:---|:---|
@@ -195,6 +196,7 @@ Location: `src/core/context_compressor.py`
195
  | UserCF | User similarity (Jaccard + activity penalty) | 1.0 | Similar user preferences |
196
  | Swing | User-pair overlap weighting: `1/(α + \|I_u ∩ I_v\|)` | 1.0 | Substitute relationships |
197
  | SASRec | Dot-product retrieval from pre-computed embeddings | 1.0 | Sequential patterns |
 
198
  | YoutubeDNN | Two-tower user-item dot product | 0.1 | Deep learning recall |
199
  | Popularity | Rating count with time decay | 0.5 | Cold-start fallback |
200
 
@@ -215,9 +217,9 @@ Architecture: Self-Attentive Sequential Recommendation with Transformer blocks
215
  - Training: 30 epochs, 64-dim embeddings, BCE loss with negative sampling
216
  - Dual use: (1) ranking feature via `sasrec_score`, (2) independent recall channel via embedding dot-product
217
 
218
- ### 4.3 LGBMRanker (LambdaRank)
219
 
220
- Replaced XGBoost binary classifier with LightGBM LambdaRank that directly optimizes NDCG.
221
 
222
  **Training strategy**:
223
  - Hard negative sampling: negatives mined from recall results (not random items)
@@ -231,24 +233,25 @@ Replaced XGBoost binary classifier with LightGBM LambdaRank that directly optimi
231
  - Sequence: sasrec_score, sim_max, sim_min, sim_mean
232
  - CF scores: icf_sum, icf_max, ucf_sum
233
 
234
- Feature importance (V2.5 LGBMRanker):
235
 
236
  | Feature | Importance | Description |
237
  |:---|:---|:---|
238
- | i_cnt | 96 | Item popularity count |
239
- | sim_max | 91 | Last-N similarity max |
240
- | u_cnt | 80 | User activity count |
241
- | i_mean | 41 | Item average rating |
242
- | sasrec_score | 22 | SASRec embedding score |
243
- | icf_max | 23 | ItemCF max similarity |
244
 
245
  ### 4.4 Evaluation Results
246
 
247
- | Metric | V2.0 (XGBoost) | V2.5 (LGBMRanker) | Improvement |
 
 
248
  |:---|:---|:---|:---|
249
- | HR@10 | 0.1380 | **0.2205** | +59.8% |
250
- | MRR@5 | 0.1295 | **0.1584** | +22.3% |
251
- | Users Evaluated | 500 | 2,000 | |
252
  | Dataset | 167,968 active users, 221,998 books | | |
253
 
254
  ---
@@ -261,9 +264,9 @@ Feature importance (V2.5 LGBMRanker):
261
  |--------|------------------------|-------------|
262
  | ISBN Recall | 0% | 100% |
263
  | Keyword Precision | Low | High (BM25 boost) |
264
- | Detail Query Recall | 0% | High (Small-to-Big) |
265
  | Avg Latency | 100ms | 300-800ms |
266
- | Chat Context Limit | ~10 turns | Unlimited (compression) |
267
 
268
  ### 5.2 Latency Benchmarks
269
 
@@ -310,7 +313,9 @@ Feature importance (V2.5 LGBMRanker):
310
 
311
  ---
312
 
313
- ## 8. SFT Data Pipeline
 
 
314
 
315
  ### 8.1 Problem
316
 
@@ -328,6 +333,8 @@ Output:
328
  - `data/sft/literary_critic_train.jsonl`: ~800 high-quality (Query, Response) pairs
329
  - `data/dpo/preference_pairs.jsonl`: ~500 (Chosen, Rejected) pairs
330
 
 
 
331
  ---
332
 
333
  ## 9. File Structure
@@ -346,7 +353,8 @@ src/
346
  │ ├── sasrec_recall.py # SASRec Embedding Recall
347
  │ ├── popularity.py # Popularity Recall
348
  │ ├── youtube_dnn.py # Two-Tower Model
349
- └── fusion.py # RRF Fusion (6 channels)
 
350
  ├── ranking/
351
  │ └── features.py # 17 Ranking Features
352
  ├── data_factory/
@@ -359,7 +367,16 @@ src/
359
 
360
  ---
361
 
362
- ## 10. Scalability
 
 
 
 
 
 
 
 
 
363
 
364
  Current capacity:
365
  - In-memory index: 2GB RAM, ~200K books
 
9
 
10
  ## Executive Summary
11
 
12
+ This project implements an integrated Agentic RAG (Retrieval-Augmented Generation) system for book discovery, combined with a personalized recommendation engine. Unlike simple vector search, it uses a self-routing architecture that dynamically selects the optimal retrieval strategy based on query intent.
13
 
14
  Key achievements:
15
  - 100% recall on exact-match queries (ISBNs)
 
18
  - Detail-level precision via hierarchical (Small-to-Big) retrieval
19
  - Personalized recommendations using 7-channel recall (Item2Vec, Stacking) and LGBMRanker (LambdaRank)
20
 
21
+ The system demonstrates both Advanced RAG Architecture (Hybrid Search, Reranking, Query Routing) and multi-channel RecSys (Item2Vec, LGBMRanker, Stacking).
22
 
23
  ---
24
 
 
83
  |
84
  v
85
  +---------------------------+
86
+ | 7-CHANNEL RECALL (RRF) |
87
  | - ItemCF (direction wt) |
88
  | - UserCF (Jaccard) |
89
  | - Swing (user-pair) |
90
  | - SASRec (embedding) |
91
+ | - Item2Vec (Word2Vec) |
92
  | - YoutubeDNN (two-tower) |
93
  | - Popularity (fallback) |
94
  +---------------------------+
 
162
  2. Matching: Query matches specific sentence ("I cried at the ending")
163
  3. Expansion: Map sentence to parent ISBN to full book context
164
 
165
+ Result: Can answer queries like "books with unreliable narrator twist" that are invisible to description-level search. *RAG components (ISBN recall, reranking, Small-to-Big) were validated via curated examples and routing statistics; no large-scale human evaluation was conducted.*
166
 
167
  ### 3.5 Temporal Dynamics
168
 
 
188
 
189
  ## 4. Personalized Recommendation System
190
 
191
+ ### 4.1 Multi-Channel Recall (7 Channels)
192
 
193
  | Recall Channel | Algorithm | Weight | Purpose |
194
  |:---|:---|:---|:---|
 
196
  | UserCF | User similarity (Jaccard + activity penalty) | 1.0 | Similar user preferences |
197
  | Swing | User-pair overlap weighting: `1/(α + \|I_u ∩ I_v\|)` | 1.0 | Substitute relationships |
198
  | SASRec | Dot-product retrieval from pre-computed embeddings | 1.0 | Sequential patterns |
199
+ | Item2Vec | Word2Vec (Skip-gram) on user interaction sequences | 0.8 | Implicit co-occurrence |
200
  | YoutubeDNN | Two-tower user-item dot product | 0.1 | Deep learning recall |
201
  | Popularity | Rating count with time decay | 0.5 | Cold-start fallback |
202
 
 
217
  - Training: 30 epochs, 64-dim embeddings, BCE loss with negative sampling
218
  - Dual use: (1) ranking feature via `sasrec_score`, (2) independent recall channel via embedding dot-product
219
 
220
+ ### 4.3 LGBMRanker (LambdaRank) + Model Stacking
221
 
222
+ Replaced XGBoost binary classifier with LightGBM LambdaRank that directly optimizes NDCG. In v2.6.0, a Stacking ensemble (LGBMRanker + XGBClassifier → LogisticRegression meta-learner) further improves ranking robustness.
223
 
224
  **Training strategy**:
225
  - Hard negative sampling: negatives mined from recall results (not random items)
 
233
  - Sequence: sasrec_score, sim_max, sim_min, sim_mean
234
  - CF scores: icf_sum, icf_max, ucf_sum
235
 
236
+ Feature importance (v2.6.0 LGBMRanker, representative subset):
237
 
238
  | Feature | Importance | Description |
239
  |:---|:---|:---|
240
+ | u_cnt | 88 | User activity count |
241
+ | sim_max | 76 | Last-N similarity max |
242
+ | icf_max | 62 | ItemCF max similarity |
243
+ | i_cnt | 59 | Item popularity count |
244
+ | len_diff | 55 | Description complexity match |
245
+ | sasrec_score | 25 | SASRec embedding score |
246
 
247
  ### 4.4 Evaluation Results
248
 
249
+ *Protocol: Leave-Last-Out, n=2000 users, title-relaxed matching, filter_favorites=False.*
250
+
251
+ | Metric | V2.0 (XGBoost) | V2.5 (LGBMRanker) | v2.6.0 (+Item2Vec, Stacking) |
252
  |:---|:---|:---|:---|
253
+ | HR@10 | 0.1380 | 0.2205 | **0.4545** |
254
+ | MRR@5 | 0.1295 | 0.1584 | **0.2893** |
 
255
  | Dataset | 167,968 active users, 221,998 books | | |
256
 
257
  ---
 
264
  |--------|------------------------|-------------|
265
  | ISBN Recall | 0% | 100% |
266
  | Keyword Precision | Low | High (BM25 boost) |
267
+ | Detail Query Recall | 0% | Demonstrated via curated examples (Small-to-Big) |
268
  | Avg Latency | 100ms | 300-800ms |
269
+ | Chat Context Limit | ~10 turns | Extended via compression (no formal limit) |
270
 
271
  ### 5.2 Latency Benchmarks
272
 
 
313
 
314
  ---
315
 
316
+ ## 8. SFT Data Pipeline (Supplementary)
317
+
318
+ *Not integrated into the main RAG flow in v2.6.0.* This pipeline was developed for potential future fine-tuning of chat tone.
319
 
320
  ### 8.1 Problem
321
 
 
333
  - `data/sft/literary_critic_train.jsonl`: ~800 high-quality (Query, Response) pairs
334
  - `data/dpo/preference_pairs.jsonl`: ~500 (Chosen, Rejected) pairs
335
 
336
+ See [Experiment Archive](experiments/experiment_archive.md) for full implementation details.
337
+
338
  ---
339
 
340
  ## 9. File Structure
 
353
  │ ├── sasrec_recall.py # SASRec Embedding Recall
354
  │ ├── popularity.py # Popularity Recall
355
  │ ├── youtube_dnn.py # Two-Tower Model
356
+ ├── item2vec.py # Item2Vec Recall (Word2Vec)
357
+ │ └── fusion.py # RRF Fusion (7 channels)
358
  ├── ranking/
359
  │ └── features.py # 17 Ranking Features
360
  ├── data_factory/
 
367
 
368
  ---
369
 
370
+ ## 10. Limitations
371
+
372
+ - **Single-dataset evaluation**: All RecSys metrics are on Amazon Books 200K; no cross-domain or external validation.
373
+ - **Rule-based router**: Intent classification uses heuristics (e.g., `len(words) <= 2` for keyword); may not generalize to other domains.
374
+ - **RAG evaluation**: RAG quality is demonstrated via curated examples (e.g., "Harry Potter", ISBN recall); no systematic human evaluation or large-scale relevance judgments.
375
+ - **Protocol sensitivity**: RecSys metrics can vary with evaluation protocol (e.g., ISBN-only vs title-relaxed matching); see [Experiment Archive](experiments/experiment_archive.md) for discussion.
376
+
377
+ ---
378
+
379
+ ## 11. Scalability
380
 
381
  Current capacity:
382
  - In-memory index: 2GB RAM, ~200K books
docs/archived/PHASE_2_DEVELOPMENT.md CHANGED
@@ -204,12 +204,12 @@ persona = getPersona(userId)
204
  - **Method:** Rule-based persona matching (Top-3 authors/categories)
205
  - **Future:** Could upgrade to LLM refinement (e.g., GPT for polish)
206
 
207
- ### Why React + Vite?
208
  - **Rationale:**
209
- - React needed for custom UX and production-grade interface
210
  - Vite super fast (no webpack pain)
211
- - Tailwind CSS for modern styling
212
- - **Architecture:** React frontend (port 5173) + FastAPI backend (port 6006/8000)
213
 
214
  ### Why Persona from Favorites (not search history)?
215
  - **Rationale:** User intent explicit in favorites, not implicit in queries
@@ -345,6 +345,13 @@ npm run dev
345
  # Connect to http://localhost:6006 backend
346
  ```
347
 
 
 
 
 
 
 
 
348
  ### Production Workflow
349
  - React builds with `npm run build` → static files
350
  - FastAPI serves as single backend
@@ -415,13 +422,13 @@ Expected highlight: "虽不在您常读类型,但情节深度与科幻的想
415
  ✅ **Modular backend design** → easy to add /highlights, /persona endpoints
416
  ✅ **React UI responsiveness** → users see results instantly
417
  ✅ **JSON-first approach** → no DB setup friction, iterate fast
418
- ✅ **API-driven architecture** → React frontend with FastAPI backend
419
  ✅ **Persona concept** → users feel "understood" by the system
420
 
421
  ### Challenges Overcome
422
- 🔧 **Port configuration** (React:5173 vs FastAPI:6006/8000) → Makefile organization
423
  🔧 **CORS issues** (frontend can't reach backend) → Added CORSMiddleware
424
- 🔧 **Image loading** (external URLs) → Runtime fetching + local fallback
425
  🔧 **Timeout errors** (cold startup > 10s) → Increased client timeouts, optimized startup
426
 
427
  ### Design Philosophy Validated
@@ -454,11 +461,12 @@ book-rec-with-LLMs/
454
  │ │ └── main.jsx # Entry point
455
  │ ├── index.html # HTML + Tailwind CDN
456
  │ └── package.json # Dependencies
 
457
  ├── Makefile # Commands
458
  ├── requirements.txt # Python deps
459
  └── data/
460
- ├── books_processed.csv # Metadata + review highlights
461
- └── user_profiles.json # User data
462
  ```
463
 
464
  ---
@@ -478,6 +486,7 @@ feat: add React UI and backend personalization features
478
  * src/marketing/highlights.py: Persona-aware selling points
479
  * 3 new API endpoints in FastAPI
480
 
 
481
  - Add CORS support, update timeouts, improve infrastructure
482
  ```
483
 
 
204
  - **Method:** Rule-based persona matching (Top-3 authors/categories)
205
  - **Future:** Could upgrade to LLM refinement (e.g., GPT for polish)
206
 
207
+ ### Why React + Vite (not Gradio)?
208
  - **Rationale:**
209
+ - Gradio good for prototypes, React needed for custom UX
210
  - Vite super fast (no webpack pain)
211
+ - Tailwind CDN avoids npm build complexity
212
+ - **Fallback:** Gradio UI (app.py) still available on port 7860
213
 
214
  ### Why Persona from Favorites (not search history)?
215
  - **Rationale:** User intent explicit in favorites, not implicit in queries
 
345
  # Connect to http://localhost:6006 backend
346
  ```
347
 
348
+ **Terminal 3 (Optional): Gradio Legacy UI**
349
+ ```bash
350
+ python app.py
351
+ # Starts on http://localhost:7860
352
+ # Alternative UI for testing
353
+ ```
354
+
355
  ### Production Workflow
356
  - React builds with `npm run build` → static files
357
  - FastAPI serves as single backend
 
422
  ✅ **Modular backend design** → easy to add /highlights, /persona endpoints
423
  ✅ **React UI responsiveness** → users see results instantly
424
  ✅ **JSON-first approach** → no DB setup friction, iterate fast
425
+ ✅ **API-driven architecture** → Gradio + React both work
426
  ✅ **Persona concept** → users feel "understood" by the system
427
 
428
  ### Challenges Overcome
429
+ 🔧 **Port conflicts** (Gradio:7860 vs React:5173 vs FastAPI:6006) → Makefile organization
430
  🔧 **CORS issues** (frontend can't reach backend) → Added CORSMiddleware
431
+ 🔧 **Image loading** (external URLs not allowed in Gradio) → Runtime fetching + local fallback
432
  🔧 **Timeout errors** (cold startup > 10s) → Increased client timeouts, optimized startup
433
 
434
  ### Design Philosophy Validated
 
461
  │ │ └── main.jsx # Entry point
462
  │ ├── index.html # HTML + Tailwind CDN
463
  │ └── package.json # Dependencies
464
+ ├── app.py # Gradio UI (legacy)
465
  ├── Makefile # Commands
466
  ├── requirements.txt # Python deps
467
  └── data/
468
+ ├── books_processed.csv # Metadata
469
+ └── user_profiles.json # ✨ NEW: User data
470
  ```
471
 
472
  ---
 
486
  * src/marketing/highlights.py: Persona-aware selling points
487
  * 3 new API endpoints in FastAPI
488
 
489
+ - Maintain Gradio UI as optional fallback
490
  - Add CORS support, update timeouts, improve infrastructure
491
  ```
492
 
DEPLOYMENT.md → docs/archived/graveyard/DEPLOYMENT.md RENAMED
File without changes
business_logic.md → docs/archived/graveyard/business_logic.md RENAMED
File without changes
docs/{future_roadmap.md → archived/graveyard/future_roadmap.md} RENAMED
File without changes
docs/{interview_deep_dive.md → archived/graveyard/interview_deep_dive.md} RENAMED
File without changes
interview_prep.md → docs/archived/graveyard/interview_prep.md RENAMED
File without changes
docs/{phase7_plan.md → archived/graveyard/phase7_plan.md} RENAMED
File without changes
project_analysis.md → docs/archived/graveyard/project_analysis.md RENAMED
File without changes
docs/{project_narrative.md → archived/graveyard/project_narrative.md} RENAMED
File without changes
docs/{rag_architecture.md → archived/graveyard/rag_architecture.md} RENAMED
File without changes
technical_architecture.md → docs/archived/graveyard/technical_architecture.md RENAMED
File without changes
docs/{technical_deep_dive_sota.md → archived/graveyard/technical_deep_dive_sota.md} RENAMED
File without changes
story_and_strategy.md → docs/archived/story_and_strategy.md RENAMED
File without changes
{experiments → docs/experiments/reports}/baseline_report.md RENAMED
File without changes
benchmarks/results.md → docs/experiments/reports/benchmark_results.md RENAMED
File without changes
{experiments → docs/experiments/reports}/hybrid_report.md RENAMED
File without changes
{experiments → docs/experiments/reports}/rerank_report.md RENAMED
File without changes
{experiments → docs/experiments/reports}/router_report.md RENAMED
File without changes
{experiments → docs/experiments/reports}/temporal_report.md RENAMED
File without changes
legacy/README.md ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ # Legacy — Not part of main architecture
2
+
3
+ Code moved here is preserved but not used in the main flow (src.main FastAPI + React).
4
+
5
+ | File | Note |
6
+ |:---|:---|
7
+ | app.py | Gradio UI (replaced by React + FastAPI) |
8
+ | agent/ | Shopping agent (broken imports, not used) |
9
+ | deploy.sh | Old Hugging Face deployment script |
10
+ | download_fix.py | Temporary fix script |
{src → legacy}/agent/agent_core.py RENAMED
File without changes
{src → legacy}/agent/data_loader.py RENAMED
File without changes
{src → legacy}/agent/dialogue_manager.py RENAMED
File without changes
{src → legacy}/agent/intent_parser.py RENAMED
File without changes
{src → legacy}/agent/llm_generator.py RENAMED
File without changes
{src → legacy}/agent/rag_indexer.py RENAMED
File without changes
{src → legacy}/agent/rag_retriever.py RENAMED
File without changes
app.py → legacy/app.py RENAMED
@@ -48,7 +48,7 @@ tones = fetch_tones()
48
  # Initialize Shopping Agent (Mock or Real)
49
  # Note: Real agent requires FAISS index. We'll handle checks later.
50
  try:
51
- # from src.agent.agent_core import ShoppingAgent
52
  # shopping_agent = ShoppingAgent(...)
53
  pass
54
  except ImportError:
 
48
  # Initialize Shopping Agent (Mock or Real)
49
  # Note: Real agent requires FAISS index. We'll handle checks later.
50
  try:
51
+ # from legacy.agent.agent_core import ShoppingAgent
52
  # shopping_agent = ShoppingAgent(...)
53
  pass
54
  except ImportError:
deploy.sh → legacy/deploy.sh RENAMED
File without changes
download_fix.py → legacy/download_fix.py RENAMED
File without changes
requirements.txt CHANGED
@@ -40,7 +40,6 @@ scipy
40
  requests
41
 
42
  # LLM Agent & Fine-tuning
43
- langchain
44
  faiss-cpu
45
  diffusers
46
  openai
 
40
  requests
41
 
42
  # LLM Agent & Fine-tuning
 
43
  faiss-cpu
44
  diffusers
45
  openai
scripts/add_isbn13_to_books_data.py DELETED
@@ -1,16 +0,0 @@
1
- import pandas as pd
2
-
3
- # 读取主表和 books_data_with_isbn.csv
4
- main = pd.read_csv("data/books_with_emotions.csv", usecols=["title", "isbn13"])
5
- data = pd.read_csv("data/books_data_with_isbn.csv")
6
-
7
- # 标准化标题
8
- main["title"] = main["title"].astype(str).str.strip().str.lower()
9
- data["Title"] = data["Title"].astype(str).str.strip().str.lower()
10
-
11
- # 合并,左连接
12
- merged = data.merge(main, left_on="Title", right_on="title", how="left")
13
-
14
- # 保存新文件
15
- merged.to_csv("data/books_data_with_isbn13.csv", index=False)
16
- print("已生成 data/books_data_with_isbn13.csv,包含 isbn13 字段。")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
scripts/add_isbn_to_books_data.py DELETED
@@ -1,21 +0,0 @@
1
- import pandas as pd
2
-
3
- # 读取 books_data.csv
4
- books_data = pd.read_csv("data/books_data.csv")
5
-
6
- # 读取 Books_rating.csv,只取 Title 和 Id 字段
7
- ratings = pd.read_csv("data/Books_rating.csv", usecols=["Title", "Id"])
8
-
9
- # 去重,避免多对一
10
- ratings = ratings.drop_duplicates(subset=["Title"])
11
-
12
- # 合并,左连接,保留 books_data.csv 所有行
13
- merged = books_data.merge(ratings, on="Title", how="left")
14
-
15
- # 重命名 Id 为 isbn
16
- merged = merged.rename(columns={"Id": "isbn"})
17
-
18
- # 保存新文件
19
- merged.to_csv("data/books_data_with_isbn.csv", index=False)
20
-
21
- print("已生成 data/books_data_with_isbn.csv,包含 isbn 字段。")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
scripts/benchmark_compressor.py DELETED
@@ -1,35 +0,0 @@
1
- import asyncio
2
- from langchain_core.messages import HumanMessage, AIMessage
3
- from src.core.context_compressor import compressor
4
-
5
- async def run_benchmark():
6
- print("🚀 Starting Context Compression Benchmark...")
7
-
8
- # 1. Simulate Long History (12 messages, 6 turns)
9
- history = []
10
- for i in range(1, 7):
11
- history.append(HumanMessage(content=f"User question {i}: I like sci-fi."))
12
- history.append(AIMessage(content=f"AI answer {i}: Here is a sci-fi book."))
13
-
14
- print(f"Original History Length: {len(history)} messages")
15
-
16
- # 2. Compress
17
- print("Compressing...")
18
- # Mock LLM generation usually takes time, so latency includes API call
19
- compressed = await compressor.compress_history(history)
20
-
21
- print(f"Compressed History Length: {len(compressed)} messages")
22
-
23
- # 3. Validation
24
- # Expected: 1 SystemMessage (Summary) + 4 Messages (Recent) = 5
25
- if len(compressed) == 5:
26
- print("✅ SUCCESS: History compressed to 5 messages.")
27
- print(f"Summary Content: {compressed[0].content}")
28
- print(f"Oldest Retained Message: {compressed[1].content}")
29
- else:
30
- print(f"❌ FAILURE: Expected 5 messages, got {len(compressed)}")
31
- for i, m in enumerate(compressed):
32
- print(f"[{i}] {type(m).__name__}: {m.content}")
33
-
34
- if __name__ == "__main__":
35
- asyncio.run(run_benchmark())
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
scripts/benchmark_hybrid.py DELETED
@@ -1,83 +0,0 @@
1
- import time
2
- import pandas as pd
3
- from src.vector_db import VectorDB
4
-
5
- def run_benchmark():
6
- print("🚀 Starting Hybrid Retrieval Benchmark...")
7
-
8
- # Load Title Mapping
9
- try:
10
- books_df = pd.read_csv("data/books_processed.csv")
11
- # Ensure string ISBN for matching
12
- if 'isbn13' in books_df.columns:
13
- books_df['isbn'] = books_df['isbn13'].astype(str)
14
- else:
15
- books_df['isbn'] = books_df['isbn'].astype(str)
16
-
17
- isbn_map = books_df.set_index('isbn')['title'].to_dict()
18
- except Exception as e:
19
- print(f"⚠️ Failed to load books_processed.csv: {e}")
20
- isbn_map = {}
21
-
22
- db = VectorDB()
23
-
24
- # Same Test Cases
25
- test_queries = [
26
- # 1. Semantic (Hybrid should match Dense)
27
- {"type": "Semantic", "query": "books about finding love in unexpected places"},
28
- {"type": "Semantic", "query": "scary stories that keep you up at night"},
29
-
30
- # 2. Keyword/Proper Noun (Hybrid should improve)
31
- {"type": "Keyword", "query": "Harry Potter"},
32
- {"type": "Keyword", "query": "Python Programming"},
33
- {"type": "Keyword", "query": "Jane Austen"},
34
-
35
- # 3. Exact Match / ISBN (Hybrid should fix this)
36
- {"type": "Exact", "query": "0060959479"},
37
- ]
38
-
39
- results = []
40
-
41
- for case in test_queries:
42
- q = case["query"]
43
- print(f"\nScanning: '{q}' ({case['type']})...")
44
-
45
- start_time = time.time()
46
- # USE HYBRID SEARCH
47
- docs = db.hybrid_search(q, k=5)
48
- duration = (time.time() - start_time) * 1000
49
-
50
- # Capture simplified results
51
- top_results = []
52
- for doc in docs:
53
- # Extract ISBN
54
- parts = doc.page_content.strip().split(' ', 1)
55
- isbn = parts[0]
56
- # Fallback parsing for legacy docs
57
- if "ISBN:" in doc.page_content:
58
- isbn = doc.page_content.split("ISBN:")[1].strip().split()[0]
59
-
60
- title = isbn_map.get(isbn, f"ISBN:{isbn}")
61
- if len(title) > 40:
62
- title = title[:37] + "..."
63
- top_results.append(title)
64
-
65
- print(f" -> Found: {top_results}")
66
- results.append({
67
- "query": q,
68
- "type": case["type"],
69
- "latency_ms": round(duration, 2),
70
- "top_results": top_results
71
- })
72
-
73
- # Save
74
- df = pd.DataFrame(results)
75
- path = "experiments/02_hybrid_results.csv"
76
- df.to_csv(path, index=False)
77
- print(f"\n💾 Results saved to {path}")
78
-
79
- print("\n## Hybrid Search Results")
80
- print(df.to_string(index=False))
81
-
82
- if __name__ == "__main__":
83
- run_benchmark()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
scripts/benchmark_rerank.py DELETED
@@ -1,82 +0,0 @@
1
- import time
2
- import pandas as pd
3
- from src.vector_db import VectorDB
4
-
5
- def run_benchmark():
6
- print("🚀 Starting Reranked Retrieval Benchmark...")
7
-
8
- # Load Title Mapping
9
- try:
10
- books_df = pd.read_csv("data/books_processed.csv")
11
- if 'isbn13' in books_df.columns:
12
- books_df['isbn'] = books_df['isbn13'].astype(str)
13
- else:
14
- books_df['isbn'] = books_df['isbn'].astype(str)
15
- isbn_map = books_df.set_index('isbn')['title'].to_dict()
16
- except Exception as e:
17
- print(f"⚠️ Failed to load books_processed.csv: {e}")
18
- isbn_map = {}
19
-
20
- db = VectorDB()
21
-
22
- # Same Test Cases
23
- test_queries = [
24
- # 1. Semantic (Reranker should bubble up best Semantic matches)
25
- {"type": "Semantic", "query": "books about finding love in unexpected places"},
26
- # Complex mood query
27
- {"type": "Complex", "query": "a dark sci-fi thriller with a female protagonist"},
28
-
29
- # 2. Keyword/Proper Noun (Reranker should confirm these are relevant)
30
- {"type": "Keyword", "query": "Harry Potter"},
31
- {"type": "Keyword", "query": "Jane Austen"},
32
-
33
- # 3. Exact Match (Should still work)
34
- {"type": "Exact", "query": "0060959479"},
35
- ]
36
-
37
- results = []
38
-
39
- for case in test_queries:
40
- q = case["query"]
41
- print(f"\nScanning: '{q}' ({case['type']})...")
42
-
43
- start_time = time.time()
44
- # USE HYBRID WITH RERANK
45
- docs = db.hybrid_search(q, k=5, rerank=True)
46
- duration = (time.time() - start_time) * 1000
47
-
48
- # Capture results with scores
49
- top_results = []
50
- for doc in docs:
51
- # Extract ISBN
52
- parts = doc.page_content.strip().split(' ', 1)
53
- isbn = parts[0]
54
- if "ISBN:" in doc.page_content:
55
- isbn = doc.page_content.split("ISBN:")[1].strip().split()[0]
56
-
57
- title = isbn_map.get(isbn, f"ISBN:{isbn}")
58
- if len(title) > 30:
59
- title = title[:27] + "..."
60
-
61
- score = doc.metadata.get("relevance_score", 0.0)
62
- top_results.append(f"{title} ({score:.4f})")
63
-
64
- print(f" -> Found: {top_results}")
65
- results.append({
66
- "query": q,
67
- "type": case["type"],
68
- "latency_ms": round(duration, 2),
69
- "top_results": top_results
70
- })
71
-
72
- # Save
73
- df = pd.DataFrame(results)
74
- path = "experiments/03_rerank_results.csv"
75
- df.to_csv(path, index=False)
76
- print(f"\n💾 Results saved to {path}")
77
-
78
- print("\n## Reranked Search Results")
79
- print(df.to_string(index=False))
80
-
81
- if __name__ == "__main__":
82
- run_benchmark()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
scripts/benchmark_retrieval.py DELETED
@@ -1,82 +0,0 @@
1
- import time
2
- import pandas as pd
3
- from typing import List
4
- from src.vector_db import VectorDB
5
-
6
- def run_benchmark():
7
- print("🚀 Starting Retrieval Benchmark (BASELINE)...")
8
-
9
- # Load Title Mapping
10
- try:
11
- books_df = pd.read_csv("data/books_processed.csv")
12
- # Ensure string ISBN for matching
13
- books_df['isbn'] = books_df['isbn'].astype(str)
14
- isbn_map = books_df.set_index('isbn')['title'].to_dict()
15
- print(f"📚 Loaded {len(isbn_map)} titles for mapping.")
16
- except Exception as e:
17
- print(f"⚠️ Failed to load books_processed.csv: {e}")
18
- isbn_map = {}
19
-
20
- db = VectorDB()
21
-
22
- # ... (Test Cases preserved) ...
23
- test_queries = [
24
- # 1. Semantic (Dense should win)
25
- {"type": "Semantic", "query": "books about finding love in unexpected places"},
26
- {"type": "Semantic", "query": "scary stories that keep you up at night"},
27
-
28
- # 2. Keyword/Proper Noun (Dense might struggle)
29
- {"type": "Keyword", "query": "Harry Potter"},
30
- {"type": "Keyword", "query": "Python Programming"},
31
- {"type": "Keyword", "query": "Jane Austen"},
32
-
33
- # 3. Exact Match / ISBN
34
- {"type": "Exact", "query": "0060959479"},
35
- ]
36
-
37
- results = []
38
-
39
- for case in test_queries:
40
- q = case["query"]
41
- print(f"\nScanning: '{q}' ({case['type']})...")
42
-
43
- start_time = time.time()
44
- docs = db.search(q, k=5)
45
- duration = (time.time() - start_time) * 1000
46
-
47
- # Capture simplified results
48
- top_results = []
49
- for doc in docs:
50
- # Format: "ISBN ReviewText..."
51
- # Extract ISBN (first token)
52
- parts = doc.page_content.strip().split(' ', 1)
53
- isbn = parts[0]
54
-
55
- # Lookup Title
56
- title = isbn_map.get(isbn, f"ISBN:{isbn}")
57
-
58
- # Truncate for display
59
- if len(title) > 40:
60
- title = title[:37] + "..."
61
- top_results.append(title)
62
-
63
- print(f" -> Found: {top_results}")
64
- results.append({
65
- "query": q,
66
- "type": case["type"],
67
- "latency_ms": round(duration, 2),
68
- "top_results": top_results
69
- })
70
-
71
- # Save Report
72
- df = pd.DataFrame(results)
73
- path = "experiments/01_baseline_results.csv"
74
- df.to_csv(path, index=False)
75
- print(f"\n💾 Results saved to {path}")
76
-
77
- # Print Summary
78
- print("\n## Baseline Results Summary")
79
- print(df.to_string(index=False))
80
-
81
- if __name__ == "__main__":
82
- run_benchmark()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
scripts/benchmark_router.py DELETED
@@ -1,99 +0,0 @@
1
- import time
2
- import pandas as pd
3
- from src.vector_db import VectorDB
4
- from src.core.router import QueryRouter
5
-
6
- def run_benchmark():
7
- print("🚀 Starting Agentic Router Benchmark...")
8
-
9
- # Init Components
10
- db = VectorDB()
11
- router = QueryRouter()
12
-
13
- # Load Title Mapping (for display)
14
- try:
15
- books_df = pd.read_csv("data/books_processed.csv")
16
- if 'isbn13' in books_df.columns:
17
- books_df['isbn'] = books_df['isbn13'].astype(str)
18
- else:
19
- books_df['isbn'] = books_df['isbn'].astype(str)
20
- isbn_map = books_df.set_index('isbn')['title'].to_dict()
21
- except:
22
- isbn_map = {}
23
-
24
- test_queries = [
25
- # 1. ISBN -> Should be EXACT (No Rerank) to avoid regression
26
- {"query": "0060959479", "expected_strat": "exact"},
27
-
28
- # 2. Keyword -> Should be FAST (No Rerank)
29
- {"query": "python programming", "expected_strat": "fast"},
30
-
31
- # 3. Complex -> Should be DEEP (With Rerank)
32
- {"query": "books about finding love in unexpected places", "expected_strat": "deep"},
33
- ]
34
-
35
- results = []
36
-
37
- for case in test_queries:
38
- q = case["query"]
39
- print(f"\nUser Query: '{q}'")
40
-
41
- # 1. ROUTING STEP
42
- route_decision = router.route(q)
43
- strat = route_decision["strategy"]
44
- use_rerank = route_decision["rerank"]
45
- alpha_val = route_decision.get("alpha", 0.5)
46
-
47
- print(f" 🤖 Router Decision: {strat.upper()} (Rerank={use_rerank}, Alpha={alpha_val})")
48
-
49
- # Check expectation
50
- if strat != case["expected_strat"]:
51
- print(f" ⚠️ WARNING: Expected {case['expected_strat']}, got {strat}")
52
-
53
- # 2. RETRIEVAL STEP
54
- start_time = time.time()
55
- docs = db.hybrid_search(
56
- q,
57
- k=5,
58
- rerank=use_rerank,
59
- alpha=alpha_val
60
- )
61
- duration = (time.time() - start_time) * 1000
62
-
63
- # Capture results
64
- top_results = []
65
- for doc in docs:
66
- # Extract ISBN/Title
67
- if "ISBN:" in doc.page_content:
68
- isbn = doc.page_content.split("ISBN:")[1].strip().split()[0]
69
- else:
70
- parts = doc.page_content.strip().split(' ', 1)
71
- isbn = parts[0]
72
-
73
- title = isbn_map.get(isbn, f"ISBN:{isbn}")
74
- if len(title) > 30:
75
- title = title[:27] + "..."
76
-
77
- score = doc.metadata.get("relevance_score", "N/A")
78
- if score != "N/A":
79
- top_results.append(f"{title} ({score:.4f})")
80
- else:
81
- top_results.append(f"{title}")
82
-
83
- print(f" -> Found: {top_results[:3]}")
84
- results.append({
85
- "query": q,
86
- "strategy": strat,
87
- "latency_ms": round(duration, 2),
88
- "top_1": top_results[0] if top_results else "None"
89
- })
90
-
91
- # Save
92
- df = pd.DataFrame(results)
93
- path = "experiments/04_router_results.csv"
94
- df.to_csv(path, index=False)
95
- print(f"\n💾 Results saved to {path}")
96
- print(df.to_string(index=False))
97
-
98
- if __name__ == "__main__":
99
- run_benchmark()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
scripts/benchmark_temporal.py DELETED
@@ -1,44 +0,0 @@
1
- import pandas as pd
2
- from src.vector_db import VectorDB
3
-
4
- def run_benchmark():
5
- print("🚀 Starting Temporal Dynamics Benchmark...")
6
-
7
- db = VectorDB()
8
-
9
- # We use a query where 'newness' matters
10
- query = "latest advancements in technology and science"
11
-
12
- print(f"\nQuery: '{query}'")
13
-
14
- # 1. Standard Search
15
- print("\n--- Standard Search (No Temporal) ---")
16
- st_docs = db.hybrid_search(query, k=5, rerank=True, temporal=False)
17
- for d in st_docs:
18
- # Get Year
19
- isbn = d.metadata.get("isbn") or d.metadata.get("isbn13")
20
- if not isbn and "ISBN:" in d.page_content:
21
- isbn = d.page_content.split("ISBN:")[1].strip().split()[0]
22
- year = db.pub_years.get(str(isbn), "Unknown")
23
- score = d.metadata.get("relevance_score", 0.0)
24
-
25
- # Parse title
26
- title = d.page_content.split('\n')[0].replace("Title: ", "")[:40]
27
- print(f"[{year}] {title}... (Score: {score:.4f})")
28
-
29
- # 2. Temporal Search
30
- print("\n--- Temporal Search (Recent Boost) ---")
31
- tm_docs = db.hybrid_search(query, k=5, rerank=True, temporal=True)
32
- for d in tm_docs:
33
- isbn = d.metadata.get("isbn") or d.metadata.get("isbn13")
34
- if not isbn and "ISBN:" in d.page_content:
35
- isbn = d.page_content.split("ISBN:")[1].strip().split()[0]
36
- year = db.pub_years.get(str(isbn), "Unknown")
37
- # In temporal mode, score is boosted
38
- score = d.metadata.get("relevance_score", 0.0)
39
-
40
- title = d.page_content.split('\n')[0].replace("Title: ", "")[:40]
41
- print(f"[{year}] {title}... (Score: {score:.4f})")
42
-
43
- if __name__ == "__main__":
44
- run_benchmark()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
scripts/build_books_basic_info.py DELETED
@@ -1,48 +0,0 @@
1
- import pandas as pd
2
- import csv
3
-
4
- # 读取原始数据,遇到格式错误行自动跳过,保证流程不中断
5
- books_data = pd.read_csv(
6
- "data/books_data.csv",
7
- engine="python",
8
- quotechar='"',
9
- escapechar='\\',
10
- on_bad_lines='skip' # pandas >=1.3
11
- )
12
- ratings = pd.read_csv("data/Books_rating.csv", engine="python", quotechar='"', escapechar='\\', on_bad_lines='skip')
13
-
14
- # 只保留有用字段
15
- books_cols = [
16
- "Title", "description", "authors", "image", "publisher", "publishedDate", "categories"
17
- ]
18
- books_data = books_data[books_cols]
19
-
20
- # 只保留 Title, Id, review/score 字段用于合并
21
- ratings_cols = ["Title", "Id", "review/score"]
22
- ratings = ratings[ratings_cols]
23
-
24
- # 去重
25
- ratings = ratings.drop_duplicates(subset=["Title"])
26
-
27
- # 合并,左连接,保留 books_data 所有行
28
- merged = books_data.merge(ratings, on="Title", how="left")
29
-
30
- # 重命名字段
31
- merged = merged.rename(columns={
32
- "Id": "isbn10",
33
- "Title": "title",
34
- "authors": "authors",
35
- "description": "description",
36
- "image": "image",
37
- "publisher": "publisher",
38
- "publishedDate": "publishedDate",
39
- "categories": "categories",
40
- "review/score": "average_rating"
41
- })
42
-
43
- # 生成 isbn13(如有更复杂规则可补充,这里仅占位)
44
- merged["isbn13"] = None # 可后续补充isbn13生成逻辑
45
-
46
- # 保存新表,强制所有字段加引号,防止description等字段被截断
47
- merged.to_csv("data/books_basic_info.csv", index=False, quoting=csv.QUOTE_ALL, quotechar='"', escapechar='\\')
48
- print("已生成 data/books_basic_info.csv,包含基础书籍信息字段。")