Spaces:

ymlin105
/

book-rec-with-LLMs

Running

App Files Files Community

ymlin105 commited on 1 day ago

Commit

5af0c50

1 Parent(s): 3f281f1

chore: update requirements and documentation for intent classifier and RAG evaluation

Browse files

Files changed (21) hide show

data/rag_golden.README.md +32 -0
docs/TECHNICAL_REPORT.md +8 -2
docs/interview_guide.md +46 -17
requirements.txt +3 -0
scripts/data/build_sequences.py +41 -82
scripts/data/fetch_new_books.py +322 -0
scripts/data/validate_data.py +19 -1
scripts/model/evaluate_rag.py +169 -0
scripts/model/train_din_ranker.py +264 -0
scripts/model/train_intent_router.py +156 -0
scripts/model/train_ranker.py +5 -8
scripts/run_pipeline.py +28 -1
src/core/freshness_monitor.py +231 -0
src/core/intent_classifier.py +204 -0
src/core/metadata_store.py +142 -0
src/core/router.py +151 -47
src/core/web_search.py +323 -0
src/ranking/din.py +212 -0
src/recommender.py +137 -17
src/services/recommend_service.py +30 -19
src/vector_db.py +9 -5

data/rag_golden.README.md ADDED Viewed

	@@ -0,0 +1,32 @@

+# RAG Golden Test Set
+Human-annotated Query-Book pairs for quantitative RAG evaluation.
+## Format
+CSV with columns: `query`, `isbn`, `relevance`, `notes`
+- **query**: User search string (e.g., "Harry Potter", "0060959479", "books about AI")
+- **isbn**: Expected relevant book ISBN (from your catalog)
+- **relevance**: 1 = relevant (filter rows with relevance=1)
+- **notes**: Optional annotation note
+Multiple rows per query = multiple relevant books (Recall@K counts all).
+## Usage
+```bash
+# Copy example and extend with your catalog ISBNs
+cp data/rag_golden.example.csv data/rag_golden.csv
+# Run evaluation
+python scripts/model/evaluate_rag.py --golden data/rag_golden.csv
+```
+## Metrics
+- **Accuracy@K**: Fraction of queries with at least one relevant book in top-K
+- **Recall@K**: Fraction of relevant books (across all queries) found in top-K
+- **MRR@K**: Mean reciprocal rank of first relevant hit
+Target: 500+ pairs for production-quality evaluation.

docs/TECHNICAL_REPORT.md CHANGED Viewed

@@ -217,6 +217,8 @@ Architecture: Self-Attentive Sequential Recommendation with Transformer blocks
 - Training: 30 epochs, 64-dim embeddings, BCE loss with negative sampling
 - Dual use: (1) ranking feature via `sasrec_score`, (2) independent recall channel via embedding dot-product
 ### 4.3 LGBMRanker (LambdaRank) + Model Stacking
 Replaced XGBoost binary classifier with LightGBM LambdaRank that directly optimizes NDCG. In v2.6.0, a Stacking ensemble (LGBMRanker + XGBClassifier → LogisticRegression meta-learner) further improves ranking robustness.
@@ -226,6 +228,8 @@ Replaced XGBoost binary classifier with LightGBM LambdaRank that directly optimi
 - 20K users sampled from 168K validation set for training speed
 - 4× negative ratio per positive sample
 **17 features** in 5 groups:
 - User statistics: u_cnt, u_mean, u_std
 - Item statistics: i_cnt, i_mean, i_std
@@ -264,10 +268,12 @@ Feature importance (v2.6.0 LGBMRanker, representative subset):
 |--------|------------------------|-------------|
 | ISBN Recall | 0% | 100% |
 | Keyword Precision | Low | High (BM25 boost) |
-| Detail Query Recall | 0% | Demonstrated via curated examples (Small-to-Big) |
 | Avg Latency | 100ms | 300-800ms |
 | Chat Context Limit | ~10 turns | Extended via compression (no formal limit) |
 ### 5.2 Latency Benchmarks
 | Operation | P50 Latency (Warm) | P95 Latency (Warm) |
@@ -371,7 +377,7 @@ src/
 - **Single-dataset evaluation**: All RecSys metrics are on Amazon Books 200K; no cross-domain or external validation.
 - **Rule-based router**: Intent classification uses heuristics (e.g., `len(words) <= 2` for keyword); may not generalize to other domains.
-- **RAG evaluation**: RAG quality is demonstrated via curated examples (e.g., "Harry Potter", ISBN recall); no systematic human evaluation or large-scale relevance judgments.
 - **Protocol sensitivity**: RecSys metrics can vary with evaluation protocol (e.g., ISBN-only vs title-relaxed matching); see [Experiment Archive](experiments/experiment_archive.md) for discussion.
 ---

 - Training: 30 epochs, 64-dim embeddings, BCE loss with negative sampling
 - Dual use: (1) ranking feature via `sasrec_score`, (2) independent recall channel via embedding dot-product
+**Time-split (no leakage)**: SASRec is trained on `train.csv` only. `user_seq_emb` and `sas_item_emb` are computed from train-only sequences. When Ranking uses `sasrec_score` for val samples, the user's history contains only train interactions—never val/test. `build_sequences.py` and SASRec/YoutubeDNN all use train-only.
 ### 4.3 LGBMRanker (LambdaRank) + Model Stacking
 Replaced XGBoost binary classifier with LightGBM LambdaRank that directly optimizes NDCG. In v2.6.0, a Stacking ensemble (LGBMRanker + XGBClassifier → LogisticRegression meta-learner) further improves ranking robustness.
 - 20K users sampled from 168K validation set for training speed
 - 4× negative ratio per positive sample
+**Feature consistency**: Recall models (SASRec, ItemCF, etc.) are trained on train.csv. Ranking labels come from val.csv. Features like `sasrec_score` use train-only embeddings. Pipeline order: `split_rec_data` → `build_sequences` (train-only) → recall models (train) → ranker (val).
 **17 features** in 5 groups:
 - User statistics: u_cnt, u_mean, u_std
 - Item statistics: i_cnt, i_mean, i_std
 |--------|------------------------|-------------|
 | ISBN Recall | 0% | 100% |
 | Keyword Precision | Low | High (BM25 boost) |
+| Detail Query Recall | 0% | Golden Test Set (Accuracy@K, Recall@K, MRR@K) |
 | Avg Latency | 100ms | 300-800ms |
 | Chat Context Limit | ~10 turns | Extended via compression (no formal limit) |
+**Golden Test Set**: Human-annotated Query-Book pairs (`data/rag_golden.csv`) replace curated examples. Run `python scripts/model/evaluate_rag.py` for Accuracy@K, Recall@K, MRR@K. Extend with ~500+ pairs for production.
 ### 5.2 Latency Benchmarks
 | Operation | P50 Latency (Warm) | P95 Latency (Warm) |
 - **Single-dataset evaluation**: All RecSys metrics are on Amazon Books 200K; no cross-domain or external validation.
 - **Rule-based router**: Intent classification uses heuristics (e.g., `len(words) <= 2` for keyword); may not generalize to other domains.
+- **RAG evaluation**: Use Golden Test Set (`data/rag_golden.csv`) for Accuracy@K, Recall@K, MRR@K. Extend to 500+ human-annotated Query-Book pairs for production.
 - **Protocol sensitivity**: RecSys metrics can vary with evaluation protocol (e.g., ISBN-only vs title-relaxed matching); see [Experiment Archive](experiments/experiment_archive.md) for discussion.
 ---

docs/interview_guide.md CHANGED Viewed

@@ -5,51 +5,80 @@
 ## 🌟 核心亮点 (Why this project?)
 ### 1. 架构深度 (Architecture Depth)
-*   **Agentic RAG**: 不仅仅是简单的向量检索，而是引入了**动态路由 (Dynamic Routing)**。系统能根据用户意图（如 ISBN 精确搜索 vs. 模糊语义搜索）自动选择最佳检索策略（BM25, Hybrid, Small-to-Big），展示了对 RAG 系统的精细化控制能力。
-*   **Stacking Ensemble (模型融合)**: 在 Ranking 阶段，没有止步于单一模型，而是实现了 **LightGBM + XGBoost + Logistic Regression** 的 Stacking 架构。这体现了对机器学习模型偏差与方差的理解，以及追求极致推荐效果的工程态度。
-*   **Vector Database**: 结合 ChromaDB 实现语义搜索，紧跟当前 LLM + Vector Store 的技术热点。
 ### 2. 工程质量 (Engineering Excellence)
-*   **性能优化 (Performance Optimization)**:
-    *   **问题**: 系统在并发场景下出现卡顿，且推理延迟较高。
-    *   **解决**:
-        1.  **Async/Await 陷阱**: 发现 FastAPI 的 `async` 路由中运行了 CPU 密集型任务（Pandas 操作），导致 Event Loop 阻塞。即使加上 `await` 也没用，必须去除非 IO 操作的 async 或使用线程池。改为同步 `def` 让 FastAPI自动利用线程池解决。
-        2.  **向量化重构**: 发现特征生成使用了 Python 原生 `for` 循环。重构为 Numpy/Pandas 的向量化 (Vectorized) 操作，利用 SIMD 指令集优势，将推理速度提升了约 10 倍。
-        3.  **单例模式**: 引入 `MetadataStore` 单例，避免每次请求重复加载 CSV，显著降低了内存占用和 I/O 开销。
-*   **可解释性 (Explainability)**: 集成了 **SHAP (SHapley Additive exPlanations)**。不再是推荐系统的“黑盒”，而是能实时给出“为什么推荐这本书”（例如：因为你喜欢作者 X，或者因为主要读这类书），这是区分初级项目和高级项目的重要特征。
 ### 3. 完整性 (Completeness)
-*   **Full Stack**: 前端 (React) + 后端 (FastAPI) + 数据流 (ETL) + 模型训练 (Train Scripts) + 部署 (Docker)。
-*   **DevOps**: 包含 Dockerfile 和完整构建脚本，具备生产部署能力。
 ---
 ## 🗣️ 面试话术与 Q&A 策略
 ### Q1: 你在项目中遇到的最大困难是什么？怎么解决的？
 **建议回答**:
 > “最让我印象深刻的是**系统性能优化**的过程。
 > 最初版本在处理高并发请求时，推理延迟很高，甚至会阻塞整个服务。
 > 我通过两个层面解决了这个问题：
-> 1.  **架构层**: 我使用 Profiling 工具发现，FastAPI 的 `async` 接口中包含了大量的 Pandas 数据处理逻辑。因为 Python 的 `async` 是单线程协作式的，CPU 密集型任务会直接卡死 Event Loop。我将其重构为利用 FastAPI 线程池的非异步模式，解决了阻塞问题。
-> 2.  **代码层**: 我发现特征工程部分原本是用 Python 循环写的。我将其重构为 **Numpy 向量化** 操作，把时间复杂度从 O(N) 的 Python 解释器���销优化到了底层 C 语言级别的矩阵运算，最终将特征生成速度提升了 10 倍以上。”
 ### Q2: 为什么选择 Stacking 融合模型？直接用 LightGBM 不够吗？
 **建议回答**:
 > “单一模型往往存在局限性。
 > LightGBM 擅长处理类别特征和梯度提升，XGBoost 在正则化处理上表现很好。
 > 通过 Stacking，我使用一个简单的逻辑回归 (Logistic Regression) 作为 Meta-Learner 来学习这两个强模型的输出。
 > 这不仅能利用不同模型的优势（降低 Bias 和 Variance），还能提高系统的**鲁棒性**。在我的离线实验中，Stacking 相比单一 LightGBM 在 NDCG@10 指标上有明显提升。”
 ### Q3: 你的 RAG 系统有什么特别之处？
 **建议回答**:
 > “我的 RAG 系统不是简单地 'Retrieve then Generate'。我设计了一个 **Agentic Router**。
 > 它会先判断用户的意图：如果是搜书号，直接走精确匹配；如果是模糊描述，走语义索引；如果是复杂查询，会触发 Rerank 重排序。
 > 这种动态策略解决了传统 RAG '查得准就不全，查得全就不准' 的痛点。”
 ---
 ## 📈 关键指标 (Key Metrics)
-*   **Hit Rate@10**: 0.4545 (v2.6.0, n=2000, Leave-Last-Out)
-*   **MRR@5**: 0.2893 (Title-relaxed matching)
-*   **Latency**: P99 < 50ms (Personalized Recs)

 ## 🌟 核心亮点 (Why this project?)
 ### 1. 架构深度 (Architecture Depth)
+* **Agentic RAG**: 不仅仅是简单的向量检索，而是引入了**动态路由 (Dynamic Routing)**。系统能根据用户意图（如 ISBN 精确搜索 vs. 模糊语义搜索）自动选择最佳检索策略（BM25, Hybrid, Small-to-Big），展示了对 RAG 系统的精细化控制能力。
+* **Stacking Ensemble (模型融合)**: 在 Ranking 阶段，没有止步于单一模型，而是实现了 **LightGBM + XGBoost + Logistic Regression** 的 Stacking 架构。这体现了对机器学习模型偏差与方差的理解，以及追求极致推荐效果的工程态度。
+* **Vector Database**: 结合 ChromaDB 实现语义搜索，紧跟当前 LLM + Vector Store 的技术热点。
 ### 2. 工程质量 (Engineering Excellence)
+* **性能优化 (Performance Optimization)**:
+  * **问题**: 系统在并发场景下出现卡顿，且推理延迟较高。
+  * **解决**:
+    1. **Async/Await 陷阱**: 发现 FastAPI 的 `async` 路由中运行了 CPU 密集型任务（Pandas 操作），导致 Event Loop 阻塞。即使加上 `await` 也没用，必须去除非 IO 操作的 async 或使用线程池。改为同步 `def` 让 FastAPI自动利用线程池解决。
+    2. **向量化重构**: 发现特征生成使用了 Python 原生 `for` 循环。重构为 Numpy/Pandas 的向量化 (Vectorized) 操作，利用 SIMD 指令集优势，将推理速度提升了约 10 倍。
+    3. **单例模式**: 引入 `MetadataStore` 单例，避免每次请求重复加载 CSV，显著降低了内存占用和 I/O 开销。
+* **可解释性 (Explainability)**: 集成了 **SHAP (SHapley Additive exPlanations)**。不再是推荐系统的“黑盒”，而是能实时给出“为什么推荐这本书”（例如：因为你喜欢作者 X，或者因为主要读这类书），这是区分初级项目和高级项目的重要特征。
 ### 3. 完整性 (Completeness)
+* **Full Stack**: 前端 (React) + 后端 (FastAPI) + 数据流 (ETL) + 模型训练 (Train Scripts) + 部署 (Docker)。
+* **DevOps**: 包含 Dockerfile 和完整构建脚本，具备生产部署能力。
 ---
 ## 🗣️ 面试话术与 Q&A 策略
 ### Q1: 你在项目中遇到的最大困难是什么？怎么解决的？
 **建议回答**:
 > “最让我印象深刻的是**系统性能优化**的过程。
 > 最初版本在处理高并发请求时，推理延迟很高，甚至会阻塞整个服务。
 > 我通过两个层面解决了这个问题：
+>
+> 1. **架构层**: 我使用 Profiling 工具发现，FastAPI 的 `async` 接口中包含了大量的 Pandas 数据处理逻辑。因为 Python 的 `async` 是单线程协作式的，CPU 密集型任务会直接卡死 Event Loop。我将其重构为利用 FastAPI 线程池的非异步模式，解决了阻塞问题。
+> 2. **代码层**: 我发现特征工程部分原本是用 Python 循环写的。我将其重构为 **Numpy 向量化** 操作，把时间复杂度从 O(N) 的 Python 解释器开销优化到了底层 C 语言级别的矩阵运算，最终将特征生成速度提升了 10 倍以上。”
 ### Q2: 为什么选择 Stacking 融合模型？直接用 LightGBM 不够吗？
 **建议回答**:
 > “单一模型往往存在局限性。
 > LightGBM 擅长处理类别特征和梯度提升，XGBoost 在正则化处理上表现很好。
 > 通过 Stacking，我使用一个简单的逻辑回归 (Logistic Regression) 作为 Meta-Learner 来学习这两个强模型的输出。
 > 这不仅能利用不同模型的优势（降低 Bias 和 Variance），还能提高系统的**鲁棒性**。在我的离线实验中，Stacking 相比单一 LightGBM 在 NDCG@10 指标上有明显提升。”
 ### Q3: 你的 RAG 系统有什么特别之处？
 **建议回答**:
 > “我的 RAG 系统不是简单地 'Retrieve then Generate'。我设计了一个 **Agentic Router**。
 > 它会先判断用户的意图：如果是搜书号，直接走精确匹配；如果是模糊描述，走语义索引；如果是复杂查询，会触发 Rerank 重排序。
 > 这种动态策略解决了传统 RAG '查得准就不全，查得全就不准' 的痛点。”
+**Q1. 关于 Swing 算法的物理意义：**
+> "我看你用了 Swing 召回。你能直观解释一下，为什么 Swing 比传统的 UserCF 更能抗噪声？`1 / (alpha + |I_u ∩ I_v|)` 这个公式里的分母是在惩罚什么样的用户对？"
+> *(考察点：是否真正理解算法原理，还是只是调包。关键在于理解 Swing 惩罚了那些“原本就很相似”的小圈子用户，突出了 serendipity)*
+**Q2. 关于 RAG 的延迟优化：**
+> "你的报告提到 Hybrid Search + Rerank 耗时约 800ms。如果我们要把这个系统部署到抖音的搜索框，要求 P99 延迟在 200ms 以内，你会砍掉哪些环节？或者如何通过工程手段优化？"
+> *(考察点：工程思维。答案可能包括：并行请求、向量库量化 HNSW、Rerank 模型蒸馏、缓存热门 Query、异步加载详情等)*
+**Q3. SASRec 的应用细节：**
+> "在 `src/model/sasrec.py` 中，你使用了 Transformer。在推理（Inference）阶段，如果用户每点一本书我们都要刷新推荐，SASRec 的计算成本是很高的。你如何缓存用户的 Embedding 状态以避免每次从头计算整个序列？"
+> *(考察点：对深度学习模型线上推理（Inference）优化的理解。关键在于 KV Cache 或者增量计算)*
+>
 ---
 ## 📈 关键指标 (Key Metrics)
+* **Hit Rate@10**: 0.4545 (v2.6.0, n=2000, Leave-Last-Out)
+* **MRR@5**: 0.2893 (Title-relaxed matching)
+* **Latency**: P99 < 50ms (Personalized Recs)

requirements.txt CHANGED Viewed

@@ -39,6 +39,9 @@ scikit-learn
 scipy
 requests
 # LLM Agent & Fine-tuning
 faiss-cpu
 diffusers

 scipy
 requests
+# Intent classifier backends (optional)
+# fasttext  # Uncomment for FastText backend: pip install fasttext
 # LLM Agent & Fine-tuning
 faiss-cpu
 diffusers

scripts/data/build_sequences.py CHANGED Viewed

@@ -4,24 +4,23 @@ Build User Sequences for Sequential Models (SASRec, YoutubeDNN)
 Converts user interaction history into padded sequences for training.
 Usage:
     python scripts/data/build_sequences.py
 Input:
-    - data/rec/train.csv, val.csv, test.csv
 Output:
-    - data/rec/user_sequences.pkl  (Dict[user_id, List[item_id]])
-    - data/rec/item_map.pkl        (Dict[isbn, item_id])
-Notes:
-    - Item IDs are 1-indexed (0 is reserved for padding)
-    - Sequences are truncated to max_len (default: 50)
-    - Test item is excluded from sequences (used for evaluation)
 """
 import pandas as pd
-import numpy as np
 import pickle
 import logging
 from pathlib import Path
@@ -30,83 +29,43 @@ from tqdm import tqdm
 logging.basicConfig(level=logging.INFO)
 logger = logging.getLogger(__name__)
-def build_sequences(data_dir='data/rec', max_len=50):
     """
-    Convert user interactions into sequences for SASRec
     """
-    logger.info("Building user sequences...")
-    # Load all data to map ISBNs to Integers (SASRec needs int IDs)
-    train_df = pd.read_csv(f'{data_dir}/train.csv')
-    val_df = pd.read_csv(f'{data_dir}/val.csv')
-    test_df = pd.read_csv(f'{data_dir}/test.csv')
-    full_df = pd.concat([train_df, val_df, test_df])
-    # 1. Map ISBN to Index (1-based, 0 is padding)
-    items = full_df['isbn'].unique()
-    item_map = {isbn: i+1 for i, isbn in enumerate(items)}
-    num_items = len(items)
-    logger.info(f"Total items: {num_items}")
-    # Save map
-    with open(f'{data_dir}/item_map.pkl', 'wb') as f:
         pickle.dump(item_map, f)
-    # 2. Group by User and Sort by Time
-    # Note: Our data split script ALREADY sorted by time.
-    # But let's be safe. We need original timestamps if possible.
-    # 'train.csv' doesn't have timestamp column? let me check split_rec_data.
-    # Ah, split_rec_data removed it. But rows are ordered.
-    # Actually, we can just group by user_id and assume rows are chronological
-    # IF we process train -> val -> test order.
-    # Let's reconstruct full history per user
-    logger.info("Grouping user history...")
-    # Optimization: processing via dictionary is faster than groupby on large df
-    user_history = {} # user_id -> list of item_ids
-    def process_df(df):
-        for _, row in tqdm(df.iterrows(), total=len(df), desc="Processing"):
-            u = row['user_id']
-            item = item_map[row['isbn']]
-            if u not in user_history:
-                user_history[u] = []
-            user_history[u].append(item)
-    # Process in chronological order: Train -> Val -> Test
-    # Wait, split_rec_data puts last item in test, 2nd last in val.
-    # So correct order is: Train rows + Val row + Test row.
-    # BUT, train.csv has multiple rows per user. They are already sorted by time in split logic.
-    process_df(train_df)
-    process_df(val_df)
-    # We leave Test item out of the input sequence!
-    # Test item is the target for evaluation.
-    # For training SASRec, we use (seq[:-1]) -> predict (seq[1:]).
-    # 3. Create Dataset
-    # Output:
-    # train_seqs: Dict[user_id, list_of_ints]
-    # Pad/Truncate
-    final_seqs = {}
-    for u, history in user_history.items():
-        # Truncate to max_len
-        seq = history[-max_len:]
-        final_seqs[u] = seq
-    logger.info(f"Processed {len(final_seqs)} users.")
-    # Save sequences
-    with open(f'{data_dir}/user_sequences.pkl', 'wb') as f:
         pickle.dump(final_seqs, f)
-    logger.info("Sequence data saved.")
 if __name__ == "__main__":
     build_sequences()

 Converts user interaction history into padded sequences for training.
+TIME-SPLIT (strict): Uses train.csv ONLY for sequences and item_map.
+This prevents leakage when Ranking uses SASRec embeddings as features:
+- Val/test samples must not appear in user history when computing sasrec_score.
+- Recall models (SASRec, YoutubeDNN) overwrite these with their own train-only output.
 Usage:
     python scripts/data/build_sequences.py
 Input:
+    - data/rec/train.csv (val.csv, test.csv exist but are NOT used for sequences)
 Output:
+    - data/rec/user_sequences.pkl  (Dict[user_id, List[item_id]]) — train-only
+    - data/rec/item_map.pkl        (Dict[isbn, item_id]) — train-only
 """
 import pandas as pd
 import pickle
 import logging
 from pathlib import Path
 logging.basicConfig(level=logging.INFO)
 logger = logging.getLogger(__name__)
+def build_sequences(data_dir="data/rec", max_len=50):
     """
+    Build user sequences from train.csv only (strict time-split).
+    Val/test are excluded to avoid leakage in ranking features (sasrec_score).
     """
+    logger.info("Building user sequences (train-only, time-split)...")
+    train_df = pd.read_csv(f"{data_dir}/train.csv")
+    # 1. Item map from train only (matches SASRec/YoutubeDNN)
+    items = train_df["isbn"].unique()
+    item_map = {isbn: i + 1 for i, isbn in enumerate(items)}
+    logger.info("  Items (train): %d", len(item_map))
+    # 2. User history from train only (no val/test)
+    user_history = {}
+    if "timestamp" in train_df.columns:
+        train_df = train_df.sort_values(["user_id", "timestamp"])
+    for _, row in tqdm(train_df.iterrows(), total=len(train_df), desc="  Processing"):
+        u = str(row["user_id"])
+        item = item_map.get(row["isbn"])
+        if item is None:
+            continue
+        if u not in user_history:
+            user_history[u] = []
+        user_history[u].append(item)
+    final_seqs = {u: hist[-max_len:] for u, hist in user_history.items()}
+    logger.info("  Users: %d", len(final_seqs))
+    data_dir = Path(data_dir)
+    data_dir.mkdir(parents=True, exist_ok=True)
+    with open(data_dir / "item_map.pkl", "wb") as f:
         pickle.dump(item_map, f)
+    with open(data_dir / "user_sequences.pkl", "wb") as f:
         pickle.dump(final_seqs, f)
+    logger.info("Sequence data saved (train-only).")
 if __name__ == "__main__":
     build_sequences()

scripts/data/fetch_new_books.py ADDED Viewed

	@@ -0,0 +1,322 @@

+#!/usr/bin/env python3
+"""
+Incremental Book Update Script.
+Fetches recently published books from Google Books API and adds them to the local database.
+Can be run manually or scheduled via cron for periodic updates.
+Usage:
+    python scripts/data/fetch_new_books.py [--categories CATEGORIES] [--year YEAR] [--max MAX]
+Examples:
+    # Fetch new books from current year (default behavior)
+    python scripts/data/fetch_new_books.py --categories "fiction" --max 50
+    # Fetch new books across multiple categories
+    python scripts/data/fetch_new_books.py --categories "fiction,mystery,science fiction"
+    # Explicitly specify year filter
+    python scripts/data/fetch_new_books.py --year 2026 --categories "thriller"
+    # Dry run (show what would be added without actually adding)
+    python scripts/data/fetch_new_books.py --dry-run --categories "thriller"
+"""
+import argparse
+import sys
+import time
+from pathlib import Path
+from datetime import datetime
+from typing import Optional
+# Add project root to path
+PROJECT_ROOT = Path(__file__).resolve().parent.parent.parent
+sys.path.insert(0, str(PROJECT_ROOT))
+from src.utils import setup_logger
+from src.core.web_search import search_new_books_by_category, search_google_books
+from src.core.metadata_store import metadata_store
+from src.recommender import BookRecommender
+logger = setup_logger(__name__)
+# Default categories to search
+DEFAULT_CATEGORIES = [
+    "fiction",
+    "mystery",
+    "thriller",
+    "science fiction",
+    "fantasy",
+    "romance",
+    "biography",
+    "history",
+    "self-help",
+    "business",
+]
+def fetch_trending_books(
+    categories: list[str],
+    year: Optional[int] = None,
+    max_per_category: int = 20,
+    dry_run: bool = False,
+) -> dict:
+    """
+    Fetch recently published books from Google Books for given categories.
+    Args:
+        categories: List of book categories to search
+        year: Filter by publication year (default: current year)
+        max_per_category: Max books to fetch per category
+        dry_run: If True, don't actually add books to database
+    Returns:
+        Dict with stats: {added: int, skipped: int, errors: int, books: list}
+    """
+    if year is None:
+        year = datetime.now().year
+    stats = {
+        "added": 0,
+        "skipped": 0,
+        "errors": 0,
+        "books": [],
+    }
+    recommender = None
+    if not dry_run:
+        recommender = BookRecommender()
+    for category in categories:
+        logger.info(f"Fetching books for category: {category} (year >= {year})")
+        try:
+            books = search_new_books_by_category(
+                category=category,
+                year=year,
+                max_results=max_per_category
+            )
+            logger.info(f"  Found {len(books)} books in '{category}'")
+            for book in books:
+                isbn = book.get("isbn13", "")
+                if not isbn:
+                    continue
+                # Check if already exists
+                if metadata_store.book_exists(isbn):
+                    stats["skipped"] += 1
+                    continue
+                if dry_run:
+                    logger.info(f"  [DRY RUN] Would add: {book.get('title', 'Unknown')} ({isbn})")
+                    stats["books"].append(book)
+                    stats["added"] += 1
+                else:
+                    result = recommender.add_new_book(
+                        isbn=isbn,
+                        title=book.get("title", ""),
+                        author=book.get("authors", "Unknown"),
+                        description=book.get("description", ""),
+                        category=book.get("simple_categories", category),
+                        thumbnail=book.get("thumbnail"),
+                        published_date=book.get("publishedDate", ""),
+                    )
+                    if result:
+                        stats["added"] += 1
+                        stats["books"].append(book)
+                        logger.info(f"  Added: {book.get('title', 'Unknown')} ({isbn})")
+                    else:
+                        stats["errors"] += 1
+                # Rate limiting: avoid hitting API limits
+                time.sleep(0.1)
+            # Pause between categories
+            time.sleep(0.5)
+        except Exception as e:
+            logger.error(f"Error fetching category '{category}': {e}")
+            stats["errors"] += 1
+    return stats
+def fetch_by_query(
+    queries: list[str],
+    max_per_query: int = 20,
+    dry_run: bool = False,
+) -> dict:
+    """
+    Fetch books by specific search queries (e.g., "AI books 2024", "new thriller novels").
+    Args:
+        queries: List of search queries
+        max_per_query: Max books per query
+        dry_run: If True, don't actually add books
+    Returns:
+        Stats dict
+    """
+    stats = {
+        "added": 0,
+        "skipped": 0,
+        "errors": 0,
+        "books": [],
+    }
+    recommender = None
+    if not dry_run:
+        recommender = BookRecommender()
+    for query in queries:
+        logger.info(f"Searching: {query}")
+        try:
+            books = search_google_books(query, max_results=max_per_query)
+            logger.info(f"  Found {len(books)} results")
+            for book in books:
+                isbn = book.get("isbn13", "")
+                if not isbn:
+                    continue
+                if metadata_store.book_exists(isbn):
+                    stats["skipped"] += 1
+                    continue
+                if dry_run:
+                    logger.info(f"  [DRY RUN] Would add: {book.get('title', 'Unknown')}")
+                    stats["books"].append(book)
+                    stats["added"] += 1
+                else:
+                    result = recommender.add_new_book(
+                        isbn=isbn,
+                        title=book.get("title", ""),
+                        author=book.get("authors", "Unknown"),
+                        description=book.get("description", ""),
+                        category=book.get("simple_categories", "General"),
+                        thumbnail=book.get("thumbnail"),
+                        published_date=book.get("publishedDate", ""),
+                    )
+                    if result:
+                        stats["added"] += 1
+                        stats["books"].append(book)
+                    else:
+                        stats["errors"] += 1
+                time.sleep(0.1)
+            time.sleep(0.5)
+        except Exception as e:
+            logger.error(f"Error with query '{query}': {e}")
+            stats["errors"] += 1
+    return stats
+def print_stats(stats: dict, dry_run: bool = False):
+    """Print summary statistics."""
+    prefix = "[DRY RUN] " if dry_run else ""
+    print(f"\n{prefix}=== Fetch Complete ===")
+    print(f"  Books added:   {stats['added']}")
+    print(f"  Books skipped: {stats['skipped']} (already in database)")
+    print(f"  Errors:        {stats['errors']}")
+    if stats["books"] and dry_run:
+        print(f"\nBooks that would be added:")
+        for book in stats["books"][:10]:
+            print(f"  - {book.get('title', 'Unknown')} by {book.get('authors', 'Unknown')}")
+        if len(stats["books"]) > 10:
+            print(f"  ... and {len(stats['books']) - 10} more")
+def main():
+    parser = argparse.ArgumentParser(
+        description="Fetch new books from Google Books API",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog=__doc__
+    )
+    parser.add_argument(
+        "--categories",
+        type=str,
+        default=None,
+        help="Comma-separated list of categories (default: all common categories)"
+    )
+    parser.add_argument(
+        "--queries",
+        type=str,
+        default=None,
+        help="Comma-separated list of custom search queries"
+    )
+    parser.add_argument(
+        "--year",
+        type=int,
+        default=None,
+        help="Filter by publication year (default: current year)"
+    )
+    parser.add_argument(
+        "--max",
+        type=int,
+        default=20,
+        help="Max books per category/query (default: 20)"
+    )
+    parser.add_argument(
+        "--dry-run",
+        action="store_true",
+        help="Show what would be added without actually adding"
+    )
+    parser.add_argument(
+        "--verbose",
+        "-v",
+        action="store_true",
+        help="Enable verbose logging"
+    )
+    args = parser.parse_args()
+    # Parse categories
+    if args.categories:
+        categories = [c.strip() for c in args.categories.split(",")]
+    else:
+        categories = DEFAULT_CATEGORIES
+    # Parse queries
+    queries = None
+    if args.queries:
+        queries = [q.strip() for q in args.queries.split(",")]
+    print(f"Book Fetch Configuration:")
+    print(f"  Categories: {categories if not queries else 'N/A (using queries)'}")
+    print(f"  Queries: {queries or 'N/A (using categories)'}")
+    print(f"  Year filter: >= {args.year or datetime.now().year}")
+    print(f"  Max per item: {args.max}")
+    print(f"  Dry run: {args.dry_run}")
+    print()
+    # Fetch books
+    if queries:
+        stats = fetch_by_query(
+            queries=queries,
+            max_per_query=args.max,
+            dry_run=args.dry_run,
+        )
+    else:
+        stats = fetch_trending_books(
+            categories=categories,
+            year=args.year,
+            max_per_category=args.max,
+            dry_run=args.dry_run,
+        )
+    print_stats(stats, args.dry_run)
+    return 0 if stats["errors"] == 0 else 1
+if __name__ == "__main__":
+    sys.exit(main())

scripts/data/validate_data.py CHANGED Viewed

@@ -144,9 +144,27 @@ def validate_rec():
         print(f"  User sequences: {len(seqs):,}")
         avg_len = np.mean([len(s) for s in seqs.values()])
         print(f"  Avg sequence length: {avg_len:.1f}")
     else:
         print("  ⚠️  User sequences not found (run build_sequences.py)")
     print("  ✅ Rec data validation passed")
     return True

         print(f"  User sequences: {len(seqs):,}")
         avg_len = np.mean([len(s) for s in seqs.values()])
         print(f"  Avg sequence length: {avg_len:.1f}")
+        # Time-split: no val items in sequences (prevents sasrec_score leakage)
+        if ITEM_MAP.exists():
+            with open(ITEM_MAP, "rb") as f:
+                item_map = pickle.load(f)
+            id_to_item = {v: k for k, v in item_map.items()}
+            leaked = 0
+            for _, row in val.iterrows():
+                uid, val_isbn = str(row["user_id"]), str(row["isbn"])
+                if uid not in seqs:
+                    continue
+                val_iid = item_map.get(val_isbn)
+                if val_iid is None:
+                    continue  # val item not in map (train-only) -> no leak possible
+                if val_iid in seqs[uid]:
+                    leaked += 1
+            check(leaked == 0, f"Time-split violation: {leaked} users have val items in sequence")
+            print("  ✅ Time-split OK (no val in sequences)")
     else:
         print("  ⚠️  User sequences not found (run build_sequences.py)")
     print("  ✅ Rec data validation passed")
     return True

scripts/model/evaluate_rag.py ADDED Viewed

	@@ -0,0 +1,169 @@

+#!/usr/bin/env python3
+"""
+Evaluate RAG retrieval on a Golden Test Set.
+Replaces "curated examples" with quantitative metrics: Accuracy@K, Recall@K, MRR@K.
+Use human-annotated Query-Book pairs for data-driven evaluation.
+Usage:
+    python scripts/model/evaluate_rag.py
+    python scripts/model/evaluate_rag.py --golden data/rag_golden.csv --top_k 10
+Golden set format (CSV): query, isbn, relevance
+    - query: user search string
+    - isbn: expected relevant book (1=relevant)
+    - Multiple rows per query = multiple relevant books
+"""
+import sys
+from pathlib import Path
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent.parent))
+import pandas as pd
+import logging
+from collections import defaultdict
+from src.recommender import BookRecommender
+logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
+logger = logging.getLogger(__name__)
+def load_golden(path: Path) -> dict[str, set[str]]:
+    """Load golden set: {query -> set of relevant isbns}."""
+    df = pd.read_csv(path, comment="#")
+    df = df[df.get("relevance", 1) == 1]  # Only relevant pairs
+    golden = defaultdict(set)
+    for _, row in df.iterrows():
+        q = str(row["query"]).strip()
+        isbn = str(row["isbn"]).strip().replace(".0", "")
+        if q and isbn:
+            golden[q].add(isbn)
+    return dict(golden)
+def evaluate_rag(
+    golden_path: Path | str = "data/rag_golden.csv",
+    top_k: int = 10,
+    use_title_match: bool = True,
+) -> dict:
+    """
+    Run RAG retrieval on golden set and compute metrics.
+    Returns: dict with accuracy_at_k, recall_at_k, mrr_at_k, n_queries
+    """
+    golden_path = Path(golden_path)
+    if not golden_path.exists():
+        # Fallback to example
+        alt = Path("data/rag_golden.example.csv")
+        if alt.exists():
+            logger.warning("Golden set not found at %s, using %s", golden_path, alt)
+            golden_path = alt
+        else:
+            raise FileNotFoundError(
+                f"Golden set not found. Create {golden_path} with columns: query,isbn,relevance. "
+                "See data/rag_golden.example.csv for format."
+            )
+    golden = load_golden(golden_path)
+    if not golden:
+        raise ValueError("Golden set is empty")
+    logger.info("Evaluating RAG on %d queries from %s", len(golden), golden_path)
+    recommender = BookRecommender()
+    isbn_to_title = {}
+    if use_title_match:
+        try:
+            bp = Path("data/books_processed.csv")
+            if not bp.exists():
+                bp = Path(__file__).resolve().parent.parent.parent / "data" / "books_processed.csv"
+            books = pd.read_csv(bp, usecols=["isbn13", "title"])
+            books["isbn13"] = books["isbn13"].astype(str).str.replace(r"\.0$", "", regex=True)
+            isbn_to_title = books.set_index("isbn13")["title"].to_dict()
+        except Exception as e:
+            logger.warning("Could not load title map: %s", e)
+            use_title_match = False
+    hits_acc = 0
+    recall_sum = 0.0
+    mrr_sum = 0.0
+    for query, relevant_isbns in golden.items():
+        try:
+            recs = recommender.get_recommendations(query, top_k=top_k * 2)
+            rec_isbns = [r.get("isbn") or r.get("isbn13") for r in recs if r]
+            rec_isbns = [str(x).replace(".0", "") for x in rec_isbns if pd.notna(x)]
+            rec_top = rec_isbns[:top_k]
+            # Match: exact or title
+            def _match(target: str, cand_list: list) -> int:
+                for i, c in enumerate(cand_list):
+                    if str(c).strip() == str(target).strip():
+                        return i
+                    if use_title_match:
+                        t_title = isbn_to_title.get(str(target), "").lower().strip()
+                        c_title = isbn_to_title.get(str(c), "").lower().strip()
+                        if t_title and c_title and t_title == c_title:
+                            return i
+                return -1
+            # Accuracy@K: at least one relevant in top-K
+            found_any = False
+            first_rank = top_k + 1
+            count_in_top = 0
+            for rel in relevant_isbns:
+                rk = _match(rel, rec_top)
+                if rk >= 0:
+                    found_any = True
+                    count_in_top += 1
+                    first_rank = min(first_rank, rk + 1)
+            if found_any:
+                hits_acc += 1
+            recall_sum += count_in_top / len(relevant_isbns) if relevant_isbns else 0
+            if first_rank <= top_k:
+                mrr_sum += 1.0 / first_rank
+        except Exception as e:
+            logger.warning("Query %r failed: %s", query[:50], e)
+    n = len(golden)
+    return {
+        "accuracy_at_k": hits_acc / n,
+        "recall_at_k": recall_sum / n,
+        "mrr_at_k": mrr_sum / n,
+        "n_queries": n,
+        "top_k": top_k,
+    }
+def main():
+    import argparse
+    parser = argparse.ArgumentParser(description="Evaluate RAG on Golden Test Set")
+    parser.add_argument("--golden", default="data/rag_golden.csv", help="Path to golden CSV")
+    parser.add_argument("--top_k", type=int, default=10)
+    parser.add_argument("--no-title-match", action="store_true", help="Disable relaxed title matching")
+    args = parser.parse_args()
+    m = evaluate_rag(
+        golden_path=args.golden,
+        top_k=args.top_k,
+        use_title_match=not args.no_title_match,
+    )
+    print("\n" + "=" * 50)
+    print("  RAG Golden Test Set Evaluation")
+    print("=" * 50)
+    print(f"  Queries:     {m['n_queries']}")
+    print(f"  Top-K:       {m['top_k']}")
+    print(f"  Accuracy@{m['top_k']}:  {m['accuracy_at_k']:.4f}  (any relevant in top-K)")
+    print(f"  Recall@{m['top_k']}:    {m['recall_at_k']:.4f}  (fraction of relevant in top-K)")
+    print(f"  MRR@{m['top_k']}:      {m['mrr_at_k']:.4f}  (mean reciprocal rank)")
+    print("=" * 50)
+if __name__ == "__main__":
+    main()

scripts/model/train_din_ranker.py ADDED Viewed

	@@ -0,0 +1,264 @@

+#!/usr/bin/env python3
+"""
+Train DIN (Deep Interest Network) ranker.
+Uses attention over user behavior sequence w.r.t. target item.
+Reuses SASRec item embeddings as initialization when available.
+Usage:
+    python scripts/model/train_din_ranker.py
+    python scripts/model/train_din_ranker.py --max_samples 10000 --epochs 10
+Input:
+    - data/rec/val.csv, train.csv
+    - data/rec/user_sequences.pkl, item_map.pkl (from SASRec/YoutubeDNN)
+    - data/model/rec/sasrec_model.pth (optional, for init)
+Output:
+    - data/model/ranking/din_ranker.pt
+"""
+import sys
+import os
+sys.path.append(os.getcwd())
+import pickle
+import logging
+from pathlib import Path
+import numpy as np
+import pandas as pd
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.utils.data import Dataset, DataLoader
+from tqdm import tqdm
+from src.ranking.din import DIN
+from src.recall.fusion import RecallFusion
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
+)
+logger = logging.getLogger(__name__)
+def build_din_data(
+    data_dir: str = "data/rec",
+    model_dir: str = "data/model/recall",
+    neg_ratio: int = 4,
+    max_samples: int = 20000,
+) -> tuple[pd.DataFrame, dict, dict]:
+    """
+    Build (user_id, isbn, label) samples with hard negatives.
+    Returns (df, user_sequences, item_map).
+    """
+    logger.info("Building DIN training data...")
+    val_df = pd.read_csv(f"{data_dir}/val.csv")
+    all_items = pd.read_csv(f"{data_dir}/train.csv")["isbn"].astype(str).unique()
+    if len(val_df) > max_samples:
+        val_df = val_df.sample(n=max_samples, random_state=42).reset_index(drop=True)
+    fusion = RecallFusion(data_dir, model_dir)
+    fusion.load_models()
+    with open(f"{data_dir}/user_sequences.pkl", "rb") as f:
+        user_sequences = pickle.load(f)
+    with open(f"{data_dir}/item_map.pkl", "rb") as f:
+        item_map = pickle.load(f)
+    rows = []
+    for _, row in tqdm(val_df.iterrows(), total=len(val_df), desc="Mining samples"):
+        user_id = str(row["user_id"])
+        pos_isbn = str(row["isbn"])
+        user_rows = [{"user_id": user_id, "isbn": pos_isbn, "label": 1}]
+        try:
+            recall_items = fusion.get_recall_items(user_id, k=50)
+            hard_negs = [item for item, _ in recall_items if item != pos_isbn][:neg_ratio]
+        except Exception:
+            hard_negs = []
+        for neg_isbn in hard_negs:
+            user_rows.append({"user_id": user_id, "isbn": str(neg_isbn), "label": 0})
+        n_remaining = neg_ratio - len(hard_negs)
+        if n_remaining > 0:
+            random_negs = np.random.choice(all_items, size=n_remaining, replace=False)
+            for neg_isbn in random_negs:
+                user_rows.append({"user_id": user_id, "isbn": str(neg_isbn), "label": 0})
+        rows.extend(user_rows)
+    df = pd.DataFrame(rows)
+    logger.info(f"Built {len(df)} samples")
+    return df, user_sequences, item_map
+class DINDataset(Dataset):
+    """Dataset for DIN: (user_hist, target_item_id, label) and optional aux features."""
+    def __init__(
+        self,
+        df: pd.DataFrame,
+        user_sequences: dict,
+        item_map: dict,
+        max_hist_len: int = 50,
+        aux_df: pd.DataFrame | None = None,
+        aux_cols: list[str] | None = None,
+    ):
+        self.samples = []
+        self.aux_df = aux_df
+        self.aux_cols = aux_cols or []
+        for idx, (_, row) in enumerate(df.iterrows()):
+            user_id = str(row["user_id"])
+            isbn = str(row["isbn"])
+            label = int(row["label"])
+            target_id = item_map.get(isbn, 0)
+            if target_id == 0:
+                continue
+            hist = user_sequences.get(user_id, [])
+            if hist and isinstance(hist[0], str):
+                hist = [item_map.get(h, 0) for h in hist if item_map.get(h, 0) > 0]
+            hist = [x for x in hist if x != target_id][-max_hist_len:]
+            self.samples.append((hist, target_id, label, idx))
+    def __len__(self) -> int:
+        return len(self.samples)
+    def __getitem__(self, idx: int):
+        hist, target_id, label, df_idx = self.samples[idx]
+        max_len = 50
+        padded = np.zeros(max_len, dtype=np.int64)
+        padded[: len(hist)] = hist
+        out = (
+            torch.LongTensor(padded),
+            torch.LongTensor([target_id]).squeeze(0),
+            torch.FloatTensor([label]).squeeze(0),
+        )
+        if self.aux_df is not None and self.aux_cols:
+            aux_row = self.aux_df.iloc[df_idx][self.aux_cols].values.astype(np.float32)
+            out = out + (torch.FloatTensor(aux_row),)
+        return out
+def train_din(
+    data_dir: str = "data/rec",
+    model_dir: str = "data/model",
+    recall_dir: str = "data/model/recall",
+    max_samples: int = 20000,
+    max_hist_len: int = 50,
+    embed_dim: int = 64,
+    epochs: int = 10,
+    batch_size: int = 256,
+    lr: float = 1e-3,
+    use_aux: bool = False,
+) -> None:
+    rank_dir = Path(model_dir) / "ranking"
+    rank_dir.mkdir(parents=True, exist_ok=True)
+    df, user_sequences, item_map = build_din_data(
+        data_dir, recall_dir, neg_ratio=4, max_samples=max_samples
+    )
+    num_items = len(item_map)
+    aux_df = None
+    aux_cols: list[str] = []
+    if use_aux:
+        from src.ranking.features import FeatureEngineer
+        fe = FeatureEngineer(data_dir, recall_dir)
+        fe.load_base_data()
+        logger.info("Generating aux features for DIN...")
+        aux_df = fe.create_dateset(df)
+        aux_cols = [c for c in aux_df.columns if c not in ("label", "user_id", "isbn")]
+        logger.info("Aux features: %s", aux_cols)
+    num_aux = len(aux_cols)
+    dataset = DINDataset(
+        df, user_sequences, item_map,
+        max_hist_len=max_hist_len,
+        aux_df=aux_df,
+        aux_cols=aux_cols if aux_cols else None,
+    )
+    loader = DataLoader(dataset, batch_size=batch_size, shuffle=True, num_workers=0)
+    pretrained_emb = None
+    sasrec_path = Path(model_dir) / "rec" / "sasrec_model.pth"
+    if sasrec_path.exists():
+        try:
+            state = torch.load(sasrec_path, map_location="cpu", weights_only=False)
+            emb = state.get("item_emb.weight")
+            if emb is not None:
+                pretrained_emb = emb.numpy()
+                logger.info("Loaded SASRec item_emb for DIN init: %s", pretrained_emb.shape)
+        except Exception as e:
+            logger.warning("Could not load SASRec init: %s", e)
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    if torch.backends.mps.is_available():
+        device = torch.device("mps")
+    model = DIN(
+        num_items=num_items,
+        embed_dim=embed_dim,
+        max_hist_len=max_hist_len,
+        num_aux=num_aux,
+        pretrained_item_emb=pretrained_emb,
+    ).to(device)
+    opt = torch.optim.Adam(model.parameters(), lr=lr)
+    for ep in range(epochs):
+        model.train()
+        total_loss = 0.0
+        n_batches = 0
+        for batch in tqdm(loader, desc=f"Epoch {ep+1}/{epochs}"):
+            hist = batch[0].to(device)
+            target = batch[1].to(device)
+            label = batch[2].to(device)
+            aux = batch[3].to(device) if len(batch) > 3 else None
+            opt.zero_grad()
+            logits = model(hist, target, aux)
+            loss = F.binary_cross_entropy_with_logits(logits, label)
+            loss.backward()
+            opt.step()
+            total_loss += loss.item()
+            n_batches += 1
+        avg = total_loss / max(n_batches, 1)
+        logger.info(f"Epoch {ep+1} loss: {avg:.4f}")
+    ckpt = {
+        "model": model,
+        "item_map": item_map,
+        "max_hist_len": max_hist_len,
+        "aux_feature_names": aux_cols,
+    }
+    out_path = rank_dir / "din_ranker.pt"
+    torch.save(ckpt, out_path)
+    logger.info("DIN ranker saved to %s", out_path)
+    with open(Path(data_dir) / "user_sequences.pkl", "wb") as f:
+        pickle.dump(user_sequences, f)
+if __name__ == "__main__":
+    import argparse
+    parser = argparse.ArgumentParser(description="Train DIN ranker")
+    parser.add_argument("--max_samples", type=int, default=20000)
+    parser.add_argument("--epochs", type=int, default=10)
+    parser.add_argument("--batch_size", type=int, default=256)
+    parser.add_argument("--aux", action="store_true", help="Use aux features from FeatureEngineer")
+    args = parser.parse_args()
+    train_din(
+        max_samples=args.max_samples,
+        epochs=args.epochs,
+        batch_size=args.batch_size,
+        use_aux=args.aux,
+    )

scripts/model/train_intent_router.py ADDED Viewed

	@@ -0,0 +1,156 @@

+#!/usr/bin/env python3
+"""
+Train model-based intent classifier for Query Router.
+Replaces rule-based heuristics with TF-IDF + LogisticRegression (or FastText/DistilBERT).
+Uses synthetic seed data; extend with real labeled queries via --data CSV.
+Usage:
+    python scripts/model/train_intent_router.py
+    python scripts/model/train_intent_router.py --data data/intent_labels.csv
+    python scripts/model/train_intent_router.py --backend fasttext
+    python scripts/model/train_intent_router.py --backend distilbert
+Output:
+    data/model/intent_classifier.pkl  (or .bin for fasttext)
+"""
+import sys
+from pathlib import Path
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent.parent))
+import joblib
+import logging
+import pandas as pd
+from src.core.intent_classifier import train_classifier, INTENTS
+logging.basicConfig(level=logging.INFO, format="%(message)s")
+logger = logging.getLogger(__name__)
+# Synthetic training data: (query, intent)
+# Extend with real user queries for better generalization
+SEED_DATA = [
+    # small_to_big: detail-oriented, plot/review focused
+    ("book with twist ending", "small_to_big"),
+    ("unreliable narrator", "small_to_big"),
+    ("spoiler about the ending", "small_to_big"),
+    ("what did readers think", "small_to_big"),
+    ("opinion on the book", "small_to_big"),
+    ("hidden details in the story", "small_to_big"),
+    ("did anyone cry reading this", "small_to_big"),
+    ("review of the book", "small_to_big"),
+    ("plot twist reveal", "small_to_big"),
+    ("unreliable narrator twist", "small_to_big"),
+    ("readers who loved the ending", "small_to_big"),
+    ("spoiler what happens at the end", "small_to_big"),
+    # fast: short keyword queries
+    ("AI book", "fast"),
+    ("Python", "fast"),
+    ("romance", "fast"),
+    ("machine learning", "fast"),
+    ("science fiction", "fast"),
+    ("best AI book", "fast"),
+    ("Python programming", "fast"),
+    ("self help", "fast"),
+    ("business", "fast"),
+    ("fiction", "fast"),
+    ("thriller", "fast"),
+    ("mystery novel", "fast"),
+    ("finance", "fast"),
+    ("history", "fast"),
+    ("psychology", "fast"),
+    ("data science", "fast"),
+    ("cooking", "fast"),
+    ("music", "fast"),
+    ("art", "fast"),
+    ("philosophy", "fast"),
+    # deep: natural language, complex queries
+    ("What are the best books about artificial intelligence for beginners", "deep"),
+    ("I'm looking for something similar to Harry Potter", "deep"),
+    ("Books that help you understand machine learning", "deep"),
+    ("Recommend me a book like Sapiens but about technology", "deep"),
+    ("I want to learn about psychology and human behavior", "deep"),
+    ("What should I read if I liked 1984", "deep"),
+    ("Looking for books on startup founding and entrepreneurship", "deep"),
+    ("Can you suggest books about climate change and sustainability", "deep"),
+    ("I need a book that explains quantum physics simply", "deep"),
+    ("Books for someone who wants to improve their writing skills", "deep"),
+    ("What are some good fiction books set in Japan", "deep"),
+    ("Recommendations for someone getting into philosophy", "deep"),
+    ("Books that discuss the future of work and automation", "deep"),
+    ("I'm interested in biographies of scientists", "deep"),
+    ("Something light and funny for a long flight", "deep"),
+    ("Books about the history of mathematics", "deep"),
+    ("Recommend me novels with strong female protagonists", "deep"),
+    ("What to read to understand economics", "deep"),
+    ("Books on meditation and mindfulness", "deep"),
+]
+def load_training_data(data_path: Path | None) -> tuple[list[str], list[str]]:
+    """Load (queries, labels) from SEED_DATA + optional CSV."""
+    queries = [q for q, _ in SEED_DATA]
+    labels = [l for _, l in SEED_DATA]
+    if data_path and data_path.exists():
+        df = pd.read_csv(data_path)
+        q_col = "query" if "query" in df.columns else df.columns[0]
+        l_col = "intent" if "intent" in df.columns else df.columns[1]
+        extra_q = df[q_col].astype(str).tolist()
+        extra_l = df[l_col].astype(str).tolist()
+        queries.extend(extra_q)
+        labels.extend(extra_l)
+        logger.info("Loaded %d extra samples from %s", len(extra_q), data_path)
+    return queries, labels
+def main():
+    import argparse
+    parser = argparse.ArgumentParser(description="Train intent classifier")
+    parser.add_argument("--data", type=Path, default=None, help="CSV with query,intent columns")
+    parser.add_argument("--backend", choices=["tfidf", "fasttext", "distilbert"], default="tfidf")
+    args = parser.parse_args()
+    project_root = Path(__file__).resolve().parent.parent.parent
+    out_dir = project_root / "data" / "model"
+    out_dir.mkdir(parents=True, exist_ok=True)
+    queries, labels = load_training_data(args.data)
+    logger.info("Training intent classifier (%s) on %d samples...", args.backend, len(queries))
+    result = train_classifier(queries, labels, backend=args.backend)
+    if args.backend == "fasttext":
+        out_path = out_dir / "intent_classifier.bin"
+        result.save_model(str(out_path))
+    else:
+        out_path = out_dir / "intent_classifier.pkl"
+        if args.backend == "distilbert":
+            joblib.dump(result, out_path)  # dict with pipeline, backend, etc.
+        else:
+            joblib.dump({"pipeline": result, "backend": "tfidf"}, out_path)
+    logger.info("Saved to %s", out_path)
+    # Quick sanity check
+    for intent in INTENTS:
+        sample = next((q for q, l in zip(queries, labels) if l == intent), None)
+        if sample:
+            if args.backend == "fasttext":
+                pred = result.predict(sample)[0][0].replace("__label__", "")
+            elif args.backend == "distilbert":
+                from transformers import pipeline
+            pipe = pipeline("zero-shot-classification", model="distilbert-base-uncased", device=-1)
+            pred = pipe(sample, INTENTS, multi_label=False)["labels"][0]
+            else:
+                pred = result.predict([sample])[0]
+            ok = "✓" if pred == intent else "✗"
+            logger.info("  %s %s: %r -> %s", ok, intent, sample[:40], pred)
+if __name__ == "__main__":
+    main()

scripts/model/train_ranker.py CHANGED Viewed

@@ -15,18 +15,15 @@ Input:
     - data/rec/train.csv              (for fallback random negatives)
     - data/model/recall/*.pkl         (recall models for hard negative mining)
-Output (Standard):
-    - data/model/ranking/lgbm_ranker.txt
-Output (Stacking):
-    - data/model/ranking/lgbm_ranker.txt   (full retrained LGB)
-    - data/model/ranking/xgb_ranker.json   (full retrained XGB)
-    - data/model/ranking/stacking_meta.pkl (LogisticRegression meta-model)
 Negative Sampling Strategy:
     - Hard negatives: items from recall results that are NOT the positive
     - Random negatives: fill remaining slots if recall returns too few
-    - This teaches the ranker to distinguish between "close but wrong" vs "right"
 """
 import sys

     - data/rec/train.csv              (for fallback random negatives)
     - data/model/recall/*.pkl         (recall models for hard negative mining)
+TIME-SPLIT (no leakage):
+    - Recall models (SASRec, etc.) are trained on train.csv only.
+    - Ranking uses val.csv for labels; recall for hard negatives.
+    - sasrec_score and user_seq_emb come from train-only SASRec.
+    - Pipeline order: split -> build_sequences(train-only) -> recall(train) -> ranker(val).
 Negative Sampling Strategy:
     - Hard negatives: items from recall results that are NOT the positive
     - Random negatives: fill remaining slots if recall returns too few
 """
 import sys

scripts/run_pipeline.py CHANGED Viewed

@@ -44,6 +44,7 @@ class Pipeline:
         skip_models: bool = False,
         skip_index: bool = False,
         stacking: bool = False,
     ):
         self.project_root = Path(project_root)
         self.data_dir = self.project_root / "data"
@@ -53,6 +54,7 @@ class Pipeline:
         self.skip_models = skip_models
         self.skip_index = skip_index
         self.stacking = stacking
     def _run_step(self, name: str, fn, *args, **kwargs):
         """Run a step with timing log."""
@@ -153,8 +155,19 @@ class Pipeline:
         from scripts.model.train_ranker import train_ranker, train_stacking
         self._run_step("Train Ranker", train_stacking if self.stacking else train_ranker)
     def run_evaluation(self) -> None:
-        """Stage 5: Validation."""
         def _validate():
             from scripts.data.validate_data import (
                 validate_raw, validate_processed, validate_rec,
@@ -168,6 +181,18 @@ class Pipeline:
         self._run_step("Validate pipeline", _validate)
     def run(self, stage: str = "all") -> None:
         """Execute full pipeline: Data Cleaning -> Training -> Evaluation."""
         logger.info("=" * 60)
@@ -201,6 +226,7 @@ def main():
     parser.add_argument("--validate-only", action="store_true", help="Only run validation")
     parser.add_argument("--device", default=None, help="Device for ML (cpu/cuda/mps)")
     parser.add_argument("--stacking", action="store_true", help="Enable stacking ranker")
     args = parser.parse_args()
     if args.validate_only:
@@ -213,6 +239,7 @@ def main():
         skip_models=args.skip_models,
         skip_index=args.skip_index,
         stacking=args.stacking,
     )
     pipeline.run(stage=args.stage)

         skip_models: bool = False,
         skip_index: bool = False,
         stacking: bool = False,
+        train_din: bool = False,
     ):
         self.project_root = Path(project_root)
         self.data_dir = self.project_root / "data"
         self.skip_models = skip_models
         self.skip_index = skip_index
         self.stacking = stacking
+        self.train_din = train_din
     def _run_step(self, name: str, fn, *args, **kwargs):
         """Run a step with timing log."""
         from scripts.model.train_ranker import train_ranker, train_stacking
         self._run_step("Train Ranker", train_stacking if self.stacking else train_ranker)
+        from scripts.model.train_intent_router import main as train_intent
+        self._run_step("Train intent classifier", train_intent)
+        if getattr(self, "train_din", False):
+            from scripts.model.train_din_ranker import train_din
+            self._run_step("Train DIN ranker", lambda: train_din(
+                data_dir=str(self.rec_dir),
+                model_dir=str(self.model_dir),
+                recall_dir=str(self.model_dir / "recall"),
+            ))
     def run_evaluation(self) -> None:
+        """Stage 5: Validation + RAG Golden Test Set (if exists)."""
         def _validate():
             from scripts.data.validate_data import (
                 validate_raw, validate_processed, validate_rec,
         self._run_step("Validate pipeline", _validate)
+        # RAG Golden Test Set evaluation (optional)
+        golden = self.rec_dir.parent / "rag_golden.csv"
+        if not golden.exists():
+            golden = self.rec_dir.parent / "rag_golden.example.csv"
+        if golden.exists():
+            def _run_rag_eval():
+                from scripts.model.evaluate_rag import evaluate_rag
+                m = evaluate_rag(str(golden))
+                logger.info("RAG Accuracy@%d: %.4f  Recall@%d: %.4f  MRR@%d: %.4f",
+                    m["top_k"], m["accuracy_at_k"], m["top_k"], m["recall_at_k"], m["top_k"], m["mrr_at_k"])
+            self._run_step("RAG Golden Test Set", _run_rag_eval)
     def run(self, stage: str = "all") -> None:
         """Execute full pipeline: Data Cleaning -> Training -> Evaluation."""
         logger.info("=" * 60)
     parser.add_argument("--validate-only", action="store_true", help="Only run validation")
     parser.add_argument("--device", default=None, help="Device for ML (cpu/cuda/mps)")
     parser.add_argument("--stacking", action="store_true", help="Enable stacking ranker")
+    parser.add_argument("--din", action="store_true", help="Train DIN ranker (deep model)")
     args = parser.parse_args()
     if args.validate_only:
         skip_models=args.skip_models,
         skip_index=args.skip_index,
         stacking=args.stacking,
+        train_din=args.din,
     )
     pipeline.run(stage=args.stage)

src/core/freshness_monitor.py ADDED Viewed

	@@ -0,0 +1,231 @@

+"""
+Data Freshness Monitor.
+Provides insights into the freshness of the local book database:
+- Distribution of books by publication year
+- Detection of data staleness
+- Recommendations for when to trigger updates
+This module helps the system decide when to rely on local data vs.
+triggering external API fallbacks.
+"""
+from datetime import datetime
+from typing import Optional
+from src.core.metadata_store import metadata_store
+from src.utils import setup_logger
+logger = setup_logger(__name__)
+class FreshnessMonitor:
+    """
+    Monitor data freshness and provide staleness detection.
+    Usage:
+        monitor = FreshnessMonitor()
+        stats = monitor.get_data_stats()
+        if monitor.is_stale_for_query("latest 2025 books"):
+            # Trigger web search fallback
+    """
+    # Years considered "recent" for freshness calculations
+    RECENT_YEARS_THRESHOLD = 2
+    def __init__(self):
+        self._cache = {}
+        self._cache_timestamp = None
+        self._cache_ttl_seconds = 300  # 5 minutes
+    def _is_cache_valid(self) -> bool:
+        """Check if cached stats are still valid."""
+        if not self._cache or not self._cache_timestamp:
+            return False
+        age = (datetime.now() - self._cache_timestamp).total_seconds()
+        return age < self._cache_ttl_seconds
+    def get_data_stats(self, force_refresh: bool = False) -> dict:
+        """
+        Get comprehensive statistics about data freshness.
+        Returns:
+            Dict with:
+                - total_books: Total number of books in database
+                - newest_year: Year of most recently published book
+                - oldest_year: Year of oldest book
+                - books_by_year: Dict mapping year -> count
+                - recent_books_count: Books published in last N years
+                - data_cutoff_year: Effective "knowledge cutoff" year
+                - freshness_score: 0-100 score indicating data freshness
+        """
+        if not force_refresh and self._is_cache_valid():
+            return self._cache
+        stats = {
+            "total_books": 0,
+            "newest_year": None,
+            "oldest_year": None,
+            "books_by_year": {},
+            "recent_books_count": 0,
+            "data_cutoff_year": None,
+            "freshness_score": 0,
+            "last_checked": datetime.now().isoformat(),
+        }
+        try:
+            stats["total_books"] = metadata_store.get_book_count()
+            stats["books_by_year"] = metadata_store.get_books_by_year_distribution()
+            if stats["books_by_year"]:
+                years = sorted(stats["books_by_year"].keys())
+                stats["newest_year"] = max(years)
+                stats["oldest_year"] = min(years)
+                stats["data_cutoff_year"] = stats["newest_year"]
+                # Count recent books (last N years)
+                current_year = datetime.now().year
+                recent_threshold = current_year - self.RECENT_YEARS_THRESHOLD
+                stats["recent_books_count"] = sum(
+                    count for year, count in stats["books_by_year"].items()
+                    if year >= recent_threshold
+                )
+                # Calculate freshness score (0-100)
+                # Based on: newest year relative to current year
+                years_behind = current_year - (stats["newest_year"] or current_year)
+                stats["freshness_score"] = max(0, 100 - (years_behind * 25))
+            self._cache = stats
+            self._cache_timestamp = datetime.now()
+        except Exception as e:
+            logger.error(f"FreshnessMonitor.get_data_stats failed: {e}")
+        return stats
+    def is_stale(self, target_year: Optional[int] = None) -> bool:
+        """
+        Check if local data is too old for a given target year.
+        Args:
+            target_year: Year the user is asking about (default: current year)
+        Returns:
+            True if data is stale and web fallback should be triggered
+        """
+        if target_year is None:
+            target_year = datetime.now().year
+        stats = self.get_data_stats()
+        newest_year = stats.get("newest_year")
+        if newest_year is None:
+            return True  # No data at all
+        # Stale if target year is newer than our newest data
+        return target_year > newest_year
+    def is_stale_for_query(self, query: str) -> bool:
+        """
+        Analyze a query and determine if data is stale for it.
+        Args:
+            query: User's search query
+        Returns:
+            True if web fallback should be triggered
+        """
+        from src.core.web_search import extract_year_from_query
+        target_year = extract_year_from_query(query)
+        if target_year is None:
+            # No year requirement - check freshness score
+            stats = self.get_data_stats()
+            # Trigger fallback if data is more than 2 years old
+            return stats.get("freshness_score", 100) < 50
+        return self.is_stale(target_year)
+    def get_coverage_for_year(self, year: int) -> dict:
+        """
+        Get coverage statistics for a specific year.
+        Returns:
+            Dict with: count, percentage of total, is_well_covered
+        """
+        stats = self.get_data_stats()
+        year_count = stats["books_by_year"].get(year, 0)
+        total = stats["total_books"] or 1
+        return {
+            "year": year,
+            "count": year_count,
+            "percentage": round(year_count / total * 100, 2),
+            "is_well_covered": year_count >= 100,  # Arbitrary threshold
+        }
+    def recommend_update_categories(self) -> list[str]:
+        """
+        Recommend categories that should be updated.
+        Returns:
+            List of category names that need fresh data
+        """
+        # This would require category-level year tracking
+        # For now, return common categories that benefit from freshness
+        return [
+            "fiction",
+            "thriller",
+            "science fiction",
+            "fantasy",
+            "mystery",
+            "self-help",
+            "business",
+        ]
+    def get_summary(self) -> str:
+        """
+        Get a human-readable summary of data freshness.
+        Returns:
+            Formatted string describing data freshness status
+        """
+        stats = self.get_data_stats()
+        lines = [
+            f"Data Freshness Report",
+            f"=" * 40,
+            f"Total books: {stats['total_books']:,}",
+            f"Newest book year: {stats['newest_year'] or 'Unknown'}",
+            f"Data cutoff: {stats['data_cutoff_year'] or 'Unknown'}",
+            f"Recent books (last {self.RECENT_YEARS_THRESHOLD} years): {stats['recent_books_count']:,}",
+            f"Freshness score: {stats['freshness_score']}/100",
+        ]
+        current_year = datetime.now().year
+        if stats["newest_year"] and stats["newest_year"] < current_year:
+            years_behind = current_year - stats["newest_year"]
+            lines.append(f"")
+            lines.append(f"WARNING: Data is {years_behind} year(s) behind current year.")
+            lines.append(f"Consider running: python scripts/data/fetch_new_books.py --year {current_year}")
+        return "\n".join(lines)
+# Global instance
+freshness_monitor = FreshnessMonitor()
+# Convenience function for quick checks
+def is_data_fresh_enough(query: str) -> bool:
+    """
+    Quick check if local data is fresh enough for a query.
+    Args:
+        query: User's search query
+    Returns:
+        True if local data is sufficient, False if web fallback recommended
+    """
+    return not freshness_monitor.is_stale_for_query(query)

src/core/intent_classifier.py ADDED Viewed

	@@ -0,0 +1,204 @@

+"""
+Model-based intent classifier for Query Router.
+Replaces brittle rule-based heuristics with a trained classifier.
+Backends: tfidf (default), fasttext, distilbert.
+Intents: small_to_big (detail), fast (keyword), deep (natural language)
+"""
+import logging
+from pathlib import Path
+from typing import Optional
+import joblib
+from sklearn.feature_extraction.text import TfidfVectorizer
+from sklearn.linear_model import LogisticRegression
+from sklearn.pipeline import Pipeline
+logger = logging.getLogger(__name__)
+INTENTS = ["small_to_big", "fast", "deep"]
+class IntentClassifier:
+    """
+    Intent classifier with pluggable backends:
+    - tfidf: TF-IDF + LogisticRegression (~1–2ms)
+    - fasttext: FastText (~1ms, requires fasttext package)
+    - distilbert: Zero-shot DistilBERT (~50–100ms, higher accuracy)
+    """
+    def __init__(self, model_path: Optional[Path] = None):
+        self.pipeline: Optional[Pipeline] = None
+        self._fasttext_model = None
+        self._distilbert_pipeline = None
+        self._backend = "tfidf"
+        self.model_path = Path(model_path) if model_path else None
+    def load(self, path: Optional[Path] = None) -> bool:
+        """Load trained model from disk."""
+        p = path or self.model_path
+        if not p:
+            return False
+        p = Path(p)
+        base = p.parent if p.suffix in (".pkl", ".bin") else p
+        pkl_path = p if p.suffix == ".pkl" else base / "intent_classifier.pkl"
+        bin_path = p if p.suffix == ".bin" else base / "intent_classifier.bin"
+        # Try .pkl first (tfidf or distilbert)
+        if pkl_path.exists():
+            try:
+                data = joblib.load(pkl_path)
+                if isinstance(data, dict):
+                    self.pipeline = data.get("pipeline")
+                    self._backend = data.get("backend", "tfidf")
+                    if self._backend == "distilbert":
+                        self._load_distilbert(data)
+                    elif self.pipeline is None and self._backend == "tfidf":
+                        self.pipeline = data
+                else:
+                    self.pipeline = data
+                self.model_path = pkl_path
+                logger.info("Intent classifier loaded from %s (backend=%s)", pkl_path, self._backend)
+                return True
+            except Exception as e:
+                logger.warning("Failed to load intent classifier: %s", e)
+        # Try .bin (FastText)
+        if bin_path.exists():
+            try:
+                import fasttext
+                self._fasttext_model = fasttext.load_model(str(bin_path))
+                self._backend = "fasttext"
+                self.model_path = bin_path
+                logger.info("Intent classifier loaded from %s (FastText)", bin_path)
+                return True
+            except ImportError:
+                logger.warning("FastText not installed; pip install fasttext")
+            except Exception as e:
+                logger.warning("Failed to load FastText: %s", e)
+        return False
+    def _load_distilbert(self, data: dict) -> None:
+        """Lazy-load DistilBERT pipeline from saved config."""
+        model_name = data.get("distilbert_model", "distilbert-base-uncased")
+        try:
+            from transformers import pipeline
+            self._distilbert_pipeline = pipeline(
+                "zero-shot-classification",
+                model=model_name,
+                device=-1,
+            )
+        except Exception as e:
+            logger.warning("DistilBERT pipeline load failed: %s", e)
+        self.pipeline = None  # Use distilbert, not sklearn pipeline
+    def predict(self, query: str) -> str:
+        """Predict intent for a query. Returns one of small_to_big, fast, deep."""
+        q = query.strip()
+        if not q:
+            return "deep"
+        if self._fasttext_model is not None:
+            pred = self._fasttext_model.predict(q)
+            return pred[0][0].replace("__label__", "")
+        if self._distilbert_pipeline is not None:
+            out = self._distilbert_pipeline(q, INTENTS, multi_label=False)
+            return out["labels"][0]
+        if self.pipeline is None:
+            raise RuntimeError("Intent classifier not loaded; call load() first")
+        return str(self.pipeline.predict([q])[0])
+    def predict_proba(self, query: str) -> dict[str, float]:
+        """Return intent probabilities for debugging."""
+        q = query.strip()
+        if not q:
+            return {i: 1.0 / len(INTENTS) for i in INTENTS}
+        if self._fasttext_model is not None:
+            pred = self._fasttext_model.predict(q, k=len(INTENTS))
+            return dict(zip([l.replace("__label__", "") for l in pred[0]], pred[1]))
+        if self._distilbert_pipeline is not None:
+            out = self._distilbert_pipeline(q, INTENTS, multi_label=False)
+            return dict(zip(out["labels"], out["scores"]))
+        if self.pipeline is None:
+            raise RuntimeError("Intent classifier not loaded")
+        probs = self.pipeline.predict_proba([q])[0]
+        last_step = self.pipeline.steps[-1][1]
+        classes = getattr(last_step, "classes_", INTENTS)
+        return dict(zip(classes, probs))
+def train_classifier(
+    queries: list[str],
+    labels: list[str],
+    max_features: int = 5000,
+    C: float = 1.0,
+    backend: str = "tfidf",
+):
+    """
+    Train intent classifier. Returns pipeline (tfidf), model (fasttext), or dict (distilbert).
+    """
+    if backend == "fasttext":
+        return _train_fasttext(queries, labels)
+    if backend == "distilbert":
+        return _train_distilbert(queries, labels)
+    # tfidf default
+    pipeline = Pipeline([
+        ("tfidf", TfidfVectorizer(
+            max_features=max_features,
+            ngram_range=(1, 2),
+            min_df=1,
+            lowercase=True,
+        )),
+        ("clf", LogisticRegression(
+            C=C,
+            max_iter=500,
+            class_weight="balanced",
+            random_state=42,
+        )),
+    ])
+    pipeline.fit(queries, labels)
+    return pipeline
+def _train_fasttext(queries: list[str], labels: list[str]):
+    """Train FastText classifier. Requires fasttext package."""
+    try:
+        import fasttext
+        import tempfile
+        import os
+        with tempfile.NamedTemporaryFile(mode="w", suffix=".txt", delete=False) as f:
+            for q, l in zip(queries, labels):
+                line = q.replace("\n", " ").strip()
+                f.write(f"__label__{l} {line}\n")
+            path = f.name
+        model = fasttext.train_supervised(path, epoch=25, lr=0.5, wordNgrams=2)
+        os.unlink(path)
+        return model
+    except ImportError:
+        raise RuntimeError("FastText not installed: pip install fasttext")
+def _train_distilbert(queries: list[str], labels: list[str]) -> dict:
+    """DistilBERT zero-shot: creates pipeline (no training). Saves config for inference."""
+    try:
+        from transformers import pipeline
+        pipe = pipeline(
+            "zero-shot-classification",
+            model="distilbert-base-uncased",
+            device=-1,
+        )
+        return {
+            "backend": "distilbert",
+            "distilbert_model": "distilbert-base-uncased",
+            "intents": INTENTS,
+        }
+    except Exception as e:
+        raise RuntimeError(f"DistilBERT setup failed: {e}")

src/core/metadata_store.py CHANGED Viewed

@@ -142,6 +142,148 @@ class MetadataStore:
             logger.error(f"MetadataStore insert_book failed: {e}")
             return False
     def load_books_processed(self): pass
     def load_train_data(self): pass

             logger.error(f"MetadataStore insert_book failed: {e}")
             return False
+    def insert_book_with_fts(self, row: Dict[str, Any]) -> bool:
+        """
+        Insert a new book into both main table AND FTS5 index.
+        This enables incremental indexing - new books are immediately searchable
+        via keyword search without requiring a full index rebuild.
+        Args:
+            row: Book data dict with keys: isbn13, title, description, authors, simple_categories, etc.
+        Returns:
+            True if successful, False otherwise
+        """
+        conn = self.connection
+        if not conn:
+            return False
+        try:
+            # 1. Insert into main books table
+            if not self.insert_book(row):
+                return False
+            # 2. Insert into FTS5 index
+            # FTS5 columns: isbn13, title, description, authors, simple_categories
+            isbn13 = str(row.get("isbn13", ""))
+            title = str(row.get("title", ""))
+            description = str(row.get("description", ""))
+            authors = str(row.get("authors", ""))
+            categories = str(row.get("simple_categories", ""))
+            # Check if FTS5 table exists
+            cursor = conn.cursor()
+            cursor.execute(
+                "SELECT name FROM sqlite_master WHERE type='table' AND name='books_fts'"
+            )
+            if not cursor.fetchone():
+                logger.warning("MetadataStore: FTS5 table 'books_fts' not found. Skipping FTS index.")
+                return True  # Main insert succeeded, FTS just not available
+            # Insert into FTS5 (use INSERT OR REPLACE to handle updates)
+            cursor.execute(
+                """
+                INSERT OR REPLACE INTO books_fts (isbn13, title, description, authors, simple_categories)
+                VALUES (?, ?, ?, ?, ?)
+                """,
+                (isbn13, title, description, authors, categories)
+            )
+            conn.commit()
+            logger.info(f"MetadataStore: Inserted book {isbn13} into FTS5 index")
+            return True
+        except sqlite3.OperationalError as e:
+            # FTS5 might not support OR REPLACE, try without
+            if "REPLACE" in str(e):
+                try:
+                    cursor = conn.cursor()
+                    cursor.execute(
+                        """
+                        INSERT INTO books_fts (isbn13, title, description, authors, simple_categories)
+                        VALUES (?, ?, ?, ?, ?)
+                        """,
+                        (isbn13, title, description, authors, categories)
+                    )
+                    conn.commit()
+                    return True
+                except Exception as inner_e:
+                    logger.error(f"MetadataStore FTS5 insert failed: {inner_e}")
+                    return False
+            logger.error(f"MetadataStore FTS5 insert failed: {e}")
+            return False
+        except Exception as e:
+            logger.error(f"MetadataStore insert_book_with_fts failed: {e}")
+            return False
+    def book_exists(self, isbn: str) -> bool:
+        """Check if a book with given ISBN exists in the database."""
+        isbn = str(isbn).strip().replace(".0", "")
+        row = self._query_one(
+            "SELECT 1 FROM books WHERE isbn13 = ? OR isbn10 = ? LIMIT 1",
+            (isbn, isbn)
+        )
+        return row is not None
+    def get_newest_book_year(self) -> Optional[int]:
+        """Get the publication year of the newest book in the database."""
+        conn = self.connection
+        if not conn:
+            return None
+        try:
+            cursor = conn.cursor()
+            # Try publishedDate column
+            cursor.execute(
+                "SELECT publishedDate FROM books WHERE publishedDate IS NOT NULL "
+                "ORDER BY publishedDate DESC LIMIT 1"
+            )
+            row = cursor.fetchone()
+            if row and row[0]:
+                # Extract year from date string
+                date_str = str(row[0])
+                if len(date_str) >= 4:
+                    return int(date_str[:4])
+        except Exception as e:
+            logger.debug(f"get_newest_book_year failed: {e}")
+        return None
+    def get_book_count(self) -> int:
+        """Get total number of books in the database."""
+        conn = self.connection
+        if not conn:
+            return 0
+        try:
+            cursor = conn.cursor()
+            cursor.execute("SELECT COUNT(*) FROM books")
+            row = cursor.fetchone()
+            return row[0] if row else 0
+        except Exception as e:
+            logger.error(f"get_book_count failed: {e}")
+            return 0
+    def get_books_by_year_distribution(self) -> Dict[int, int]:
+        """Get distribution of books by publication year."""
+        conn = self.connection
+        if not conn:
+            return {}
+        try:
+            cursor = conn.cursor()
+            cursor.execute(
+                """
+                SELECT SUBSTR(publishedDate, 1, 4) as year, COUNT(*) as count
+                FROM books
+                WHERE publishedDate IS NOT NULL AND LENGTH(publishedDate) >= 4
+                GROUP BY year
+                ORDER BY year DESC
+                LIMIT 20
+                """
+            )
+            return {int(row[0]): row[1] for row in cursor.fetchall() if row[0].isdigit()}
+        except Exception as e:
+            logger.debug(f"get_books_by_year_distribution failed: {e}")
+            return {}
     def load_books_processed(self): pass
     def load_train_data(self): pass

src/core/router.py CHANGED Viewed

@@ -1,74 +1,178 @@
 import re
-from typing import Dict, Any, List
 from src.utils import setup_logger
 logger = setup_logger(__name__)
 class QueryRouter:
     """
     Intelligent Router for the RAG Pipeline.
     Classifies user queries to select the optimal retrieval strategy.
     Strategies:
     1. EXACT (ISBN/ID) -> Pure BM25 (High Precision, No Rerank noise).
     2. FAST (Keywords) -> Hybrid (RRF), No Rerank (Low Latency).
     3. DEEP (Complex)  -> Hybrid + Rerank (High Latency, High contextual relevance).
-    """
-    def __init__(self):
-        # Regex for ISBN-10 and ISBN-13
-        self.isbn_pattern = re.compile(r'^(?:\d{9}[\dX]|\d{13})$')
     def route(self, query: str) -> Dict[str, Any]:
         """
         Analyze query and return retrieval parameters.
-        Returns dict with: 'strategy', 'hybrid_alpha', 'rerank'
         """
         cleaned_query = query.strip()
         words = cleaned_query.split()
-        # 1. Check for ISBN (Exact Match)
-        # Remove hyphens/spaces for check
         normalized = cleaned_query.replace("-", "").replace(" ", "")
         if self.isbn_pattern.match(normalized):
-            logger.info(f"Router: Detected ISBN -> EXACT Strategy ({normalized})")
-            return {"strategy": "exact", "alpha": 1.0, "rerank": False, "k_final": 5}
-        # 2. Check for Temporal Keywords (Freshness Bias)
-        temporal_keywords = {"new", "newest", "latest", "recent", "modern", "contemporary", "2020", "2021", "2022", "2023", "2024", "2025"}
-        is_temporal = any(word.lower() in temporal_keywords for word in words)
-        # 3. Check for Detail-Oriented Queries (Triggers Small-to-Big)
-        # These are queries asking about specific plot points, reactions, or hidden details
-        detail_keywords = {"twist", "ending", "spoiler", "readers", "felt", "cried", "hated", "loved",
-                          "review", "opinion", "think", "unreliable", "narrator", "realize", "find out"}
-        is_detail = any(word.lower() in detail_keywords for word in words)
-        if is_detail:
-            logger.info(f"Router: Detected Detail Query -> SMALL_TO_BIG Strategy")
             return {
-                "strategy": "small_to_big",
-                "rerank": False,  # Small-to-Big already does precision matching
                 "k_final": 5,
-                "temporal": is_temporal
             }
-        # 4. Check for Simple Keyword Search (Short queries)
-        if len(words) <= 2:
-             logger.info(f"Router: Detected Keyword -> FAST Strategy (Temporal={is_temporal})")
-             return {
-                 "strategy": "fast",
-                 "rerank": False,     # Skip expensive rerank
-                 "k_final": 5,
-                 "temporal": is_temporal
-             }
-        # 5. Default to Deep Search
-        logger.info(f"Router: Detected Natural Language -> DEEP Strategy (Temporal={is_temporal})")
-        return {
-            "strategy": "deep",
-            "rerank": True,
-            "k_final": 10,
-            "temporal": is_temporal
-        }

 import re
+from pathlib import Path
+from typing import Dict, Any, List, Optional
 from src.utils import setup_logger
 logger = setup_logger(__name__)
 class QueryRouter:
     """
     Intelligent Router for the RAG Pipeline.
     Classifies user queries to select the optimal retrieval strategy.
+    Uses model-based intent classifier when available; falls back to rule-based
+    heuristics when classifier not trained/loaded.
     Strategies:
     1. EXACT (ISBN/ID) -> Pure BM25 (High Precision, No Rerank noise).
     2. FAST (Keywords) -> Hybrid (RRF), No Rerank (Low Latency).
     3. DEEP (Complex)  -> Hybrid + Rerank (High Latency, High contextual relevance).
+    Freshness-Aware Routing:
+    - Detects queries asking for "new", "latest", or specific years (2024, 2025, etc.)
+    - Sets freshness_fallback=True to enable Web Search when local results insufficient
+    """
+    # Keywords that indicate user wants fresh/recent content
+    # Note: Year numbers are detected dynamically in _detect_freshness()
+    FRESHNESS_KEYWORDS = {
+        "new", "newest", "latest", "recent", "modern", "contemporary", "current",
+    }
+    # Strong freshness indicators (always trigger fallback)
+    STRONG_FRESHNESS_KEYWORDS = {
+        "newest", "latest",
+    }
+    def __init__(self, model_dir: str | Path | None = None):
+        self.isbn_pattern = re.compile(r"^(?:\d{9}[\dX]|\d{13})$")
+        if model_dir is None:
+            from src.config import DATA_DIR
+            model_dir = DATA_DIR / "model"
+        self.model_dir = Path(model_dir)
+        self._classifier = None
+    def _get_classifier(self):
+        """Lazy-load intent classifier when first needed."""
+        if self._classifier is not None:
+            return self._classifier
+        try:
+            from src.core.intent_classifier import IntentClassifier
+            clf = IntentClassifier(self.model_dir / "intent_classifier.pkl")
+            if clf.load():
+                self._classifier = clf
+        except Exception as e:
+            logger.debug("Intent classifier not available: %s", e)
+        return self._classifier
+    def _detect_freshness(self, words: list) -> tuple[bool, bool, Optional[int]]:
+        """
+        Detect if query requires fresh content.
+        Returns:
+            (is_temporal, freshness_fallback, target_year)
+            - is_temporal: Should apply temporal boost to local results
+            - freshness_fallback: Should enable Web Search if local results insufficient
+            - target_year: Specific year user is looking for (if detected)
+        """
+        from datetime import datetime
+        current_year = datetime.now().year
+        lower_words = {w.lower() for w in words}
+        is_temporal = bool(lower_words & self.FRESHNESS_KEYWORDS)
+        freshness_fallback = bool(lower_words & self.STRONG_FRESHNESS_KEYWORDS)
+        # Extract explicit year from query
+        target_year = None
+        for word in words:
+            if word.isdigit() and len(word) == 4:
+                year = int(word)
+                if 2000 <= year <= 2050:
+                    target_year = year
+                    # Recent years (within last 3 years) trigger freshness
+                    if year >= current_year - 2:
+                        is_temporal = True
+                        freshness_fallback = True
+                    break
+        return is_temporal, freshness_fallback, target_year
+    def _route_by_rules(
+        self,
+        cleaned_query: str,
+        words: list,
+        is_temporal: bool,
+        freshness_fallback: bool = False,
+        target_year: Optional[int] = None
+    ) -> Dict[str, Any]:
+        """Fallback: rule-based routing (original logic + freshness)."""
+        detail_keywords = {
+            "twist", "ending", "spoiler", "readers", "felt", "cried", "hated", "loved",
+            "review", "opinion", "think", "unreliable", "narrator", "realize", "find out",
+        }
+        base_result = {
+            "temporal": is_temporal,
+            "freshness_fallback": freshness_fallback,
+            "freshness_threshold": 3,  # Trigger web search if < 3 results
+            "target_year": target_year,
+        }
+        if any(w.lower() in detail_keywords for w in words):
+            logger.info("Router (rules): Detail Query -> SMALL_TO_BIG")
+            return {**base_result, "strategy": "small_to_big", "alpha": 0.5, "rerank": False, "k_final": 5}
+        if len(words) <= 2:
+            logger.info("Router (rules): Keyword -> FAST (Temporal=%s, Freshness=%s)", is_temporal, freshness_fallback)
+            return {**base_result, "strategy": "fast", "alpha": 0.5, "rerank": False, "k_final": 5}
+        logger.info("Router (rules): Natural Language -> DEEP (Temporal=%s, Freshness=%s)", is_temporal, freshness_fallback)
+        return {**base_result, "strategy": "deep", "alpha": 0.5, "rerank": True, "k_final": 10}
     def route(self, query: str) -> Dict[str, Any]:
         """
         Analyze query and return retrieval parameters.
+        Returns dict with:
+            - 'strategy': 'exact' | 'fast' | 'deep' | 'small_to_big'
+            - 'alpha': float (hybrid search weight)
+            - 'rerank': bool (use cross-encoder reranking)
+            - 'k_final': int (number of results)
+            - 'temporal': bool (apply temporal boost)
+            - 'freshness_fallback': bool (enable web search if local results insufficient)
+            - 'freshness_threshold': int (min local results before triggering web search)
+            - 'target_year': int | None (specific year user requested)
         """
         cleaned_query = query.strip()
         words = cleaned_query.split()
+        # 1. ISBN: keep regex (deterministic, correct)
         normalized = cleaned_query.replace("-", "").replace(" ", "")
         if self.isbn_pattern.match(normalized):
+            logger.info("Router: ISBN -> EXACT (%s)", normalized)
             return {
+                "strategy": "exact",
+                "alpha": 1.0,
+                "rerank": False,
                 "k_final": 5,
+                "temporal": False,
+                "freshness_fallback": False,
+                "freshness_threshold": 1,
+                "target_year": None,
             }
+        # 2. Freshness detection (temporal boost + web fallback)
+        is_temporal, freshness_fallback, target_year = self._detect_freshness(words)
+        # 3. Model-based vs rule-based intent
+        clf = self._get_classifier()
+        if clf is not None:
+            try:
+                intent = clf.predict(cleaned_query)
+                logger.info("Router (model): %s -> %s (Freshness=%s)", intent, intent.upper(), freshness_fallback)
+                return {
+                    "strategy": intent,
+                    "alpha": 0.5,
+                    "rerank": intent == "deep",
+                    "k_final": 10 if intent == "deep" else 5,
+                    "temporal": is_temporal,
+                    "freshness_fallback": freshness_fallback,
+                    "freshness_threshold": 3,
+                    "target_year": target_year,
+                }
+            except Exception as e:
+                logger.warning("Intent classifier failed, falling back to rules: %s", e)
+        return self._route_by_rules(cleaned_query, words, is_temporal, freshness_fallback, target_year)

src/core/web_search.py ADDED Viewed

	@@ -0,0 +1,323 @@

+"""
+Web Search Fallback Module for Data Freshness.
+Extends the existing Google Books integration in cover_fetcher.py to support
+full book metadata retrieval and keyword-based search for new books.
+This module provides:
+- search_google_books(): Search books by keyword query
+- fetch_book_by_isbn(): Get complete metadata for a single book
+- is_fresh_enough(): Evaluate if local search results meet freshness requirements
+API: Google Books API (free tier, no auth required for basic queries)
+Rate Limit: ~1000 requests/day (unofficial), implement conservative caching
+"""
+import requests
+from typing import Optional
+from functools import lru_cache
+from datetime import datetime
+from src.utils import setup_logger
+logger = setup_logger(__name__)
+# Google Books API endpoint
+GOOGLE_BOOKS_API = "https://www.googleapis.com/books/v1/volumes"
+# Request timeout (seconds)
+REQUEST_TIMEOUT = 5
+def _parse_isbn_from_identifiers(identifiers: list[dict]) -> tuple[str, str]:
+    """
+    Extract ISBN-13 and ISBN-10 from Google Books industryIdentifiers.
+    Returns:
+        (isbn13, isbn10) - empty strings if not found
+    """
+    isbn13, isbn10 = "", ""
+    for ident in identifiers:
+        id_type = ident.get("type", "")
+        id_value = ident.get("identifier", "")
+        if id_type == "ISBN_13":
+            isbn13 = id_value
+        elif id_type == "ISBN_10":
+            isbn10 = id_value
+    return isbn13, isbn10
+def _parse_volume_info(volume_info: dict) -> Optional[dict]:
+    """
+    Parse Google Books volumeInfo into our standard book format.
+    Returns:
+        dict with keys: isbn13, isbn10, title, authors, description,
+                       publishedDate, thumbnail, categories
+        None if ISBN is missing (we can't index without ISBN)
+    """
+    identifiers = volume_info.get("industryIdentifiers", [])
+    isbn13, isbn10 = _parse_isbn_from_identifiers(identifiers)
+    # Skip books without ISBN - can't be indexed reliably
+    if not isbn13 and not isbn10:
+        return None
+    # Use isbn13 as primary, fallback to isbn10
+    primary_isbn = isbn13 if isbn13 else isbn10
+    # Extract image links (prefer larger sizes)
+    image_links = volume_info.get("imageLinks", {})
+    thumbnail = (
+        image_links.get("extraLarge") or
+        image_links.get("large") or
+        image_links.get("medium") or
+        image_links.get("small") or
+        image_links.get("thumbnail") or
+        ""
+    )
+    # Ensure HTTPS
+    if thumbnail.startswith("http://"):
+        thumbnail = thumbnail.replace("http://", "https://")
+    # Categories: Google returns list, we join to single string
+    categories = volume_info.get("categories", [])
+    category_str = categories[0] if categories else "General"
+    return {
+        "isbn13": isbn13 or isbn10,  # Primary key
+        "isbn10": isbn10 or isbn13[:10] if isbn13 else "",
+        "title": volume_info.get("title", "Unknown Title"),
+        "authors": ", ".join(volume_info.get("authors", ["Unknown"])),
+        "description": volume_info.get("description", ""),
+        "publishedDate": volume_info.get("publishedDate", ""),
+        "thumbnail": thumbnail,
+        "simple_categories": category_str,
+        "average_rating": volume_info.get("averageRating", 0.0),
+        "source": "google_books",  # Track data source
+    }
+def search_google_books(query: str, max_results: int = 10) -> list[dict]:
+    """
+    Search Google Books by keyword query.
+    Args:
+        query: Search query (e.g., "latest sci-fi 2024", "new fantasy novels")
+        max_results: Maximum number of results (1-40, Google's limit)
+    Returns:
+        List of book dicts in standard format, ordered by relevance
+    """
+    if not query or not query.strip():
+        return []
+    max_results = min(max_results, 40)  # Google's API limit
+    try:
+        params = {
+            "q": query,
+            "maxResults": max_results,
+            "printType": "books",
+            "orderBy": "relevance",
+        }
+        response = requests.get(
+            GOOGLE_BOOKS_API,
+            params=params,
+            timeout=REQUEST_TIMEOUT
+        )
+        if response.status_code != 200:
+            logger.warning(f"Google Books API returned {response.status_code}")
+            return []
+        data = response.json()
+        total_items = data.get("totalItems", 0)
+        if total_items == 0:
+            logger.info(f"No results for query: {query}")
+            return []
+        items = data.get("items", [])
+        results = []
+        for item in items:
+            volume_info = item.get("volumeInfo", {})
+            parsed = _parse_volume_info(volume_info)
+            if parsed:
+                results.append(parsed)
+        logger.info(f"Google Books search '{query}': {len(results)} valid results")
+        return results
+    except requests.Timeout:
+        logger.warning(f"Google Books API timeout for query: {query}")
+        return []
+    except requests.RequestException as e:
+        logger.error(f"Google Books API request failed: {e}")
+        return []
+    except Exception as e:
+        logger.error(f"Unexpected error in search_google_books: {e}")
+        return []
+@lru_cache(maxsize=500)
+def fetch_book_by_isbn(isbn: str) -> Optional[dict]:
+    """
+    Fetch complete book metadata by ISBN from Google Books.
+    Args:
+        isbn: ISBN-10 or ISBN-13
+    Returns:
+        Book dict in standard format, or None if not found
+    """
+    if not isbn or not isbn.strip():
+        return None
+    isbn = isbn.strip().replace("-", "")
+    try:
+        params = {
+            "q": f"isbn:{isbn}",
+            "maxResults": 1,
+        }
+        response = requests.get(
+            GOOGLE_BOOKS_API,
+            params=params,
+            timeout=REQUEST_TIMEOUT
+        )
+        if response.status_code != 200:
+            return None
+        data = response.json()
+        if data.get("totalItems", 0) == 0:
+            return None
+        items = data.get("items", [])
+        if not items:
+            return None
+        volume_info = items[0].get("volumeInfo", {})
+        return _parse_volume_info(volume_info)
+    except Exception as e:
+        logger.debug(f"fetch_book_by_isbn({isbn}) failed: {e}")
+        return None
+def search_new_books_by_category(
+    category: str,
+    year: Optional[int] = None,
+    max_results: int = 10
+) -> list[dict]:
+    """
+    Search for recently published books in a specific category.
+    Args:
+        category: Book category (e.g., "fiction", "science fiction", "mystery")
+        year: Filter by publication year (default: current year)
+        max_results: Maximum number of results
+    Returns:
+        List of book dicts, filtered to specified year or newer
+    """
+    if year is None:
+        year = datetime.now().year
+    # Build query with subject filter
+    query = f"subject:{category}"
+    # Get more results than needed, filter by year locally
+    raw_results = search_google_books(query, max_results=max_results * 2)
+    filtered = []
+    for book in raw_results:
+        pub_date = book.get("publishedDate", "")
+        if pub_date:
+            # Extract year from various date formats (YYYY, YYYY-MM, YYYY-MM-DD)
+            try:
+                pub_year = int(pub_date[:4])
+                if pub_year >= year:
+                    filtered.append(book)
+            except (ValueError, IndexError):
+                continue
+    return filtered[:max_results]
+def is_fresh_enough(
+    results: list[dict],
+    threshold: int = 3,
+    min_year: Optional[int] = None
+) -> bool:
+    """
+    Evaluate if local search results meet freshness requirements.
+    Args:
+        results: List of book dicts from local search
+        threshold: Minimum number of results required
+        min_year: If specified, count only books published >= this year
+    Returns:
+        True if results are sufficient, False if web fallback should be triggered
+    """
+    if len(results) < threshold:
+        return False
+    if min_year is None:
+        return True
+    # Count books meeting year requirement
+    fresh_count = 0
+    for book in results:
+        pub_date = book.get("publishedDate", "") or book.get("published_date", "")
+        if pub_date:
+            try:
+                pub_year = int(str(pub_date)[:4])
+                if pub_year >= min_year:
+                    fresh_count += 1
+            except (ValueError, IndexError):
+                continue
+    return fresh_count >= threshold
+def extract_year_from_query(query: str) -> Optional[int]:
+    """
+    Extract year requirement from user query.
+    Examples:
+        "books from 2024" -> 2024
+        "latest 2025 novels" -> 2025
+        "new sci-fi" -> current_year
+        "classic mystery" -> None
+    Returns:
+        Year as int, or None if no year requirement detected
+    """
+    import re
+    # Explicit year patterns
+    year_patterns = [
+        r"\b(202[0-9])\b",  # 2020-2029
+        r"\b(201[0-9])\b",  # 2010-2019
+    ]
+    for pattern in year_patterns:
+        match = re.search(pattern, query)
+        if match:
+            return int(match.group(1))
+    # Keywords implying "recent" = current year - 1
+    freshness_keywords = {
+        "new", "newest", "latest", "recent", "modern", "contemporary", "current"
+    }
+    words = set(query.lower().split())
+    if words & freshness_keywords:
+        return datetime.now().year - 1
+    return None

src/ranking/din.py ADDED Viewed

	@@ -0,0 +1,212 @@

+"""
+DIN (Deep Interest Network) for CTR/Ranking.
+Uses attention over user behavior sequence w.r.t. target item to capture
+user interest. Reuses SASRec item embeddings as initialization when available.
+Reference: Zhou et al., "Deep Interest Network for Click-Through Rate Prediction" (KDD 2018)
+"""
+import logging
+from pathlib import Path
+from typing import Optional
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+logger = logging.getLogger(__name__)
+class DIN(nn.Module):
+    """
+    Deep Interest Network: attention over user history w.r.t. target item.
+    Input:
+        - user_hist: [B, max_len] int64, padded behavior sequence (item_ids, 0=pad)
+        - target_item: [B] int64, candidate item ids
+        - aux_features: [B, num_aux] float32, optional scalar features
+    Output: [B] logits for click/positive probability
+    """
+    def __init__(
+        self,
+        num_items: int,
+        embed_dim: int = 64,
+        max_hist_len: int = 50,
+        mlp_dims: tuple = (128, 64, 32),
+        dropout: float = 0.1,
+        num_aux: int = 0,
+        pretrained_item_emb: Optional[np.ndarray] = None,
+    ):
+        super().__init__()
+        self.num_items = num_items
+        self.embed_dim = embed_dim
+        self.max_hist_len = max_hist_len
+        self.num_aux = num_aux
+        # Item embedding (1-indexed, 0=pad)
+        self.item_emb = nn.Embedding(num_items + 1, embed_dim, padding_idx=0)
+        if pretrained_item_emb is not None:
+            self._init_from_pretrained(pretrained_item_emb)
+        # Attention: local activation unit (DIN paper)
+        self.attn_fc = nn.Sequential(
+            nn.Linear(embed_dim * 4, 36),
+            nn.ReLU(),
+            nn.Linear(36, 1),
+        )
+        # MLP: [user_interest; target_emb; aux?] -> score
+        mlp_in = embed_dim * 2 + num_aux
+        layers = []
+        for d in mlp_dims:
+            layers.append(nn.Linear(mlp_in, d))
+            layers.append(nn.ReLU())
+            layers.append(nn.Dropout(dropout))
+            mlp_in = d
+        layers.append(nn.Linear(mlp_in, 1))
+        self.mlp = nn.Sequential(*layers)
+    def _init_from_pretrained(self, emb: np.ndarray) -> None:
+        """Initialize item_emb from SASRec checkpoint."""
+        if emb.shape[0] >= self.num_items + 1 and emb.shape[1] == self.embed_dim:
+            with torch.no_grad():
+                self.item_emb.weight.data[: emb.shape[0]].copy_(torch.from_numpy(emb))
+            logger.info("DIN: Initialized item_emb from pretrained (%d x %d)", *emb.shape)
+        else:
+            logger.warning("DIN: Pretrained shape %s mismatch, skipping init", emb.shape)
+    def forward(
+        self,
+        user_hist: torch.Tensor,
+        target_item: torch.Tensor,
+        aux_features: Optional[torch.Tensor] = None,
+    ) -> torch.Tensor:
+        """
+        user_hist: [B, L]
+        target_item: [B]
+        aux_features: [B, num_aux] or None
+        """
+        # [B, L, E]
+        hist_embs = self.item_emb(user_hist)
+        # [B, E]
+        target_emb = self.item_emb(target_item)
+        # Attention: local activation
+        # [B, L, E] -> expand target to [B, L, E]
+        target_expand = target_emb.unsqueeze(1).expand(-1, user_hist.size(1), -1)
+        attn_input = torch.cat([
+            hist_embs,
+            target_expand,
+            hist_embs * target_expand,
+            hist_embs - target_expand,
+        ], dim=-1)  # [B, L, 4E]
+        attn_scores = self.attn_fc(attn_input).squeeze(-1)  # [B, L]
+        # Mask padding (0 = pad)
+        mask = (user_hist != 0).float()
+        attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
+        attn_weights = F.softmax(attn_scores, dim=1)  # [B, L]
+        # When all zeros (no history), attn_weights can be nan; use mask to zero out
+        attn_weights = torch.nan_to_num(attn_weights, nan=0.0)
+        attn_weights = attn_weights * mask
+        attn_weights = attn_weights / (attn_weights.sum(dim=1, keepdim=True) + 1e-9)
+        # Weighted sum: user interest vector [B, E]
+        user_interest = (hist_embs * attn_weights.unsqueeze(-1)).sum(dim=1)
+        # MLP input
+        mlp_in = torch.cat([user_interest, target_emb], dim=1)
+        if aux_features is not None and self.num_aux > 0 and aux_features.size(-1) == self.num_aux:
+            mlp_in = torch.cat([mlp_in, aux_features], dim=1)
+        logits = self.mlp(mlp_in).squeeze(-1)
+        return logits
+class DINRanker:
+    """
+    Wrapper for DIN model: load, predict, compatible with RecommendationService.
+    """
+    def __init__(
+        self,
+        data_dir: str = "data/rec",
+        model_dir: str = "data/model",
+    ):
+        self.data_dir = Path(data_dir)
+        self.model_dir = Path(model_dir) / "ranking"
+        self.model: Optional[DIN] = None
+        self.item_map: dict = {}
+        self.id_to_item: dict = {}
+        self.user_sequences: dict = {}
+        self.max_hist_len = 50
+        self.aux_feature_names: list = []
+        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+        if torch.backends.mps.is_available():
+            self.device = torch.device("mps")
+    def load(self) -> bool:
+        """Load trained DIN and aux data."""
+        import pickle
+        model_path = self.model_dir / "din_ranker.pt"
+        if not model_path.exists():
+            return False
+        try:
+            ckpt = torch.load(model_path, map_location=self.device, weights_only=False)
+            self.model = ckpt["model"]
+            self.model.to(self.device)
+            self.model.eval()
+            self.item_map = ckpt.get("item_map", {})
+            self.id_to_item = {v: k for k, v in self.item_map.items()}
+            self.max_hist_len = ckpt.get("max_hist_len", 50)
+            self.aux_feature_names = ckpt.get("aux_feature_names", [])
+            with open(self.data_dir / "user_sequences.pkl", "rb") as f:
+                seqs = pickle.load(f)
+            # user_sequences: user_id -> list of item_ids (int)
+            self.user_sequences = seqs
+            logger.info("DIN ranker loaded from %s", model_path)
+            return True
+        except Exception as e:
+            logger.error("Failed to load DIN ranker: %s", e)
+            return False
+    def predict(
+        self,
+        user_id: str,
+        candidate_items: list[str],
+        aux_features: Optional[np.ndarray] = None,
+    ) -> np.ndarray:
+        """Predict scores for (user_id, candidate_items). Returns [len(candidate_items)]."""
+        if self.model is None:
+            self.load()
+        if self.model is None:
+            return np.zeros(len(candidate_items))
+        hist = self.user_sequences.get(user_id, [])
+        if hist and isinstance(hist[0], str):
+            hist = [self.item_map.get(h, 0) for h in hist]
+        hist = hist[-self.max_hist_len:]
+        padded = np.zeros(self.max_hist_len, dtype=np.int64)
+        padded[: len(hist)] = hist
+        target_ids = np.array([self.item_map.get(str(it), 0) for it in candidate_items], dtype=np.int64)
+        hist_t = torch.LongTensor(padded).unsqueeze(0).expand(len(candidate_items), -1).to(self.device)
+        target_t = torch.LongTensor(target_ids).to(self.device)
+        aux_t = None
+        if aux_features is not None and aux_features.size > 0:
+            aux_t = torch.from_numpy(aux_features.astype(np.float32)).to(self.device)
+        with torch.no_grad():
+            logits = self.model(hist_t, target_t, aux_t)
+        return logits.cpu().numpy()

src/recommender.py CHANGED Viewed

@@ -1,4 +1,4 @@
-from typing import List, Dict, Any
 from src.vector_db import VectorDB
 from src.config import TOP_K_INITIAL, TOP_K_FINAL, DATA_DIR
 from src.cache import CacheManager
@@ -136,13 +136,108 @@ class BookRecommender:
                 "emotions": emotions,
                 "review_highlights": highlights,
                 "persona_summary": "",
-                "average_rating": float(meta.get("average_rating", 0.0))
             })
             if len(results) >= TOP_K_FINAL:
                 break
         return results
     def get_categories(self) -> List[str]:
         """Get unique book categories from SQLite."""
@@ -152,20 +247,47 @@ class BookRecommender:
         """Get available emotional tones."""
         return ["All", "Happy", "Sad", "Fear", "Anger", "Surprise"]
-    def add_new_book(self, isbn: str, title: str, author: str, description: str, category: str = "General", thumbnail: str = None) -> Any:
         """
-        Add a new book to the system: CSV, Memory, and Vector DB.
-        Returns the new book dictionary if successful, None otherwise.
         """
         try:
             import pandas as pd
             # 1. Update Persistent Storage (CSV)
             csv_path = DATA_DIR / "books_processed.csv"
             # Define new row with all expected columns
             new_row = {
-                "isbn13": isbn,
                 "title": title,
                 "authors": author,
                 "description": description,
@@ -174,13 +296,10 @@ class BookRecommender:
                 "average_rating": 0.0,
                 "joy": 0.0, "sadness": 0.0, "fear": 0.0, "anger": 0.0, "surprise": 0.0,
                 "tags": "", "review_highlights": "",
-                "isbn10": str(isbn)[:10] # Approximation
             }
-            isbn_s = str(isbn)
-            if metadata_store.get_book_metadata(isbn_s):
-                logger.warning(f"Book {isbn} already exists. Skipping add.")
-                return None
             # Append to CSV
             if csv_path.exists():
@@ -191,19 +310,20 @@ class BookRecommender:
                 # Filter/Order new_row to match CSV structure
                 ordered_row = {}
                 for col in csv_columns:
-                    ordered_row[col] = new_row.get(col, "") # Default to empty string if missing
                 # Append to CSV
                 pd.DataFrame([ordered_row]).to_csv(csv_path, mode='a', header=False, index=False)
             else:
-                 pd.DataFrame([new_row]).to_csv(csv_path, index=False)
             new_row["large_thumbnail"] = new_row["thumbnail"]
-            # 3. Insert into SQLite (zero-RAM mode)
-            metadata_store.insert_book(new_row)
-            # 4. Update Vector DB (Chroma + BM25)
             self.vector_db.add_book(new_row)
             logger.info(f"Successfully added book {isbn}: {title}")

+from typing import List, Dict, Any, Optional
 from src.vector_db import VectorDB
 from src.config import TOP_K_INITIAL, TOP_K_FINAL, DATA_DIR
 from src.cache import CacheManager
                 "emotions": emotions,
                 "review_highlights": highlights,
                 "persona_summary": "",
+                "average_rating": float(meta.get("average_rating", 0.0)),
+                "source": "local",  # Track data source
             })
             if len(results) >= TOP_K_FINAL:
                 break
+        # 3. Web Search Fallback (Freshness-Aware)
+        # Triggered when: freshness_fallback=True AND local results < threshold
+        if decision.get("freshness_fallback", False):
+            threshold = decision.get("freshness_threshold", 3)
+            if len(results) < threshold:
+                web_results = self._fetch_from_web(query, TOP_K_FINAL - len(results), category)
+                results.extend(web_results)
+                logger.info(f"Web fallback added {len(web_results)} books")
+        # Cache the results
+        if results:
+            self.cache.set(cache_key, results)
         return results
+    def _fetch_from_web(
+        self,
+        query: str,
+        max_results: int,
+        category: str = "All"
+    ) -> List[Dict[str, Any]]:
+        """
+        Fetch books from Google Books API when local results are insufficient.
+        Auto-persists discovered books to local database for future queries.
+        Args:
+            query: User's search query
+            max_results: Maximum number of results to fetch
+            category: Category filter (not applied to web search, used for filtering results)
+        Returns:
+            List of formatted book dicts ready for response
+        """
+        try:
+            from src.core.web_search import search_google_books
+        except ImportError:
+            logger.warning("Web search module not available")
+            return []
+        results = []
+        try:
+            web_books = search_google_books(query, max_results=max_results * 2)
+            for book in web_books:
+                isbn = book.get("isbn13", "")
+                if not isbn:
+                    continue
+                # Skip if already in local database
+                if metadata_store.book_exists(isbn):
+                    continue
+                # Category filter (if specified)
+                if category and category != "All":
+                    book_cat = book.get("simple_categories", "")
+                    if category.lower() not in book_cat.lower():
+                        continue
+                # Auto-persist to local database
+                added = self.add_new_book(
+                    isbn=isbn,
+                    title=book.get("title", ""),
+                    author=book.get("authors", "Unknown"),
+                    description=book.get("description", ""),
+                    category=book.get("simple_categories", "General"),
+                    thumbnail=book.get("thumbnail"),
+                    published_date=book.get("publishedDate", ""),
+                )
+                if added:
+                    results.append({
+                        "isbn": isbn,
+                        "title": book.get("title", ""),
+                        "authors": book.get("authors", "Unknown"),
+                        "description": book.get("description", ""),
+                        "thumbnail": book.get("thumbnail", ""),
+                        "caption": f"{book.get('title', '')} by {book.get('authors', 'Unknown')}",
+                        "tags": [],
+                        "emotions": {"joy": 0.0, "sadness": 0.0, "fear": 0.0, "anger": 0.0, "surprise": 0.0},
+                        "review_highlights": [],
+                        "persona_summary": "",
+                        "average_rating": float(book.get("average_rating", 0.0)),
+                        "source": "google_books",  # Track data source
+                    })
+                if len(results) >= max_results:
+                    break
+            logger.info(f"Web fallback: Found and persisted {len(results)} new books")
+            return results
+        except Exception as e:
+            logger.error(f"Web fallback failed: {e}")
+            return []
     def get_categories(self) -> List[str]:
         """Get unique book categories from SQLite."""
         """Get available emotional tones."""
         return ["All", "Happy", "Sad", "Fear", "Anger", "Surprise"]
+    def add_new_book(
+        self,
+        isbn: str,
+        title: str,
+        author: str,
+        description: str,
+        category: str = "General",
+        thumbnail: Optional[str] = None,
+        published_date: Optional[str] = None,
+    ) -> Optional[Dict[str, Any]]:
         """
+        Add a new book to the system: CSV, SQLite (with FTS5), and ChromaDB.
+        Args:
+            isbn: ISBN-13 or ISBN-10
+            title: Book title
+            author: Author name(s)
+            description: Book description
+            category: Book category
+            thumbnail: Cover image URL
+            published_date: Publication date (YYYY, YYYY-MM, or YYYY-MM-DD)
+        Returns:
+            New book dictionary if successful, None otherwise
         """
         try:
             import pandas as pd
+            isbn_s = str(isbn).strip()
+            # Check if already exists
+            if metadata_store.book_exists(isbn_s):
+                logger.debug(f"Book {isbn} already exists. Skipping add.")
+                return None
             # 1. Update Persistent Storage (CSV)
             csv_path = DATA_DIR / "books_processed.csv"
             # Define new row with all expected columns
             new_row = {
+                "isbn13": isbn_s,
                 "title": title,
                 "authors": author,
                 "description": description,
                 "average_rating": 0.0,
                 "joy": 0.0, "sadness": 0.0, "fear": 0.0, "anger": 0.0, "surprise": 0.0,
                 "tags": "", "review_highlights": "",
+                "isbn10": isbn_s[:10] if len(isbn_s) >= 10 else isbn_s,
+                "publishedDate": published_date or "",
+                "source": "google_books",  # Track data source
             }
             # Append to CSV
             if csv_path.exists():
                 # Filter/Order new_row to match CSV structure
                 ordered_row = {}
                 for col in csv_columns:
+                    ordered_row[col] = new_row.get(col, "")
                 # Append to CSV
                 pd.DataFrame([ordered_row]).to_csv(csv_path, mode='a', header=False, index=False)
             else:
+                pd.DataFrame([new_row]).to_csv(csv_path, index=False)
             new_row["large_thumbnail"] = new_row["thumbnail"]
+            new_row["image"] = new_row["thumbnail"]
+            # 2. Insert into SQLite with FTS5 (incremental indexing)
+            metadata_store.insert_book_with_fts(new_row)
+            # 3. Update Vector DB (ChromaDB)
             self.vector_db.add_book(new_row)
             logger.info(f"Successfully added book {isbn}: {title}")

src/services/recommend_service.py CHANGED Viewed

@@ -7,10 +7,12 @@ from pathlib import Path
 from src.recall.fusion import RecallFusion
 from src.ranking.features import FeatureEngineer
 from src.ranking.explainer import RankingExplainer
 from src.utils import setup_logger
 logger = setup_logger(__name__)
 class RecommendationService:
     def __init__(self, data_dir='data/rec', model_dir='data/model'):
         self.data_dir = Path(data_dir)
@@ -21,6 +23,8 @@ class RecommendationService:
         self.ranker = None
         self.ranker_loaded = False
         self.xgb_ranker = None
         self.meta_model = None
         self.use_stacking = False
@@ -34,7 +38,14 @@ class RecommendationService:
         self.fusion.load_models()
         self.fe.load_base_data()
-        # Load Ranker (LightGBM)
         ranker_path = self.model_dir / 'ranking/lgbm_ranker.txt'
         if ranker_path.exists():
             self.ranker = lgb.Booster(model_file=str(ranker_path))
@@ -119,29 +130,34 @@ class RecommendationService:
         candidate_items = [item for item, score in candidates]
         # 2. Ranking
-        if self.ranker_loaded:
-            # Filter candidates first
-            valid_candidates = [item for item in candidate_items if item not in fav_isbns]
-            if not valid_candidates:
-                return []
-            # Batch Feature Generation (Optimized)
             X_df = self.fe.generate_features_batch(user_id, valid_candidates)
-            # Align features to match model
             model_features = self.ranker.feature_name()
             for col in model_features:
                 if col not in X_df.columns:
                     X_df[col] = 0
             X_df = X_df[model_features]
-            # Predict
             if self.use_stacking and self.xgb_ranker is not None and self.meta_model is not None:
-                # Stacking: Level-1 predictions -> Level-2 meta-learner
                 lgb_scores = self.ranker.predict(X_df)
-                # Check if XGB Ranker is a raw Booster or Sklearn Estimator
                 if isinstance(self.xgb_ranker, xgb.Booster):
                     dtest = xgb.DMatrix(X_df)
                     xgb_scores = self.xgb_ranker.predict(dtest)
@@ -150,24 +166,19 @@ class RecommendationService:
                 meta_features = np.column_stack([lgb_scores, xgb_scores])
                 scores = self.meta_model.predict_proba(meta_features)[:, 1]
             else:
-                # Fallback: LightGBM only (backward compatible)
                 scores = self.ranker.predict(X_df)
-            # Compute SHAP explanations (V2.7)
             explanations_list = []
             if self.explainer is not None:
                 try:
                     explanations_list = self.explainer.explain(X_df, top_k=3)
                 except Exception as e:
-                    logger.warning(f"SHAP explanation failed: {e}")
                     explanations_list = [[] for _ in valid_candidates]
             else:
                 explanations_list = [[] for _ in valid_candidates]
-            # Combine with explanations
             final_scores = list(zip(valid_candidates, scores, explanations_list))
             final_scores.sort(key=lambda x: x[1], reverse=True)
         else:
             # Fallback to recall scores, but filter
             final_scores = []

 from src.recall.fusion import RecallFusion
 from src.ranking.features import FeatureEngineer
 from src.ranking.explainer import RankingExplainer
+from src.ranking.din import DINRanker
 from src.utils import setup_logger
 logger = setup_logger(__name__)
 class RecommendationService:
     def __init__(self, data_dir='data/rec', model_dir='data/model'):
         self.data_dir = Path(data_dir)
         self.ranker = None
         self.ranker_loaded = False
+        self.din_ranker = DINRanker(str(data_dir), str(model_dir))
+        self.din_ranker_loaded = False
         self.xgb_ranker = None
         self.meta_model = None
         self.use_stacking = False
         self.fusion.load_models()
         self.fe.load_base_data()
+        # Prefer DIN ranker when available (deep model)
+        din_path = self.model_dir / 'ranking/din_ranker.pt'
+        if din_path.exists():
+            if self.din_ranker.load():
+                self.din_ranker_loaded = True
+                logger.info("DIN ranker loaded — using deep model for ranking")
+        # Load LGBM ranker (fallback when DIN not available)
         ranker_path = self.model_dir / 'ranking/lgbm_ranker.txt'
         if ranker_path.exists():
             self.ranker = lgb.Booster(model_file=str(ranker_path))
         candidate_items = [item for item, score in candidates]
         # 2. Ranking
+        valid_candidates = [item for item in candidate_items if item not in fav_isbns]
+        if not valid_candidates:
+            return []
+        if self.din_ranker_loaded:
+            # DIN: deep model; optional aux features from FeatureEngineer
+            aux_arr = None
+            if self.din_ranker.aux_feature_names:
+                X_df = self.fe.generate_features_batch(user_id, valid_candidates)
+                for col in self.din_ranker.aux_feature_names:
+                    if col not in X_df.columns:
+                        X_df[col] = 0
+                aux_arr = X_df[self.din_ranker.aux_feature_names].values.astype(np.float32)
+            scores = self.din_ranker.predict(user_id, valid_candidates, aux_arr)
+            explanations_list = [[] for _ in valid_candidates]
+            final_scores = list(zip(valid_candidates, scores, explanations_list))
+            final_scores.sort(key=lambda x: x[1], reverse=True)
+        elif self.ranker_loaded:
+            # LGBM / stacking path
             X_df = self.fe.generate_features_batch(user_id, valid_candidates)
             model_features = self.ranker.feature_name()
             for col in model_features:
                 if col not in X_df.columns:
                     X_df[col] = 0
             X_df = X_df[model_features]
             if self.use_stacking and self.xgb_ranker is not None and self.meta_model is not None:
                 lgb_scores = self.ranker.predict(X_df)
                 if isinstance(self.xgb_ranker, xgb.Booster):
                     dtest = xgb.DMatrix(X_df)
                     xgb_scores = self.xgb_ranker.predict(dtest)
                 meta_features = np.column_stack([lgb_scores, xgb_scores])
                 scores = self.meta_model.predict_proba(meta_features)[:, 1]
             else:
                 scores = self.ranker.predict(X_df)
             explanations_list = []
             if self.explainer is not None:
                 try:
                     explanations_list = self.explainer.explain(X_df, top_k=3)
                 except Exception as e:
                     explanations_list = [[] for _ in valid_candidates]
             else:
                 explanations_list = [[] for _ in valid_candidates]
             final_scores = list(zip(valid_candidates, scores, explanations_list))
             final_scores.sort(key=lambda x: x[1], reverse=True)
         else:
             # Fallback to recall scores, but filter
             final_scores = []

src/vector_db.py CHANGED Viewed

@@ -321,7 +321,14 @@ class VectorDB:
     def add_book(self, book_data: dict):
         """
-        Dynamically add a new book to the vector database and update indices.
         """
         from langchain_core.documents import Document
@@ -330,7 +337,7 @@ class VectorDB:
         author = book_data.get("authors", "")
         description = book_data.get("description", "")
-        # 1. Add to Chroma
         content = f"Title: {title}\nAuthor: {author}\nDescription: {description}\nISBN: {isbn}"
         doc = Document(
             page_content=content,
@@ -346,7 +353,4 @@ class VectorDB:
         if self.db:
             self.db.add_documents([doc])
             logger.info(f"Added book {isbn} to ChromaDB")
-        if hasattr(self, 'fts_enabled') and self.fts_enabled:
-            logger.info("Note: FTS5 database updates are not implemented in add_book yet.")

     def add_book(self, book_data: dict):
         """
+        Dynamically add a new book to the ChromaDB vector index.
+        Note: FTS5 incremental updates are handled separately via
+        metadata_store.insert_book_with_fts() called from BookRecommender.add_new_book().
+        This method only handles the dense vector index.
+        Args:
+            book_data: Dict with isbn13, title, authors, description, etc.
         """
         from langchain_core.documents import Document
         author = book_data.get("authors", "")
         description = book_data.get("description", "")
+        # Add to ChromaDB (dense vector index)
         content = f"Title: {title}\nAuthor: {author}\nDescription: {description}\nISBN: {isbn}"
         doc = Document(
             page_content=content,
         if self.db:
             self.db.add_documents([doc])
             logger.info(f"Added book {isbn} to ChromaDB")