File size: 6,476 Bytes
c7256ee
 
 
 
 
 
 
 
 
 
 
d73bcb6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d755b80
d73bcb6
 
 
 
 
 
 
 
 
40d1258
d73bcb6
40d1258
d73bcb6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d755b80
d73bcb6
 
 
 
 
d755b80
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d73bcb6
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
---
title: NLP RAG
emoji: 🏢
colorFrom: gray
colorTo: green
sdk: docker
pinned: false
license: mit
short_description: NLP Spring 2026 Project 1
---

RAG-based Question-Answering System for Cognitive Behavior Therapy (CBT)

## Overview

This project is a Retrieval-Augmented Generation (RAG) system built to answer CBT-related questions using grounded evidence from source manuals instead of relying on generic model knowledge. It combines hybrid retrieval, re-ranking, and strict response constraints so the assistant stays accurate, clinically focused, and less prone to hallucinations.

## Index

- [Overview](#overview)
- [Live Demo and Repository](#live-demo-and-repository)
- [Live Web Interface](#live-web-interface)
- [Tech Stack](#tech-stack)
- [System Architecture](#system-architecture)
- [Key Features](#key-features)
- [Installation and Setup](#installation-and-setup)
- [Configuration](#configuration)
- [Testing](#testing)
- [Running the Main Pipeline](#running-the-main-pipeline)
- [Contributors](#contributors)

## Live Demo and Repository

- Live Demo: https://rag-as-3-nlp.vercel.app/
- Code Repository: https://github.com/ramailkk/RAG-AS3-NLP

## Live Web Interface

<img width="1895" height="986" alt="image" src="https://github.com/user-attachments/assets/95eeba40-10c6-4137-af1a-5d83cc1b3a3c" />

<img width="1908" height="990" alt="image" src="https://github.com/user-attachments/assets/d8746422-900d-4101-9d8a-287a0eb5a22f" />


## Tech Stack

- Frontend: Vercel (Node.js/React)
- Backend: Hugging Face Spaces (FastAPI)
- Vector Database: Pinecone
- Embeddings: jinaai/jina-embeddings-v2-small-en
- LLMs: Llama-3-8B (Primary), TinyAya, Mistral-7B, Qwen-2.5
- Re-ranking: Voyage AI (rerank-2.5) and Cross-Encoder (ms-marco-MiniLM-L-6-v2)
- Retrieval: Hybrid Search (Dense + BM25 Sparse)

## System Architecture

The system operates through a high-precision multi-stage pipeline to ensure clinical safety and data grounding:

- Hybrid Retrieval: Simultaneously queries dense vector indices for semantic intent and sparse BM25 indices for specific clinical terminology such as Socratic Questioning or Cognitive Distortions.
- Fusion & Re-ranking: Uses Reciprocal Rank Fusion (RRF) to merge results, followed by a Cross-Encoder stage to re-evaluate the relevance of chunks against the user query.
- Diversity Filtering (MMR): Implements Maximal Marginal Relevance to ensure the context provided to the LLM is not redundant.
- Prompt Engineering: Employs a specialized persona that acts as an empathetic CBT therapist with strict grounding constraints to prevent the use of outside knowledge.
- Automated Evaluation: An LLM-as-a-Judge framework calculates:
  - Faithfulness: Verifying claims against the source document.
  - Relevancy: Ensuring the answer directly addresses the user's query.

## Key Features

- Clinical Domain Focus: Optimized for high-density information found in mental health manuals.
- Zero Tolerance for Hallucinations: Includes a fallback protocol to state when information is missing rather than inventing therapeutic advice.
- Advanced Chunking: Uses sentence-level and recursive character splitting to preserve the logical flow of therapeutic guidelines and patient transcripts.
- Multi-Model Support: Tested across multiple LLMs to find the best balance between latency and grounding.

## Installation and Setup

### Backend Setup

The backend handles document processing, Pinecone vector operations, and the hybrid retrieval logic.

1. Initialize Virtual Environment:

	```bash
	python -m venv .venv
	# Windows
	source .venv/Scripts/activate
	# Linux/Mac
	source .venv/bin/activate
	```

2. Install Dependencies:

	```bash
	pip install -r requirements.txt
	```

3. Launch API Server:

	```bash
	uvicorn backend.api:app --reload --host 0.0.0.0 --port 8000
	```

### Frontend Setup

The frontend provides the interactive chat interface and real-time evaluation scores.

1. Navigate and Install:

	```bash
	cd frontend
	npm install
	```

2. Start Development Server:

	```bash
	npm run dev
	```

## Configuration

To replicate the system, ensure your environment variables contain valid API keys for:

- Pinecone for vector storage
- OpenRouter or Hugging Face Inference API for LLM access
- Voyage AI for re-ranking

## Testing

Run `test.py` to benchmark the chunking strategies and retrieval configurations, then generate a complete Markdown report of the results.

```bash
python test.py
```

This script evaluates multiple test queries across the configured chunking techniques and retrieval strategies, then writes the full output to `retrieval_report.md`. Use that report to choose the best chunking strategy and retrieval configuration.

### Key variables you can change in `test.py`

- `test_queries`: the questions used for benchmarking.
- `CHUNKING_TECHNIQUES_FILTERED`: the chunking strategies included in the report.
- `RETRIEVAL_STRATEGIES`: the retrieval modes and MMR settings being compared.
- `index_name`: the Pinecone index that stores the chunked data.
- `top_k` and `final_k`: how many candidates are retrieved and how many are kept in the final context.

## Running the Main Pipeline

After testing, run `main.py` to reproduce the main experiment with the selected configuration and evaluate faithfulness and relevancy across the model set. This script is part of the reproducibility workflow, since changing its configuration lets you rerun the same evaluation under different chunking, retrieval, and model settings.

```bash
python main.py
```

This step runs the end-to-end comparison flow for all models, measures faithfulness and relevancy for each one, and writes the detailed findings to `rag_ablation_findings.md`.

### Key variables you can change in `main.py`

- `CHUNKING_TECHNIQUES` or the technique filter used in the script: controls which chunking methods are evaluated.
- `test_queries`: the query set used for the ablation study.
- `MODEL_MAP`: the model lineup being compared.
- `retrieval_strategy`: the retrieval mode, MMR setting, and label for each run.
- `top_k` and `final_k`: candidate retrieval depth and final context size.
- `temperature` in `cfg.gen`: generation randomness for the model outputs.
- `output_file`: the markdown report written by the run, usually `rag_ablation_findings.md`.

## Contributors

- Ramail Khan ([ramailkk](https://github.com/ramailkk))
- Qamar Raza ([Qar-Raz](https://github.com/Qar-Raz))
- Muddasir Javed ([bsparx](https://github.com/bsparx))