contextpilot commited on
Commit
6906333
·
1 Parent(s): 76bc2b7

📝 Major documentation overhaul: Consolidated all docs into professional README.md

Browse files
Files changed (12) hide show
  1. .env.example +26 -7
  2. CHANGELOG.md +36 -2
  3. FEATURES.md +0 -162
  4. FIXES_APPLIED.md +0 -281
  5. INSTALLATION.md +0 -110
  6. QUICKSTART.md +0 -227
  7. QUICK_REFERENCE.md +0 -351
  8. README.md +422 -203
  9. TESTING_GUIDE.md +0 -326
  10. UPGRADE_SUMMARY.md +0 -324
  11. data/.gitkeep +0 -2
  12. uploads/.gitkeep +0 -2
.env.example CHANGED
@@ -1,13 +1,32 @@
1
- # Pinecone Configuration
 
 
 
 
 
 
 
 
 
2
  PINECONE_API_KEY=your_pinecone_api_key_here
3
 
4
- # Google AI (Gemini) Configuration
 
 
 
5
  GOOGLE_API_KEY=your_google_ai_api_key_here
6
 
7
- # HuggingFace Token (optional, for some models)
 
 
 
 
8
  HF_TOKEN=your_huggingface_token_here
9
 
10
- # Application Settings (optional)
11
- # MAX_UPLOAD_SIZE=52428800 # 50MB in bytes
12
- # BATCH_SIZE=32
13
- # MODEL_TYPE=quality # Options: fast, balanced, quality
 
 
 
 
1
+ # ===================================
2
+ # PaperBOT Environment Configuration
3
+ # ===================================
4
+ # Copy this file to .env and fill in your API keys
5
+ # cp .env.example .env
6
+
7
+ # --------------------------
8
+ # REQUIRED: Pinecone Vector Database
9
+ # Get your API key: https://www.pinecone.io/
10
+ # --------------------------
11
  PINECONE_API_KEY=your_pinecone_api_key_here
12
 
13
+ # --------------------------
14
+ # REQUIRED: Google AI (Gemini)
15
+ # Get your API key: https://aistudio.google.com/
16
+ # --------------------------
17
  GOOGLE_API_KEY=your_google_ai_api_key_here
18
 
19
+ # --------------------------
20
+ # OPTIONAL: HuggingFace Token
21
+ # Get your token: https://huggingface.co/settings/tokens
22
+ # Required for some gated models
23
+ # --------------------------
24
  HF_TOKEN=your_huggingface_token_here
25
 
26
+ # --------------------------
27
+ # OPTIONAL: Application Settings
28
+ # Uncomment and modify as needed
29
+ # --------------------------
30
+ # MAX_UPLOAD_SIZE=15728640 # 15MB in bytes (default)
31
+ # BATCH_SIZE=32 # Chunks per batch
32
+ # MODEL_TYPE=quality # Options: fast, balanced, quality
CHANGELOG.md CHANGED
@@ -1,6 +1,40 @@
1
- # CHANGELOG
2
 
3
- ## Version 2.0.0 - Major Upgrade (January 2026)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
 
5
  ### 🎉 New Features
6
 
 
1
+ # Changelog
2
 
3
+ All notable changes to PaperBOT are documented here.
4
+
5
+ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
6
+
7
+ ---
8
+
9
+ ## [2.1.0] - January 2025
10
+
11
+ ### Added
12
+ - **Document Preview** - In-browser preview for all document types
13
+ - **Preloaded Files Preview** - Preview button for files in data/ folder
14
+ - **File Size Warnings** - Warning dialog for files over 5MB
15
+ - **Model Pre-warming** - Embedding model loads on server startup
16
+ - **Curated Fallback Responses** - Beautiful output when API quota exceeded
17
+ - **Retry Logic** - Exponential backoff for transient API failures
18
+
19
+ ### Changed
20
+ - Reduced max file size from 50MB to 15MB for better performance
21
+ - Optimized chunk size to 300 words (from 200)
22
+ - Improved error messages with troubleshooting tips
23
+ - Enhanced UI with Bootstrap 5
24
+
25
+ ### Fixed
26
+ - Pinecone 40KB metadata limit error (chunks now enforced to 8KB)
27
+ - Google API 429 quota errors now handled gracefully
28
+ - Memory leaks during batch processing
29
+ - Encoding issues with special characters
30
+
31
+ ### Security
32
+ - API keys stored only in .env (not tracked by git)
33
+ - Added comprehensive .gitignore
34
+
35
+ ---
36
+
37
+ ## [2.0.0] - January 2025 (Major Upgrade)
38
 
39
  ### 🎉 New Features
40
 
FEATURES.md DELETED
@@ -1,162 +0,0 @@
1
- # PaperBOT Features
2
-
3
- ## 🚀 Core Features
4
-
5
- ### 1. **Multi-Format Document Support**
6
- Upload and process various document formats:
7
- - PDF documents
8
- - Microsoft Word (.docx, .doc)
9
- - Plain text files (.txt)
10
- - Markdown files (.md)
11
- - CSV data files
12
- - JSON files
13
- - Excel spreadsheets (.xlsx, .xls)
14
-
15
- ### 2. **Parallel Processing**
16
- - **Batch Embedding**: Process multiple chunks simultaneously
17
- - **Multi-threaded Upload**: Non-blocking document processing
18
- - **Optimized Memory Usage**: Automatic memory cleanup between batches
19
- - **Fast Processing**: 30-50 chunks/second on average hardware
20
-
21
- ### 3. **Smart Semantic Search**
22
- - **Vector Database**: Powered by Pinecone for fast similarity search
23
- - **Top-K Retrieval**: Retrieves 10 most relevant chunks
24
- - **Relevance Scoring**: Shows confidence scores for retrieved content
25
- - **Namespace Isolation**: Each document in separate namespace for accuracy
26
-
27
- ### 4. **Advanced Memory Management**
28
- - **Automatic Garbage Collection**: Clears memory every 5 batches
29
- - **Memory Monitoring**: Real-time memory usage tracking
30
- - **Batch Processing**: Configurable batch sizes (default: 32 chunks)
31
- - **Resource Cleanup**: Automatic cleanup after processing
32
-
33
- ### 5. **Intelligent Q&A System**
34
- - **RAG (Retrieval Augmented Generation)**: Combines search with AI generation
35
- - **Customizable Responses**: Choose style and length
36
- - **Context-Aware**: Only answers from uploaded document
37
- - **Fallback Mechanisms**: Multiple fallback strategies if main pipeline fails
38
-
39
- ### 6. **User-Friendly Interface**
40
- - **Drag & Drop Upload**: Easy file upload
41
- - **Progress Tracking**: Real-time upload and processing progress
42
- - **Document Management**: View current document, delete when done
43
- - **Response Customization**: Select explanation style and length
44
- - **Formatted Answers**: Markdown rendering with syntax highlighting
45
-
46
- ## 📊 Performance Features
47
-
48
- ### Configurable Processing Modes
49
- Choose between speed and quality in `QASystem/config.py`:
50
-
51
- #### Fast Mode
52
- ```python
53
- CURRENT_MODEL = "fast"
54
- BATCH_SIZE = 64
55
- ```
56
- - ⚡ 5-10x faster processing
57
- - 💾 Lower memory usage
58
- - ✅ Great for large documents (100+ pages)
59
-
60
- #### Balanced Mode
61
- ```python
62
- CURRENT_MODEL = "balanced"
63
- BATCH_SIZE = 32
64
- ```
65
- - ⚖️ Good balance of speed and quality
66
- - 📄 Recommended for most documents
67
-
68
- #### Quality Mode
69
- ```python
70
- CURRENT_MODEL = "quality"
71
- BATCH_SIZE = 16
72
- ```
73
- - 🎯 Highest accuracy
74
- - 📚 Best for technical/academic papers
75
-
76
- ## 🔒 Security & Reliability
77
-
78
- ### Document Isolation
79
- - Each document stored in separate namespace
80
- - Previous documents automatically cleared
81
- - No mixing of different document content
82
-
83
- ### Error Handling
84
- - Comprehensive try-catch blocks
85
- - Graceful fallbacks for API failures
86
- - Detailed error messages with troubleshooting tips
87
- - Server-side validation for file types and sizes
88
-
89
- ### API Management
90
- - Cached embedding models (load once, use many times)
91
- - Optimized API calls to reduce costs
92
- - Timeout protection (2-minute max per upload)
93
-
94
- ## 💡 Smart Features
95
-
96
- ### 1. Model Warm-up
97
- Application pre-loads embedding model on startup for instant first upload
98
-
99
- ### 2. Progress Callbacks
100
- Real-time progress updates during document processing:
101
- - 10%: Upload started
102
- - 30%: File received
103
- - 50%: Document store initialized
104
- - 70%: Embedding in progress
105
- - 90%: Writing to database
106
- - 100%: Complete
107
-
108
- ### 3. Adaptive Chunking
109
- Intelligent document splitting:
110
- - Word-based splitting for better context
111
- - Configurable chunk size (default: 250 words)
112
- - Overlap between chunks (default: 50 words)
113
- - Preserves paragraph structure
114
-
115
- ### 4. Response Styles
116
- Choose from multiple explanation styles:
117
- - **Simple & Intuitive**: Easy to understand
118
- - **Balanced**: Mix of detail and clarity
119
- - **Detailed & Technical**: In-depth technical explanations
120
- - **Academic**: Formal academic writing
121
-
122
- ### 5. Response Lengths
123
- Control response verbosity:
124
- - **Short**: 1 paragraph summary
125
- - **Medium**: 2-3 paragraphs (default)
126
- - **Comprehensive**: Detailed multi-paragraph response
127
-
128
- ## 🛠️ Technical Stack
129
-
130
- - **Framework**: FastAPI (high-performance async web framework)
131
- - **Vector DB**: Pinecone (scalable vector search)
132
- - **RAG Framework**: Haystack AI (modular NLP framework)
133
- - **Embeddings**: Sentence Transformers (state-of-the-art models)
134
- - **LLM**: Google Gemini 1.5 Flash (fast, accurate generation)
135
- - **Frontend**: Vanilla JS with Bootstrap & SweetAlert2
136
- - **Processing**: Concurrent.futures for parallel execution
137
-
138
- ## 📈 Performance Metrics
139
-
140
- Typical performance on mid-range hardware (tested):
141
- - **Upload Speed**: 1-3 seconds for file transfer
142
- - **Processing Speed**: 30-50 chunks/second
143
- - **Query Response**: 2-5 seconds end-to-end
144
- - **Memory Usage**: 500MB-2GB depending on document size
145
-
146
- ## 🎯 Use Cases
147
-
148
- 1. **Academic Research**: Quickly understand complex papers
149
- 2. **Technical Documentation**: Extract information from manuals
150
- 3. **Business Reports**: Analyze CSV/Excel data with AI
151
- 4. **Code Documentation**: Process Markdown documentation
152
- 5. **Data Analysis**: Query JSON/CSV datasets naturally
153
- 6. **Meeting Notes**: Search through text/DOCX notes
154
-
155
- ## 🔄 Continuous Improvements
156
-
157
- The codebase includes:
158
- - Comprehensive error logging
159
- - Performance monitoring
160
- - Memory usage tracking
161
- - Detailed console output for debugging
162
- - Modular design for easy enhancements
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
FIXES_APPLIED.md DELETED
@@ -1,281 +0,0 @@
1
- # Large File Upload Fixes - Implementation Summary
2
-
3
- ## Problem Identified
4
- The application was unable to handle file uploads around 5MB due to multiple issues:
5
- 1. Missing file size validation and limits in FastAPI
6
- 2. No streaming upload support for large files
7
- 3. Insufficient error handling for edge cases
8
- 4. Memory management issues during processing
9
- 5. Frontend timeout issues
10
- 6. Lack of proper progress tracking
11
-
12
- ## Solutions Implemented
13
-
14
- ### 1. **FastAPI/Backend Improvements** ([app.py](app.py))
15
-
16
- #### A. Enhanced Upload Configuration
17
- - **Added FastAPI upload size limits**: Configured `MAX_UPLOAD_SIZE = 50MB`
18
- - **Increased timeouts**: Set `timeout_keep_alive=600` (10 minutes) for large file processing
19
- - **Added connection limits**: `limit_concurrency=10` to prevent resource exhaustion
20
- - **Graceful shutdown**: `timeout_graceful_shutdown=30` for clean server stops
21
-
22
- #### B. Streaming File Upload
23
- ```python
24
- # Old approach: Direct file copy (can fail for large files)
25
- shutil.copyfileobj(file.file, buffer)
26
-
27
- # New approach: Streaming with size validation
28
- while chunk := await file.read(1024 * 1024): # Read 1MB at a time
29
- file_size += len(chunk)
30
- if file_size > MAX_FILE_SIZE:
31
- return error # Early termination
32
- chunks.append(chunk)
33
- ```
34
-
35
- **Benefits**:
36
- - Handles files of any size up to 50MB
37
- - Validates size during upload (not after)
38
- - Prevents memory overflow
39
- - Better error messages
40
-
41
- #### C. Comprehensive Error Handling
42
- - **File validation**: Checks for empty files, missing files, invalid types
43
- - **Size validation**: Real-time size checking during upload
44
- - **Processing errors**: Proper cleanup on failure
45
- - **HTTP status codes**: 400, 413, 500 for different error types
46
- - **Detailed error messages**: User-friendly error descriptions
47
-
48
- #### D. File Size Reporting
49
- - Shows file size in MB in success messages
50
- - Displays processing time
51
- - Tracks upload progress
52
-
53
- ### 2. **Document Processing Improvements** ([QASystem/ingestion.py](QASystem/ingestion.py))
54
-
55
- #### A. Enhanced Validation
56
- ```python
57
- # Input validation
58
- - Checks if file exists before processing
59
- - Validates document store is available
60
- - Verifies extracted content is not empty
61
- - Validates chunk creation success
62
- ```
63
-
64
- #### B. Better Error Recovery
65
- - **Batch failure tolerance**: Continues processing if <20% of batches fail
66
- - **Partial success handling**: Accepts results if >50% chunks succeed
67
- - **Detailed error logging**: Full stack traces for debugging
68
- - **Memory cleanup**: Automatic cleanup on errors
69
-
70
- #### C. Improved Progress Tracking
71
- ```python
72
- # Progress indicators
73
- ✓ File read successfully
74
- ✓ Extracted content (X documents, Y chars)
75
- ✓ Created Z chunks
76
- ✓ Batch N/M complete
77
- ✓ Wrote to Pinecone
78
- ```
79
-
80
- #### D. Memory Management
81
- - **Periodic cleanup**: Every 3 batches for large files
82
- - **Memory monitoring**: Tracks usage before/after operations
83
- - **Resource reporting**: Shows memory delta in statistics
84
-
85
- ### 3. **Configuration Optimization** ([QASystem/config.py](QASystem/config.py))
86
-
87
- #### Updated Settings for 5MB Files
88
- ```python
89
- # Large File Detection
90
- LARGE_FILE_THRESHOLD = 3MB # Changed from 2MB
91
- # Better catches 5MB files for optimization
92
-
93
- # Chunk Settings
94
- LARGE_FILE_CHUNK_LENGTH = 350 # Optimized from 400
95
- LARGE_FILE_BATCH_SIZE = 20 # Optimized from 24
96
- ```
97
-
98
- **Why these values?**
99
- - 350 words per chunk: Balance between context and speed
100
- - Batch size 20: Prevents memory issues on quality model
101
- - 3MB threshold: Catches most research papers (5-10MB range)
102
-
103
- ### 4. **Frontend Improvements** ([templates/index.html](templates/index.html))
104
-
105
- #### A. Better Validation
106
- ```javascript
107
- // File size display
108
- console.log(`File: ${file.name}, Size: ${fileSizeMB}MB`);
109
-
110
- // Empty file check
111
- if (file.size === 0) {
112
- // Show error
113
- }
114
-
115
- // Size validation with actual size shown
116
- text: `File size is ${fileSizeMB}MB. Max is 50MB.`
117
- ```
118
-
119
- #### B. Enhanced User Feedback
120
- - Shows exact file size in error messages
121
- - Real-time progress polling
122
- - Better error descriptions
123
- - Loading indicators
124
-
125
- ## Key Features Added
126
-
127
- ### 1. **Streaming Upload**
128
- - Reads files in 1MB chunks
129
- - Validates size during upload
130
- - Prevents memory overflow
131
- - Handles files up to 50MB
132
-
133
- ### 2. **Comprehensive Validation**
134
- ✓ File type validation
135
- ✓ File size validation (client + server)
136
- ✓ Empty file detection
137
- ✓ Content extraction validation
138
- ✓ Embedding validation
139
-
140
- ### 3. **Error Recovery**
141
- ✓ Partial batch success tolerance
142
- ✓ Automatic cleanup on failure
143
- ✓ Detailed error messages
144
- ✓ Graceful degradation
145
-
146
- ### 4. **Progress Tracking**
147
- ✓ Real-time upload progress
148
- ✓ Processing stage indicators
149
- ✓ Batch completion tracking
150
- ✓ Final statistics report
151
-
152
- ### 5. **Memory Optimization**
153
- ✓ Streaming file reads
154
- ✓ Periodic garbage collection
155
- ✓ Memory usage monitoring
156
- ✓ Batch size optimization
157
-
158
- ## Testing Recommendations
159
-
160
- ### Test Cases to Verify
161
- 1. **Small files** (< 1MB): Should process quickly
162
- 2. **Medium files** (1-3MB): Standard processing
163
- 3. **Large files** (3-10MB): Optimized settings activated
164
- 4. **Max size** (50MB): Should work but warn if approaching limit
165
- 5. **Oversized** (> 50MB): Should reject with clear error
166
- 6. **Empty files**: Should reject with error
167
- 7. **Invalid types**: Should reject with supported formats list
168
-
169
- ### Expected Behavior
170
- ```
171
- File Size | Chunk Size | Batch Size | Expected Time
172
- -------------|------------|------------|---------------
173
- < 1MB | 300 words | 16 chunks | < 30s
174
- 1-3MB | 300 words | 16 chunks | 30-60s
175
- 3-10MB | 350 words | 20 chunks | 1-3 min
176
- 10-50MB | 350 words | 20 chunks | 3-10 min
177
- ```
178
-
179
- ## Performance Improvements
180
-
181
- ### Before Fixes
182
- - ❌ Files >5MB: Failed silently or timeout
183
- - ❌ No size validation until after upload
184
- - ❌ Poor error messages
185
- - ❌ Memory issues with large files
186
- - ❌ No progress tracking
187
-
188
- ### After Fixes
189
- - ✅ Files up to 50MB: Full support
190
- - ✅ Size validation during upload
191
- - ✅ Clear, actionable error messages
192
- - ✅ Optimized memory usage
193
- - ✅ Real-time progress tracking
194
- - ✅ Automatic cleanup on errors
195
- - ✅ Detailed processing statistics
196
-
197
- ## Files Modified
198
-
199
- 1. **[app.py](app.py)** - Main application
200
- - Added streaming upload
201
- - Enhanced error handling
202
- - Improved validation
203
- - Better configuration
204
-
205
- 2. **[QASystem/ingestion.py](QASystem/ingestion.py)** - Document processing
206
- - Better error recovery
207
- - Enhanced validation
208
- - Memory optimization
209
- - Progress tracking
210
-
211
- 3. **[QASystem/config.py](QASystem/config.py)** - Configuration
212
- - Optimized thresholds
213
- - Better chunk sizes
214
- - Improved batch sizes
215
-
216
- 4. **[templates/index.html](templates/index.html)** - Frontend
217
- - Better validation
218
- - Enhanced error messages
219
- - File size display
220
-
221
- ## How to Use
222
-
223
- 1. **Start the server**:
224
- ```bash
225
- python app.py
226
- ```
227
-
228
- 2. **Upload a file**:
229
- - Drag and drop or click "Choose File"
230
- - Files up to 50MB supported
231
- - Watch progress in real-time
232
-
233
- 3. **Monitor progress**:
234
- - Console shows detailed processing steps
235
- - Frontend shows upload percentage
236
- - Statistics displayed on completion
237
-
238
- ## Troubleshooting
239
-
240
- ### If upload still fails:
241
-
242
- 1. **Check file size**: Must be < 50MB
243
- 2. **Check file type**: PDF, DOCX, TXT, etc.
244
- 3. **Check console logs**: Look for specific errors
245
- 4. **Check memory**: Ensure system has >2GB free
246
- 5. **Check network**: Stable connection required
247
- 6. **Check Pinecone**: API key and index must be valid
248
-
249
- ### Common Issues:
250
-
251
- **Issue**: "File too large"
252
- **Solution**: File exceeds 50MB, compress or split it
253
-
254
- **Issue**: "Empty file"
255
- **Solution**: File has no content, check source
256
-
257
- **Issue**: "Processing timeout"
258
- **Solution**: File is very large/complex, try splitting it
259
-
260
- **Issue**: "No chunks embedded"
261
- **Solution**: Check embedding model and Pinecone connection
262
-
263
- ## Performance Tips
264
-
265
- 1. **For fastest processing**: Use files < 3MB
266
- 2. **For large documents**: Consider splitting into chapters
267
- 3. **For better quality**: Use smaller chunk sizes (edit config.py)
268
- 4. **For faster speed**: Use "fast" model in config.py
269
- 5. **For maximum compatibility**: Use PDF format
270
-
271
- ## Summary
272
-
273
- All issues related to uploading ~5MB files have been resolved with:
274
- - ✅ Streaming upload support
275
- - ✅ Comprehensive validation
276
- - ✅ Better error handling
277
- - ✅ Memory optimization
278
- - ✅ Progress tracking
279
- - ✅ Detailed logging
280
-
281
- The application now handles files from 1KB to 50MB reliably with appropriate error messages and recovery mechanisms at every stage.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
INSTALLATION.md DELETED
@@ -1,110 +0,0 @@
1
- # PaperBOT Installation Guide
2
-
3
- ## Prerequisites
4
- - Python 3.9 or higher
5
- - Pinecone account and API key
6
- - Google AI (Gemini) API key
7
-
8
- ## Quick Start
9
-
10
- ### 1. Clone or Download the Repository
11
- ```bash
12
- cd PaperBOT
13
- ```
14
-
15
- ### 2. Create Virtual Environment (Recommended)
16
- ```bash
17
- # Windows
18
- python -m venv venv
19
- venv\Scripts\activate
20
-
21
- # Linux/Mac
22
- python3 -m venv venv
23
- source venv/bin/activate
24
- ```
25
-
26
- ### 3. Install Dependencies
27
- ```bash
28
- pip install -r requirements.txt
29
- ```
30
-
31
- ### 4. Configure Environment Variables
32
- 1. Copy `.env.example` to `.env`:
33
- ```bash
34
- # Windows
35
- copy .env.example .env
36
-
37
- # Linux/Mac
38
- cp .env.example .env
39
- ```
40
-
41
- 2. Edit `.env` and add your API keys:
42
- ```
43
- PINECONE_API_KEY=your_actual_pinecone_key
44
- GOOGLE_API_KEY=your_actual_google_ai_key
45
- ```
46
-
47
- ### 5. Setup Pinecone Index
48
- 1. Log in to [Pinecone Console](https://app.pinecone.io/)
49
- 2. Create a new index with these settings:
50
- - **Name**: `paperbot`
51
- - **Dimensions**: `1024`
52
- - **Metric**: `cosine`
53
- - **Pod Type**: `Starter` or `s1`
54
-
55
- ### 6. Run the Application
56
- ```bash
57
- python app.py
58
- ```
59
-
60
- The application will start on `http://localhost:8000`
61
-
62
- ## Supported File Formats
63
- - **PDF** (.pdf)
64
- - **Word Documents** (.docx, .doc)
65
- - **Text Files** (.txt)
66
- - **Markdown** (.md)
67
- - **CSV** (.csv)
68
- - **JSON** (.json)
69
- - **Excel** (.xlsx, .xls)
70
-
71
- ## Performance Optimization
72
-
73
- ### For Faster Processing
74
- Edit `QASystem/config.py`:
75
- ```python
76
- CURRENT_MODEL = "fast" # Use fast embedding model
77
- BATCH_SIZE = 64 # Increase batch size
78
- ```
79
-
80
- ### For Better Quality
81
- Edit `QASystem/config.py`:
82
- ```python
83
- CURRENT_MODEL = "quality" # Use high-quality model
84
- CHUNK_SETTINGS = {
85
- "split_length": 200, # Smaller chunks for precision
86
- "split_overlap": 75
87
- }
88
- ```
89
-
90
- ## Troubleshooting
91
-
92
- ### Out of Memory Errors
93
- - Reduce `BATCH_SIZE` in `config.py` to 16 or 8
94
- - Use `CURRENT_MODEL = "fast"` for smaller memory footprint
95
-
96
- ### Slow Upload Times
97
- - Increase `BATCH_SIZE` for parallel processing
98
- - Use `CURRENT_MODEL = "fast"` for faster embedding
99
-
100
- ### API Rate Limits
101
- - Wait a moment between requests
102
- - Check your API key quotas
103
-
104
- ## System Requirements
105
- - **RAM**: Minimum 4GB (8GB+ recommended)
106
- - **Storage**: 2GB free space for models
107
- - **Internet**: Required for API calls
108
-
109
- ## Need Help?
110
- Check the console output (F12 in browser) for detailed error messages.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
QUICKSTART.md DELETED
@@ -1,227 +0,0 @@
1
- # PaperBOT Quick Reference
2
-
3
- ## 🚀 Quick Commands
4
-
5
- ### First Time Setup
6
- ```bash
7
- # 1. Create virtual environment
8
- python -m venv venv
9
-
10
- # 2. Activate (Windows)
11
- venv\Scripts\activate
12
-
13
- # 2. Activate (Linux/Mac)
14
- source venv/bin/activate
15
-
16
- # 3. Install dependencies
17
- pip install -r requirements.txt
18
-
19
- # 4. Configure environment
20
- cp .env.example .env
21
- # Edit .env with your API keys
22
-
23
- # 5. Run application
24
- python app.py
25
- ```
26
-
27
- ### Daily Usage
28
- ```bash
29
- # Windows
30
- start.bat
31
-
32
- # Linux/Mac
33
- chmod +x start.sh
34
- ./start.sh
35
- ```
36
-
37
- ## 📝 API Keys Setup
38
-
39
- ### Pinecone
40
- 1. Go to https://www.pinecone.io/
41
- 2. Sign up/Login
42
- 3. Create new API key
43
- 4. Copy to `.env` file
44
-
45
- ### Google AI (Gemini)
46
- 1. Go to https://makersuite.google.com/app/apikey
47
- 2. Create API key
48
- 3. Copy to `.env` file
49
-
50
- ### Pinecone Index Setup
51
- 1. Login to Pinecone Console
52
- 2. Click "Create Index"
53
- 3. Settings:
54
- - Name: `paperbot`
55
- - Dimensions: `1024`
56
- - Metric: `cosine`
57
- - Cloud: Any (AWS/GCP/Azure)
58
- - Region: Choose nearest
59
-
60
- ## ⚙️ Configuration Options
61
-
62
- ### Speed vs Quality (`QASystem/config.py`)
63
-
64
- **Fast Mode** (Recommended for large docs):
65
- ```python
66
- CURRENT_MODEL = "fast"
67
- BATCH_SIZE = 64
68
- ```
69
-
70
- **Quality Mode** (Recommended for technical papers):
71
- ```python
72
- CURRENT_MODEL = "quality"
73
- BATCH_SIZE = 16
74
- ```
75
-
76
- ## 🎯 Supported File Types
77
-
78
- | Format | Extension | Notes |
79
- |--------|-----------|-------|
80
- | PDF | .pdf | Best for research papers |
81
- | Word | .docx, .doc | Full text extraction |
82
- | Text | .txt | Plain text |
83
- | Markdown | .md | Preserves formatting |
84
- | CSV | .csv | Tabular data |
85
- | Excel | .xlsx, .xls | Spreadsheets |
86
- | JSON | .json | Structured data |
87
-
88
- ## 🔧 Troubleshooting
89
-
90
- ### Issue: Out of Memory
91
- **Solution**: Reduce batch size in `config.py`
92
- ```python
93
- BATCH_SIZE = 8 # Lower value
94
- ```
95
-
96
- ### Issue: Slow Upload
97
- **Solution**: Use fast model
98
- ```python
99
- CURRENT_MODEL = "fast"
100
- ```
101
-
102
- ### Issue: API Rate Limit
103
- **Solution**: Wait 1-2 minutes between uploads
104
-
105
- ### Issue: Can't Connect to Server
106
- **Solution**: Check if port 8000 is free
107
- ```bash
108
- # Windows
109
- netstat -ano | findstr :8000
110
-
111
- # Linux/Mac
112
- lsof -i :8000
113
- ```
114
-
115
- ## 📊 Performance Tips
116
-
117
- ### For Large Documents (100+ pages)
118
- ```python
119
- # config.py
120
- CURRENT_MODEL = "fast"
121
- BATCH_SIZE = 64
122
- CHUNK_SETTINGS = {
123
- "split_length": 400,
124
- "split_overlap": 50
125
- }
126
- ```
127
-
128
- ### For Technical Papers
129
- ```python
130
- # config.py
131
- CURRENT_MODEL = "quality"
132
- BATCH_SIZE = 16
133
- CHUNK_SETTINGS = {
134
- "split_length": 200,
135
- "split_overlap": 75
136
- }
137
- ```
138
-
139
- ### For CSV/Excel Files
140
- - Keep files under 10,000 rows for best performance
141
- - Remove unnecessary columns before upload
142
- - Use CSV format for faster processing
143
-
144
- ## 🎨 UI Features
145
-
146
- ### Response Styles
147
- - **Simple & Intuitive**: Easy explanations
148
- - **Balanced**: Mix of detail and clarity
149
- - **Detailed & Technical**: In-depth analysis
150
- - **Academic**: Formal writing
151
-
152
- ### Response Lengths
153
- - **Short**: 1 paragraph
154
- - **Medium**: 2-3 paragraphs (recommended)
155
- - **Comprehensive**: Detailed multi-paragraph
156
-
157
- ## 🔍 Example Questions
158
-
159
- ### For Research Papers
160
- - "What is the main contribution of this paper?"
161
- - "Explain the methodology used"
162
- - "What are the key findings?"
163
- - "How does this compare to previous work?"
164
-
165
- ### For Data Files (CSV/Excel)
166
- - "What are the main trends in this data?"
167
- - "Summarize the statistics"
168
- - "What columns are available?"
169
-
170
- ### For Documentation
171
- - "How do I install this software?"
172
- - "Explain the configuration options"
173
- - "What are the prerequisites?"
174
-
175
- ## 📈 Monitoring
176
-
177
- ### Memory Usage
178
- Check console output for memory stats:
179
- ```
180
- 📊 Statistics:
181
- • Memory: 1250.5MB
182
- • Processing speed: 45 chunks/sec
183
- ```
184
-
185
- ### Performance Metrics
186
- - Upload: 1-3 seconds
187
- - Processing: 30-50 chunks/sec
188
- - Query: 2-5 seconds
189
-
190
- ## 🆘 Getting Help
191
-
192
- 1. Check console output (terminal)
193
- 2. Open browser console (F12)
194
- 3. Review error messages
195
- 4. Check [INSTALLATION.md](INSTALLATION.md)
196
- 5. Review [FEATURES.md](FEATURES.md)
197
-
198
- ## 🔄 Updates & Maintenance
199
-
200
- ### Update Dependencies
201
- ```bash
202
- pip install --upgrade -r requirements.txt
203
- ```
204
-
205
- ### Clear Cache
206
- ```bash
207
- # Delete uploads
208
- rm -rf uploads/*
209
-
210
- # Clear Python cache
211
- find . -type d -name __pycache__ -exec rm -r {} +
212
- ```
213
-
214
- ### Reset Database
215
- Delete all vectors in Pinecone console or via code
216
-
217
- ## 🎓 Best Practices
218
-
219
- 1. **One Document at a Time**: Upload new doc clears previous
220
- 2. **Appropriate Model**: Use fast for speed, quality for accuracy
221
- 3. **Clear Questions**: Be specific in your queries
222
- 4. **File Size**: Keep under 50MB for best performance
223
- 5. **Internet**: Stable connection required for API calls
224
-
225
- ---
226
-
227
- **Need more help?** Check the full documentation in [README.md](README.md)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
QUICK_REFERENCE.md DELETED
@@ -1,351 +0,0 @@
1
- # 🚀 Large File Upload Issue - RESOLVED
2
-
3
- ## Executive Summary
4
-
5
- **Problem**: Application unable to upload files around 5MB
6
- **Root Cause**: Multiple issues in upload pipeline
7
- **Status**: ✅ **FULLY RESOLVED**
8
- **Tested Up To**: 50MB files
9
-
10
- ---
11
-
12
- ## 🎯 What Was Fixed
13
-
14
- ### Critical Issues Resolved
15
-
16
- 1. **❌ No file size validation** → ✅ Multi-layer size validation (client + server)
17
- 2. **❌ No streaming upload** → ✅ Chunked streaming for large files
18
- 3. **❌ Poor error handling** → ✅ Comprehensive error recovery
19
- 4. **❌ Memory issues** → ✅ Optimized memory management
20
- 5. **❌ No progress tracking** → ✅ Real-time progress updates
21
- 6. **❌ Timeout problems** → ✅ Extended timeouts (10 min)
22
- 7. **❌ Silent failures** → ✅ Detailed error messages
23
-
24
- ---
25
-
26
- ## 📊 Performance Comparison
27
-
28
- ### Before Fixes
29
- ```
30
- File Size | Status
31
- ----------|--------
32
- < 1 MB | ✅ Works
33
- 1-3 MB | ⚠️ Sometimes
34
- 3-5 MB | ❌ Fails
35
- > 5 MB | ❌ Fails
36
- ```
37
-
38
- ### After Fixes
39
- ```
40
- File Size | Status | Time
41
- ----------|-------------|--------
42
- < 1 MB | ✅ Works | < 30s
43
- 1-3 MB | ✅ Works | 30-60s
44
- 3-5 MB | ✅ Works | 1-2 min
45
- 5-10 MB | ✅ Works | 2-5 min
46
- 10-50 MB | ✅ Works | 5-10 min
47
- > 50 MB | ⛔ Rejected | N/A
48
- ```
49
-
50
- ---
51
-
52
- ## 🔧 Technical Changes
53
-
54
- ### 1. Backend ([app.py](app.py))
55
-
56
- **Upload Endpoint Rewrite**
57
- - ✅ Streaming file upload (1MB chunks)
58
- - ✅ Real-time size validation
59
- - ✅ Comprehensive error handling
60
- - ✅ Automatic cleanup on failure
61
- - ✅ Detailed logging
62
-
63
- **Server Configuration**
64
- - ✅ 10-minute timeout
65
- - ✅ 50MB request limit
66
- - ✅ Connection limiting
67
- - ✅ Graceful shutdown
68
-
69
- ### 2. Processing ([QASystem/ingestion.py](QASystem/ingestion.py))
70
-
71
- **Validation Improvements**
72
- - ✅ File existence check
73
- - ✅ Content validation
74
- - ✅ Chunk creation verification
75
- - ✅ Embedding validation
76
-
77
- **Error Recovery**
78
- - ✅ Batch failure tolerance (20%)
79
- - ✅ Partial success handling (50%)
80
- - ✅ Memory cleanup on error
81
- - ✅ Full stack traces
82
-
83
- **Memory Management**
84
- - ✅ Periodic garbage collection
85
- - ✅ Memory usage tracking
86
- - ✅ Batch size optimization
87
- - ✅ Resource monitoring
88
-
89
- ### 3. Configuration ([QASystem/config.py](QASystem/config.py))
90
-
91
- **Optimized for 5MB Files**
92
- - ✅ 3MB threshold (was 2MB)
93
- - ✅ 350-word chunks (balanced)
94
- - ✅ Batch size 20 (memory-safe)
95
-
96
- ### 4. Frontend ([templates/index.html](templates/index.html))
97
-
98
- **User Experience**
99
- - ✅ File size display
100
- - ✅ Empty file detection
101
- - ✅ Better error messages
102
- - ✅ Real-time progress
103
-
104
- ---
105
-
106
- ## 🎮 How to Test
107
-
108
- ### Quick Test (5MB file)
109
-
110
- 1. **Start server**:
111
- ```bash
112
- python app.py
113
- ```
114
-
115
- 2. **Upload 5MB PDF**:
116
- - Open http://localhost:8000
117
- - Choose a ~5MB research paper
118
- - Click upload
119
-
120
- 3. **Verify**:
121
- - ✅ Progress bar shows updates
122
- - ✅ Console shows "Large file detected"
123
- - ✅ Processing completes in 1-2 minutes
124
- - ✅ Success message shows file size
125
- - ✅ Can ask questions immediately
126
-
127
- ### Expected Console Output
128
- ```
129
- 📥 Upload endpoint called - Filename: paper.pdf
130
- ✓ File type validated: .pdf
131
- ✓ File read successfully: 5.23MB
132
- ✓ File saved to: uploads\paper.pdf
133
-
134
- 📄 Starting document ingestion: paper.pdf
135
- ⚡ Large file detected - using optimized settings:
136
- Chunk length: 350 words
137
- Batch size: 20 chunks
138
- ✓ Created 156 chunks
139
- ✓ Batch 1/8 complete (20 chunks, +15.2MB)
140
- 🧹 Memory cleanup: 485.3MB
141
- ...
142
- ✅ Ingestion completed successfully!
143
- • Size: 5.23MB
144
- • Chunks: 156
145
- • Time: 94.3s
146
- ```
147
-
148
- ---
149
-
150
- ## 📚 Documentation
151
-
152
- | File | Purpose |
153
- |------|---------|
154
- | **[FIXES_APPLIED.md](FIXES_APPLIED.md)** | Detailed technical explanation of all changes |
155
- | **[TESTING_GUIDE.md](TESTING_GUIDE.md)** | Comprehensive testing procedures and benchmarks |
156
- | **This File** | Quick reference and summary |
157
-
158
- ---
159
-
160
- ## ✅ Verification Checklist
161
-
162
- ### Upload Pipeline
163
- - [x] File type validation
164
- - [x] File size validation (client-side)
165
- - [x] File size validation (server-side)
166
- - [x] Streaming upload support
167
- - [x] Empty file detection
168
- - [x] Error message clarity
169
-
170
- ### Processing Pipeline
171
- - [x] Document extraction
172
- - [x] Chunk creation
173
- - [x] Embedding generation
174
- - [x] Database storage
175
- - [x] Memory management
176
- - [x] Error recovery
177
-
178
- ### User Experience
179
- - [x] Progress tracking
180
- - [x] Clear error messages
181
- - [x] File size display
182
- - [x] Processing statistics
183
- - [x] Success confirmation
184
- - [x] Immediate usability
185
-
186
- ---
187
-
188
- ## 🚨 Known Limits
189
-
190
- | Limit | Value | Reason |
191
- |-------|-------|--------|
192
- | **Max File Size** | 50MB | Memory constraints |
193
- | **Upload Timeout** | 10 min | Very large file processing |
194
- | **Min File Size** | > 0 bytes | Must have content |
195
- | **Supported Formats** | PDF, DOCX, TXT, etc. | Converter availability |
196
-
197
- ---
198
-
199
- ## 🐛 Troubleshooting
200
-
201
- ### Issue: Upload fails for 5MB file
202
- **Check**:
203
- 1. Console for specific error
204
- 2. File type is supported
205
- 3. File is not corrupted
206
- 4. Pinecone API key is valid
207
- 5. Internet connection is stable
208
-
209
- ### Issue: Slow processing
210
- **Solutions**:
211
- 1. Use "fast" model in config.py
212
- 2. Increase chunk size
213
- 3. Close other applications
214
- 4. Check system has >2GB RAM free
215
-
216
- ### Issue: Memory error
217
- **Solutions**:
218
- 1. Reduce batch size in config.py
219
- 2. Use "fast" model (uses less memory)
220
- 3. Increase system RAM
221
- 4. Process smaller files
222
-
223
- ---
224
-
225
- ## 🎓 For Developers
226
-
227
- ### Key Design Decisions
228
-
229
- **Why streaming upload?**
230
- - Handles files larger than available RAM
231
- - Validates size during upload (not after)
232
- - Better user experience (shows progress)
233
-
234
- **Why 3MB threshold?**
235
- - Research papers typically 5-10MB
236
- - Activates optimizations early enough
237
- - Prevents memory issues on medium files
238
-
239
- **Why 350-word chunks?**
240
- - Balance between context and speed
241
- - Works well with quality model
242
- - Optimal for most research papers
243
-
244
- **Why batch size 20?**
245
- - Prevents out-of-memory errors
246
- - Good balance with quality model
247
- - Allows frequent cleanup
248
-
249
- ### Code Architecture
250
-
251
- ```
252
- Client Upload
253
-
254
- [Streaming Validation] → Size check every 1MB
255
-
256
- [File Storage] → Save to uploads/
257
-
258
- [Document Extraction] → PDF/DOCX/etc to text
259
-
260
- [Chunk Creation] → 300-350 word chunks
261
-
262
- [Batch Embedding] → 16-20 chunks at a time
263
-
264
- [Vector Storage] → Pinecone write
265
-
266
- [Memory Cleanup] → Garbage collection
267
-
268
- Success!
269
- ```
270
-
271
- ---
272
-
273
- ## 📈 Metrics
274
-
275
- ### Success Rates (Expected)
276
-
277
- | File Size | Success Rate | Avg Time |
278
- |-----------|-------------|----------|
279
- | < 1MB | 99% | 25s |
280
- | 1-3MB | 98% | 50s |
281
- | 3-5MB | 95% | 100s |
282
- | 5-10MB | 90% | 180s |
283
- | 10-50MB | 85% | 360s |
284
-
285
- *Lower success rates for larger files due to network/system variability*
286
-
287
- ### Error Distribution (Fixed)
288
-
289
- | Error Type | Before | After |
290
- |------------|--------|-------|
291
- | File too large | 60% | 0% |
292
- | Timeout | 25% | 2% |
293
- | Memory error | 10% | 1% |
294
- | Network error | 5% | 5% |
295
-
296
- ---
297
-
298
- ## 🔐 Security Notes
299
-
300
- - ✅ File type whitelist enforced
301
- - ✅ File size limits enforced
302
- - ✅ Path traversal prevented
303
- - ✅ Automatic cleanup on error
304
- - ✅ No arbitrary code execution
305
- - ✅ API key not exposed
306
-
307
- ---
308
-
309
- ## 🎉 Final Notes
310
-
311
- ### What You Can Do Now
312
-
313
- ✅ Upload research papers up to 50MB
314
- ✅ Get detailed progress updates
315
- ✅ See clear error messages
316
- ✅ Process large documents reliably
317
- ✅ Handle multiple file formats
318
- ✅ Monitor memory usage
319
- ✅ Track processing statistics
320
-
321
- ### What's Improved
322
-
323
- ✅ **Reliability**: 95%+ success rate for 5MB files
324
- ✅ **Performance**: Optimized settings activate automatically
325
- ✅ **User Experience**: Real-time progress and clear errors
326
- ✅ **Error Recovery**: Automatic cleanup and retry capability
327
- ✅ **Monitoring**: Detailed logging for debugging
328
-
329
- ---
330
-
331
- ## 📞 Support
332
-
333
- If issues persist:
334
-
335
- 1. Check **[TESTING_GUIDE.md](TESTING_GUIDE.md)** for specific test cases
336
- 2. Review **[FIXES_APPLIED.md](FIXES_APPLIED.md)** for technical details
337
- 3. Check console output for specific errors
338
- 4. Verify system requirements (RAM, disk space)
339
- 5. Test with smaller files first
340
-
341
- ---
342
-
343
- **Status**: ✅ Production Ready
344
- **Tested**: Files from 1KB to 50MB
345
- **Confidence**: High
346
- **Next Steps**: Deploy and monitor real-world usage
347
-
348
- ---
349
-
350
- *All changes have been implemented, tested, and documented.*
351
- *The application now handles large file uploads reliably and efficiently.*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
README.md CHANGED
@@ -1,271 +1,490 @@
1
- # 🤖 PaperBOT - AI Research Assistant
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
- PaperBOT is an intelligent document analysis system that uses **RAG (Retrieval Augmented Generation)** to help you understand and query research papers, documents, and data files. Upload any supported document and ask questions in natural language!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
 
5
- ## ✨ Key Features
 
6
 
7
- - 📄 **Multi-Format Support**: PDF, DOCX, DOC, TXT, MD, CSV, JSON, XLSX, XLS
8
- - ⚡ **Parallel Processing**: Fast document ingestion with multi-threaded embedding
9
- - 🧠 **Smart Semantic Search**: Powered by Pinecone vector database
10
- - 💾 **Memory Management**: Optimized batch processing with automatic cleanup
11
- - 🎯 **RAG Pipeline**: Combines retrieval with Google Gemini for accurate answers
12
- - 🎨 **Beautiful UI**: Modern, responsive interface with drag-and-drop upload
13
- - 🔧 **Configurable**: Choose between speed and quality modes
 
 
 
 
14
 
15
  ## 🚀 Quick Start
16
 
17
  ### Prerequisites
18
- - Python 3.9+
19
- - Pinecone API key ([Get it here](https://www.pinecone.io/))
20
- - Google AI API key ([Get it here](https://makersuite.google.com/app/apikey))
 
 
 
21
 
22
  ### Installation
23
 
24
- 1. **Clone the repository**
25
- ```bash
26
- git clone <your-repo-url>
27
- cd PaperBOT
28
- ```
29
-
30
- 2. **Create virtual environment**
31
- ```bash
32
- python -m venv venv
33
- # Windows
34
- venv\Scripts\activate
35
- # Linux/Mac
36
- source venv/bin/activate
37
- ```
38
-
39
- 3. **Install dependencies**
40
- ```bash
41
- pip install -r requirements.txt
42
- ```
43
-
44
- 4. **Configure environment**
45
- ```bash
46
- # Copy example env file
47
- cp .env.example .env
48
-
49
- # Edit .env and add your API keys
50
- PINECONE_API_KEY=your_key_here
51
- GOOGLE_API_KEY=your_key_here
52
- ```
53
-
54
- 5. **Setup Pinecone Index**
55
- - Create index named `paperbot`
56
- - Dimensions: `1024`
57
- - Metric: `cosine`
58
-
59
- 6. **Run the application**
60
- ```bash
61
- python app.py
62
- ```
63
-
64
- 7. **Open in browser**
65
- ```
66
- http://localhost:8000
67
- ```
68
 
69
- ## 📖 Usage
 
70
 
71
- 1. **Upload Document**: Drag & drop or click to upload (PDF, DOCX, CSV, etc.)
72
- 2. **Wait for Processing**: Progress bar shows upload and embedding status
73
- 3. **Ask Questions**: Type your question in natural language
74
- 4. **Customize Response**: Select explanation style and length
75
- 5. **Get Answers**: AI generates context-aware answers from your document
76
 
77
- ## 🏗️ Architecture
 
78
 
79
- ### Tech Stack
80
- - **Backend**: FastAPI (async Python web framework)
81
- - **Vector DB**: Pinecone (cloud-native vector database)
82
- - **RAG Framework**: Haystack AI
83
- - **Embeddings**: Sentence Transformers (BAAI/bge-large-en-v1.5)
84
- - **LLM**: Google Gemini 1.5 Flash
85
- - **Frontend**: HTML/CSS/JS with Bootstrap
86
 
87
- ### Processing Pipeline
 
 
88
 
 
 
89
  ```
90
- Document Upload
91
-
92
- File Validation & Storage
93
-
94
- Format Detection & Conversion
95
-
96
- Text Extraction
97
-
98
- Chunking (250 words, 50 overlap)
99
-
100
- Parallel Batch Embedding (32 chunks/batch)
101
-
102
- Vector Storage (Pinecone)
103
-
104
- Ready for Queries!
105
  ```
106
 
107
- ### Query Pipeline
 
 
 
 
108
 
109
  ```
110
- User Query
111
-
112
- Query Embedding
113
-
114
- Semantic Search (Top 10 chunks)
115
-
116
- Context Assembly
117
-
118
- LLM Generation (Gemini)
119
-
120
- Formatted Answer
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
121
  ```
122
 
 
 
123
  ## ⚙️ Configuration
124
 
125
- Edit [`QASystem/config.py`](QASystem/config.py) for performance tuning:
126
 
127
- ### Fast Mode (5-10x faster)
128
- ```python
129
- CURRENT_MODEL = "fast"
130
- BATCH_SIZE = 64
131
- CHUNK_SETTINGS = {
132
- "split_length": 400,
133
- "split_overlap": 50
134
- }
 
 
 
135
  ```
136
 
137
- ### Quality Mode (best accuracy)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
138
  ```python
139
- CURRENT_MODEL = "quality"
140
- BATCH_SIZE = 16
 
 
141
  CHUNK_SETTINGS = {
142
- "split_length": 200,
143
- "split_overlap": 75
 
144
  }
 
 
 
145
  ```
146
 
147
- ## 📊 Performance
148
 
149
- Tested on mid-range hardware (16GB RAM, i7 CPU):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
150
 
151
- | Metric | Performance |
152
- |--------|------------|
153
- | Upload Speed | 1-3 seconds |
154
- | Processing Speed | 30-50 chunks/sec |
155
- | Query Response | 2-5 seconds |
156
- | Memory Usage | 500MB - 2GB |
157
 
158
- **Processing Times (typical)**:
159
- - 10-page PDF: ~15-30 seconds
160
- - 50-page PDF: ~60-90 seconds
161
- - 100-page PDF: ~2-3 minutes
162
 
163
- ## 🎯 Use Cases
164
 
165
- - 📚 **Academic Research**: Understand complex papers quickly
166
- - 📊 **Data Analysis**: Query CSV/Excel files naturally
167
- - 📝 **Documentation**: Search technical docs and manuals
168
- - 💼 **Business**: Analyze reports and presentations
169
- - 🔬 **Research**: Extract insights from scientific papers
170
 
171
- ## 🔧 Advanced Features
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
172
 
173
- ### Memory Management
174
- - Automatic garbage collection every 5 batches
175
- - Real-time memory monitoring
176
- - Configurable batch sizes
177
- - Resource cleanup after processing
178
 
179
- ### Semantic Search
180
- - Vector similarity search with Pinecone
181
- - Top-K retrieval with relevance scoring
182
- - Namespace isolation per document
183
- - Fallback retrieval strategies
 
 
 
184
 
185
- ### Error Handling
186
- - Comprehensive validation
187
- - Graceful API failure fallbacks
188
- - Detailed error messages
189
- - Automatic retry mechanisms
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
190
 
191
  ## 📁 Project Structure
192
 
193
  ```
194
  PaperBOT/
195
- ├── app.py # FastAPI application
196
- ├── requirements.txt # Python dependencies
197
- ├── setup.py # Package setup
198
- ├── .env.example # Environment template
199
- ├── INSTALLATION.md # Detailed setup guide
200
- ├── FEATURES.md # Feature documentation
201
- ├── QASystem/
202
  │ ├── __init__.py
203
- │ ├── config.py # Configuration settings
204
- │ ├── ingestion.py # Document processing
205
- │ ├── retrieval_and_generation.py # RAG pipeline
206
- │ └── utils.py # Utilities
207
- ├── templates/
208
- │ └── index.html # Web interface
209
- ├── uploads/ # Temporary file storage
210
- ── data/ # Sample documents
 
 
 
 
 
211
  ```
212
 
213
- ## 🛠️ Troubleshooting
214
 
215
- ### Out of Memory
216
- - Reduce `BATCH_SIZE` to 8 or 16
217
- - Use `CURRENT_MODEL = "fast"`
218
- - Process smaller documents
219
 
220
- ### Slow Processing
221
- - Increase `BATCH_SIZE` to 64
222
- - Use `CURRENT_MODEL = "fast"`
223
- - Check internet connection
224
 
225
- ### API Errors
226
- - Verify API keys in `.env`
227
- - Check Pinecone index configuration
228
- - Ensure sufficient API quotas
 
 
 
 
 
 
 
 
 
 
 
 
229
 
230
- ### Upload Fails
231
- - Check file size (max 50MB)
232
- - Verify file format is supported
233
- - Check console (F12) for details
234
 
235
- ## 📚 Documentation
236
 
237
- - [Installation Guide](INSTALLATION.md) - Detailed setup instructions
238
- - [Features](FEATURES.md) - Comprehensive feature list
239
- - [Configuration](QASystem/config.py) - Performance tuning options
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
240
 
241
  ## 🤝 Contributing
242
 
243
- Contributions are welcome! Areas for improvement:
244
- - Additional file format support
245
- - More embedding model options
246
- - Advanced retrieval strategies
247
- - UI/UX enhancements
248
- - Performance optimizations
249
 
250
- ## 📄 License
 
 
 
 
251
 
252
- This project is open source. Feel free to use and modify.
253
 
254
- ## 🙏 Acknowledgments
 
 
255
 
256
- - **Haystack AI** - RAG framework
257
- - **Pinecone** - Vector database
258
- - **Google AI** - Gemini LLM
259
- - **Sentence Transformers** - Embedding models
260
 
261
- ## 📞 Support
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
262
 
263
- For issues and questions:
264
- 1. Check [INSTALLATION.md](INSTALLATION.md) for setup help
265
- 2. Review [FEATURES.md](FEATURES.md) for usage details
266
- 3. Check console output for error details
267
- 4. Open an issue on GitHub
268
 
269
  ---
270
 
271
- **Made with ❤️ for researchers, students, and knowledge workers**
 
 
 
1
+ <p align="center">
2
+ <img src="https://img.shields.io/badge/Python-3.9+-blue?style=for-the-badge&logo=python&logoColor=white" alt="Python">
3
+ <img src="https://img.shields.io/badge/FastAPI-0.128+-00a393?style=for-the-badge&logo=fastapi&logoColor=white" alt="FastAPI">
4
+ <img src="https://img.shields.io/badge/Haystack-2.22+-1C3D5A?style=for-the-badge" alt="Haystack">
5
+ <img src="https://img.shields.io/badge/Pinecone-Vector_DB-6b21a8?style=for-the-badge" alt="Pinecone">
6
+ <img src="https://img.shields.io/badge/Google_Gemini-AI-4285F4?style=for-the-badge&logo=google&logoColor=white" alt="Gemini">
7
+ <img src="https://img.shields.io/badge/License-MIT-green?style=for-the-badge" alt="License">
8
+ </p>
9
+
10
+ <h1 align="center">🤖 PaperBOT</h1>
11
+ <h3 align="center">AI-Powered Research Paper Assistant</h3>
12
+
13
+ <p align="center">
14
+ <b>Upload any document and ask questions in natural language. Get AI-powered answers grounded in your document's content.</b>
15
+ </p>
16
+
17
+ <p align="center">
18
+ <a href="#-features">Features</a> •
19
+ <a href="#-quick-start">Quick Start</a> •
20
+ <a href="#-usage">Usage</a> •
21
+ <a href="#%EF%B8%8F-configuration">Configuration</a> •
22
+ <a href="#-api-reference">API</a> •
23
+ <a href="#-architecture">Architecture</a>
24
+ </p>
25
 
26
+ ---
27
+
28
+ ## 🎯 What is PaperBOT?
29
+
30
+ PaperBOT is a **Retrieval-Augmented Generation (RAG)** application that allows you to upload research papers, documents, or data files and have intelligent conversations about their content. Unlike generic chatbots, PaperBOT answers are **always grounded in your uploaded document**, preventing hallucinations and ensuring accuracy.
31
+
32
+ ### Key Highlights
33
+
34
+ - 📄 **Multi-format Support** — PDF, DOCX, TXT, MD, CSV, JSON, Excel
35
+ - 🚀 **Fast Processing** — Parallel embedding with optimized chunking
36
+ - 🎯 **Accurate Answers** — RAG ensures responses come from your document
37
+ - 🎨 **Beautiful UI** — Modern, responsive interface with progress tracking
38
+ - 🔒 **Privacy First** — Your documents stay on your infrastructure
39
+
40
+ ---
41
+
42
+ ## ✨ Features
43
+
44
+ <table>
45
+ <tr>
46
+ <td width="50%">
47
+
48
+ ### 📚 Document Processing
49
+ - **9 file formats** supported
50
+ - Smart text chunking (300 words/chunk)
51
+ - Parallel batch embedding
52
+ - Metadata size enforcement for Pinecone
53
+
54
+ </td>
55
+ <td width="50%">
56
+
57
+ ### 🧠 AI-Powered Q&A
58
+ - Semantic search with Pinecone
59
+ - Google Gemini 2.0 Flash for generation
60
+ - Curated fallback responses
61
+ - Customizable response styles
62
+
63
+ </td>
64
+ </tr>
65
+ <tr>
66
+ <td width="50%">
67
+
68
+ ### ⚡ Performance
69
+ - Model pre-warming on startup
70
+ - Configurable speed/quality tradeoff
71
+ - Memory-efficient processing
72
+ - Up to 15MB file support
73
 
74
+ </td>
75
+ <td width="50%">
76
 
77
+ ### 🎨 User Experience
78
+ - Drag-and-drop file upload
79
+ - Real-time progress tracking
80
+ - In-browser document preview
81
+ - Preloaded files support
82
+
83
+ </td>
84
+ </tr>
85
+ </table>
86
+
87
+ ---
88
 
89
  ## 🚀 Quick Start
90
 
91
  ### Prerequisites
92
+
93
+ | Requirement | Version | Purpose |
94
+ |-------------|---------|---------|
95
+ | Python | 3.9+ | Runtime |
96
+ | Pinecone Account | Free tier | Vector database |
97
+ | Google AI API Key | Free tier | LLM generation |
98
 
99
  ### Installation
100
 
101
+ ```bash
102
+ # 1. Clone the repository
103
+ git clone https://github.com/vikash-48413/PaperBOT.git
104
+ cd PaperBOT
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
105
 
106
+ # 2. Create virtual environment
107
+ python -m venv venv
108
 
109
+ # Windows
110
+ venv\Scripts\activate
 
 
 
111
 
112
+ # Linux/Mac
113
+ source venv/bin/activate
114
 
115
+ # 3. Install dependencies
116
+ pip install -r requirements.txt
 
 
 
 
 
117
 
118
+ # 4. Configure environment
119
+ cp .env.example .env
120
+ # Edit .env with your API keys (see Configuration section)
121
 
122
+ # 5. Run the application
123
+ python app.py
124
  ```
125
+
126
+ ### Access the Application
127
+
128
+ Open your browser and navigate to:
129
+ ```
130
+ http://localhost:8000
 
 
 
 
 
 
 
 
 
131
  ```
132
 
133
+ ---
134
+
135
+ ## 📖 Usage
136
+
137
+ ### Basic Workflow
138
 
139
  ```
140
+ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
141
+ │ Upload Document │ ──▶ │ Ask Questions │ ──▶ │ Get Answers │
142
+ │ (PDF, DOCX...) │ │ (Natural Lang) │ │ (AI-Powered) │
143
+ └─────────────────┘ └─────────────────┘ └─────────────────┘
144
+ ```
145
+
146
+ ### Step-by-Step
147
+
148
+ 1. **Upload a Document**
149
+ - Drag & drop or click to select a file
150
+ - Supported: PDF, DOCX, DOC, TXT, MD, CSV, JSON, XLSX, XLS
151
+ - Maximum size: 15MB (recommended: under 5MB for fast processing)
152
+
153
+ 2. **Wait for Processing**
154
+ - Progress bar shows upload and embedding status
155
+ - Processing time: ~30s for 1MB, ~2-3min for 5MB
156
+
157
+ 3. **Ask Questions**
158
+ - Type your question in natural language
159
+ - Example: "What is the main contribution of this paper?"
160
+
161
+ 4. **Customize Response**
162
+ - **Style**: Simple, Balanced, or Technical
163
+ - **Length**: Short, Medium, or Comprehensive
164
+
165
+ 5. **Preview Document**
166
+ - Click the 👁️ Preview button to view documents in-browser
167
+ - No download required
168
+
169
+ ### Using Preloaded Files
170
+
171
+ Place documents in the `data/` folder to make them available as preloaded options:
172
+
173
+ ```bash
174
+ # Add a paper to preloaded files
175
+ cp your-paper.pdf data/
176
  ```
177
 
178
+ ---
179
+
180
  ## ⚙️ Configuration
181
 
182
+ ### Environment Variables
183
 
184
+ Create a `.env` file in the project root:
185
+
186
+ ```env
187
+ # Required: Pinecone Vector Database
188
+ PINECONE_API_KEY=your_pinecone_api_key_here
189
+
190
+ # Required: Google AI (Gemini)
191
+ GOOGLE_API_KEY=your_google_api_key_here
192
+
193
+ # Optional: HuggingFace (for some models)
194
+ HF_TOKEN=your_huggingface_token_here
195
  ```
196
 
197
+ ### Getting API Keys
198
+
199
+ | Service | Link | Notes |
200
+ |---------|------|-------|
201
+ | Pinecone | [pinecone.io](https://www.pinecone.io/) | Free tier: 1 index, 100K vectors |
202
+ | Google AI | [aistudio.google.com](https://aistudio.google.com/) | Free tier: 60 requests/min |
203
+ | HuggingFace | [huggingface.co](https://huggingface.co/) | Optional, for gated models |
204
+
205
+ ### Pinecone Index Setup
206
+
207
+ Create an index with these settings:
208
+
209
+ | Setting | Value |
210
+ |---------|-------|
211
+ | **Name** | `paperbot` |
212
+ | **Dimensions** | `1024` |
213
+ | **Metric** | `cosine` |
214
+ | **Cloud** | Any (AWS, GCP, Azure) |
215
+
216
+ ### Performance Tuning
217
+
218
+ Edit `QASystem/config.py` to adjust:
219
+
220
  ```python
221
+ # Embedding model (must match Pinecone dimensions)
222
+ CURRENT_MODEL = "quality" # Options: "fast", "balanced", "quality"
223
+
224
+ # Chunk settings
225
  CHUNK_SETTINGS = {
226
+ "split_by": "word",
227
+ "split_length": 300, # Words per chunk
228
+ "split_overlap": 15, # Overlap between chunks
229
  }
230
+
231
+ # Batch size for embeddings
232
+ BATCH_SIZE = 32 # Higher = faster, but uses more memory
233
  ```
234
 
235
+ ---
236
 
237
+ ## 📡 API Reference
238
+
239
+ ### Endpoints
240
+
241
+ | Method | Endpoint | Description |
242
+ |--------|----------|-------------|
243
+ | `GET` | `/` | Main web interface |
244
+ | `POST` | `/upload_document` | Upload and process a document |
245
+ | `POST` | `/get_result` | Ask a question |
246
+ | `GET` | `/document_status` | Check current document status |
247
+ | `GET` | `/preview_document` | Preview current document |
248
+ | `GET` | `/preview_file/{filename}` | Preview any file |
249
+ | `POST` | `/delete_document` | Delete current document |
250
+ | `GET` | `/preloaded_files` | List preloaded files |
251
+ | `POST` | `/load_preloaded_file` | Load a preloaded file |
252
+ | `GET` | `/model_status` | Check if embedding model is ready |
253
+
254
+ ### Example: Ask a Question (cURL)
255
+
256
+ ```bash
257
+ curl -X POST "http://localhost:8000/get_result" \
258
+ -H "Content-Type: application/x-www-form-urlencoded" \
259
+ -d "question=What is attention mechanism?&style=Balanced&length=Medium"
260
+ ```
261
+
262
+ ### Response Format
263
 
264
+ ```json
265
+ {
266
+ "answer": "The attention mechanism allows the model to focus on relevant parts of the input...",
267
+ "source_file": "attention_paper.pdf"
268
+ }
269
+ ```
270
 
271
+ ---
 
 
 
272
 
273
+ ## 🏗️ Architecture
274
 
275
+ ### System Overview
 
 
 
 
276
 
277
+ ```
278
+ ┌──────────────────────────────────────────────────────────────────┐
279
+ │ PaperBOT │
280
+ ├──────────────────────────────────────────────────────────────────┤
281
+ │ ┌─────────────┐ ┌──────────────┐ ┌─────────────────────┐ │
282
+ │ │ FastAPI │───▶│ Haystack │───▶│ Google Gemini │ │
283
+ │ │ Server │ │ Pipeline │ │ (LLM Generation) │ │
284
+ │ └─────────────┘ └──────────────┘ └─────────────────────┘ │
285
+ │ │ │ │
286
+ │ ▼ ▼ │
287
+ │ ┌─────────────┐ ┌──────────────┐ │
288
+ │ │ Document │ │ Pinecone │ │
289
+ │ │ Converters │ │ Vector DB │ │
290
+ │ └─────────────┘ └──────────────┘ │
291
+ └──────────────────────────────────────────────────────────────────┘
292
+ ```
293
 
294
+ ### Tech Stack
 
 
 
 
295
 
296
+ | Component | Technology | Purpose |
297
+ |-----------|------------|---------|
298
+ | **Backend** | FastAPI + Uvicorn | Async web server |
299
+ | **RAG Framework** | Haystack 2.22 | Pipeline orchestration |
300
+ | **Embeddings** | Sentence Transformers | BAAI/bge-large-en-v1.5 |
301
+ | **Vector DB** | Pinecone | Semantic search |
302
+ | **LLM** | Google Gemini 2.0 Flash | Answer generation |
303
+ | **Frontend** | HTML/CSS/JS + Bootstrap | User interface |
304
 
305
+ ### Processing Pipeline
306
+
307
+ ```
308
+ Document Upload
309
+
310
+
311
+ ┌─────────────────┐
312
+ │ File Validation │ ← Check type, size
313
+ └────────┬────────┘
314
+
315
+
316
+ ┌─────────────────┐
317
+ │ Format Converter│ ← PDF, DOCX, Excel → Text
318
+ └────────┬────────┘
319
+
320
+
321
+ ┌─────────────────┐
322
+ │ Text Chunking │ ← 300 words/chunk
323
+ └────────┬────────┘
324
+
325
+
326
+ ┌─────────────────┐
327
+ │ Size Enforcement│ ← Ensure <8KB per chunk
328
+ └────────┬────────┘
329
+
330
+
331
+ ┌─────────────────┐
332
+ │ Batch Embedding │ ← 32 chunks/batch
333
+ └────────┬────────┘
334
+
335
+
336
+ ┌─────────────────┐
337
+ │ Pinecone Upload │ ← Store vectors
338
+ └─────────────────┘
339
+ ```
340
+
341
+ ---
342
 
343
  ## 📁 Project Structure
344
 
345
  ```
346
  PaperBOT/
347
+ ├── app.py # Main FastAPI application
348
+ ├── QASystem/ # Core RAG system
 
 
 
 
 
349
  │ ├── __init__.py
350
+ │ ├── config.py # Configuration settings
351
+ │ ├── ingestion.py # Document processing & embedding
352
+ │ ├── retrieval_and_generation.py # Q&A pipeline
353
+ │ └── utils.py # Pinecone utilities
354
+ ├── templates/ # HTML templates
355
+ │ └── index.html # Main UI
356
+ ├── data/ # Preloaded documents
357
+ ── uploads/ # User uploads (gitignored)
358
+ ├── requirements.txt # Python dependencies
359
+ ├── .env.example # Environment template
360
+ ├── start.bat # Windows launcher
361
+ ├── start.sh # Linux/Mac launcher
362
+ └── LICENSE # MIT License
363
  ```
364
 
365
+ ---
366
 
367
+ ## 🧪 Testing
 
 
 
368
 
369
+ ### Run System Tests
 
 
 
370
 
371
+ ```bash
372
+ # Test all components
373
+ python test_system.py
374
+
375
+ # Test Pinecone connection
376
+ python test_pinecone.py
377
+ ```
378
+
379
+ ### Manual Testing Checklist
380
+
381
+ 1. **Upload Test**: Upload a small PDF (<1MB)
382
+ 2. **Query Test**: Ask "What is this document about?"
383
+ 3. **Preview Test**: Click the preview button
384
+ 4. **Delete Test**: Delete the document
385
+
386
+ ---
387
 
388
+ ## 🔧 Troubleshooting
 
 
 
389
 
390
+ ### Common Issues
391
 
392
+ <details>
393
+ <summary><b>❌ "Pinecone index not found"</b></summary>
394
+
395
+ Create the index in Pinecone console:
396
+ - Name: `paperbot`
397
+ - Dimensions: `1024`
398
+ - Metric: `cosine`
399
+ </details>
400
+
401
+ <details>
402
+ <summary><b>❌ "Google API quota exceeded"</b></summary>
403
+
404
+ The free tier has rate limits. Either:
405
+ - Wait a few minutes for quota reset
406
+ - Upgrade to paid tier
407
+ - Use curated fallback (automatic)
408
+ </details>
409
+
410
+ <details>
411
+ <summary><b>❌ "File too large" error</b></summary>
412
+
413
+ Maximum file size is 15MB. For faster processing:
414
+ - Keep files under 5MB
415
+ - Split large documents into chapters
416
+ </details>
417
+
418
+ <details>
419
+ <summary><b>❌ Server not starting</b></summary>
420
+
421
+ 1. Check if port 8000 is in use
422
+ 2. Verify virtual environment is activated
423
+ 3. Check all dependencies: `pip install -r requirements.txt`
424
+ </details>
425
+
426
+ <details>
427
+ <summary><b>❌ "Model dimension mismatch"</b></summary>
428
+
429
+ The embedding model dimension must match Pinecone index:
430
+ - `fast` model → 384 dimensions
431
+ - `balanced` model → 768 dimensions
432
+ - `quality` model → 1024 dimensions
433
+
434
+ Either recreate the Pinecone index or change `CURRENT_MODEL` in config.py
435
+ </details>
436
+
437
+ ---
438
 
439
  ## 🤝 Contributing
440
 
441
+ Contributions are welcome! Please:
 
 
 
 
 
442
 
443
+ 1. Fork the repository
444
+ 2. Create a feature branch: `git checkout -b feature/amazing-feature`
445
+ 3. Commit changes: `git commit -m 'Add amazing feature'`
446
+ 4. Push to branch: `git push origin feature/amazing-feature`
447
+ 5. Open a Pull Request
448
 
449
+ ### Development Setup
450
 
451
+ ```bash
452
+ # Install development dependencies
453
+ pip install -r requirements-dev.txt
454
 
455
+ # Run tests
456
+ pytest
 
 
457
 
458
+ # Format code
459
+ black . --line-length 100
460
+ ```
461
+
462
+ ---
463
+
464
+ ## 📜 License
465
+
466
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
467
+
468
+ ---
469
+
470
+ ## 👤 Author
471
+
472
+ **Vikash**
473
+ - GitHub: [@vikash-48413](https://github.com/vikash-48413)
474
+ - Email: vikash17052005@gmail.com
475
+
476
+ ---
477
+
478
+ ## 🙏 Acknowledgments
479
 
480
+ - [Haystack](https://haystack.deepset.ai/) RAG framework
481
+ - [Pinecone](https://www.pinecone.io/) Vector database
482
+ - [Google AI](https://ai.google.dev/) Gemini LLM
483
+ - [Sentence Transformers](https://www.sbert.net/) Embeddings
484
+ - [FastAPI](https://fastapi.tiangolo.com/) Web framework
485
 
486
  ---
487
 
488
+ <p align="center">
489
+ <b>⭐ Star this repository if you find it useful!</b>
490
+ </p>
TESTING_GUIDE.md DELETED
@@ -1,326 +0,0 @@
1
- # Testing Guide - Large File Upload Fixes
2
-
3
- ## Quick Test Checklist
4
-
5
- ### ✅ Pre-Flight Checks
6
- 1. Server starts without errors
7
- 2. Pinecone connection works
8
- 3. No Python dependency issues
9
-
10
- ### ✅ File Upload Tests
11
-
12
- #### Test 1: Small File (< 1MB)
13
- - **File**: Any PDF < 1MB
14
- - **Expected**: Fast upload (< 30s)
15
- - **Check**: Progress bar, success message, file size shown
16
-
17
- #### Test 2: Medium File (1-3MB)
18
- - **File**: Research paper 1-3MB
19
- - **Expected**: Normal processing (30-60s)
20
- - **Check**: Optimized settings NOT activated
21
-
22
- #### Test 3: Large File (3-5MB) ⭐ PRIMARY TEST
23
- - **File**: Research paper ~5MB
24
- - **Expected**: Optimized processing (1-2 min)
25
- - **Check**: Console shows "Large file detected - using optimized settings"
26
- - **Verify**:
27
- - Chunk length: 350 words
28
- - Batch size: 20 chunks
29
- - Memory cleanup messages appear
30
-
31
- #### Test 4: Very Large File (5-10MB)
32
- - **File**: Long research paper 5-10MB
33
- - **Expected**: Slower but successful (2-5 min)
34
- - **Check**: All chunks processed, no memory errors
35
-
36
- #### Test 5: Maximum Size (45-50MB)
37
- - **File**: Very large document
38
- - **Expected**: Works but takes time (5-10 min)
39
- - **Warning**: May be slow, monitor console
40
-
41
- #### Test 6: Oversized File (> 50MB)
42
- - **File**: Any file > 50MB
43
- - **Expected**: Clear error message
44
- - **Error Text**: "File size (XXmb) exceeds maximum allowed size (50MB)"
45
- - **HTTP Code**: 413 (Payload Too Large)
46
-
47
- ### ✅ Error Handling Tests
48
-
49
- #### Test 7: Empty File
50
- - **File**: Create empty .pdf file
51
- - **Expected**: Error "File is empty"
52
- - **HTTP Code**: 400
53
-
54
- #### Test 8: Wrong File Type
55
- - **File**: .exe, .zip, .mp3, etc.
56
- - **Expected**: Error listing supported formats
57
- - **Supported**: PDF, DOCX, DOC, TXT, MD, CSV, JSON, XLSX, XLS
58
-
59
- #### Test 9: Corrupted PDF
60
- - **File**: Damaged PDF file
61
- - **Expected**: Error during content extraction
62
- - **Check**: Proper error message, cleanup occurs
63
-
64
- ### ✅ Performance Tests
65
-
66
- #### Test 10: Memory Usage
67
- 1. Note starting memory (Task Manager)
68
- 2. Upload 5MB file
69
- 3. Check memory during processing
70
- 4. Verify memory cleanup after completion
71
- 5. **Expected**: Memory returns near baseline
72
-
73
- #### Test 11: Multiple Uploads
74
- 1. Upload file A (5MB)
75
- 2. Wait for completion
76
- 3. Delete file A
77
- 4. Upload file B (5MB)
78
- 5. **Expected**: Both work, no accumulation errors
79
-
80
- #### Test 12: Upload Cancellation
81
- 1. Start uploading large file
82
- 2. Refresh page mid-upload
83
- 3. **Expected**: Cleanup occurs, no orphaned files
84
-
85
- ### ✅ Progress Tracking Tests
86
-
87
- #### Test 13: Progress Updates
88
- 1. Upload 5MB file
89
- 2. Watch console output
90
- 3. **Check for**:
91
- - ✓ File type validated
92
- - ✓ File read successfully: X.XXmb
93
- - ✓ Cleared previous uploads
94
- - ✓ File saved to: uploads/...
95
- - ✓ Initializing document store
96
- - ✓ Converting file
97
- - ✓ Extracted content
98
- - ✓ Created X chunks
99
- - ✓ Batch N/M complete
100
- - ✓ Memory cleanup
101
- - ✓ Wrote X chunks to Pinecone
102
- - ✅ Ingestion completed successfully
103
-
104
- #### Test 14: Frontend Progress
105
- 1. Upload any file
106
- 2. Watch progress bar
107
- 3. **Check**:
108
- - Shows percentage
109
- - Updates in real-time
110
- - Shows "Complete! ✓" at end
111
-
112
- ## Console Output Examples
113
-
114
- ### ✅ Successful 5MB Upload
115
- ```
116
- 📥 Upload endpoint called - Filename: research_paper.pdf
117
- ✓ File type validated: .pdf
118
- 📦 Reading file in chunks...
119
- ✓ File read successfully: 5.23MB
120
- 🧹 Clearing previous uploads...
121
- ✓ File saved to: uploads\research_paper.pdf
122
- 📤 Upload started: research_paper.pdf (5.23MB)
123
- 🔧 Initializing document store...
124
- 📊 Processing document...
125
-
126
- 📄 Starting document ingestion: research_paper.pdf
127
- 💾 Server baseline memory: 450.2MB
128
- ✓ Cleared existing documents from vector store
129
- 🔄 Converting .pdf file...
130
- ✓ Extracted content (1 document(s), 45230 chars)
131
- 📊 File size: 5.23MB
132
- ⚡ Large file detected - using optimized settings:
133
- Chunk length: 350 words
134
- Batch size: 20 chunks
135
- ✂️ Splitting into chunks...
136
- ✓ Created 156 chunks
137
- 🧠 Embedding chunks with parallel processing...
138
- 📦 Processing 156 chunks in 8 batches (batch_size=20)
139
- ✓ Batch 1/8 complete (20 chunks, +15.2MB)
140
- ✓ Batch 2/8 complete (20 chunks, +14.8MB)
141
- ✓ Batch 3/8 complete (20 chunks, +15.1MB)
142
- 🧹 Memory cleanup: 485.3MB
143
- ✓ Batch 4/8 complete (20 chunks, +15.0MB)
144
- ...
145
- ✓ Batch 8/8 complete (16 chunks, +12.1MB)
146
- ✓ Successfully embedded 156 chunks
147
- 💾 Writing to vector database...
148
- ✓ Wrote 156 chunks to Pinecone
149
-
150
- ✅ Ingestion completed successfully!
151
- 📊 Statistics:
152
- • Document: research_paper.pdf
153
- • Format: PDF
154
- • Size: 5.23MB
155
- • Chunks created: 156
156
- • Time taken: 94.32 seconds
157
- • Speed: 1.7 chunks/sec
158
- • Memory used: +45.3MB (450.2MB → 495.5MB)
159
- ⏱️ Total processing time: 94.3s
160
- ✅ Upload completed: research_paper.pdf
161
- ```
162
-
163
- ### ❌ Failed Upload - File Too Large
164
- ```
165
- 📥 Upload endpoint called - Filename: huge_file.pdf
166
- ✓ File type validated: .pdf
167
- 📦 Reading file in chunks...
168
- ❌ Error: File size (52.34MB) exceeds maximum (50MB)
169
- ```
170
-
171
- ### ❌ Failed Upload - Empty File
172
- ```
173
- 📥 Upload endpoint called - Filename: empty.pdf
174
- ✓ File type validated: .pdf
175
- 📦 Reading file in chunks...
176
- ✓ File read successfully: 0.00MB
177
- ❌ Error: File is empty
178
- ```
179
-
180
- ## Browser Console Checks
181
-
182
- ### Successful Upload
183
- ```javascript
184
- File selected: File {name: "paper.pdf", size: 5485760}
185
- File: paper.pdf, Size: 5.23MB
186
- Starting upload for: paper.pdf
187
- Sending request to /upload_document
188
- Processing progress: 30%
189
- Processing progress: 50%
190
- Processing progress: 70%
191
- Processing progress: 90%
192
- ```
193
-
194
- ### Failed Upload
195
- ```javascript
196
- File: huge.pdf, Size: 52.34MB
197
- ❌ File too large error shown
198
- ```
199
-
200
- ## What to Look For
201
-
202
- ### ✅ SUCCESS Indicators
203
- - Progress bar completes to 100%
204
- - Green success notification
205
- - File appears in "Current Document" section
206
- - Console shows "✅ Ingestion completed successfully"
207
- - Can ask questions immediately
208
-
209
- ### ❌ FAILURE Indicators
210
- - Red error notification
211
- - Clear error message
212
- - Console shows "❌" errors
213
- - File automatically cleaned up
214
- - Can upload again without issues
215
-
216
- ## Performance Benchmarks
217
-
218
- | File Size | Expected Time | Chunks | Memory Usage |
219
- |-----------|---------------|--------|---------------|
220
- | 0.5 MB | 15-20s | 40-60 | +20-30MB |
221
- | 1 MB | 25-35s | 80-100 | +30-40MB |
222
- | 3 MB | 50-70s | 120-140| +40-60MB |
223
- | 5 MB | 80-120s | 150-180| +50-80MB |
224
- | 10 MB | 150-240s | 280-320| +80-120MB |
225
-
226
- *Times vary based on system specs and model choice*
227
-
228
- ## Common Issues & Solutions
229
-
230
- ### Issue: "Timeout after 10 minutes"
231
- **Cause**: File too complex or system too slow
232
- **Solution**:
233
- - Try smaller file
234
- - Use "fast" model in config.py
235
- - Increase chunk size in config.py
236
-
237
- ### Issue: "Memory Error"
238
- **Cause**: Insufficient RAM
239
- **Solution**:
240
- - Close other applications
241
- - Increase batch size in config.py
242
- - Use "fast" model (uses less memory)
243
-
244
- ### Issue: "No chunks embedded"
245
- **Cause**: Pinecone connection issue
246
- **Solution**:
247
- - Check PINECONE_API_KEY in .env
248
- - Verify internet connection
249
- - Check Pinecone index exists
250
-
251
- ### Issue: "File not found after upload"
252
- **Cause**: Permission or path issue
253
- **Solution**:
254
- - Check uploads/ directory exists
255
- - Verify write permissions
256
- - Check disk space
257
-
258
- ## Quick Debug Commands
259
-
260
- ### Check Server Logs
261
- ```bash
262
- # Look for errors in console
263
- # Search for ❌ or "ERROR" or "Exception"
264
- ```
265
-
266
- ### Check File Size
267
- ```bash
268
- # Windows PowerShell
269
- (Get-Item "path\to\file.pdf").Length / 1MB
270
- ```
271
-
272
- ### Check Memory Usage
273
- ```bash
274
- # Windows Task Manager
275
- # Look for python.exe process
276
- ```
277
-
278
- ### Check Uploads Directory
279
- ```bash
280
- # PowerShell
281
- Get-ChildItem uploads\
282
- ```
283
-
284
- ## Success Criteria
285
-
286
- For a 5MB file upload to be considered successful:
287
-
288
- 1. ✅ File uploads without errors
289
- 2. ✅ Optimized settings activate (console shows message)
290
- 3. ✅ All chunks process successfully
291
- 4. ✅ Data written to Pinecone
292
- 5. ✅ Memory cleanup occurs
293
- 6. ✅ Processing completes in < 5 minutes
294
- 7. ✅ Can ask questions and get answers
295
- 8. ✅ Statistics show correct file size
296
-
297
- ## Report Template
298
-
299
- When reporting issues, include:
300
-
301
- ```
302
- File Details:
303
- - Name: _______
304
- - Size: _______
305
- - Type: _______
306
-
307
- Error:
308
- - Message: _______
309
- - HTTP Code: _______
310
- - When occurred: _______
311
-
312
- Console Output:
313
- [Paste last 20-30 lines]
314
-
315
- Browser Console:
316
- [Paste any errors]
317
-
318
- System:
319
- - OS: _______
320
- - RAM: _______
321
- - Python Version: _______
322
- ```
323
-
324
- ---
325
-
326
- **Need Help?** Check [FIXES_APPLIED.md](FIXES_APPLIED.md) for detailed technical information.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
UPGRADE_SUMMARY.md DELETED
@@ -1,324 +0,0 @@
1
- # 🎯 PaperBOT v2.0 - Upgrade Summary
2
-
3
- ## ✅ Completed Improvements
4
-
5
- ### 1. Multi-Format File Support ✅
6
- **What Changed:**
7
- - Added support for 7+ file formats beyond just PDF/TXT
8
- - New formats: DOCX, DOC, MD, CSV, JSON, XLSX, XLS
9
-
10
- **Implementation:**
11
- - Created converter functions for each format
12
- - Added `python-docx` for Word documents
13
- - Added `pandas` for CSV/Excel files
14
- - Integrated Markdown support
15
- - JSON parsing with proper formatting
16
-
17
- **Files Modified:**
18
- - `QASystem/ingestion.py` - Added converter functions
19
- - `app.py` - Updated file validation
20
- - `templates/index.html` - Updated UI file types
21
- - `requirements.txt` - Added dependencies
22
-
23
- ---
24
-
25
- ### 2. Parallel Processing ✅
26
- **What Changed:**
27
- - Implemented batch processing for embeddings
28
- - Added multi-threaded document upload
29
- - Concurrent chunk processing
30
-
31
- **Implementation:**
32
- - `process_chunks_parallel()` function for batch embedding
33
- - ThreadPoolExecutor in `app.py` for async upload
34
- - Configurable batch size (default: 32 chunks)
35
- - Progress tracking throughout pipeline
36
-
37
- **Performance Gain:**
38
- - 3-5x faster processing
39
- - 30-50 chunks/second (up from 10-15)
40
- - Non-blocking UI during upload
41
-
42
- **Files Modified:**
43
- - `QASystem/ingestion.py` - Parallel processing logic
44
- - `app.py` - Async upload handling
45
-
46
- ---
47
-
48
- ### 3. Memory Management ✅
49
- **What Changed:**
50
- - Real-time memory monitoring
51
- - Automatic garbage collection
52
- - Batch-wise memory cleanup
53
- - Memory usage reporting
54
-
55
- **Implementation:**
56
- - Added `psutil` for memory monitoring
57
- - `get_memory_usage()` function
58
- - `clear_memory()` for garbage collection
59
- - Cleanup every 5 batches
60
- - Memory stats in console output
61
-
62
- **Memory Optimization:**
63
- - Prevents memory leaks
64
- - Handles large documents (100+ pages)
65
- - Configurable batch sizes for memory control
66
-
67
- **Files Modified:**
68
- - `QASystem/ingestion.py` - Memory management functions
69
- - `requirements.txt` - Added psutil
70
-
71
- ---
72
-
73
- ### 4. Enhanced Semantic Search ✅
74
- **What Changed:**
75
- - Increased retrieval candidates from 5 to 10
76
- - Added cached embedders for faster queries
77
- - Implemented fallback retrieval strategies
78
- - Added relevance scoring display
79
-
80
- **Implementation:**
81
- - `get_text_embedder()` with caching
82
- - Enhanced `get_result()` function
83
- - Multiple fallback mechanisms
84
- - Relevance score calculation
85
- - Better error handling
86
-
87
- **Search Quality:**
88
- - Better context coverage
89
- - More accurate answers
90
- - Faster query responses (2-5s)
91
- - Graceful degradation on errors
92
-
93
- **Files Modified:**
94
- - `QASystem/retrieval_and_generation.py` - Enhanced retrieval
95
-
96
- ---
97
-
98
- ### 5. Progress Tracking & Error Handling ✅
99
- **What Changed:**
100
- - Real-time progress updates
101
- - Comprehensive error messages
102
- - Detailed console logging
103
- - Better user feedback
104
-
105
- **Implementation:**
106
- - Progress callbacks in ingestion
107
- - Try-catch blocks everywhere
108
- - Detailed error traces
109
- - User-friendly error messages
110
- - Troubleshooting hints
111
-
112
- **User Experience:**
113
- - Clear status updates
114
- - Informative error messages
115
- - Debug information in console
116
- - Recovery suggestions
117
-
118
- **Files Modified:**
119
- - `app.py` - Progress tracking
120
- - `QASystem/ingestion.py` - Logging
121
- - `templates/index.html` - UI feedback
122
-
123
- ---
124
-
125
- ### 6. Dependencies Updated ✅
126
- **What Changed:**
127
- - Added 8 new packages
128
- - Updated requirements.txt
129
- - Created .env.example
130
-
131
- **New Dependencies:**
132
- ```
133
- python-docx # Word documents
134
- pandas # Data processing
135
- openpyxl # Excel files
136
- psutil # Memory monitoring
137
- tqdm # Progress bars
138
- markdown # Markdown support
139
- jinja2 # Templates
140
- ```
141
-
142
- **Files Modified:**
143
- - `requirements.txt` - Complete dependencies
144
- - `.env.example` - Environment template
145
-
146
- ---
147
-
148
- ### 7. Comprehensive Documentation ✅
149
- **What Changed:**
150
- - Created 6 new documentation files
151
- - Added helper scripts
152
- - Created quick reference guides
153
-
154
- **New Files:**
155
- - `README.md` - Complete project overview
156
- - `INSTALLATION.md` - Setup guide
157
- - `FEATURES.md` - Feature documentation
158
- - `QUICKSTART.md` - Quick reference
159
- - `CHANGELOG.md` - Version history
160
- - `.env.example` - Config template
161
- - `test_system.py` - System test script
162
- - `start.bat` / `start.sh` - Startup scripts
163
-
164
- ---
165
-
166
- ## 📊 Performance Comparison
167
-
168
- | Metric | Before (v1.0) | After (v2.0) | Improvement |
169
- |--------|---------------|--------------|-------------|
170
- | Upload Speed | 5-10s | 1-3s | **3x faster** |
171
- | Processing Speed | 10-15 chunks/s | 30-50 chunks/s | **3-4x faster** |
172
- | Query Response | 5-10s | 2-5s | **2x faster** |
173
- | Memory Usage | Unoptimized | Monitored & Optimized | **Better** |
174
- | File Formats | 2 (PDF, TXT) | 7+ formats | **4x more** |
175
- | Error Handling | Basic | Comprehensive | **Much better** |
176
-
177
- ---
178
-
179
- ## 🔧 Key Code Improvements
180
-
181
- ### Ingestion Pipeline (Before)
182
- ```python
183
- # Simple pipeline, no parallelization
184
- indexing.run({"converter": {"sources": [file]}})
185
- ```
186
-
187
- ### Ingestion Pipeline (After)
188
- ```python
189
- # Parallel processing with memory management
190
- documents = convert_to_documents(file) # Multi-format
191
- chunks = split_documents(documents)
192
- embedded = process_chunks_parallel(chunks) # Parallel!
193
- write_to_store(embedded)
194
- clear_memory() # Memory cleanup
195
- ```
196
-
197
- ### Retrieval (Before)
198
- ```python
199
- # Simple retrieval, top_k=5
200
- retriever = PineconeEmbeddingRetriever(top_k=5)
201
- ```
202
-
203
- ### Retrieval (After)
204
- ```python
205
- # Enhanced retrieval with caching and fallbacks
206
- embedder = get_text_embedder() # Cached!
207
- retriever = PineconeEmbeddingRetriever(top_k=10) # More candidates
208
- # + Multiple fallback strategies
209
- ```
210
-
211
- ---
212
-
213
- ## 📁 Project Structure (Updated)
214
-
215
- ```
216
- PaperBOT/
217
- ├── 📄 app.py # FastAPI app (enhanced)
218
- ├── 📄 requirements.txt # Updated dependencies
219
- ├── 📄 setup.py # Package setup
220
- ├── 📄 .env.example # NEW: Config template
221
- ├── 📄 .gitignore # Git ignore rules
222
- ├── 📄 test_system.py # NEW: System test
223
- ├── 📄 start.bat # NEW: Windows startup
224
- ├── 📄 start.sh # NEW: Linux/Mac startup
225
-
226
- ├── 📚 Documentation (NEW)
227
- │ ├── README.md # Complete overview
228
- │ ├── INSTALLATION.md # Setup guide
229
- │ ├── FEATURES.md # Feature docs
230
- │ ├── QUICKSTART.md # Quick reference
231
- │ └── CHANGELOG.md # Version history
232
-
233
- ├── 📂 QASystem/
234
- │ ├── __init__.py
235
- │ ├── config.py # Performance config
236
- │ ├── ingestion.py # Enhanced with parallel processing
237
- │ ├── retrieval_and_generation.py # Enhanced semantic search
238
- │ └── utils.py # Utilities
239
-
240
- ├── 📂 templates/
241
- │ └── index.html # Updated UI
242
-
243
- ├── 📂 uploads/ # Temp storage
244
- │ └── .gitkeep # NEW
245
-
246
- └── 📂 data/ # Sample docs
247
- ```
248
-
249
- ---
250
-
251
- ## 🎯 How to Use the Upgrades
252
-
253
- ### 1. Install New Dependencies
254
- ```bash
255
- pip install -r requirements.txt
256
- ```
257
-
258
- ### 2. Configure Performance (Optional)
259
- Edit `QASystem/config.py`:
260
- ```python
261
- # For speed
262
- CURRENT_MODEL = "fast"
263
- BATCH_SIZE = 64
264
-
265
- # For quality
266
- CURRENT_MODEL = "quality"
267
- BATCH_SIZE = 16
268
- ```
269
-
270
- ### 3. Upload New File Types
271
- - Drag & drop any supported format
272
- - System auto-detects and converts
273
- - Watch progress bar for status
274
-
275
- ### 4. Monitor Performance
276
- - Check console for memory stats
277
- - View processing speed (chunks/sec)
278
- - See relevance scores in answers
279
-
280
- ---
281
-
282
- ## 🚀 Next Steps
283
-
284
- ### To Run the Application:
285
- ```bash
286
- # Option 1: Use startup script
287
- start.bat # Windows
288
- ./start.sh # Linux/Mac
289
-
290
- # Option 2: Manual start
291
- python app.py
292
- ```
293
-
294
- ### To Test the System:
295
- ```bash
296
- python test_system.py
297
- ```
298
-
299
- ### To Read Documentation:
300
- - Quick start: `QUICKSTART.md`
301
- - Full guide: `README.md`
302
- - Features: `FEATURES.md`
303
- - Installation help: `INSTALLATION.md`
304
-
305
- ---
306
-
307
- ## 📝 Summary
308
-
309
- ✅ **7 Major Improvements Completed**
310
- ✅ **6 New Documentation Files**
311
- ✅ **3-5x Performance Improvement**
312
- ✅ **7+ File Formats Supported**
313
- ✅ **Optimized Memory Management**
314
- ✅ **Enhanced Semantic Search**
315
- ✅ **Production-Ready Code**
316
-
317
- The application is now:
318
- - ⚡ **Faster** - Parallel processing, cached models
319
- - 🧠 **Smarter** - Better retrieval, relevance scoring
320
- - 💪 **Stronger** - Memory management, error handling
321
- - 📚 **More Capable** - 7+ file formats
322
- - 🎯 **Better Documented** - Comprehensive guides
323
-
324
- **Status: READY FOR PRODUCTION USE! 🎉**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
data/.gitkeep CHANGED
@@ -1,2 +0,0 @@
1
- # This file ensures the data directory is tracked by Git
2
- # Preloaded PDF files can be placed here
 
 
 
uploads/.gitkeep CHANGED
@@ -1,2 +0,0 @@
1
- # This file ensures the uploads directory is tracked by Git
2
- # User-uploaded files will be stored here but are ignored by .gitignore