Spaces:
Sleeping
Sleeping
Commit ·
6906333
1
Parent(s): 76bc2b7
📝 Major documentation overhaul: Consolidated all docs into professional README.md
Browse files- .env.example +26 -7
- CHANGELOG.md +36 -2
- FEATURES.md +0 -162
- FIXES_APPLIED.md +0 -281
- INSTALLATION.md +0 -110
- QUICKSTART.md +0 -227
- QUICK_REFERENCE.md +0 -351
- README.md +422 -203
- TESTING_GUIDE.md +0 -326
- UPGRADE_SUMMARY.md +0 -324
- data/.gitkeep +0 -2
- uploads/.gitkeep +0 -2
.env.example
CHANGED
|
@@ -1,13 +1,32 @@
|
|
| 1 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
PINECONE_API_KEY=your_pinecone_api_key_here
|
| 3 |
|
| 4 |
-
#
|
|
|
|
|
|
|
|
|
|
| 5 |
GOOGLE_API_KEY=your_google_ai_api_key_here
|
| 6 |
|
| 7 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
HF_TOKEN=your_huggingface_token_here
|
| 9 |
|
| 10 |
-
#
|
| 11 |
-
#
|
| 12 |
-
#
|
| 13 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# ===================================
|
| 2 |
+
# PaperBOT Environment Configuration
|
| 3 |
+
# ===================================
|
| 4 |
+
# Copy this file to .env and fill in your API keys
|
| 5 |
+
# cp .env.example .env
|
| 6 |
+
|
| 7 |
+
# --------------------------
|
| 8 |
+
# REQUIRED: Pinecone Vector Database
|
| 9 |
+
# Get your API key: https://www.pinecone.io/
|
| 10 |
+
# --------------------------
|
| 11 |
PINECONE_API_KEY=your_pinecone_api_key_here
|
| 12 |
|
| 13 |
+
# --------------------------
|
| 14 |
+
# REQUIRED: Google AI (Gemini)
|
| 15 |
+
# Get your API key: https://aistudio.google.com/
|
| 16 |
+
# --------------------------
|
| 17 |
GOOGLE_API_KEY=your_google_ai_api_key_here
|
| 18 |
|
| 19 |
+
# --------------------------
|
| 20 |
+
# OPTIONAL: HuggingFace Token
|
| 21 |
+
# Get your token: https://huggingface.co/settings/tokens
|
| 22 |
+
# Required for some gated models
|
| 23 |
+
# --------------------------
|
| 24 |
HF_TOKEN=your_huggingface_token_here
|
| 25 |
|
| 26 |
+
# --------------------------
|
| 27 |
+
# OPTIONAL: Application Settings
|
| 28 |
+
# Uncomment and modify as needed
|
| 29 |
+
# --------------------------
|
| 30 |
+
# MAX_UPLOAD_SIZE=15728640 # 15MB in bytes (default)
|
| 31 |
+
# BATCH_SIZE=32 # Chunks per batch
|
| 32 |
+
# MODEL_TYPE=quality # Options: fast, balanced, quality
|
CHANGELOG.md
CHANGED
|
@@ -1,6 +1,40 @@
|
|
| 1 |
-
#
|
| 2 |
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
|
| 5 |
### 🎉 New Features
|
| 6 |
|
|
|
|
| 1 |
+
# Changelog
|
| 2 |
|
| 3 |
+
All notable changes to PaperBOT are documented here.
|
| 4 |
+
|
| 5 |
+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## [2.1.0] - January 2025
|
| 10 |
+
|
| 11 |
+
### Added
|
| 12 |
+
- **Document Preview** - In-browser preview for all document types
|
| 13 |
+
- **Preloaded Files Preview** - Preview button for files in data/ folder
|
| 14 |
+
- **File Size Warnings** - Warning dialog for files over 5MB
|
| 15 |
+
- **Model Pre-warming** - Embedding model loads on server startup
|
| 16 |
+
- **Curated Fallback Responses** - Beautiful output when API quota exceeded
|
| 17 |
+
- **Retry Logic** - Exponential backoff for transient API failures
|
| 18 |
+
|
| 19 |
+
### Changed
|
| 20 |
+
- Reduced max file size from 50MB to 15MB for better performance
|
| 21 |
+
- Optimized chunk size to 300 words (from 200)
|
| 22 |
+
- Improved error messages with troubleshooting tips
|
| 23 |
+
- Enhanced UI with Bootstrap 5
|
| 24 |
+
|
| 25 |
+
### Fixed
|
| 26 |
+
- Pinecone 40KB metadata limit error (chunks now enforced to 8KB)
|
| 27 |
+
- Google API 429 quota errors now handled gracefully
|
| 28 |
+
- Memory leaks during batch processing
|
| 29 |
+
- Encoding issues with special characters
|
| 30 |
+
|
| 31 |
+
### Security
|
| 32 |
+
- API keys stored only in .env (not tracked by git)
|
| 33 |
+
- Added comprehensive .gitignore
|
| 34 |
+
|
| 35 |
+
---
|
| 36 |
+
|
| 37 |
+
## [2.0.0] - January 2025 (Major Upgrade)
|
| 38 |
|
| 39 |
### 🎉 New Features
|
| 40 |
|
FEATURES.md
DELETED
|
@@ -1,162 +0,0 @@
|
|
| 1 |
-
# PaperBOT Features
|
| 2 |
-
|
| 3 |
-
## 🚀 Core Features
|
| 4 |
-
|
| 5 |
-
### 1. **Multi-Format Document Support**
|
| 6 |
-
Upload and process various document formats:
|
| 7 |
-
- PDF documents
|
| 8 |
-
- Microsoft Word (.docx, .doc)
|
| 9 |
-
- Plain text files (.txt)
|
| 10 |
-
- Markdown files (.md)
|
| 11 |
-
- CSV data files
|
| 12 |
-
- JSON files
|
| 13 |
-
- Excel spreadsheets (.xlsx, .xls)
|
| 14 |
-
|
| 15 |
-
### 2. **Parallel Processing**
|
| 16 |
-
- **Batch Embedding**: Process multiple chunks simultaneously
|
| 17 |
-
- **Multi-threaded Upload**: Non-blocking document processing
|
| 18 |
-
- **Optimized Memory Usage**: Automatic memory cleanup between batches
|
| 19 |
-
- **Fast Processing**: 30-50 chunks/second on average hardware
|
| 20 |
-
|
| 21 |
-
### 3. **Smart Semantic Search**
|
| 22 |
-
- **Vector Database**: Powered by Pinecone for fast similarity search
|
| 23 |
-
- **Top-K Retrieval**: Retrieves 10 most relevant chunks
|
| 24 |
-
- **Relevance Scoring**: Shows confidence scores for retrieved content
|
| 25 |
-
- **Namespace Isolation**: Each document in separate namespace for accuracy
|
| 26 |
-
|
| 27 |
-
### 4. **Advanced Memory Management**
|
| 28 |
-
- **Automatic Garbage Collection**: Clears memory every 5 batches
|
| 29 |
-
- **Memory Monitoring**: Real-time memory usage tracking
|
| 30 |
-
- **Batch Processing**: Configurable batch sizes (default: 32 chunks)
|
| 31 |
-
- **Resource Cleanup**: Automatic cleanup after processing
|
| 32 |
-
|
| 33 |
-
### 5. **Intelligent Q&A System**
|
| 34 |
-
- **RAG (Retrieval Augmented Generation)**: Combines search with AI generation
|
| 35 |
-
- **Customizable Responses**: Choose style and length
|
| 36 |
-
- **Context-Aware**: Only answers from uploaded document
|
| 37 |
-
- **Fallback Mechanisms**: Multiple fallback strategies if main pipeline fails
|
| 38 |
-
|
| 39 |
-
### 6. **User-Friendly Interface**
|
| 40 |
-
- **Drag & Drop Upload**: Easy file upload
|
| 41 |
-
- **Progress Tracking**: Real-time upload and processing progress
|
| 42 |
-
- **Document Management**: View current document, delete when done
|
| 43 |
-
- **Response Customization**: Select explanation style and length
|
| 44 |
-
- **Formatted Answers**: Markdown rendering with syntax highlighting
|
| 45 |
-
|
| 46 |
-
## 📊 Performance Features
|
| 47 |
-
|
| 48 |
-
### Configurable Processing Modes
|
| 49 |
-
Choose between speed and quality in `QASystem/config.py`:
|
| 50 |
-
|
| 51 |
-
#### Fast Mode
|
| 52 |
-
```python
|
| 53 |
-
CURRENT_MODEL = "fast"
|
| 54 |
-
BATCH_SIZE = 64
|
| 55 |
-
```
|
| 56 |
-
- ⚡ 5-10x faster processing
|
| 57 |
-
- 💾 Lower memory usage
|
| 58 |
-
- ✅ Great for large documents (100+ pages)
|
| 59 |
-
|
| 60 |
-
#### Balanced Mode
|
| 61 |
-
```python
|
| 62 |
-
CURRENT_MODEL = "balanced"
|
| 63 |
-
BATCH_SIZE = 32
|
| 64 |
-
```
|
| 65 |
-
- ⚖️ Good balance of speed and quality
|
| 66 |
-
- 📄 Recommended for most documents
|
| 67 |
-
|
| 68 |
-
#### Quality Mode
|
| 69 |
-
```python
|
| 70 |
-
CURRENT_MODEL = "quality"
|
| 71 |
-
BATCH_SIZE = 16
|
| 72 |
-
```
|
| 73 |
-
- 🎯 Highest accuracy
|
| 74 |
-
- 📚 Best for technical/academic papers
|
| 75 |
-
|
| 76 |
-
## 🔒 Security & Reliability
|
| 77 |
-
|
| 78 |
-
### Document Isolation
|
| 79 |
-
- Each document stored in separate namespace
|
| 80 |
-
- Previous documents automatically cleared
|
| 81 |
-
- No mixing of different document content
|
| 82 |
-
|
| 83 |
-
### Error Handling
|
| 84 |
-
- Comprehensive try-catch blocks
|
| 85 |
-
- Graceful fallbacks for API failures
|
| 86 |
-
- Detailed error messages with troubleshooting tips
|
| 87 |
-
- Server-side validation for file types and sizes
|
| 88 |
-
|
| 89 |
-
### API Management
|
| 90 |
-
- Cached embedding models (load once, use many times)
|
| 91 |
-
- Optimized API calls to reduce costs
|
| 92 |
-
- Timeout protection (2-minute max per upload)
|
| 93 |
-
|
| 94 |
-
## 💡 Smart Features
|
| 95 |
-
|
| 96 |
-
### 1. Model Warm-up
|
| 97 |
-
Application pre-loads embedding model on startup for instant first upload
|
| 98 |
-
|
| 99 |
-
### 2. Progress Callbacks
|
| 100 |
-
Real-time progress updates during document processing:
|
| 101 |
-
- 10%: Upload started
|
| 102 |
-
- 30%: File received
|
| 103 |
-
- 50%: Document store initialized
|
| 104 |
-
- 70%: Embedding in progress
|
| 105 |
-
- 90%: Writing to database
|
| 106 |
-
- 100%: Complete
|
| 107 |
-
|
| 108 |
-
### 3. Adaptive Chunking
|
| 109 |
-
Intelligent document splitting:
|
| 110 |
-
- Word-based splitting for better context
|
| 111 |
-
- Configurable chunk size (default: 250 words)
|
| 112 |
-
- Overlap between chunks (default: 50 words)
|
| 113 |
-
- Preserves paragraph structure
|
| 114 |
-
|
| 115 |
-
### 4. Response Styles
|
| 116 |
-
Choose from multiple explanation styles:
|
| 117 |
-
- **Simple & Intuitive**: Easy to understand
|
| 118 |
-
- **Balanced**: Mix of detail and clarity
|
| 119 |
-
- **Detailed & Technical**: In-depth technical explanations
|
| 120 |
-
- **Academic**: Formal academic writing
|
| 121 |
-
|
| 122 |
-
### 5. Response Lengths
|
| 123 |
-
Control response verbosity:
|
| 124 |
-
- **Short**: 1 paragraph summary
|
| 125 |
-
- **Medium**: 2-3 paragraphs (default)
|
| 126 |
-
- **Comprehensive**: Detailed multi-paragraph response
|
| 127 |
-
|
| 128 |
-
## 🛠️ Technical Stack
|
| 129 |
-
|
| 130 |
-
- **Framework**: FastAPI (high-performance async web framework)
|
| 131 |
-
- **Vector DB**: Pinecone (scalable vector search)
|
| 132 |
-
- **RAG Framework**: Haystack AI (modular NLP framework)
|
| 133 |
-
- **Embeddings**: Sentence Transformers (state-of-the-art models)
|
| 134 |
-
- **LLM**: Google Gemini 1.5 Flash (fast, accurate generation)
|
| 135 |
-
- **Frontend**: Vanilla JS with Bootstrap & SweetAlert2
|
| 136 |
-
- **Processing**: Concurrent.futures for parallel execution
|
| 137 |
-
|
| 138 |
-
## 📈 Performance Metrics
|
| 139 |
-
|
| 140 |
-
Typical performance on mid-range hardware (tested):
|
| 141 |
-
- **Upload Speed**: 1-3 seconds for file transfer
|
| 142 |
-
- **Processing Speed**: 30-50 chunks/second
|
| 143 |
-
- **Query Response**: 2-5 seconds end-to-end
|
| 144 |
-
- **Memory Usage**: 500MB-2GB depending on document size
|
| 145 |
-
|
| 146 |
-
## 🎯 Use Cases
|
| 147 |
-
|
| 148 |
-
1. **Academic Research**: Quickly understand complex papers
|
| 149 |
-
2. **Technical Documentation**: Extract information from manuals
|
| 150 |
-
3. **Business Reports**: Analyze CSV/Excel data with AI
|
| 151 |
-
4. **Code Documentation**: Process Markdown documentation
|
| 152 |
-
5. **Data Analysis**: Query JSON/CSV datasets naturally
|
| 153 |
-
6. **Meeting Notes**: Search through text/DOCX notes
|
| 154 |
-
|
| 155 |
-
## 🔄 Continuous Improvements
|
| 156 |
-
|
| 157 |
-
The codebase includes:
|
| 158 |
-
- Comprehensive error logging
|
| 159 |
-
- Performance monitoring
|
| 160 |
-
- Memory usage tracking
|
| 161 |
-
- Detailed console output for debugging
|
| 162 |
-
- Modular design for easy enhancements
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
FIXES_APPLIED.md
DELETED
|
@@ -1,281 +0,0 @@
|
|
| 1 |
-
# Large File Upload Fixes - Implementation Summary
|
| 2 |
-
|
| 3 |
-
## Problem Identified
|
| 4 |
-
The application was unable to handle file uploads around 5MB due to multiple issues:
|
| 5 |
-
1. Missing file size validation and limits in FastAPI
|
| 6 |
-
2. No streaming upload support for large files
|
| 7 |
-
3. Insufficient error handling for edge cases
|
| 8 |
-
4. Memory management issues during processing
|
| 9 |
-
5. Frontend timeout issues
|
| 10 |
-
6. Lack of proper progress tracking
|
| 11 |
-
|
| 12 |
-
## Solutions Implemented
|
| 13 |
-
|
| 14 |
-
### 1. **FastAPI/Backend Improvements** ([app.py](app.py))
|
| 15 |
-
|
| 16 |
-
#### A. Enhanced Upload Configuration
|
| 17 |
-
- **Added FastAPI upload size limits**: Configured `MAX_UPLOAD_SIZE = 50MB`
|
| 18 |
-
- **Increased timeouts**: Set `timeout_keep_alive=600` (10 minutes) for large file processing
|
| 19 |
-
- **Added connection limits**: `limit_concurrency=10` to prevent resource exhaustion
|
| 20 |
-
- **Graceful shutdown**: `timeout_graceful_shutdown=30` for clean server stops
|
| 21 |
-
|
| 22 |
-
#### B. Streaming File Upload
|
| 23 |
-
```python
|
| 24 |
-
# Old approach: Direct file copy (can fail for large files)
|
| 25 |
-
shutil.copyfileobj(file.file, buffer)
|
| 26 |
-
|
| 27 |
-
# New approach: Streaming with size validation
|
| 28 |
-
while chunk := await file.read(1024 * 1024): # Read 1MB at a time
|
| 29 |
-
file_size += len(chunk)
|
| 30 |
-
if file_size > MAX_FILE_SIZE:
|
| 31 |
-
return error # Early termination
|
| 32 |
-
chunks.append(chunk)
|
| 33 |
-
```
|
| 34 |
-
|
| 35 |
-
**Benefits**:
|
| 36 |
-
- Handles files of any size up to 50MB
|
| 37 |
-
- Validates size during upload (not after)
|
| 38 |
-
- Prevents memory overflow
|
| 39 |
-
- Better error messages
|
| 40 |
-
|
| 41 |
-
#### C. Comprehensive Error Handling
|
| 42 |
-
- **File validation**: Checks for empty files, missing files, invalid types
|
| 43 |
-
- **Size validation**: Real-time size checking during upload
|
| 44 |
-
- **Processing errors**: Proper cleanup on failure
|
| 45 |
-
- **HTTP status codes**: 400, 413, 500 for different error types
|
| 46 |
-
- **Detailed error messages**: User-friendly error descriptions
|
| 47 |
-
|
| 48 |
-
#### D. File Size Reporting
|
| 49 |
-
- Shows file size in MB in success messages
|
| 50 |
-
- Displays processing time
|
| 51 |
-
- Tracks upload progress
|
| 52 |
-
|
| 53 |
-
### 2. **Document Processing Improvements** ([QASystem/ingestion.py](QASystem/ingestion.py))
|
| 54 |
-
|
| 55 |
-
#### A. Enhanced Validation
|
| 56 |
-
```python
|
| 57 |
-
# Input validation
|
| 58 |
-
- Checks if file exists before processing
|
| 59 |
-
- Validates document store is available
|
| 60 |
-
- Verifies extracted content is not empty
|
| 61 |
-
- Validates chunk creation success
|
| 62 |
-
```
|
| 63 |
-
|
| 64 |
-
#### B. Better Error Recovery
|
| 65 |
-
- **Batch failure tolerance**: Continues processing if <20% of batches fail
|
| 66 |
-
- **Partial success handling**: Accepts results if >50% chunks succeed
|
| 67 |
-
- **Detailed error logging**: Full stack traces for debugging
|
| 68 |
-
- **Memory cleanup**: Automatic cleanup on errors
|
| 69 |
-
|
| 70 |
-
#### C. Improved Progress Tracking
|
| 71 |
-
```python
|
| 72 |
-
# Progress indicators
|
| 73 |
-
✓ File read successfully
|
| 74 |
-
✓ Extracted content (X documents, Y chars)
|
| 75 |
-
✓ Created Z chunks
|
| 76 |
-
✓ Batch N/M complete
|
| 77 |
-
✓ Wrote to Pinecone
|
| 78 |
-
```
|
| 79 |
-
|
| 80 |
-
#### D. Memory Management
|
| 81 |
-
- **Periodic cleanup**: Every 3 batches for large files
|
| 82 |
-
- **Memory monitoring**: Tracks usage before/after operations
|
| 83 |
-
- **Resource reporting**: Shows memory delta in statistics
|
| 84 |
-
|
| 85 |
-
### 3. **Configuration Optimization** ([QASystem/config.py](QASystem/config.py))
|
| 86 |
-
|
| 87 |
-
#### Updated Settings for 5MB Files
|
| 88 |
-
```python
|
| 89 |
-
# Large File Detection
|
| 90 |
-
LARGE_FILE_THRESHOLD = 3MB # Changed from 2MB
|
| 91 |
-
# Better catches 5MB files for optimization
|
| 92 |
-
|
| 93 |
-
# Chunk Settings
|
| 94 |
-
LARGE_FILE_CHUNK_LENGTH = 350 # Optimized from 400
|
| 95 |
-
LARGE_FILE_BATCH_SIZE = 20 # Optimized from 24
|
| 96 |
-
```
|
| 97 |
-
|
| 98 |
-
**Why these values?**
|
| 99 |
-
- 350 words per chunk: Balance between context and speed
|
| 100 |
-
- Batch size 20: Prevents memory issues on quality model
|
| 101 |
-
- 3MB threshold: Catches most research papers (5-10MB range)
|
| 102 |
-
|
| 103 |
-
### 4. **Frontend Improvements** ([templates/index.html](templates/index.html))
|
| 104 |
-
|
| 105 |
-
#### A. Better Validation
|
| 106 |
-
```javascript
|
| 107 |
-
// File size display
|
| 108 |
-
console.log(`File: ${file.name}, Size: ${fileSizeMB}MB`);
|
| 109 |
-
|
| 110 |
-
// Empty file check
|
| 111 |
-
if (file.size === 0) {
|
| 112 |
-
// Show error
|
| 113 |
-
}
|
| 114 |
-
|
| 115 |
-
// Size validation with actual size shown
|
| 116 |
-
text: `File size is ${fileSizeMB}MB. Max is 50MB.`
|
| 117 |
-
```
|
| 118 |
-
|
| 119 |
-
#### B. Enhanced User Feedback
|
| 120 |
-
- Shows exact file size in error messages
|
| 121 |
-
- Real-time progress polling
|
| 122 |
-
- Better error descriptions
|
| 123 |
-
- Loading indicators
|
| 124 |
-
|
| 125 |
-
## Key Features Added
|
| 126 |
-
|
| 127 |
-
### 1. **Streaming Upload**
|
| 128 |
-
- Reads files in 1MB chunks
|
| 129 |
-
- Validates size during upload
|
| 130 |
-
- Prevents memory overflow
|
| 131 |
-
- Handles files up to 50MB
|
| 132 |
-
|
| 133 |
-
### 2. **Comprehensive Validation**
|
| 134 |
-
✓ File type validation
|
| 135 |
-
✓ File size validation (client + server)
|
| 136 |
-
✓ Empty file detection
|
| 137 |
-
✓ Content extraction validation
|
| 138 |
-
✓ Embedding validation
|
| 139 |
-
|
| 140 |
-
### 3. **Error Recovery**
|
| 141 |
-
✓ Partial batch success tolerance
|
| 142 |
-
✓ Automatic cleanup on failure
|
| 143 |
-
✓ Detailed error messages
|
| 144 |
-
✓ Graceful degradation
|
| 145 |
-
|
| 146 |
-
### 4. **Progress Tracking**
|
| 147 |
-
✓ Real-time upload progress
|
| 148 |
-
✓ Processing stage indicators
|
| 149 |
-
✓ Batch completion tracking
|
| 150 |
-
✓ Final statistics report
|
| 151 |
-
|
| 152 |
-
### 5. **Memory Optimization**
|
| 153 |
-
✓ Streaming file reads
|
| 154 |
-
✓ Periodic garbage collection
|
| 155 |
-
✓ Memory usage monitoring
|
| 156 |
-
✓ Batch size optimization
|
| 157 |
-
|
| 158 |
-
## Testing Recommendations
|
| 159 |
-
|
| 160 |
-
### Test Cases to Verify
|
| 161 |
-
1. **Small files** (< 1MB): Should process quickly
|
| 162 |
-
2. **Medium files** (1-3MB): Standard processing
|
| 163 |
-
3. **Large files** (3-10MB): Optimized settings activated
|
| 164 |
-
4. **Max size** (50MB): Should work but warn if approaching limit
|
| 165 |
-
5. **Oversized** (> 50MB): Should reject with clear error
|
| 166 |
-
6. **Empty files**: Should reject with error
|
| 167 |
-
7. **Invalid types**: Should reject with supported formats list
|
| 168 |
-
|
| 169 |
-
### Expected Behavior
|
| 170 |
-
```
|
| 171 |
-
File Size | Chunk Size | Batch Size | Expected Time
|
| 172 |
-
-------------|------------|------------|---------------
|
| 173 |
-
< 1MB | 300 words | 16 chunks | < 30s
|
| 174 |
-
1-3MB | 300 words | 16 chunks | 30-60s
|
| 175 |
-
3-10MB | 350 words | 20 chunks | 1-3 min
|
| 176 |
-
10-50MB | 350 words | 20 chunks | 3-10 min
|
| 177 |
-
```
|
| 178 |
-
|
| 179 |
-
## Performance Improvements
|
| 180 |
-
|
| 181 |
-
### Before Fixes
|
| 182 |
-
- ❌ Files >5MB: Failed silently or timeout
|
| 183 |
-
- ❌ No size validation until after upload
|
| 184 |
-
- ❌ Poor error messages
|
| 185 |
-
- ❌ Memory issues with large files
|
| 186 |
-
- ❌ No progress tracking
|
| 187 |
-
|
| 188 |
-
### After Fixes
|
| 189 |
-
- ✅ Files up to 50MB: Full support
|
| 190 |
-
- ✅ Size validation during upload
|
| 191 |
-
- ✅ Clear, actionable error messages
|
| 192 |
-
- ✅ Optimized memory usage
|
| 193 |
-
- ✅ Real-time progress tracking
|
| 194 |
-
- ✅ Automatic cleanup on errors
|
| 195 |
-
- ✅ Detailed processing statistics
|
| 196 |
-
|
| 197 |
-
## Files Modified
|
| 198 |
-
|
| 199 |
-
1. **[app.py](app.py)** - Main application
|
| 200 |
-
- Added streaming upload
|
| 201 |
-
- Enhanced error handling
|
| 202 |
-
- Improved validation
|
| 203 |
-
- Better configuration
|
| 204 |
-
|
| 205 |
-
2. **[QASystem/ingestion.py](QASystem/ingestion.py)** - Document processing
|
| 206 |
-
- Better error recovery
|
| 207 |
-
- Enhanced validation
|
| 208 |
-
- Memory optimization
|
| 209 |
-
- Progress tracking
|
| 210 |
-
|
| 211 |
-
3. **[QASystem/config.py](QASystem/config.py)** - Configuration
|
| 212 |
-
- Optimized thresholds
|
| 213 |
-
- Better chunk sizes
|
| 214 |
-
- Improved batch sizes
|
| 215 |
-
|
| 216 |
-
4. **[templates/index.html](templates/index.html)** - Frontend
|
| 217 |
-
- Better validation
|
| 218 |
-
- Enhanced error messages
|
| 219 |
-
- File size display
|
| 220 |
-
|
| 221 |
-
## How to Use
|
| 222 |
-
|
| 223 |
-
1. **Start the server**:
|
| 224 |
-
```bash
|
| 225 |
-
python app.py
|
| 226 |
-
```
|
| 227 |
-
|
| 228 |
-
2. **Upload a file**:
|
| 229 |
-
- Drag and drop or click "Choose File"
|
| 230 |
-
- Files up to 50MB supported
|
| 231 |
-
- Watch progress in real-time
|
| 232 |
-
|
| 233 |
-
3. **Monitor progress**:
|
| 234 |
-
- Console shows detailed processing steps
|
| 235 |
-
- Frontend shows upload percentage
|
| 236 |
-
- Statistics displayed on completion
|
| 237 |
-
|
| 238 |
-
## Troubleshooting
|
| 239 |
-
|
| 240 |
-
### If upload still fails:
|
| 241 |
-
|
| 242 |
-
1. **Check file size**: Must be < 50MB
|
| 243 |
-
2. **Check file type**: PDF, DOCX, TXT, etc.
|
| 244 |
-
3. **Check console logs**: Look for specific errors
|
| 245 |
-
4. **Check memory**: Ensure system has >2GB free
|
| 246 |
-
5. **Check network**: Stable connection required
|
| 247 |
-
6. **Check Pinecone**: API key and index must be valid
|
| 248 |
-
|
| 249 |
-
### Common Issues:
|
| 250 |
-
|
| 251 |
-
**Issue**: "File too large"
|
| 252 |
-
**Solution**: File exceeds 50MB, compress or split it
|
| 253 |
-
|
| 254 |
-
**Issue**: "Empty file"
|
| 255 |
-
**Solution**: File has no content, check source
|
| 256 |
-
|
| 257 |
-
**Issue**: "Processing timeout"
|
| 258 |
-
**Solution**: File is very large/complex, try splitting it
|
| 259 |
-
|
| 260 |
-
**Issue**: "No chunks embedded"
|
| 261 |
-
**Solution**: Check embedding model and Pinecone connection
|
| 262 |
-
|
| 263 |
-
## Performance Tips
|
| 264 |
-
|
| 265 |
-
1. **For fastest processing**: Use files < 3MB
|
| 266 |
-
2. **For large documents**: Consider splitting into chapters
|
| 267 |
-
3. **For better quality**: Use smaller chunk sizes (edit config.py)
|
| 268 |
-
4. **For faster speed**: Use "fast" model in config.py
|
| 269 |
-
5. **For maximum compatibility**: Use PDF format
|
| 270 |
-
|
| 271 |
-
## Summary
|
| 272 |
-
|
| 273 |
-
All issues related to uploading ~5MB files have been resolved with:
|
| 274 |
-
- ✅ Streaming upload support
|
| 275 |
-
- ✅ Comprehensive validation
|
| 276 |
-
- ✅ Better error handling
|
| 277 |
-
- ✅ Memory optimization
|
| 278 |
-
- ✅ Progress tracking
|
| 279 |
-
- ✅ Detailed logging
|
| 280 |
-
|
| 281 |
-
The application now handles files from 1KB to 50MB reliably with appropriate error messages and recovery mechanisms at every stage.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
INSTALLATION.md
DELETED
|
@@ -1,110 +0,0 @@
|
|
| 1 |
-
# PaperBOT Installation Guide
|
| 2 |
-
|
| 3 |
-
## Prerequisites
|
| 4 |
-
- Python 3.9 or higher
|
| 5 |
-
- Pinecone account and API key
|
| 6 |
-
- Google AI (Gemini) API key
|
| 7 |
-
|
| 8 |
-
## Quick Start
|
| 9 |
-
|
| 10 |
-
### 1. Clone or Download the Repository
|
| 11 |
-
```bash
|
| 12 |
-
cd PaperBOT
|
| 13 |
-
```
|
| 14 |
-
|
| 15 |
-
### 2. Create Virtual Environment (Recommended)
|
| 16 |
-
```bash
|
| 17 |
-
# Windows
|
| 18 |
-
python -m venv venv
|
| 19 |
-
venv\Scripts\activate
|
| 20 |
-
|
| 21 |
-
# Linux/Mac
|
| 22 |
-
python3 -m venv venv
|
| 23 |
-
source venv/bin/activate
|
| 24 |
-
```
|
| 25 |
-
|
| 26 |
-
### 3. Install Dependencies
|
| 27 |
-
```bash
|
| 28 |
-
pip install -r requirements.txt
|
| 29 |
-
```
|
| 30 |
-
|
| 31 |
-
### 4. Configure Environment Variables
|
| 32 |
-
1. Copy `.env.example` to `.env`:
|
| 33 |
-
```bash
|
| 34 |
-
# Windows
|
| 35 |
-
copy .env.example .env
|
| 36 |
-
|
| 37 |
-
# Linux/Mac
|
| 38 |
-
cp .env.example .env
|
| 39 |
-
```
|
| 40 |
-
|
| 41 |
-
2. Edit `.env` and add your API keys:
|
| 42 |
-
```
|
| 43 |
-
PINECONE_API_KEY=your_actual_pinecone_key
|
| 44 |
-
GOOGLE_API_KEY=your_actual_google_ai_key
|
| 45 |
-
```
|
| 46 |
-
|
| 47 |
-
### 5. Setup Pinecone Index
|
| 48 |
-
1. Log in to [Pinecone Console](https://app.pinecone.io/)
|
| 49 |
-
2. Create a new index with these settings:
|
| 50 |
-
- **Name**: `paperbot`
|
| 51 |
-
- **Dimensions**: `1024`
|
| 52 |
-
- **Metric**: `cosine`
|
| 53 |
-
- **Pod Type**: `Starter` or `s1`
|
| 54 |
-
|
| 55 |
-
### 6. Run the Application
|
| 56 |
-
```bash
|
| 57 |
-
python app.py
|
| 58 |
-
```
|
| 59 |
-
|
| 60 |
-
The application will start on `http://localhost:8000`
|
| 61 |
-
|
| 62 |
-
## Supported File Formats
|
| 63 |
-
- **PDF** (.pdf)
|
| 64 |
-
- **Word Documents** (.docx, .doc)
|
| 65 |
-
- **Text Files** (.txt)
|
| 66 |
-
- **Markdown** (.md)
|
| 67 |
-
- **CSV** (.csv)
|
| 68 |
-
- **JSON** (.json)
|
| 69 |
-
- **Excel** (.xlsx, .xls)
|
| 70 |
-
|
| 71 |
-
## Performance Optimization
|
| 72 |
-
|
| 73 |
-
### For Faster Processing
|
| 74 |
-
Edit `QASystem/config.py`:
|
| 75 |
-
```python
|
| 76 |
-
CURRENT_MODEL = "fast" # Use fast embedding model
|
| 77 |
-
BATCH_SIZE = 64 # Increase batch size
|
| 78 |
-
```
|
| 79 |
-
|
| 80 |
-
### For Better Quality
|
| 81 |
-
Edit `QASystem/config.py`:
|
| 82 |
-
```python
|
| 83 |
-
CURRENT_MODEL = "quality" # Use high-quality model
|
| 84 |
-
CHUNK_SETTINGS = {
|
| 85 |
-
"split_length": 200, # Smaller chunks for precision
|
| 86 |
-
"split_overlap": 75
|
| 87 |
-
}
|
| 88 |
-
```
|
| 89 |
-
|
| 90 |
-
## Troubleshooting
|
| 91 |
-
|
| 92 |
-
### Out of Memory Errors
|
| 93 |
-
- Reduce `BATCH_SIZE` in `config.py` to 16 or 8
|
| 94 |
-
- Use `CURRENT_MODEL = "fast"` for smaller memory footprint
|
| 95 |
-
|
| 96 |
-
### Slow Upload Times
|
| 97 |
-
- Increase `BATCH_SIZE` for parallel processing
|
| 98 |
-
- Use `CURRENT_MODEL = "fast"` for faster embedding
|
| 99 |
-
|
| 100 |
-
### API Rate Limits
|
| 101 |
-
- Wait a moment between requests
|
| 102 |
-
- Check your API key quotas
|
| 103 |
-
|
| 104 |
-
## System Requirements
|
| 105 |
-
- **RAM**: Minimum 4GB (8GB+ recommended)
|
| 106 |
-
- **Storage**: 2GB free space for models
|
| 107 |
-
- **Internet**: Required for API calls
|
| 108 |
-
|
| 109 |
-
## Need Help?
|
| 110 |
-
Check the console output (F12 in browser) for detailed error messages.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
QUICKSTART.md
DELETED
|
@@ -1,227 +0,0 @@
|
|
| 1 |
-
# PaperBOT Quick Reference
|
| 2 |
-
|
| 3 |
-
## 🚀 Quick Commands
|
| 4 |
-
|
| 5 |
-
### First Time Setup
|
| 6 |
-
```bash
|
| 7 |
-
# 1. Create virtual environment
|
| 8 |
-
python -m venv venv
|
| 9 |
-
|
| 10 |
-
# 2. Activate (Windows)
|
| 11 |
-
venv\Scripts\activate
|
| 12 |
-
|
| 13 |
-
# 2. Activate (Linux/Mac)
|
| 14 |
-
source venv/bin/activate
|
| 15 |
-
|
| 16 |
-
# 3. Install dependencies
|
| 17 |
-
pip install -r requirements.txt
|
| 18 |
-
|
| 19 |
-
# 4. Configure environment
|
| 20 |
-
cp .env.example .env
|
| 21 |
-
# Edit .env with your API keys
|
| 22 |
-
|
| 23 |
-
# 5. Run application
|
| 24 |
-
python app.py
|
| 25 |
-
```
|
| 26 |
-
|
| 27 |
-
### Daily Usage
|
| 28 |
-
```bash
|
| 29 |
-
# Windows
|
| 30 |
-
start.bat
|
| 31 |
-
|
| 32 |
-
# Linux/Mac
|
| 33 |
-
chmod +x start.sh
|
| 34 |
-
./start.sh
|
| 35 |
-
```
|
| 36 |
-
|
| 37 |
-
## 📝 API Keys Setup
|
| 38 |
-
|
| 39 |
-
### Pinecone
|
| 40 |
-
1. Go to https://www.pinecone.io/
|
| 41 |
-
2. Sign up/Login
|
| 42 |
-
3. Create new API key
|
| 43 |
-
4. Copy to `.env` file
|
| 44 |
-
|
| 45 |
-
### Google AI (Gemini)
|
| 46 |
-
1. Go to https://makersuite.google.com/app/apikey
|
| 47 |
-
2. Create API key
|
| 48 |
-
3. Copy to `.env` file
|
| 49 |
-
|
| 50 |
-
### Pinecone Index Setup
|
| 51 |
-
1. Login to Pinecone Console
|
| 52 |
-
2. Click "Create Index"
|
| 53 |
-
3. Settings:
|
| 54 |
-
- Name: `paperbot`
|
| 55 |
-
- Dimensions: `1024`
|
| 56 |
-
- Metric: `cosine`
|
| 57 |
-
- Cloud: Any (AWS/GCP/Azure)
|
| 58 |
-
- Region: Choose nearest
|
| 59 |
-
|
| 60 |
-
## ⚙️ Configuration Options
|
| 61 |
-
|
| 62 |
-
### Speed vs Quality (`QASystem/config.py`)
|
| 63 |
-
|
| 64 |
-
**Fast Mode** (Recommended for large docs):
|
| 65 |
-
```python
|
| 66 |
-
CURRENT_MODEL = "fast"
|
| 67 |
-
BATCH_SIZE = 64
|
| 68 |
-
```
|
| 69 |
-
|
| 70 |
-
**Quality Mode** (Recommended for technical papers):
|
| 71 |
-
```python
|
| 72 |
-
CURRENT_MODEL = "quality"
|
| 73 |
-
BATCH_SIZE = 16
|
| 74 |
-
```
|
| 75 |
-
|
| 76 |
-
## 🎯 Supported File Types
|
| 77 |
-
|
| 78 |
-
| Format | Extension | Notes |
|
| 79 |
-
|--------|-----------|-------|
|
| 80 |
-
| PDF | .pdf | Best for research papers |
|
| 81 |
-
| Word | .docx, .doc | Full text extraction |
|
| 82 |
-
| Text | .txt | Plain text |
|
| 83 |
-
| Markdown | .md | Preserves formatting |
|
| 84 |
-
| CSV | .csv | Tabular data |
|
| 85 |
-
| Excel | .xlsx, .xls | Spreadsheets |
|
| 86 |
-
| JSON | .json | Structured data |
|
| 87 |
-
|
| 88 |
-
## 🔧 Troubleshooting
|
| 89 |
-
|
| 90 |
-
### Issue: Out of Memory
|
| 91 |
-
**Solution**: Reduce batch size in `config.py`
|
| 92 |
-
```python
|
| 93 |
-
BATCH_SIZE = 8 # Lower value
|
| 94 |
-
```
|
| 95 |
-
|
| 96 |
-
### Issue: Slow Upload
|
| 97 |
-
**Solution**: Use fast model
|
| 98 |
-
```python
|
| 99 |
-
CURRENT_MODEL = "fast"
|
| 100 |
-
```
|
| 101 |
-
|
| 102 |
-
### Issue: API Rate Limit
|
| 103 |
-
**Solution**: Wait 1-2 minutes between uploads
|
| 104 |
-
|
| 105 |
-
### Issue: Can't Connect to Server
|
| 106 |
-
**Solution**: Check if port 8000 is free
|
| 107 |
-
```bash
|
| 108 |
-
# Windows
|
| 109 |
-
netstat -ano | findstr :8000
|
| 110 |
-
|
| 111 |
-
# Linux/Mac
|
| 112 |
-
lsof -i :8000
|
| 113 |
-
```
|
| 114 |
-
|
| 115 |
-
## 📊 Performance Tips
|
| 116 |
-
|
| 117 |
-
### For Large Documents (100+ pages)
|
| 118 |
-
```python
|
| 119 |
-
# config.py
|
| 120 |
-
CURRENT_MODEL = "fast"
|
| 121 |
-
BATCH_SIZE = 64
|
| 122 |
-
CHUNK_SETTINGS = {
|
| 123 |
-
"split_length": 400,
|
| 124 |
-
"split_overlap": 50
|
| 125 |
-
}
|
| 126 |
-
```
|
| 127 |
-
|
| 128 |
-
### For Technical Papers
|
| 129 |
-
```python
|
| 130 |
-
# config.py
|
| 131 |
-
CURRENT_MODEL = "quality"
|
| 132 |
-
BATCH_SIZE = 16
|
| 133 |
-
CHUNK_SETTINGS = {
|
| 134 |
-
"split_length": 200,
|
| 135 |
-
"split_overlap": 75
|
| 136 |
-
}
|
| 137 |
-
```
|
| 138 |
-
|
| 139 |
-
### For CSV/Excel Files
|
| 140 |
-
- Keep files under 10,000 rows for best performance
|
| 141 |
-
- Remove unnecessary columns before upload
|
| 142 |
-
- Use CSV format for faster processing
|
| 143 |
-
|
| 144 |
-
## 🎨 UI Features
|
| 145 |
-
|
| 146 |
-
### Response Styles
|
| 147 |
-
- **Simple & Intuitive**: Easy explanations
|
| 148 |
-
- **Balanced**: Mix of detail and clarity
|
| 149 |
-
- **Detailed & Technical**: In-depth analysis
|
| 150 |
-
- **Academic**: Formal writing
|
| 151 |
-
|
| 152 |
-
### Response Lengths
|
| 153 |
-
- **Short**: 1 paragraph
|
| 154 |
-
- **Medium**: 2-3 paragraphs (recommended)
|
| 155 |
-
- **Comprehensive**: Detailed multi-paragraph
|
| 156 |
-
|
| 157 |
-
## 🔍 Example Questions
|
| 158 |
-
|
| 159 |
-
### For Research Papers
|
| 160 |
-
- "What is the main contribution of this paper?"
|
| 161 |
-
- "Explain the methodology used"
|
| 162 |
-
- "What are the key findings?"
|
| 163 |
-
- "How does this compare to previous work?"
|
| 164 |
-
|
| 165 |
-
### For Data Files (CSV/Excel)
|
| 166 |
-
- "What are the main trends in this data?"
|
| 167 |
-
- "Summarize the statistics"
|
| 168 |
-
- "What columns are available?"
|
| 169 |
-
|
| 170 |
-
### For Documentation
|
| 171 |
-
- "How do I install this software?"
|
| 172 |
-
- "Explain the configuration options"
|
| 173 |
-
- "What are the prerequisites?"
|
| 174 |
-
|
| 175 |
-
## 📈 Monitoring
|
| 176 |
-
|
| 177 |
-
### Memory Usage
|
| 178 |
-
Check console output for memory stats:
|
| 179 |
-
```
|
| 180 |
-
📊 Statistics:
|
| 181 |
-
• Memory: 1250.5MB
|
| 182 |
-
• Processing speed: 45 chunks/sec
|
| 183 |
-
```
|
| 184 |
-
|
| 185 |
-
### Performance Metrics
|
| 186 |
-
- Upload: 1-3 seconds
|
| 187 |
-
- Processing: 30-50 chunks/sec
|
| 188 |
-
- Query: 2-5 seconds
|
| 189 |
-
|
| 190 |
-
## 🆘 Getting Help
|
| 191 |
-
|
| 192 |
-
1. Check console output (terminal)
|
| 193 |
-
2. Open browser console (F12)
|
| 194 |
-
3. Review error messages
|
| 195 |
-
4. Check [INSTALLATION.md](INSTALLATION.md)
|
| 196 |
-
5. Review [FEATURES.md](FEATURES.md)
|
| 197 |
-
|
| 198 |
-
## 🔄 Updates & Maintenance
|
| 199 |
-
|
| 200 |
-
### Update Dependencies
|
| 201 |
-
```bash
|
| 202 |
-
pip install --upgrade -r requirements.txt
|
| 203 |
-
```
|
| 204 |
-
|
| 205 |
-
### Clear Cache
|
| 206 |
-
```bash
|
| 207 |
-
# Delete uploads
|
| 208 |
-
rm -rf uploads/*
|
| 209 |
-
|
| 210 |
-
# Clear Python cache
|
| 211 |
-
find . -type d -name __pycache__ -exec rm -r {} +
|
| 212 |
-
```
|
| 213 |
-
|
| 214 |
-
### Reset Database
|
| 215 |
-
Delete all vectors in Pinecone console or via code
|
| 216 |
-
|
| 217 |
-
## 🎓 Best Practices
|
| 218 |
-
|
| 219 |
-
1. **One Document at a Time**: Upload new doc clears previous
|
| 220 |
-
2. **Appropriate Model**: Use fast for speed, quality for accuracy
|
| 221 |
-
3. **Clear Questions**: Be specific in your queries
|
| 222 |
-
4. **File Size**: Keep under 50MB for best performance
|
| 223 |
-
5. **Internet**: Stable connection required for API calls
|
| 224 |
-
|
| 225 |
-
---
|
| 226 |
-
|
| 227 |
-
**Need more help?** Check the full documentation in [README.md](README.md)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
QUICK_REFERENCE.md
DELETED
|
@@ -1,351 +0,0 @@
|
|
| 1 |
-
# 🚀 Large File Upload Issue - RESOLVED
|
| 2 |
-
|
| 3 |
-
## Executive Summary
|
| 4 |
-
|
| 5 |
-
**Problem**: Application unable to upload files around 5MB
|
| 6 |
-
**Root Cause**: Multiple issues in upload pipeline
|
| 7 |
-
**Status**: ✅ **FULLY RESOLVED**
|
| 8 |
-
**Tested Up To**: 50MB files
|
| 9 |
-
|
| 10 |
-
---
|
| 11 |
-
|
| 12 |
-
## 🎯 What Was Fixed
|
| 13 |
-
|
| 14 |
-
### Critical Issues Resolved
|
| 15 |
-
|
| 16 |
-
1. **❌ No file size validation** → ✅ Multi-layer size validation (client + server)
|
| 17 |
-
2. **❌ No streaming upload** → ✅ Chunked streaming for large files
|
| 18 |
-
3. **❌ Poor error handling** → ✅ Comprehensive error recovery
|
| 19 |
-
4. **❌ Memory issues** → ✅ Optimized memory management
|
| 20 |
-
5. **❌ No progress tracking** → ✅ Real-time progress updates
|
| 21 |
-
6. **❌ Timeout problems** → ✅ Extended timeouts (10 min)
|
| 22 |
-
7. **❌ Silent failures** → ✅ Detailed error messages
|
| 23 |
-
|
| 24 |
-
---
|
| 25 |
-
|
| 26 |
-
## 📊 Performance Comparison
|
| 27 |
-
|
| 28 |
-
### Before Fixes
|
| 29 |
-
```
|
| 30 |
-
File Size | Status
|
| 31 |
-
----------|--------
|
| 32 |
-
< 1 MB | ✅ Works
|
| 33 |
-
1-3 MB | ⚠️ Sometimes
|
| 34 |
-
3-5 MB | ❌ Fails
|
| 35 |
-
> 5 MB | ❌ Fails
|
| 36 |
-
```
|
| 37 |
-
|
| 38 |
-
### After Fixes
|
| 39 |
-
```
|
| 40 |
-
File Size | Status | Time
|
| 41 |
-
----------|-------------|--------
|
| 42 |
-
< 1 MB | ✅ Works | < 30s
|
| 43 |
-
1-3 MB | ✅ Works | 30-60s
|
| 44 |
-
3-5 MB | ✅ Works | 1-2 min
|
| 45 |
-
5-10 MB | ✅ Works | 2-5 min
|
| 46 |
-
10-50 MB | ✅ Works | 5-10 min
|
| 47 |
-
> 50 MB | ⛔ Rejected | N/A
|
| 48 |
-
```
|
| 49 |
-
|
| 50 |
-
---
|
| 51 |
-
|
| 52 |
-
## 🔧 Technical Changes
|
| 53 |
-
|
| 54 |
-
### 1. Backend ([app.py](app.py))
|
| 55 |
-
|
| 56 |
-
**Upload Endpoint Rewrite**
|
| 57 |
-
- ✅ Streaming file upload (1MB chunks)
|
| 58 |
-
- ✅ Real-time size validation
|
| 59 |
-
- ✅ Comprehensive error handling
|
| 60 |
-
- ✅ Automatic cleanup on failure
|
| 61 |
-
- ✅ Detailed logging
|
| 62 |
-
|
| 63 |
-
**Server Configuration**
|
| 64 |
-
- ✅ 10-minute timeout
|
| 65 |
-
- ✅ 50MB request limit
|
| 66 |
-
- ✅ Connection limiting
|
| 67 |
-
- ✅ Graceful shutdown
|
| 68 |
-
|
| 69 |
-
### 2. Processing ([QASystem/ingestion.py](QASystem/ingestion.py))
|
| 70 |
-
|
| 71 |
-
**Validation Improvements**
|
| 72 |
-
- ✅ File existence check
|
| 73 |
-
- ✅ Content validation
|
| 74 |
-
- ✅ Chunk creation verification
|
| 75 |
-
- ✅ Embedding validation
|
| 76 |
-
|
| 77 |
-
**Error Recovery**
|
| 78 |
-
- ✅ Batch failure tolerance (20%)
|
| 79 |
-
- ✅ Partial success handling (50%)
|
| 80 |
-
- ✅ Memory cleanup on error
|
| 81 |
-
- ✅ Full stack traces
|
| 82 |
-
|
| 83 |
-
**Memory Management**
|
| 84 |
-
- ✅ Periodic garbage collection
|
| 85 |
-
- ✅ Memory usage tracking
|
| 86 |
-
- ✅ Batch size optimization
|
| 87 |
-
- ✅ Resource monitoring
|
| 88 |
-
|
| 89 |
-
### 3. Configuration ([QASystem/config.py](QASystem/config.py))
|
| 90 |
-
|
| 91 |
-
**Optimized for 5MB Files**
|
| 92 |
-
- ✅ 3MB threshold (was 2MB)
|
| 93 |
-
- ✅ 350-word chunks (balanced)
|
| 94 |
-
- ✅ Batch size 20 (memory-safe)
|
| 95 |
-
|
| 96 |
-
### 4. Frontend ([templates/index.html](templates/index.html))
|
| 97 |
-
|
| 98 |
-
**User Experience**
|
| 99 |
-
- ✅ File size display
|
| 100 |
-
- ✅ Empty file detection
|
| 101 |
-
- ✅ Better error messages
|
| 102 |
-
- ✅ Real-time progress
|
| 103 |
-
|
| 104 |
-
---
|
| 105 |
-
|
| 106 |
-
## 🎮 How to Test
|
| 107 |
-
|
| 108 |
-
### Quick Test (5MB file)
|
| 109 |
-
|
| 110 |
-
1. **Start server**:
|
| 111 |
-
```bash
|
| 112 |
-
python app.py
|
| 113 |
-
```
|
| 114 |
-
|
| 115 |
-
2. **Upload 5MB PDF**:
|
| 116 |
-
- Open http://localhost:8000
|
| 117 |
-
- Choose a ~5MB research paper
|
| 118 |
-
- Click upload
|
| 119 |
-
|
| 120 |
-
3. **Verify**:
|
| 121 |
-
- ✅ Progress bar shows updates
|
| 122 |
-
- ✅ Console shows "Large file detected"
|
| 123 |
-
- ✅ Processing completes in 1-2 minutes
|
| 124 |
-
- ✅ Success message shows file size
|
| 125 |
-
- ✅ Can ask questions immediately
|
| 126 |
-
|
| 127 |
-
### Expected Console Output
|
| 128 |
-
```
|
| 129 |
-
📥 Upload endpoint called - Filename: paper.pdf
|
| 130 |
-
✓ File type validated: .pdf
|
| 131 |
-
✓ File read successfully: 5.23MB
|
| 132 |
-
✓ File saved to: uploads\paper.pdf
|
| 133 |
-
|
| 134 |
-
📄 Starting document ingestion: paper.pdf
|
| 135 |
-
⚡ Large file detected - using optimized settings:
|
| 136 |
-
Chunk length: 350 words
|
| 137 |
-
Batch size: 20 chunks
|
| 138 |
-
✓ Created 156 chunks
|
| 139 |
-
✓ Batch 1/8 complete (20 chunks, +15.2MB)
|
| 140 |
-
🧹 Memory cleanup: 485.3MB
|
| 141 |
-
...
|
| 142 |
-
✅ Ingestion completed successfully!
|
| 143 |
-
• Size: 5.23MB
|
| 144 |
-
• Chunks: 156
|
| 145 |
-
• Time: 94.3s
|
| 146 |
-
```
|
| 147 |
-
|
| 148 |
-
---
|
| 149 |
-
|
| 150 |
-
## 📚 Documentation
|
| 151 |
-
|
| 152 |
-
| File | Purpose |
|
| 153 |
-
|------|---------|
|
| 154 |
-
| **[FIXES_APPLIED.md](FIXES_APPLIED.md)** | Detailed technical explanation of all changes |
|
| 155 |
-
| **[TESTING_GUIDE.md](TESTING_GUIDE.md)** | Comprehensive testing procedures and benchmarks |
|
| 156 |
-
| **This File** | Quick reference and summary |
|
| 157 |
-
|
| 158 |
-
---
|
| 159 |
-
|
| 160 |
-
## ✅ Verification Checklist
|
| 161 |
-
|
| 162 |
-
### Upload Pipeline
|
| 163 |
-
- [x] File type validation
|
| 164 |
-
- [x] File size validation (client-side)
|
| 165 |
-
- [x] File size validation (server-side)
|
| 166 |
-
- [x] Streaming upload support
|
| 167 |
-
- [x] Empty file detection
|
| 168 |
-
- [x] Error message clarity
|
| 169 |
-
|
| 170 |
-
### Processing Pipeline
|
| 171 |
-
- [x] Document extraction
|
| 172 |
-
- [x] Chunk creation
|
| 173 |
-
- [x] Embedding generation
|
| 174 |
-
- [x] Database storage
|
| 175 |
-
- [x] Memory management
|
| 176 |
-
- [x] Error recovery
|
| 177 |
-
|
| 178 |
-
### User Experience
|
| 179 |
-
- [x] Progress tracking
|
| 180 |
-
- [x] Clear error messages
|
| 181 |
-
- [x] File size display
|
| 182 |
-
- [x] Processing statistics
|
| 183 |
-
- [x] Success confirmation
|
| 184 |
-
- [x] Immediate usability
|
| 185 |
-
|
| 186 |
-
---
|
| 187 |
-
|
| 188 |
-
## 🚨 Known Limits
|
| 189 |
-
|
| 190 |
-
| Limit | Value | Reason |
|
| 191 |
-
|-------|-------|--------|
|
| 192 |
-
| **Max File Size** | 50MB | Memory constraints |
|
| 193 |
-
| **Upload Timeout** | 10 min | Very large file processing |
|
| 194 |
-
| **Min File Size** | > 0 bytes | Must have content |
|
| 195 |
-
| **Supported Formats** | PDF, DOCX, TXT, etc. | Converter availability |
|
| 196 |
-
|
| 197 |
-
---
|
| 198 |
-
|
| 199 |
-
## 🐛 Troubleshooting
|
| 200 |
-
|
| 201 |
-
### Issue: Upload fails for 5MB file
|
| 202 |
-
**Check**:
|
| 203 |
-
1. Console for specific error
|
| 204 |
-
2. File type is supported
|
| 205 |
-
3. File is not corrupted
|
| 206 |
-
4. Pinecone API key is valid
|
| 207 |
-
5. Internet connection is stable
|
| 208 |
-
|
| 209 |
-
### Issue: Slow processing
|
| 210 |
-
**Solutions**:
|
| 211 |
-
1. Use "fast" model in config.py
|
| 212 |
-
2. Increase chunk size
|
| 213 |
-
3. Close other applications
|
| 214 |
-
4. Check system has >2GB RAM free
|
| 215 |
-
|
| 216 |
-
### Issue: Memory error
|
| 217 |
-
**Solutions**:
|
| 218 |
-
1. Reduce batch size in config.py
|
| 219 |
-
2. Use "fast" model (uses less memory)
|
| 220 |
-
3. Increase system RAM
|
| 221 |
-
4. Process smaller files
|
| 222 |
-
|
| 223 |
-
---
|
| 224 |
-
|
| 225 |
-
## 🎓 For Developers
|
| 226 |
-
|
| 227 |
-
### Key Design Decisions
|
| 228 |
-
|
| 229 |
-
**Why streaming upload?**
|
| 230 |
-
- Handles files larger than available RAM
|
| 231 |
-
- Validates size during upload (not after)
|
| 232 |
-
- Better user experience (shows progress)
|
| 233 |
-
|
| 234 |
-
**Why 3MB threshold?**
|
| 235 |
-
- Research papers typically 5-10MB
|
| 236 |
-
- Activates optimizations early enough
|
| 237 |
-
- Prevents memory issues on medium files
|
| 238 |
-
|
| 239 |
-
**Why 350-word chunks?**
|
| 240 |
-
- Balance between context and speed
|
| 241 |
-
- Works well with quality model
|
| 242 |
-
- Optimal for most research papers
|
| 243 |
-
|
| 244 |
-
**Why batch size 20?**
|
| 245 |
-
- Prevents out-of-memory errors
|
| 246 |
-
- Good balance with quality model
|
| 247 |
-
- Allows frequent cleanup
|
| 248 |
-
|
| 249 |
-
### Code Architecture
|
| 250 |
-
|
| 251 |
-
```
|
| 252 |
-
Client Upload
|
| 253 |
-
↓
|
| 254 |
-
[Streaming Validation] → Size check every 1MB
|
| 255 |
-
↓
|
| 256 |
-
[File Storage] → Save to uploads/
|
| 257 |
-
↓
|
| 258 |
-
[Document Extraction] → PDF/DOCX/etc to text
|
| 259 |
-
↓
|
| 260 |
-
[Chunk Creation] → 300-350 word chunks
|
| 261 |
-
↓
|
| 262 |
-
[Batch Embedding] → 16-20 chunks at a time
|
| 263 |
-
↓
|
| 264 |
-
[Vector Storage] → Pinecone write
|
| 265 |
-
↓
|
| 266 |
-
[Memory Cleanup] → Garbage collection
|
| 267 |
-
↓
|
| 268 |
-
Success!
|
| 269 |
-
```
|
| 270 |
-
|
| 271 |
-
---
|
| 272 |
-
|
| 273 |
-
## 📈 Metrics
|
| 274 |
-
|
| 275 |
-
### Success Rates (Expected)
|
| 276 |
-
|
| 277 |
-
| File Size | Success Rate | Avg Time |
|
| 278 |
-
|-----------|-------------|----------|
|
| 279 |
-
| < 1MB | 99% | 25s |
|
| 280 |
-
| 1-3MB | 98% | 50s |
|
| 281 |
-
| 3-5MB | 95% | 100s |
|
| 282 |
-
| 5-10MB | 90% | 180s |
|
| 283 |
-
| 10-50MB | 85% | 360s |
|
| 284 |
-
|
| 285 |
-
*Lower success rates for larger files due to network/system variability*
|
| 286 |
-
|
| 287 |
-
### Error Distribution (Fixed)
|
| 288 |
-
|
| 289 |
-
| Error Type | Before | After |
|
| 290 |
-
|------------|--------|-------|
|
| 291 |
-
| File too large | 60% | 0% |
|
| 292 |
-
| Timeout | 25% | 2% |
|
| 293 |
-
| Memory error | 10% | 1% |
|
| 294 |
-
| Network error | 5% | 5% |
|
| 295 |
-
|
| 296 |
-
---
|
| 297 |
-
|
| 298 |
-
## 🔐 Security Notes
|
| 299 |
-
|
| 300 |
-
- ✅ File type whitelist enforced
|
| 301 |
-
- ✅ File size limits enforced
|
| 302 |
-
- ✅ Path traversal prevented
|
| 303 |
-
- ✅ Automatic cleanup on error
|
| 304 |
-
- ✅ No arbitrary code execution
|
| 305 |
-
- ✅ API key not exposed
|
| 306 |
-
|
| 307 |
-
---
|
| 308 |
-
|
| 309 |
-
## 🎉 Final Notes
|
| 310 |
-
|
| 311 |
-
### What You Can Do Now
|
| 312 |
-
|
| 313 |
-
✅ Upload research papers up to 50MB
|
| 314 |
-
✅ Get detailed progress updates
|
| 315 |
-
✅ See clear error messages
|
| 316 |
-
✅ Process large documents reliably
|
| 317 |
-
✅ Handle multiple file formats
|
| 318 |
-
✅ Monitor memory usage
|
| 319 |
-
✅ Track processing statistics
|
| 320 |
-
|
| 321 |
-
### What's Improved
|
| 322 |
-
|
| 323 |
-
✅ **Reliability**: 95%+ success rate for 5MB files
|
| 324 |
-
✅ **Performance**: Optimized settings activate automatically
|
| 325 |
-
✅ **User Experience**: Real-time progress and clear errors
|
| 326 |
-
✅ **Error Recovery**: Automatic cleanup and retry capability
|
| 327 |
-
✅ **Monitoring**: Detailed logging for debugging
|
| 328 |
-
|
| 329 |
-
---
|
| 330 |
-
|
| 331 |
-
## 📞 Support
|
| 332 |
-
|
| 333 |
-
If issues persist:
|
| 334 |
-
|
| 335 |
-
1. Check **[TESTING_GUIDE.md](TESTING_GUIDE.md)** for specific test cases
|
| 336 |
-
2. Review **[FIXES_APPLIED.md](FIXES_APPLIED.md)** for technical details
|
| 337 |
-
3. Check console output for specific errors
|
| 338 |
-
4. Verify system requirements (RAM, disk space)
|
| 339 |
-
5. Test with smaller files first
|
| 340 |
-
|
| 341 |
-
---
|
| 342 |
-
|
| 343 |
-
**Status**: ✅ Production Ready
|
| 344 |
-
**Tested**: Files from 1KB to 50MB
|
| 345 |
-
**Confidence**: High
|
| 346 |
-
**Next Steps**: Deploy and monitor real-world usage
|
| 347 |
-
|
| 348 |
-
---
|
| 349 |
-
|
| 350 |
-
*All changes have been implemented, tested, and documented.*
|
| 351 |
-
*The application now handles large file uploads reliably and efficiently.*
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
README.md
CHANGED
|
@@ -1,271 +1,490 @@
|
|
| 1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
|
| 5 |
-
|
|
|
|
| 6 |
|
| 7 |
-
|
| 8 |
-
-
|
| 9 |
-
-
|
| 10 |
-
-
|
| 11 |
-
-
|
| 12 |
-
|
| 13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
|
| 15 |
## 🚀 Quick Start
|
| 16 |
|
| 17 |
### Prerequisites
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
-
|
|
|
|
|
|
|
|
|
|
| 21 |
|
| 22 |
### Installation
|
| 23 |
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
```
|
| 29 |
-
|
| 30 |
-
2. **Create virtual environment**
|
| 31 |
-
```bash
|
| 32 |
-
python -m venv venv
|
| 33 |
-
# Windows
|
| 34 |
-
venv\Scripts\activate
|
| 35 |
-
# Linux/Mac
|
| 36 |
-
source venv/bin/activate
|
| 37 |
-
```
|
| 38 |
-
|
| 39 |
-
3. **Install dependencies**
|
| 40 |
-
```bash
|
| 41 |
-
pip install -r requirements.txt
|
| 42 |
-
```
|
| 43 |
-
|
| 44 |
-
4. **Configure environment**
|
| 45 |
-
```bash
|
| 46 |
-
# Copy example env file
|
| 47 |
-
cp .env.example .env
|
| 48 |
-
|
| 49 |
-
# Edit .env and add your API keys
|
| 50 |
-
PINECONE_API_KEY=your_key_here
|
| 51 |
-
GOOGLE_API_KEY=your_key_here
|
| 52 |
-
```
|
| 53 |
-
|
| 54 |
-
5. **Setup Pinecone Index**
|
| 55 |
-
- Create index named `paperbot`
|
| 56 |
-
- Dimensions: `1024`
|
| 57 |
-
- Metric: `cosine`
|
| 58 |
-
|
| 59 |
-
6. **Run the application**
|
| 60 |
-
```bash
|
| 61 |
-
python app.py
|
| 62 |
-
```
|
| 63 |
-
|
| 64 |
-
7. **Open in browser**
|
| 65 |
-
```
|
| 66 |
-
http://localhost:8000
|
| 67 |
-
```
|
| 68 |
|
| 69 |
-
#
|
|
|
|
| 70 |
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
3. **Ask Questions**: Type your question in natural language
|
| 74 |
-
4. **Customize Response**: Select explanation style and length
|
| 75 |
-
5. **Get Answers**: AI generates context-aware answers from your document
|
| 76 |
|
| 77 |
-
#
|
|
|
|
| 78 |
|
| 79 |
-
#
|
| 80 |
-
|
| 81 |
-
- **Vector DB**: Pinecone (cloud-native vector database)
|
| 82 |
-
- **RAG Framework**: Haystack AI
|
| 83 |
-
- **Embeddings**: Sentence Transformers (BAAI/bge-large-en-v1.5)
|
| 84 |
-
- **LLM**: Google Gemini 1.5 Flash
|
| 85 |
-
- **Frontend**: HTML/CSS/JS with Bootstrap
|
| 86 |
|
| 87 |
-
#
|
|
|
|
|
|
|
| 88 |
|
|
|
|
|
|
|
| 89 |
```
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
Text Extraction
|
| 97 |
-
↓
|
| 98 |
-
Chunking (250 words, 50 overlap)
|
| 99 |
-
↓
|
| 100 |
-
Parallel Batch Embedding (32 chunks/batch)
|
| 101 |
-
↓
|
| 102 |
-
Vector Storage (Pinecone)
|
| 103 |
-
↓
|
| 104 |
-
Ready for Queries!
|
| 105 |
```
|
| 106 |
|
| 107 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 108 |
|
| 109 |
```
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 121 |
```
|
| 122 |
|
|
|
|
|
|
|
| 123 |
## ⚙️ Configuration
|
| 124 |
|
| 125 |
-
|
| 126 |
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
|
|
|
|
|
|
|
|
|
| 135 |
```
|
| 136 |
|
| 137 |
-
###
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 138 |
```python
|
| 139 |
-
|
| 140 |
-
|
|
|
|
|
|
|
| 141 |
CHUNK_SETTINGS = {
|
| 142 |
-
"
|
| 143 |
-
"
|
|
|
|
| 144 |
}
|
|
|
|
|
|
|
|
|
|
| 145 |
```
|
| 146 |
|
| 147 |
-
|
| 148 |
|
| 149 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 150 |
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
|
| 156 |
-
|
| 157 |
|
| 158 |
-
|
| 159 |
-
- 10-page PDF: ~15-30 seconds
|
| 160 |
-
- 50-page PDF: ~60-90 seconds
|
| 161 |
-
- 100-page PDF: ~2-3 minutes
|
| 162 |
|
| 163 |
-
##
|
| 164 |
|
| 165 |
-
|
| 166 |
-
- 📊 **Data Analysis**: Query CSV/Excel files naturally
|
| 167 |
-
- 📝 **Documentation**: Search technical docs and manuals
|
| 168 |
-
- 💼 **Business**: Analyze reports and presentations
|
| 169 |
-
- 🔬 **Research**: Extract insights from scientific papers
|
| 170 |
|
| 171 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 172 |
|
| 173 |
-
###
|
| 174 |
-
- Automatic garbage collection every 5 batches
|
| 175 |
-
- Real-time memory monitoring
|
| 176 |
-
- Configurable batch sizes
|
| 177 |
-
- Resource cleanup after processing
|
| 178 |
|
| 179 |
-
|
| 180 |
-
-
|
| 181 |
-
|
| 182 |
-
|
| 183 |
-
|
|
|
|
|
|
|
|
|
|
| 184 |
|
| 185 |
-
###
|
| 186 |
-
|
| 187 |
-
|
| 188 |
-
|
| 189 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 190 |
|
| 191 |
## 📁 Project Structure
|
| 192 |
|
| 193 |
```
|
| 194 |
PaperBOT/
|
| 195 |
-
├── app.py
|
| 196 |
-
├──
|
| 197 |
-
├── setup.py # Package setup
|
| 198 |
-
├── .env.example # Environment template
|
| 199 |
-
├── INSTALLATION.md # Detailed setup guide
|
| 200 |
-
├── FEATURES.md # Feature documentation
|
| 201 |
-
├── QASystem/
|
| 202 |
│ ├── __init__.py
|
| 203 |
-
│ ├── config.py
|
| 204 |
-
│ ├── ingestion.py
|
| 205 |
-
│ ├── retrieval_and_generation.py
|
| 206 |
-
│ └── utils.py
|
| 207 |
-
├── templates/
|
| 208 |
-
│ └── index.html
|
| 209 |
-
├──
|
| 210 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 211 |
```
|
| 212 |
|
| 213 |
-
|
| 214 |
|
| 215 |
-
##
|
| 216 |
-
- Reduce `BATCH_SIZE` to 8 or 16
|
| 217 |
-
- Use `CURRENT_MODEL = "fast"`
|
| 218 |
-
- Process smaller documents
|
| 219 |
|
| 220 |
-
###
|
| 221 |
-
- Increase `BATCH_SIZE` to 64
|
| 222 |
-
- Use `CURRENT_MODEL = "fast"`
|
| 223 |
-
- Check internet connection
|
| 224 |
|
| 225 |
-
|
| 226 |
-
|
| 227 |
-
|
| 228 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 229 |
|
| 230 |
-
##
|
| 231 |
-
- Check file size (max 50MB)
|
| 232 |
-
- Verify file format is supported
|
| 233 |
-
- Check console (F12) for details
|
| 234 |
|
| 235 |
-
##
|
| 236 |
|
| 237 |
-
|
| 238 |
-
|
| 239 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 240 |
|
| 241 |
## 🤝 Contributing
|
| 242 |
|
| 243 |
-
Contributions are welcome!
|
| 244 |
-
- Additional file format support
|
| 245 |
-
- More embedding model options
|
| 246 |
-
- Advanced retrieval strategies
|
| 247 |
-
- UI/UX enhancements
|
| 248 |
-
- Performance optimizations
|
| 249 |
|
| 250 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 251 |
|
| 252 |
-
|
| 253 |
|
| 254 |
-
|
|
|
|
|
|
|
| 255 |
|
| 256 |
-
|
| 257 |
-
|
| 258 |
-
- **Google AI** - Gemini LLM
|
| 259 |
-
- **Sentence Transformers** - Embedding models
|
| 260 |
|
| 261 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 262 |
|
| 263 |
-
|
| 264 |
-
|
| 265 |
-
|
| 266 |
-
|
| 267 |
-
|
| 268 |
|
| 269 |
---
|
| 270 |
|
| 271 |
-
|
|
|
|
|
|
|
|
|
| 1 |
+
<p align="center">
|
| 2 |
+
<img src="https://img.shields.io/badge/Python-3.9+-blue?style=for-the-badge&logo=python&logoColor=white" alt="Python">
|
| 3 |
+
<img src="https://img.shields.io/badge/FastAPI-0.128+-00a393?style=for-the-badge&logo=fastapi&logoColor=white" alt="FastAPI">
|
| 4 |
+
<img src="https://img.shields.io/badge/Haystack-2.22+-1C3D5A?style=for-the-badge" alt="Haystack">
|
| 5 |
+
<img src="https://img.shields.io/badge/Pinecone-Vector_DB-6b21a8?style=for-the-badge" alt="Pinecone">
|
| 6 |
+
<img src="https://img.shields.io/badge/Google_Gemini-AI-4285F4?style=for-the-badge&logo=google&logoColor=white" alt="Gemini">
|
| 7 |
+
<img src="https://img.shields.io/badge/License-MIT-green?style=for-the-badge" alt="License">
|
| 8 |
+
</p>
|
| 9 |
+
|
| 10 |
+
<h1 align="center">🤖 PaperBOT</h1>
|
| 11 |
+
<h3 align="center">AI-Powered Research Paper Assistant</h3>
|
| 12 |
+
|
| 13 |
+
<p align="center">
|
| 14 |
+
<b>Upload any document and ask questions in natural language. Get AI-powered answers grounded in your document's content.</b>
|
| 15 |
+
</p>
|
| 16 |
+
|
| 17 |
+
<p align="center">
|
| 18 |
+
<a href="#-features">Features</a> •
|
| 19 |
+
<a href="#-quick-start">Quick Start</a> •
|
| 20 |
+
<a href="#-usage">Usage</a> •
|
| 21 |
+
<a href="#%EF%B8%8F-configuration">Configuration</a> •
|
| 22 |
+
<a href="#-api-reference">API</a> •
|
| 23 |
+
<a href="#-architecture">Architecture</a>
|
| 24 |
+
</p>
|
| 25 |
|
| 26 |
+
---
|
| 27 |
+
|
| 28 |
+
## 🎯 What is PaperBOT?
|
| 29 |
+
|
| 30 |
+
PaperBOT is a **Retrieval-Augmented Generation (RAG)** application that allows you to upload research papers, documents, or data files and have intelligent conversations about their content. Unlike generic chatbots, PaperBOT answers are **always grounded in your uploaded document**, preventing hallucinations and ensuring accuracy.
|
| 31 |
+
|
| 32 |
+
### Key Highlights
|
| 33 |
+
|
| 34 |
+
- 📄 **Multi-format Support** — PDF, DOCX, TXT, MD, CSV, JSON, Excel
|
| 35 |
+
- 🚀 **Fast Processing** — Parallel embedding with optimized chunking
|
| 36 |
+
- 🎯 **Accurate Answers** — RAG ensures responses come from your document
|
| 37 |
+
- 🎨 **Beautiful UI** — Modern, responsive interface with progress tracking
|
| 38 |
+
- 🔒 **Privacy First** — Your documents stay on your infrastructure
|
| 39 |
+
|
| 40 |
+
---
|
| 41 |
+
|
| 42 |
+
## ✨ Features
|
| 43 |
+
|
| 44 |
+
<table>
|
| 45 |
+
<tr>
|
| 46 |
+
<td width="50%">
|
| 47 |
+
|
| 48 |
+
### 📚 Document Processing
|
| 49 |
+
- **9 file formats** supported
|
| 50 |
+
- Smart text chunking (300 words/chunk)
|
| 51 |
+
- Parallel batch embedding
|
| 52 |
+
- Metadata size enforcement for Pinecone
|
| 53 |
+
|
| 54 |
+
</td>
|
| 55 |
+
<td width="50%">
|
| 56 |
+
|
| 57 |
+
### 🧠 AI-Powered Q&A
|
| 58 |
+
- Semantic search with Pinecone
|
| 59 |
+
- Google Gemini 2.0 Flash for generation
|
| 60 |
+
- Curated fallback responses
|
| 61 |
+
- Customizable response styles
|
| 62 |
+
|
| 63 |
+
</td>
|
| 64 |
+
</tr>
|
| 65 |
+
<tr>
|
| 66 |
+
<td width="50%">
|
| 67 |
+
|
| 68 |
+
### ⚡ Performance
|
| 69 |
+
- Model pre-warming on startup
|
| 70 |
+
- Configurable speed/quality tradeoff
|
| 71 |
+
- Memory-efficient processing
|
| 72 |
+
- Up to 15MB file support
|
| 73 |
|
| 74 |
+
</td>
|
| 75 |
+
<td width="50%">
|
| 76 |
|
| 77 |
+
### 🎨 User Experience
|
| 78 |
+
- Drag-and-drop file upload
|
| 79 |
+
- Real-time progress tracking
|
| 80 |
+
- In-browser document preview
|
| 81 |
+
- Preloaded files support
|
| 82 |
+
|
| 83 |
+
</td>
|
| 84 |
+
</tr>
|
| 85 |
+
</table>
|
| 86 |
+
|
| 87 |
+
---
|
| 88 |
|
| 89 |
## 🚀 Quick Start
|
| 90 |
|
| 91 |
### Prerequisites
|
| 92 |
+
|
| 93 |
+
| Requirement | Version | Purpose |
|
| 94 |
+
|-------------|---------|---------|
|
| 95 |
+
| Python | 3.9+ | Runtime |
|
| 96 |
+
| Pinecone Account | Free tier | Vector database |
|
| 97 |
+
| Google AI API Key | Free tier | LLM generation |
|
| 98 |
|
| 99 |
### Installation
|
| 100 |
|
| 101 |
+
```bash
|
| 102 |
+
# 1. Clone the repository
|
| 103 |
+
git clone https://github.com/vikash-48413/PaperBOT.git
|
| 104 |
+
cd PaperBOT
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 105 |
|
| 106 |
+
# 2. Create virtual environment
|
| 107 |
+
python -m venv venv
|
| 108 |
|
| 109 |
+
# Windows
|
| 110 |
+
venv\Scripts\activate
|
|
|
|
|
|
|
|
|
|
| 111 |
|
| 112 |
+
# Linux/Mac
|
| 113 |
+
source venv/bin/activate
|
| 114 |
|
| 115 |
+
# 3. Install dependencies
|
| 116 |
+
pip install -r requirements.txt
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 117 |
|
| 118 |
+
# 4. Configure environment
|
| 119 |
+
cp .env.example .env
|
| 120 |
+
# Edit .env with your API keys (see Configuration section)
|
| 121 |
|
| 122 |
+
# 5. Run the application
|
| 123 |
+
python app.py
|
| 124 |
```
|
| 125 |
+
|
| 126 |
+
### Access the Application
|
| 127 |
+
|
| 128 |
+
Open your browser and navigate to:
|
| 129 |
+
```
|
| 130 |
+
http://localhost:8000
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 131 |
```
|
| 132 |
|
| 133 |
+
---
|
| 134 |
+
|
| 135 |
+
## 📖 Usage
|
| 136 |
+
|
| 137 |
+
### Basic Workflow
|
| 138 |
|
| 139 |
```
|
| 140 |
+
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
| 141 |
+
│ Upload Document │ ──▶ │ Ask Questions │ ──▶ │ Get Answers │
|
| 142 |
+
│ (PDF, DOCX...) │ │ (Natural Lang) │ │ (AI-Powered) │
|
| 143 |
+
└─────────────────┘ └─────────────────┘ └─────────────────┘
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
### Step-by-Step
|
| 147 |
+
|
| 148 |
+
1. **Upload a Document**
|
| 149 |
+
- Drag & drop or click to select a file
|
| 150 |
+
- Supported: PDF, DOCX, DOC, TXT, MD, CSV, JSON, XLSX, XLS
|
| 151 |
+
- Maximum size: 15MB (recommended: under 5MB for fast processing)
|
| 152 |
+
|
| 153 |
+
2. **Wait for Processing**
|
| 154 |
+
- Progress bar shows upload and embedding status
|
| 155 |
+
- Processing time: ~30s for 1MB, ~2-3min for 5MB
|
| 156 |
+
|
| 157 |
+
3. **Ask Questions**
|
| 158 |
+
- Type your question in natural language
|
| 159 |
+
- Example: "What is the main contribution of this paper?"
|
| 160 |
+
|
| 161 |
+
4. **Customize Response**
|
| 162 |
+
- **Style**: Simple, Balanced, or Technical
|
| 163 |
+
- **Length**: Short, Medium, or Comprehensive
|
| 164 |
+
|
| 165 |
+
5. **Preview Document**
|
| 166 |
+
- Click the 👁️ Preview button to view documents in-browser
|
| 167 |
+
- No download required
|
| 168 |
+
|
| 169 |
+
### Using Preloaded Files
|
| 170 |
+
|
| 171 |
+
Place documents in the `data/` folder to make them available as preloaded options:
|
| 172 |
+
|
| 173 |
+
```bash
|
| 174 |
+
# Add a paper to preloaded files
|
| 175 |
+
cp your-paper.pdf data/
|
| 176 |
```
|
| 177 |
|
| 178 |
+
---
|
| 179 |
+
|
| 180 |
## ⚙️ Configuration
|
| 181 |
|
| 182 |
+
### Environment Variables
|
| 183 |
|
| 184 |
+
Create a `.env` file in the project root:
|
| 185 |
+
|
| 186 |
+
```env
|
| 187 |
+
# Required: Pinecone Vector Database
|
| 188 |
+
PINECONE_API_KEY=your_pinecone_api_key_here
|
| 189 |
+
|
| 190 |
+
# Required: Google AI (Gemini)
|
| 191 |
+
GOOGLE_API_KEY=your_google_api_key_here
|
| 192 |
+
|
| 193 |
+
# Optional: HuggingFace (for some models)
|
| 194 |
+
HF_TOKEN=your_huggingface_token_here
|
| 195 |
```
|
| 196 |
|
| 197 |
+
### Getting API Keys
|
| 198 |
+
|
| 199 |
+
| Service | Link | Notes |
|
| 200 |
+
|---------|------|-------|
|
| 201 |
+
| Pinecone | [pinecone.io](https://www.pinecone.io/) | Free tier: 1 index, 100K vectors |
|
| 202 |
+
| Google AI | [aistudio.google.com](https://aistudio.google.com/) | Free tier: 60 requests/min |
|
| 203 |
+
| HuggingFace | [huggingface.co](https://huggingface.co/) | Optional, for gated models |
|
| 204 |
+
|
| 205 |
+
### Pinecone Index Setup
|
| 206 |
+
|
| 207 |
+
Create an index with these settings:
|
| 208 |
+
|
| 209 |
+
| Setting | Value |
|
| 210 |
+
|---------|-------|
|
| 211 |
+
| **Name** | `paperbot` |
|
| 212 |
+
| **Dimensions** | `1024` |
|
| 213 |
+
| **Metric** | `cosine` |
|
| 214 |
+
| **Cloud** | Any (AWS, GCP, Azure) |
|
| 215 |
+
|
| 216 |
+
### Performance Tuning
|
| 217 |
+
|
| 218 |
+
Edit `QASystem/config.py` to adjust:
|
| 219 |
+
|
| 220 |
```python
|
| 221 |
+
# Embedding model (must match Pinecone dimensions)
|
| 222 |
+
CURRENT_MODEL = "quality" # Options: "fast", "balanced", "quality"
|
| 223 |
+
|
| 224 |
+
# Chunk settings
|
| 225 |
CHUNK_SETTINGS = {
|
| 226 |
+
"split_by": "word",
|
| 227 |
+
"split_length": 300, # Words per chunk
|
| 228 |
+
"split_overlap": 15, # Overlap between chunks
|
| 229 |
}
|
| 230 |
+
|
| 231 |
+
# Batch size for embeddings
|
| 232 |
+
BATCH_SIZE = 32 # Higher = faster, but uses more memory
|
| 233 |
```
|
| 234 |
|
| 235 |
+
---
|
| 236 |
|
| 237 |
+
## 📡 API Reference
|
| 238 |
+
|
| 239 |
+
### Endpoints
|
| 240 |
+
|
| 241 |
+
| Method | Endpoint | Description |
|
| 242 |
+
|--------|----------|-------------|
|
| 243 |
+
| `GET` | `/` | Main web interface |
|
| 244 |
+
| `POST` | `/upload_document` | Upload and process a document |
|
| 245 |
+
| `POST` | `/get_result` | Ask a question |
|
| 246 |
+
| `GET` | `/document_status` | Check current document status |
|
| 247 |
+
| `GET` | `/preview_document` | Preview current document |
|
| 248 |
+
| `GET` | `/preview_file/{filename}` | Preview any file |
|
| 249 |
+
| `POST` | `/delete_document` | Delete current document |
|
| 250 |
+
| `GET` | `/preloaded_files` | List preloaded files |
|
| 251 |
+
| `POST` | `/load_preloaded_file` | Load a preloaded file |
|
| 252 |
+
| `GET` | `/model_status` | Check if embedding model is ready |
|
| 253 |
+
|
| 254 |
+
### Example: Ask a Question (cURL)
|
| 255 |
+
|
| 256 |
+
```bash
|
| 257 |
+
curl -X POST "http://localhost:8000/get_result" \
|
| 258 |
+
-H "Content-Type: application/x-www-form-urlencoded" \
|
| 259 |
+
-d "question=What is attention mechanism?&style=Balanced&length=Medium"
|
| 260 |
+
```
|
| 261 |
+
|
| 262 |
+
### Response Format
|
| 263 |
|
| 264 |
+
```json
|
| 265 |
+
{
|
| 266 |
+
"answer": "The attention mechanism allows the model to focus on relevant parts of the input...",
|
| 267 |
+
"source_file": "attention_paper.pdf"
|
| 268 |
+
}
|
| 269 |
+
```
|
| 270 |
|
| 271 |
+
---
|
|
|
|
|
|
|
|
|
|
| 272 |
|
| 273 |
+
## 🏗️ Architecture
|
| 274 |
|
| 275 |
+
### System Overview
|
|
|
|
|
|
|
|
|
|
|
|
|
| 276 |
|
| 277 |
+
```
|
| 278 |
+
┌──────────────────────────────────────────────────────────────────┐
|
| 279 |
+
│ PaperBOT │
|
| 280 |
+
├──────────────────────────────────────────────────────────────────┤
|
| 281 |
+
│ ┌─────────────┐ ┌──────────────┐ ┌─────────────────────┐ │
|
| 282 |
+
│ │ FastAPI │───▶│ Haystack │───▶│ Google Gemini │ │
|
| 283 |
+
│ │ Server │ │ Pipeline │ │ (LLM Generation) │ │
|
| 284 |
+
│ └─────────────┘ └──────────────┘ └─────────────────────┘ │
|
| 285 |
+
│ │ │ │
|
| 286 |
+
│ ▼ ▼ │
|
| 287 |
+
│ ┌─────────────┐ ┌──────────────┐ │
|
| 288 |
+
│ │ Document │ │ Pinecone │ │
|
| 289 |
+
│ │ Converters │ │ Vector DB │ │
|
| 290 |
+
│ └─────────────┘ └──────────────┘ │
|
| 291 |
+
└──────────────────────────────────────────────────────────────────┘
|
| 292 |
+
```
|
| 293 |
|
| 294 |
+
### Tech Stack
|
|
|
|
|
|
|
|
|
|
|
|
|
| 295 |
|
| 296 |
+
| Component | Technology | Purpose |
|
| 297 |
+
|-----------|------------|---------|
|
| 298 |
+
| **Backend** | FastAPI + Uvicorn | Async web server |
|
| 299 |
+
| **RAG Framework** | Haystack 2.22 | Pipeline orchestration |
|
| 300 |
+
| **Embeddings** | Sentence Transformers | BAAI/bge-large-en-v1.5 |
|
| 301 |
+
| **Vector DB** | Pinecone | Semantic search |
|
| 302 |
+
| **LLM** | Google Gemini 2.0 Flash | Answer generation |
|
| 303 |
+
| **Frontend** | HTML/CSS/JS + Bootstrap | User interface |
|
| 304 |
|
| 305 |
+
### Processing Pipeline
|
| 306 |
+
|
| 307 |
+
```
|
| 308 |
+
Document Upload
|
| 309 |
+
│
|
| 310 |
+
▼
|
| 311 |
+
┌─────────────────┐
|
| 312 |
+
│ File Validation │ ← Check type, size
|
| 313 |
+
└────────┬────────┘
|
| 314 |
+
│
|
| 315 |
+
▼
|
| 316 |
+
┌─────────────────┐
|
| 317 |
+
│ Format Converter│ ← PDF, DOCX, Excel → Text
|
| 318 |
+
└────────┬────────┘
|
| 319 |
+
│
|
| 320 |
+
▼
|
| 321 |
+
┌─────────────────┐
|
| 322 |
+
│ Text Chunking │ ← 300 words/chunk
|
| 323 |
+
└────────┬────────┘
|
| 324 |
+
│
|
| 325 |
+
▼
|
| 326 |
+
┌─────────────────┐
|
| 327 |
+
│ Size Enforcement│ ← Ensure <8KB per chunk
|
| 328 |
+
└────────┬────────┘
|
| 329 |
+
│
|
| 330 |
+
▼
|
| 331 |
+
┌─────────────────┐
|
| 332 |
+
│ Batch Embedding │ ← 32 chunks/batch
|
| 333 |
+
└────────┬────────┘
|
| 334 |
+
│
|
| 335 |
+
▼
|
| 336 |
+
┌─────────────────┐
|
| 337 |
+
│ Pinecone Upload │ ← Store vectors
|
| 338 |
+
└─────────────────┘
|
| 339 |
+
```
|
| 340 |
+
|
| 341 |
+
---
|
| 342 |
|
| 343 |
## 📁 Project Structure
|
| 344 |
|
| 345 |
```
|
| 346 |
PaperBOT/
|
| 347 |
+
├── app.py # Main FastAPI application
|
| 348 |
+
├── QASystem/ # Core RAG system
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 349 |
│ ├── __init__.py
|
| 350 |
+
│ ├── config.py # Configuration settings
|
| 351 |
+
│ ├── ingestion.py # Document processing & embedding
|
| 352 |
+
│ ├── retrieval_and_generation.py # Q&A pipeline
|
| 353 |
+
│ └── utils.py # Pinecone utilities
|
| 354 |
+
├── templates/ # HTML templates
|
| 355 |
+
│ └── index.html # Main UI
|
| 356 |
+
├── data/ # Preloaded documents
|
| 357 |
+
├── uploads/ # User uploads (gitignored)
|
| 358 |
+
├── requirements.txt # Python dependencies
|
| 359 |
+
├── .env.example # Environment template
|
| 360 |
+
├── start.bat # Windows launcher
|
| 361 |
+
├── start.sh # Linux/Mac launcher
|
| 362 |
+
└── LICENSE # MIT License
|
| 363 |
```
|
| 364 |
|
| 365 |
+
---
|
| 366 |
|
| 367 |
+
## 🧪 Testing
|
|
|
|
|
|
|
|
|
|
| 368 |
|
| 369 |
+
### Run System Tests
|
|
|
|
|
|
|
|
|
|
| 370 |
|
| 371 |
+
```bash
|
| 372 |
+
# Test all components
|
| 373 |
+
python test_system.py
|
| 374 |
+
|
| 375 |
+
# Test Pinecone connection
|
| 376 |
+
python test_pinecone.py
|
| 377 |
+
```
|
| 378 |
+
|
| 379 |
+
### Manual Testing Checklist
|
| 380 |
+
|
| 381 |
+
1. **Upload Test**: Upload a small PDF (<1MB)
|
| 382 |
+
2. **Query Test**: Ask "What is this document about?"
|
| 383 |
+
3. **Preview Test**: Click the preview button
|
| 384 |
+
4. **Delete Test**: Delete the document
|
| 385 |
+
|
| 386 |
+
---
|
| 387 |
|
| 388 |
+
## 🔧 Troubleshooting
|
|
|
|
|
|
|
|
|
|
| 389 |
|
| 390 |
+
### Common Issues
|
| 391 |
|
| 392 |
+
<details>
|
| 393 |
+
<summary><b>❌ "Pinecone index not found"</b></summary>
|
| 394 |
+
|
| 395 |
+
Create the index in Pinecone console:
|
| 396 |
+
- Name: `paperbot`
|
| 397 |
+
- Dimensions: `1024`
|
| 398 |
+
- Metric: `cosine`
|
| 399 |
+
</details>
|
| 400 |
+
|
| 401 |
+
<details>
|
| 402 |
+
<summary><b>❌ "Google API quota exceeded"</b></summary>
|
| 403 |
+
|
| 404 |
+
The free tier has rate limits. Either:
|
| 405 |
+
- Wait a few minutes for quota reset
|
| 406 |
+
- Upgrade to paid tier
|
| 407 |
+
- Use curated fallback (automatic)
|
| 408 |
+
</details>
|
| 409 |
+
|
| 410 |
+
<details>
|
| 411 |
+
<summary><b>❌ "File too large" error</b></summary>
|
| 412 |
+
|
| 413 |
+
Maximum file size is 15MB. For faster processing:
|
| 414 |
+
- Keep files under 5MB
|
| 415 |
+
- Split large documents into chapters
|
| 416 |
+
</details>
|
| 417 |
+
|
| 418 |
+
<details>
|
| 419 |
+
<summary><b>❌ Server not starting</b></summary>
|
| 420 |
+
|
| 421 |
+
1. Check if port 8000 is in use
|
| 422 |
+
2. Verify virtual environment is activated
|
| 423 |
+
3. Check all dependencies: `pip install -r requirements.txt`
|
| 424 |
+
</details>
|
| 425 |
+
|
| 426 |
+
<details>
|
| 427 |
+
<summary><b>❌ "Model dimension mismatch"</b></summary>
|
| 428 |
+
|
| 429 |
+
The embedding model dimension must match Pinecone index:
|
| 430 |
+
- `fast` model → 384 dimensions
|
| 431 |
+
- `balanced` model → 768 dimensions
|
| 432 |
+
- `quality` model → 1024 dimensions
|
| 433 |
+
|
| 434 |
+
Either recreate the Pinecone index or change `CURRENT_MODEL` in config.py
|
| 435 |
+
</details>
|
| 436 |
+
|
| 437 |
+
---
|
| 438 |
|
| 439 |
## 🤝 Contributing
|
| 440 |
|
| 441 |
+
Contributions are welcome! Please:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 442 |
|
| 443 |
+
1. Fork the repository
|
| 444 |
+
2. Create a feature branch: `git checkout -b feature/amazing-feature`
|
| 445 |
+
3. Commit changes: `git commit -m 'Add amazing feature'`
|
| 446 |
+
4. Push to branch: `git push origin feature/amazing-feature`
|
| 447 |
+
5. Open a Pull Request
|
| 448 |
|
| 449 |
+
### Development Setup
|
| 450 |
|
| 451 |
+
```bash
|
| 452 |
+
# Install development dependencies
|
| 453 |
+
pip install -r requirements-dev.txt
|
| 454 |
|
| 455 |
+
# Run tests
|
| 456 |
+
pytest
|
|
|
|
|
|
|
| 457 |
|
| 458 |
+
# Format code
|
| 459 |
+
black . --line-length 100
|
| 460 |
+
```
|
| 461 |
+
|
| 462 |
+
---
|
| 463 |
+
|
| 464 |
+
## 📜 License
|
| 465 |
+
|
| 466 |
+
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
|
| 467 |
+
|
| 468 |
+
---
|
| 469 |
+
|
| 470 |
+
## 👤 Author
|
| 471 |
+
|
| 472 |
+
**Vikash**
|
| 473 |
+
- GitHub: [@vikash-48413](https://github.com/vikash-48413)
|
| 474 |
+
- Email: vikash17052005@gmail.com
|
| 475 |
+
|
| 476 |
+
---
|
| 477 |
+
|
| 478 |
+
## 🙏 Acknowledgments
|
| 479 |
|
| 480 |
+
- [Haystack](https://haystack.deepset.ai/) — RAG framework
|
| 481 |
+
- [Pinecone](https://www.pinecone.io/) — Vector database
|
| 482 |
+
- [Google AI](https://ai.google.dev/) — Gemini LLM
|
| 483 |
+
- [Sentence Transformers](https://www.sbert.net/) — Embeddings
|
| 484 |
+
- [FastAPI](https://fastapi.tiangolo.com/) — Web framework
|
| 485 |
|
| 486 |
---
|
| 487 |
|
| 488 |
+
<p align="center">
|
| 489 |
+
<b>⭐ Star this repository if you find it useful!</b>
|
| 490 |
+
</p>
|
TESTING_GUIDE.md
DELETED
|
@@ -1,326 +0,0 @@
|
|
| 1 |
-
# Testing Guide - Large File Upload Fixes
|
| 2 |
-
|
| 3 |
-
## Quick Test Checklist
|
| 4 |
-
|
| 5 |
-
### ✅ Pre-Flight Checks
|
| 6 |
-
1. Server starts without errors
|
| 7 |
-
2. Pinecone connection works
|
| 8 |
-
3. No Python dependency issues
|
| 9 |
-
|
| 10 |
-
### ✅ File Upload Tests
|
| 11 |
-
|
| 12 |
-
#### Test 1: Small File (< 1MB)
|
| 13 |
-
- **File**: Any PDF < 1MB
|
| 14 |
-
- **Expected**: Fast upload (< 30s)
|
| 15 |
-
- **Check**: Progress bar, success message, file size shown
|
| 16 |
-
|
| 17 |
-
#### Test 2: Medium File (1-3MB)
|
| 18 |
-
- **File**: Research paper 1-3MB
|
| 19 |
-
- **Expected**: Normal processing (30-60s)
|
| 20 |
-
- **Check**: Optimized settings NOT activated
|
| 21 |
-
|
| 22 |
-
#### Test 3: Large File (3-5MB) ⭐ PRIMARY TEST
|
| 23 |
-
- **File**: Research paper ~5MB
|
| 24 |
-
- **Expected**: Optimized processing (1-2 min)
|
| 25 |
-
- **Check**: Console shows "Large file detected - using optimized settings"
|
| 26 |
-
- **Verify**:
|
| 27 |
-
- Chunk length: 350 words
|
| 28 |
-
- Batch size: 20 chunks
|
| 29 |
-
- Memory cleanup messages appear
|
| 30 |
-
|
| 31 |
-
#### Test 4: Very Large File (5-10MB)
|
| 32 |
-
- **File**: Long research paper 5-10MB
|
| 33 |
-
- **Expected**: Slower but successful (2-5 min)
|
| 34 |
-
- **Check**: All chunks processed, no memory errors
|
| 35 |
-
|
| 36 |
-
#### Test 5: Maximum Size (45-50MB)
|
| 37 |
-
- **File**: Very large document
|
| 38 |
-
- **Expected**: Works but takes time (5-10 min)
|
| 39 |
-
- **Warning**: May be slow, monitor console
|
| 40 |
-
|
| 41 |
-
#### Test 6: Oversized File (> 50MB)
|
| 42 |
-
- **File**: Any file > 50MB
|
| 43 |
-
- **Expected**: Clear error message
|
| 44 |
-
- **Error Text**: "File size (XXmb) exceeds maximum allowed size (50MB)"
|
| 45 |
-
- **HTTP Code**: 413 (Payload Too Large)
|
| 46 |
-
|
| 47 |
-
### ✅ Error Handling Tests
|
| 48 |
-
|
| 49 |
-
#### Test 7: Empty File
|
| 50 |
-
- **File**: Create empty .pdf file
|
| 51 |
-
- **Expected**: Error "File is empty"
|
| 52 |
-
- **HTTP Code**: 400
|
| 53 |
-
|
| 54 |
-
#### Test 8: Wrong File Type
|
| 55 |
-
- **File**: .exe, .zip, .mp3, etc.
|
| 56 |
-
- **Expected**: Error listing supported formats
|
| 57 |
-
- **Supported**: PDF, DOCX, DOC, TXT, MD, CSV, JSON, XLSX, XLS
|
| 58 |
-
|
| 59 |
-
#### Test 9: Corrupted PDF
|
| 60 |
-
- **File**: Damaged PDF file
|
| 61 |
-
- **Expected**: Error during content extraction
|
| 62 |
-
- **Check**: Proper error message, cleanup occurs
|
| 63 |
-
|
| 64 |
-
### ✅ Performance Tests
|
| 65 |
-
|
| 66 |
-
#### Test 10: Memory Usage
|
| 67 |
-
1. Note starting memory (Task Manager)
|
| 68 |
-
2. Upload 5MB file
|
| 69 |
-
3. Check memory during processing
|
| 70 |
-
4. Verify memory cleanup after completion
|
| 71 |
-
5. **Expected**: Memory returns near baseline
|
| 72 |
-
|
| 73 |
-
#### Test 11: Multiple Uploads
|
| 74 |
-
1. Upload file A (5MB)
|
| 75 |
-
2. Wait for completion
|
| 76 |
-
3. Delete file A
|
| 77 |
-
4. Upload file B (5MB)
|
| 78 |
-
5. **Expected**: Both work, no accumulation errors
|
| 79 |
-
|
| 80 |
-
#### Test 12: Upload Cancellation
|
| 81 |
-
1. Start uploading large file
|
| 82 |
-
2. Refresh page mid-upload
|
| 83 |
-
3. **Expected**: Cleanup occurs, no orphaned files
|
| 84 |
-
|
| 85 |
-
### ✅ Progress Tracking Tests
|
| 86 |
-
|
| 87 |
-
#### Test 13: Progress Updates
|
| 88 |
-
1. Upload 5MB file
|
| 89 |
-
2. Watch console output
|
| 90 |
-
3. **Check for**:
|
| 91 |
-
- ✓ File type validated
|
| 92 |
-
- ✓ File read successfully: X.XXmb
|
| 93 |
-
- ✓ Cleared previous uploads
|
| 94 |
-
- ✓ File saved to: uploads/...
|
| 95 |
-
- ✓ Initializing document store
|
| 96 |
-
- ✓ Converting file
|
| 97 |
-
- ✓ Extracted content
|
| 98 |
-
- ✓ Created X chunks
|
| 99 |
-
- ✓ Batch N/M complete
|
| 100 |
-
- ✓ Memory cleanup
|
| 101 |
-
- ✓ Wrote X chunks to Pinecone
|
| 102 |
-
- ✅ Ingestion completed successfully
|
| 103 |
-
|
| 104 |
-
#### Test 14: Frontend Progress
|
| 105 |
-
1. Upload any file
|
| 106 |
-
2. Watch progress bar
|
| 107 |
-
3. **Check**:
|
| 108 |
-
- Shows percentage
|
| 109 |
-
- Updates in real-time
|
| 110 |
-
- Shows "Complete! ✓" at end
|
| 111 |
-
|
| 112 |
-
## Console Output Examples
|
| 113 |
-
|
| 114 |
-
### ✅ Successful 5MB Upload
|
| 115 |
-
```
|
| 116 |
-
📥 Upload endpoint called - Filename: research_paper.pdf
|
| 117 |
-
✓ File type validated: .pdf
|
| 118 |
-
📦 Reading file in chunks...
|
| 119 |
-
✓ File read successfully: 5.23MB
|
| 120 |
-
🧹 Clearing previous uploads...
|
| 121 |
-
✓ File saved to: uploads\research_paper.pdf
|
| 122 |
-
📤 Upload started: research_paper.pdf (5.23MB)
|
| 123 |
-
🔧 Initializing document store...
|
| 124 |
-
📊 Processing document...
|
| 125 |
-
|
| 126 |
-
📄 Starting document ingestion: research_paper.pdf
|
| 127 |
-
💾 Server baseline memory: 450.2MB
|
| 128 |
-
✓ Cleared existing documents from vector store
|
| 129 |
-
🔄 Converting .pdf file...
|
| 130 |
-
✓ Extracted content (1 document(s), 45230 chars)
|
| 131 |
-
📊 File size: 5.23MB
|
| 132 |
-
⚡ Large file detected - using optimized settings:
|
| 133 |
-
Chunk length: 350 words
|
| 134 |
-
Batch size: 20 chunks
|
| 135 |
-
✂️ Splitting into chunks...
|
| 136 |
-
✓ Created 156 chunks
|
| 137 |
-
🧠 Embedding chunks with parallel processing...
|
| 138 |
-
📦 Processing 156 chunks in 8 batches (batch_size=20)
|
| 139 |
-
✓ Batch 1/8 complete (20 chunks, +15.2MB)
|
| 140 |
-
✓ Batch 2/8 complete (20 chunks, +14.8MB)
|
| 141 |
-
✓ Batch 3/8 complete (20 chunks, +15.1MB)
|
| 142 |
-
🧹 Memory cleanup: 485.3MB
|
| 143 |
-
✓ Batch 4/8 complete (20 chunks, +15.0MB)
|
| 144 |
-
...
|
| 145 |
-
✓ Batch 8/8 complete (16 chunks, +12.1MB)
|
| 146 |
-
✓ Successfully embedded 156 chunks
|
| 147 |
-
💾 Writing to vector database...
|
| 148 |
-
✓ Wrote 156 chunks to Pinecone
|
| 149 |
-
|
| 150 |
-
✅ Ingestion completed successfully!
|
| 151 |
-
📊 Statistics:
|
| 152 |
-
• Document: research_paper.pdf
|
| 153 |
-
• Format: PDF
|
| 154 |
-
• Size: 5.23MB
|
| 155 |
-
• Chunks created: 156
|
| 156 |
-
• Time taken: 94.32 seconds
|
| 157 |
-
• Speed: 1.7 chunks/sec
|
| 158 |
-
• Memory used: +45.3MB (450.2MB → 495.5MB)
|
| 159 |
-
⏱️ Total processing time: 94.3s
|
| 160 |
-
✅ Upload completed: research_paper.pdf
|
| 161 |
-
```
|
| 162 |
-
|
| 163 |
-
### ❌ Failed Upload - File Too Large
|
| 164 |
-
```
|
| 165 |
-
📥 Upload endpoint called - Filename: huge_file.pdf
|
| 166 |
-
✓ File type validated: .pdf
|
| 167 |
-
📦 Reading file in chunks...
|
| 168 |
-
❌ Error: File size (52.34MB) exceeds maximum (50MB)
|
| 169 |
-
```
|
| 170 |
-
|
| 171 |
-
### ❌ Failed Upload - Empty File
|
| 172 |
-
```
|
| 173 |
-
📥 Upload endpoint called - Filename: empty.pdf
|
| 174 |
-
✓ File type validated: .pdf
|
| 175 |
-
📦 Reading file in chunks...
|
| 176 |
-
✓ File read successfully: 0.00MB
|
| 177 |
-
❌ Error: File is empty
|
| 178 |
-
```
|
| 179 |
-
|
| 180 |
-
## Browser Console Checks
|
| 181 |
-
|
| 182 |
-
### Successful Upload
|
| 183 |
-
```javascript
|
| 184 |
-
File selected: File {name: "paper.pdf", size: 5485760}
|
| 185 |
-
File: paper.pdf, Size: 5.23MB
|
| 186 |
-
Starting upload for: paper.pdf
|
| 187 |
-
Sending request to /upload_document
|
| 188 |
-
Processing progress: 30%
|
| 189 |
-
Processing progress: 50%
|
| 190 |
-
Processing progress: 70%
|
| 191 |
-
Processing progress: 90%
|
| 192 |
-
```
|
| 193 |
-
|
| 194 |
-
### Failed Upload
|
| 195 |
-
```javascript
|
| 196 |
-
File: huge.pdf, Size: 52.34MB
|
| 197 |
-
❌ File too large error shown
|
| 198 |
-
```
|
| 199 |
-
|
| 200 |
-
## What to Look For
|
| 201 |
-
|
| 202 |
-
### ✅ SUCCESS Indicators
|
| 203 |
-
- Progress bar completes to 100%
|
| 204 |
-
- Green success notification
|
| 205 |
-
- File appears in "Current Document" section
|
| 206 |
-
- Console shows "✅ Ingestion completed successfully"
|
| 207 |
-
- Can ask questions immediately
|
| 208 |
-
|
| 209 |
-
### ❌ FAILURE Indicators
|
| 210 |
-
- Red error notification
|
| 211 |
-
- Clear error message
|
| 212 |
-
- Console shows "❌" errors
|
| 213 |
-
- File automatically cleaned up
|
| 214 |
-
- Can upload again without issues
|
| 215 |
-
|
| 216 |
-
## Performance Benchmarks
|
| 217 |
-
|
| 218 |
-
| File Size | Expected Time | Chunks | Memory Usage |
|
| 219 |
-
|-----------|---------------|--------|---------------|
|
| 220 |
-
| 0.5 MB | 15-20s | 40-60 | +20-30MB |
|
| 221 |
-
| 1 MB | 25-35s | 80-100 | +30-40MB |
|
| 222 |
-
| 3 MB | 50-70s | 120-140| +40-60MB |
|
| 223 |
-
| 5 MB | 80-120s | 150-180| +50-80MB |
|
| 224 |
-
| 10 MB | 150-240s | 280-320| +80-120MB |
|
| 225 |
-
|
| 226 |
-
*Times vary based on system specs and model choice*
|
| 227 |
-
|
| 228 |
-
## Common Issues & Solutions
|
| 229 |
-
|
| 230 |
-
### Issue: "Timeout after 10 minutes"
|
| 231 |
-
**Cause**: File too complex or system too slow
|
| 232 |
-
**Solution**:
|
| 233 |
-
- Try smaller file
|
| 234 |
-
- Use "fast" model in config.py
|
| 235 |
-
- Increase chunk size in config.py
|
| 236 |
-
|
| 237 |
-
### Issue: "Memory Error"
|
| 238 |
-
**Cause**: Insufficient RAM
|
| 239 |
-
**Solution**:
|
| 240 |
-
- Close other applications
|
| 241 |
-
- Increase batch size in config.py
|
| 242 |
-
- Use "fast" model (uses less memory)
|
| 243 |
-
|
| 244 |
-
### Issue: "No chunks embedded"
|
| 245 |
-
**Cause**: Pinecone connection issue
|
| 246 |
-
**Solution**:
|
| 247 |
-
- Check PINECONE_API_KEY in .env
|
| 248 |
-
- Verify internet connection
|
| 249 |
-
- Check Pinecone index exists
|
| 250 |
-
|
| 251 |
-
### Issue: "File not found after upload"
|
| 252 |
-
**Cause**: Permission or path issue
|
| 253 |
-
**Solution**:
|
| 254 |
-
- Check uploads/ directory exists
|
| 255 |
-
- Verify write permissions
|
| 256 |
-
- Check disk space
|
| 257 |
-
|
| 258 |
-
## Quick Debug Commands
|
| 259 |
-
|
| 260 |
-
### Check Server Logs
|
| 261 |
-
```bash
|
| 262 |
-
# Look for errors in console
|
| 263 |
-
# Search for ❌ or "ERROR" or "Exception"
|
| 264 |
-
```
|
| 265 |
-
|
| 266 |
-
### Check File Size
|
| 267 |
-
```bash
|
| 268 |
-
# Windows PowerShell
|
| 269 |
-
(Get-Item "path\to\file.pdf").Length / 1MB
|
| 270 |
-
```
|
| 271 |
-
|
| 272 |
-
### Check Memory Usage
|
| 273 |
-
```bash
|
| 274 |
-
# Windows Task Manager
|
| 275 |
-
# Look for python.exe process
|
| 276 |
-
```
|
| 277 |
-
|
| 278 |
-
### Check Uploads Directory
|
| 279 |
-
```bash
|
| 280 |
-
# PowerShell
|
| 281 |
-
Get-ChildItem uploads\
|
| 282 |
-
```
|
| 283 |
-
|
| 284 |
-
## Success Criteria
|
| 285 |
-
|
| 286 |
-
For a 5MB file upload to be considered successful:
|
| 287 |
-
|
| 288 |
-
1. ✅ File uploads without errors
|
| 289 |
-
2. ✅ Optimized settings activate (console shows message)
|
| 290 |
-
3. ✅ All chunks process successfully
|
| 291 |
-
4. ✅ Data written to Pinecone
|
| 292 |
-
5. ✅ Memory cleanup occurs
|
| 293 |
-
6. ✅ Processing completes in < 5 minutes
|
| 294 |
-
7. ✅ Can ask questions and get answers
|
| 295 |
-
8. ✅ Statistics show correct file size
|
| 296 |
-
|
| 297 |
-
## Report Template
|
| 298 |
-
|
| 299 |
-
When reporting issues, include:
|
| 300 |
-
|
| 301 |
-
```
|
| 302 |
-
File Details:
|
| 303 |
-
- Name: _______
|
| 304 |
-
- Size: _______
|
| 305 |
-
- Type: _______
|
| 306 |
-
|
| 307 |
-
Error:
|
| 308 |
-
- Message: _______
|
| 309 |
-
- HTTP Code: _______
|
| 310 |
-
- When occurred: _______
|
| 311 |
-
|
| 312 |
-
Console Output:
|
| 313 |
-
[Paste last 20-30 lines]
|
| 314 |
-
|
| 315 |
-
Browser Console:
|
| 316 |
-
[Paste any errors]
|
| 317 |
-
|
| 318 |
-
System:
|
| 319 |
-
- OS: _______
|
| 320 |
-
- RAM: _______
|
| 321 |
-
- Python Version: _______
|
| 322 |
-
```
|
| 323 |
-
|
| 324 |
-
---
|
| 325 |
-
|
| 326 |
-
**Need Help?** Check [FIXES_APPLIED.md](FIXES_APPLIED.md) for detailed technical information.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
UPGRADE_SUMMARY.md
DELETED
|
@@ -1,324 +0,0 @@
|
|
| 1 |
-
# 🎯 PaperBOT v2.0 - Upgrade Summary
|
| 2 |
-
|
| 3 |
-
## ✅ Completed Improvements
|
| 4 |
-
|
| 5 |
-
### 1. Multi-Format File Support ✅
|
| 6 |
-
**What Changed:**
|
| 7 |
-
- Added support for 7+ file formats beyond just PDF/TXT
|
| 8 |
-
- New formats: DOCX, DOC, MD, CSV, JSON, XLSX, XLS
|
| 9 |
-
|
| 10 |
-
**Implementation:**
|
| 11 |
-
- Created converter functions for each format
|
| 12 |
-
- Added `python-docx` for Word documents
|
| 13 |
-
- Added `pandas` for CSV/Excel files
|
| 14 |
-
- Integrated Markdown support
|
| 15 |
-
- JSON parsing with proper formatting
|
| 16 |
-
|
| 17 |
-
**Files Modified:**
|
| 18 |
-
- `QASystem/ingestion.py` - Added converter functions
|
| 19 |
-
- `app.py` - Updated file validation
|
| 20 |
-
- `templates/index.html` - Updated UI file types
|
| 21 |
-
- `requirements.txt` - Added dependencies
|
| 22 |
-
|
| 23 |
-
---
|
| 24 |
-
|
| 25 |
-
### 2. Parallel Processing ✅
|
| 26 |
-
**What Changed:**
|
| 27 |
-
- Implemented batch processing for embeddings
|
| 28 |
-
- Added multi-threaded document upload
|
| 29 |
-
- Concurrent chunk processing
|
| 30 |
-
|
| 31 |
-
**Implementation:**
|
| 32 |
-
- `process_chunks_parallel()` function for batch embedding
|
| 33 |
-
- ThreadPoolExecutor in `app.py` for async upload
|
| 34 |
-
- Configurable batch size (default: 32 chunks)
|
| 35 |
-
- Progress tracking throughout pipeline
|
| 36 |
-
|
| 37 |
-
**Performance Gain:**
|
| 38 |
-
- 3-5x faster processing
|
| 39 |
-
- 30-50 chunks/second (up from 10-15)
|
| 40 |
-
- Non-blocking UI during upload
|
| 41 |
-
|
| 42 |
-
**Files Modified:**
|
| 43 |
-
- `QASystem/ingestion.py` - Parallel processing logic
|
| 44 |
-
- `app.py` - Async upload handling
|
| 45 |
-
|
| 46 |
-
---
|
| 47 |
-
|
| 48 |
-
### 3. Memory Management ✅
|
| 49 |
-
**What Changed:**
|
| 50 |
-
- Real-time memory monitoring
|
| 51 |
-
- Automatic garbage collection
|
| 52 |
-
- Batch-wise memory cleanup
|
| 53 |
-
- Memory usage reporting
|
| 54 |
-
|
| 55 |
-
**Implementation:**
|
| 56 |
-
- Added `psutil` for memory monitoring
|
| 57 |
-
- `get_memory_usage()` function
|
| 58 |
-
- `clear_memory()` for garbage collection
|
| 59 |
-
- Cleanup every 5 batches
|
| 60 |
-
- Memory stats in console output
|
| 61 |
-
|
| 62 |
-
**Memory Optimization:**
|
| 63 |
-
- Prevents memory leaks
|
| 64 |
-
- Handles large documents (100+ pages)
|
| 65 |
-
- Configurable batch sizes for memory control
|
| 66 |
-
|
| 67 |
-
**Files Modified:**
|
| 68 |
-
- `QASystem/ingestion.py` - Memory management functions
|
| 69 |
-
- `requirements.txt` - Added psutil
|
| 70 |
-
|
| 71 |
-
---
|
| 72 |
-
|
| 73 |
-
### 4. Enhanced Semantic Search ✅
|
| 74 |
-
**What Changed:**
|
| 75 |
-
- Increased retrieval candidates from 5 to 10
|
| 76 |
-
- Added cached embedders for faster queries
|
| 77 |
-
- Implemented fallback retrieval strategies
|
| 78 |
-
- Added relevance scoring display
|
| 79 |
-
|
| 80 |
-
**Implementation:**
|
| 81 |
-
- `get_text_embedder()` with caching
|
| 82 |
-
- Enhanced `get_result()` function
|
| 83 |
-
- Multiple fallback mechanisms
|
| 84 |
-
- Relevance score calculation
|
| 85 |
-
- Better error handling
|
| 86 |
-
|
| 87 |
-
**Search Quality:**
|
| 88 |
-
- Better context coverage
|
| 89 |
-
- More accurate answers
|
| 90 |
-
- Faster query responses (2-5s)
|
| 91 |
-
- Graceful degradation on errors
|
| 92 |
-
|
| 93 |
-
**Files Modified:**
|
| 94 |
-
- `QASystem/retrieval_and_generation.py` - Enhanced retrieval
|
| 95 |
-
|
| 96 |
-
---
|
| 97 |
-
|
| 98 |
-
### 5. Progress Tracking & Error Handling ✅
|
| 99 |
-
**What Changed:**
|
| 100 |
-
- Real-time progress updates
|
| 101 |
-
- Comprehensive error messages
|
| 102 |
-
- Detailed console logging
|
| 103 |
-
- Better user feedback
|
| 104 |
-
|
| 105 |
-
**Implementation:**
|
| 106 |
-
- Progress callbacks in ingestion
|
| 107 |
-
- Try-catch blocks everywhere
|
| 108 |
-
- Detailed error traces
|
| 109 |
-
- User-friendly error messages
|
| 110 |
-
- Troubleshooting hints
|
| 111 |
-
|
| 112 |
-
**User Experience:**
|
| 113 |
-
- Clear status updates
|
| 114 |
-
- Informative error messages
|
| 115 |
-
- Debug information in console
|
| 116 |
-
- Recovery suggestions
|
| 117 |
-
|
| 118 |
-
**Files Modified:**
|
| 119 |
-
- `app.py` - Progress tracking
|
| 120 |
-
- `QASystem/ingestion.py` - Logging
|
| 121 |
-
- `templates/index.html` - UI feedback
|
| 122 |
-
|
| 123 |
-
---
|
| 124 |
-
|
| 125 |
-
### 6. Dependencies Updated ✅
|
| 126 |
-
**What Changed:**
|
| 127 |
-
- Added 8 new packages
|
| 128 |
-
- Updated requirements.txt
|
| 129 |
-
- Created .env.example
|
| 130 |
-
|
| 131 |
-
**New Dependencies:**
|
| 132 |
-
```
|
| 133 |
-
python-docx # Word documents
|
| 134 |
-
pandas # Data processing
|
| 135 |
-
openpyxl # Excel files
|
| 136 |
-
psutil # Memory monitoring
|
| 137 |
-
tqdm # Progress bars
|
| 138 |
-
markdown # Markdown support
|
| 139 |
-
jinja2 # Templates
|
| 140 |
-
```
|
| 141 |
-
|
| 142 |
-
**Files Modified:**
|
| 143 |
-
- `requirements.txt` - Complete dependencies
|
| 144 |
-
- `.env.example` - Environment template
|
| 145 |
-
|
| 146 |
-
---
|
| 147 |
-
|
| 148 |
-
### 7. Comprehensive Documentation ✅
|
| 149 |
-
**What Changed:**
|
| 150 |
-
- Created 6 new documentation files
|
| 151 |
-
- Added helper scripts
|
| 152 |
-
- Created quick reference guides
|
| 153 |
-
|
| 154 |
-
**New Files:**
|
| 155 |
-
- `README.md` - Complete project overview
|
| 156 |
-
- `INSTALLATION.md` - Setup guide
|
| 157 |
-
- `FEATURES.md` - Feature documentation
|
| 158 |
-
- `QUICKSTART.md` - Quick reference
|
| 159 |
-
- `CHANGELOG.md` - Version history
|
| 160 |
-
- `.env.example` - Config template
|
| 161 |
-
- `test_system.py` - System test script
|
| 162 |
-
- `start.bat` / `start.sh` - Startup scripts
|
| 163 |
-
|
| 164 |
-
---
|
| 165 |
-
|
| 166 |
-
## 📊 Performance Comparison
|
| 167 |
-
|
| 168 |
-
| Metric | Before (v1.0) | After (v2.0) | Improvement |
|
| 169 |
-
|--------|---------------|--------------|-------------|
|
| 170 |
-
| Upload Speed | 5-10s | 1-3s | **3x faster** |
|
| 171 |
-
| Processing Speed | 10-15 chunks/s | 30-50 chunks/s | **3-4x faster** |
|
| 172 |
-
| Query Response | 5-10s | 2-5s | **2x faster** |
|
| 173 |
-
| Memory Usage | Unoptimized | Monitored & Optimized | **Better** |
|
| 174 |
-
| File Formats | 2 (PDF, TXT) | 7+ formats | **4x more** |
|
| 175 |
-
| Error Handling | Basic | Comprehensive | **Much better** |
|
| 176 |
-
|
| 177 |
-
---
|
| 178 |
-
|
| 179 |
-
## 🔧 Key Code Improvements
|
| 180 |
-
|
| 181 |
-
### Ingestion Pipeline (Before)
|
| 182 |
-
```python
|
| 183 |
-
# Simple pipeline, no parallelization
|
| 184 |
-
indexing.run({"converter": {"sources": [file]}})
|
| 185 |
-
```
|
| 186 |
-
|
| 187 |
-
### Ingestion Pipeline (After)
|
| 188 |
-
```python
|
| 189 |
-
# Parallel processing with memory management
|
| 190 |
-
documents = convert_to_documents(file) # Multi-format
|
| 191 |
-
chunks = split_documents(documents)
|
| 192 |
-
embedded = process_chunks_parallel(chunks) # Parallel!
|
| 193 |
-
write_to_store(embedded)
|
| 194 |
-
clear_memory() # Memory cleanup
|
| 195 |
-
```
|
| 196 |
-
|
| 197 |
-
### Retrieval (Before)
|
| 198 |
-
```python
|
| 199 |
-
# Simple retrieval, top_k=5
|
| 200 |
-
retriever = PineconeEmbeddingRetriever(top_k=5)
|
| 201 |
-
```
|
| 202 |
-
|
| 203 |
-
### Retrieval (After)
|
| 204 |
-
```python
|
| 205 |
-
# Enhanced retrieval with caching and fallbacks
|
| 206 |
-
embedder = get_text_embedder() # Cached!
|
| 207 |
-
retriever = PineconeEmbeddingRetriever(top_k=10) # More candidates
|
| 208 |
-
# + Multiple fallback strategies
|
| 209 |
-
```
|
| 210 |
-
|
| 211 |
-
---
|
| 212 |
-
|
| 213 |
-
## 📁 Project Structure (Updated)
|
| 214 |
-
|
| 215 |
-
```
|
| 216 |
-
PaperBOT/
|
| 217 |
-
├── 📄 app.py # FastAPI app (enhanced)
|
| 218 |
-
├── 📄 requirements.txt # Updated dependencies
|
| 219 |
-
├── 📄 setup.py # Package setup
|
| 220 |
-
├── 📄 .env.example # NEW: Config template
|
| 221 |
-
├── 📄 .gitignore # Git ignore rules
|
| 222 |
-
├── 📄 test_system.py # NEW: System test
|
| 223 |
-
├── 📄 start.bat # NEW: Windows startup
|
| 224 |
-
├── 📄 start.sh # NEW: Linux/Mac startup
|
| 225 |
-
│
|
| 226 |
-
├── 📚 Documentation (NEW)
|
| 227 |
-
│ ├── README.md # Complete overview
|
| 228 |
-
│ ├── INSTALLATION.md # Setup guide
|
| 229 |
-
│ ├── FEATURES.md # Feature docs
|
| 230 |
-
│ ├── QUICKSTART.md # Quick reference
|
| 231 |
-
│ └── CHANGELOG.md # Version history
|
| 232 |
-
│
|
| 233 |
-
├── 📂 QASystem/
|
| 234 |
-
│ ├── __init__.py
|
| 235 |
-
│ ├── config.py # Performance config
|
| 236 |
-
│ ├── ingestion.py # Enhanced with parallel processing
|
| 237 |
-
│ ├── retrieval_and_generation.py # Enhanced semantic search
|
| 238 |
-
│ └── utils.py # Utilities
|
| 239 |
-
│
|
| 240 |
-
├── 📂 templates/
|
| 241 |
-
│ └── index.html # Updated UI
|
| 242 |
-
│
|
| 243 |
-
├── 📂 uploads/ # Temp storage
|
| 244 |
-
│ └── .gitkeep # NEW
|
| 245 |
-
│
|
| 246 |
-
└── 📂 data/ # Sample docs
|
| 247 |
-
```
|
| 248 |
-
|
| 249 |
-
---
|
| 250 |
-
|
| 251 |
-
## 🎯 How to Use the Upgrades
|
| 252 |
-
|
| 253 |
-
### 1. Install New Dependencies
|
| 254 |
-
```bash
|
| 255 |
-
pip install -r requirements.txt
|
| 256 |
-
```
|
| 257 |
-
|
| 258 |
-
### 2. Configure Performance (Optional)
|
| 259 |
-
Edit `QASystem/config.py`:
|
| 260 |
-
```python
|
| 261 |
-
# For speed
|
| 262 |
-
CURRENT_MODEL = "fast"
|
| 263 |
-
BATCH_SIZE = 64
|
| 264 |
-
|
| 265 |
-
# For quality
|
| 266 |
-
CURRENT_MODEL = "quality"
|
| 267 |
-
BATCH_SIZE = 16
|
| 268 |
-
```
|
| 269 |
-
|
| 270 |
-
### 3. Upload New File Types
|
| 271 |
-
- Drag & drop any supported format
|
| 272 |
-
- System auto-detects and converts
|
| 273 |
-
- Watch progress bar for status
|
| 274 |
-
|
| 275 |
-
### 4. Monitor Performance
|
| 276 |
-
- Check console for memory stats
|
| 277 |
-
- View processing speed (chunks/sec)
|
| 278 |
-
- See relevance scores in answers
|
| 279 |
-
|
| 280 |
-
---
|
| 281 |
-
|
| 282 |
-
## 🚀 Next Steps
|
| 283 |
-
|
| 284 |
-
### To Run the Application:
|
| 285 |
-
```bash
|
| 286 |
-
# Option 1: Use startup script
|
| 287 |
-
start.bat # Windows
|
| 288 |
-
./start.sh # Linux/Mac
|
| 289 |
-
|
| 290 |
-
# Option 2: Manual start
|
| 291 |
-
python app.py
|
| 292 |
-
```
|
| 293 |
-
|
| 294 |
-
### To Test the System:
|
| 295 |
-
```bash
|
| 296 |
-
python test_system.py
|
| 297 |
-
```
|
| 298 |
-
|
| 299 |
-
### To Read Documentation:
|
| 300 |
-
- Quick start: `QUICKSTART.md`
|
| 301 |
-
- Full guide: `README.md`
|
| 302 |
-
- Features: `FEATURES.md`
|
| 303 |
-
- Installation help: `INSTALLATION.md`
|
| 304 |
-
|
| 305 |
-
---
|
| 306 |
-
|
| 307 |
-
## 📝 Summary
|
| 308 |
-
|
| 309 |
-
✅ **7 Major Improvements Completed**
|
| 310 |
-
✅ **6 New Documentation Files**
|
| 311 |
-
✅ **3-5x Performance Improvement**
|
| 312 |
-
✅ **7+ File Formats Supported**
|
| 313 |
-
✅ **Optimized Memory Management**
|
| 314 |
-
✅ **Enhanced Semantic Search**
|
| 315 |
-
✅ **Production-Ready Code**
|
| 316 |
-
|
| 317 |
-
The application is now:
|
| 318 |
-
- ⚡ **Faster** - Parallel processing, cached models
|
| 319 |
-
- 🧠 **Smarter** - Better retrieval, relevance scoring
|
| 320 |
-
- 💪 **Stronger** - Memory management, error handling
|
| 321 |
-
- 📚 **More Capable** - 7+ file formats
|
| 322 |
-
- 🎯 **Better Documented** - Comprehensive guides
|
| 323 |
-
|
| 324 |
-
**Status: READY FOR PRODUCTION USE! 🎉**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
data/.gitkeep
CHANGED
|
@@ -1,2 +0,0 @@
|
|
| 1 |
-
# This file ensures the data directory is tracked by Git
|
| 2 |
-
# Preloaded PDF files can be placed here
|
|
|
|
|
|
|
|
|
uploads/.gitkeep
CHANGED
|
@@ -1,2 +0,0 @@
|
|
| 1 |
-
# This file ensures the uploads directory is tracked by Git
|
| 2 |
-
# User-uploaded files will be stored here but are ignored by .gitignore
|
|
|
|
|
|
|
|
|