| # ML Models Integration Guide |
|
|
| This document explains how to train and use the ML models for conflict prediction and package similarity. |
|
|
| ## Overview |
|
|
| The project includes two ML models: |
|
|
| 1. **Conflict Prediction Model**: A Random Forest classifier that predicts whether a set of dependencies will have conflicts |
| 2. **Package Embeddings**: Pre-computed semantic embeddings for common Python packages for similarity matching |
|
|
| ## Training the Models |
|
|
| ### Step 1: Install Training Dependencies |
|
|
| ```bash |
| pip install scikit-learn sentence-transformers numpy |
| ``` |
|
|
| ### Step 2: Train Conflict Prediction Model |
|
|
| ```bash |
| cd "code to upload" |
| python train_conflict_model.py |
| ``` |
|
|
| This will: |
| - Load the synthetic dataset (`synthetic_requirements_dataset.json`) |
| - Extract features from requirements |
| - Train a Random Forest classifier |
| - Save the model to `models/conflict_predictor.pkl` |
| - Display accuracy and feature importance |
|
|
| **Expected Output:** |
| - Model size: ~2-5 MB |
| - Test accuracy: ~85-95% (depending on dataset) |
|
|
| ### Step 3: Generate Package Embeddings |
|
|
| ```bash |
| python generate_embeddings.py |
| ``` |
|
|
| This will: |
| - Load a sentence transformer model |
| - Generate embeddings for common Python packages |
| - Save embeddings to `models/package_embeddings.json` |
| - Save model info to `models/embedding_info.json` |
|
|
| **Expected Output:** |
| - Embeddings file: ~5-10 MB |
| - Embedding dimension: 384 |
| - Number of packages: ~100+ |
|
|
| ## Model Files Structure |
|
|
| After training, you should have: |
|
|
| ``` |
| code to upload/ |
| βββ models/ |
| β βββ conflict_predictor.pkl # Classification model |
| β βββ package_embeddings.json # Pre-computed embeddings |
| β βββ embedding_info.json # Model metadata |
| ``` |
|
|
| ## Integration in Main App |
|
|
| The models are automatically loaded when available: |
|
|
| 1. **Conflict Prediction**: Runs before detailed analysis to provide early warnings |
| 2. **Package Similarity**: Enhances spell-checking with semantic matching |
|
|
| ### Features |
|
|
| - **Graceful Fallback**: If models aren't available, the app works with rule-based methods |
| - **Lazy Loading**: Models load only when needed |
| - **Error Handling**: ML failures don't break the app |
|
|
| ## Usage in Code |
|
|
| ### Conflict Prediction |
|
|
| ```python |
| from ml_models import ConflictPredictor |
| |
| predictor = ConflictPredictor() |
| has_conflict, confidence = predictor.predict(requirements_text) |
| |
| if has_conflict: |
| print(f"Conflict predicted with {confidence:.1%} confidence") |
| ``` |
|
|
| ### Package Similarity |
|
|
| ```python |
| from ml_models import PackageEmbeddings |
| |
| embeddings = PackageEmbeddings() |
| similar = embeddings.find_similar("numpyy", top_k=3) |
| # Returns: [('numpy', 0.95), ('scipy', 0.72), ...] |
| |
| best_match = embeddings.get_best_match("pandaz") |
| # Returns: 'pandas' |
| ``` |
|
|
| ## Hugging Face Spaces Deployment |
|
|
| ### Option 1: Include Models in Repo |
|
|
| 1. Train models locally |
| 2. Commit model files to the repo |
| 3. Models load automatically on Spaces |
|
|
| **Pros**: Simple, no external dependencies |
| **Cons**: Larger repo size (~10-15 MB) |
|
|
| ### Option 2: Upload to Hugging Face Hub |
|
|
| 1. Train models locally |
| 2. Upload to Hugging Face Hub: |
| ```python |
| from huggingface_hub import upload_file |
| upload_file("models/conflict_predictor.pkl", repo_id="your-username/conflict-predictor") |
| ``` |
| 3. Load from Hub in app: |
| ```python |
| from huggingface_hub import hf_hub_download |
| model_path = hf_hub_download(repo_id="your-username/conflict-predictor", filename="conflict_predictor.pkl") |
| ``` |
|
|
| **Pros**: Smaller repo, version control for models |
| **Cons**: Requires internet connection at startup |
|
|
| ## Performance |
|
|
| - **Conflict Prediction**: <10ms per prediction |
| - **Embedding Lookup**: <1ms (pre-computed) or ~50ms (on-the-fly) |
| - **Model Loading**: ~1-2 seconds at startup |
|
|
| ## Troubleshooting |
|
|
| ### Models Not Loading |
|
|
| - Check that `models/` directory exists |
| - Verify model files are present |
| - Check file permissions |
|
|
| ### Low Prediction Accuracy |
|
|
| - Retrain with more data |
| - Adjust feature engineering |
| - Try different model parameters |
|
|
| ### Embeddings Not Working |
|
|
| - Ensure `sentence-transformers` is installed |
| - Check internet connection (for first-time model download) |
| - Verify embeddings file format |
|
|
| ## Future Improvements |
|
|
| - [ ] Train on larger, real-world dataset |
| - [ ] Add version-specific embeddings |
| - [ ] Implement online learning |
| - [ ] Add confidence intervals |
| - [ ] Support for custom model paths |
|
|
|
|