File size: 1,409 Bytes
463fc7e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | # CodeMode Core Scripts 🚀
This directory contains the **modular core logic** for the CodeMode pipeline. It is designed to be cleaner and more production-ready than the experimental notebooks.
## Structure
### 1. Ingestion (`scripts/core/ingestion`)
Handles data collection and processing.
- `ingest.py`: The Git Crawler (formerly `git_crawler.py`).
- `chunk.py`: The Universal Chunker (formerly `repo_chunker.py`).
- `generate_data.py`: Creates training triplets (formerly `pairs_triplets_generator.py`).
**Usage:**
```bash
# Example: Ingest a repo
python -m scripts.core.ingestion.ingest --url https://github.com/crewAIInc/crewAI
# Example: Generate Triplets
python -m scripts.core.ingestion.generate_data --chunks data/processed/chunks.jsonl --output data/training
```
### 2. Training (`scripts/core/training`)
Handles model training and embedding generation.
- `train.py`: Main training loop.
- `model.py`: The CodeEmbedder model architecture.
- `trainer.py`: The training loop logic.
**Usage:**
```bash
# Example: Train the model
python -m scripts.core.training.train --data_path data/training/triplets.jsonl --epochs 3
```
## Why this structure?
- **Separation of Concerns:** Training logic doesn't depend on web scraping libraries.
- **Reusability:** You can import `CodeEmbedder` or `RepoChunker` in other projects easily.
- **Production Ready:** Direct python scripts instead of notebooks.
|