File size: 1,409 Bytes
463fc7e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# CodeMode Core Scripts 🚀

This directory contains the **modular core logic** for the CodeMode pipeline. It is designed to be cleaner and more production-ready than the experimental notebooks.

## Structure

### 1. Ingestion (`scripts/core/ingestion`)
Handles data collection and processing.
- `ingest.py`: The Git Crawler (formerly `git_crawler.py`).
- `chunk.py`: The Universal Chunker (formerly `repo_chunker.py`).
- `generate_data.py`: Creates training triplets (formerly `pairs_triplets_generator.py`).

**Usage:**
```bash
# Example: Ingest a repo
python -m scripts.core.ingestion.ingest --url https://github.com/crewAIInc/crewAI

# Example: Generate Triplets
python -m scripts.core.ingestion.generate_data --chunks data/processed/chunks.jsonl --output data/training
```

### 2. Training (`scripts/core/training`)
Handles model training and embedding generation.
- `train.py`: Main training loop.
- `model.py`: The CodeEmbedder model architecture.
- `trainer.py`: The training loop logic.

**Usage:**
```bash
# Example: Train the model
python -m scripts.core.training.train --data_path data/training/triplets.jsonl --epochs 3
```

## Why this structure?
- **Separation of Concerns:** Training logic doesn't depend on web scraping libraries.
- **Reusability:** You can import `CodeEmbedder` or `RepoChunker` in other projects easily.
- **Production Ready:** Direct python scripts instead of notebooks.