Spaces:

icyayaka
/

CodeMode

Running

File size: 1,409 Bytes

463fc7e

# CodeMode Core Scripts 🚀

This directory contains the **modular core logic** for the CodeMode pipeline. It is designed to be cleaner and more production-ready than the experimental notebooks.

## Structure

### 1. Ingestion (`scripts/core/ingestion`)
Handles data collection and processing.
- `ingest.py`: The Git Crawler (formerly `git_crawler.py`).
- `chunk.py`: The Universal Chunker (formerly `repo_chunker.py`).
- `generate_data.py`: Creates training triplets (formerly `pairs_triplets_generator.py`).

**Usage:**
```bash
# Example: Ingest a repo
python -m scripts.core.ingestion.ingest --url https://github.com/crewAIInc/crewAI

# Example: Generate Triplets
python -m scripts.core.ingestion.generate_data --chunks data/processed/chunks.jsonl --output data/training
```

### 2. Training (`scripts/core/training`)
Handles model training and embedding generation.
- `train.py`: Main training loop.
- `model.py`: The CodeEmbedder model architecture.
- `trainer.py`: The training loop logic.

**Usage:**
```bash
# Example: Train the model
python -m scripts.core.training.train --data_path data/training/triplets.jsonl --epochs 3
```

## Why this structure?
- **Separation of Concerns:** Training logic doesn't depend on web scraping libraries.
- **Reusability:** You can import `CodeEmbedder` or `RepoChunker` in other projects easily.
- **Production Ready:** Direct python scripts instead of notebooks.