A newer version of the Gradio SDK is available:
6.6.0
CodeMode Core Scripts π
This directory contains the modular core logic for the CodeMode pipeline. It is designed to be cleaner and more production-ready than the experimental notebooks.
Structure
1. Ingestion (scripts/core/ingestion)
Handles data collection and processing.
ingest.py: The Git Crawler (formerlygit_crawler.py).chunk.py: The Universal Chunker (formerlyrepo_chunker.py).generate_data.py: Creates training triplets (formerlypairs_triplets_generator.py).
Usage:
# Example: Ingest a repo
python -m scripts.core.ingestion.ingest --url https://github.com/crewAIInc/crewAI
# Example: Generate Triplets
python -m scripts.core.ingestion.generate_data --chunks data/processed/chunks.jsonl --output data/training
2. Training (scripts/core/training)
Handles model training and embedding generation.
train.py: Main training loop.model.py: The CodeEmbedder model architecture.trainer.py: The training loop logic.
Usage:
# Example: Train the model
python -m scripts.core.training.train --data_path data/training/triplets.jsonl --epochs 3
Why this structure?
- Separation of Concerns: Training logic doesn't depend on web scraping libraries.
- Reusability: You can import
CodeEmbedderorRepoChunkerin other projects easily. - Production Ready: Direct python scripts instead of notebooks.