# CodeMode Core Scripts 🚀 This directory contains the **modular core logic** for the CodeMode pipeline. It is designed to be cleaner and more production-ready than the experimental notebooks. ## Structure ### 1. Ingestion (`scripts/core/ingestion`) Handles data collection and processing. - `ingest.py`: The Git Crawler (formerly `git_crawler.py`). - `chunk.py`: The Universal Chunker (formerly `repo_chunker.py`). - `generate_data.py`: Creates training triplets (formerly `pairs_triplets_generator.py`). **Usage:** ```bash # Example: Ingest a repo python -m scripts.core.ingestion.ingest --url https://github.com/crewAIInc/crewAI # Example: Generate Triplets python -m scripts.core.ingestion.generate_data --chunks data/processed/chunks.jsonl --output data/training ``` ### 2. Training (`scripts/core/training`) Handles model training and embedding generation. - `train.py`: Main training loop. - `model.py`: The CodeEmbedder model architecture. - `trainer.py`: The training loop logic. **Usage:** ```bash # Example: Train the model python -m scripts.core.training.train --data_path data/training/triplets.jsonl --epochs 3 ``` ## Why this structure? - **Separation of Concerns:** Training logic doesn't depend on web scraping libraries. - **Reusability:** You can import `CodeEmbedder` or `RepoChunker` in other projects easily. - **Production Ready:** Direct python scripts instead of notebooks.