CodeMode / scripts /core /README.md
CodeMode Agent
Deploy CodeMode via Agent
463fc7e

A newer version of the Gradio SDK is available: 6.6.0

Upgrade

CodeMode Core Scripts πŸš€

This directory contains the modular core logic for the CodeMode pipeline. It is designed to be cleaner and more production-ready than the experimental notebooks.

Structure

1. Ingestion (scripts/core/ingestion)

Handles data collection and processing.

  • ingest.py: The Git Crawler (formerly git_crawler.py).
  • chunk.py: The Universal Chunker (formerly repo_chunker.py).
  • generate_data.py: Creates training triplets (formerly pairs_triplets_generator.py).

Usage:

# Example: Ingest a repo
python -m scripts.core.ingestion.ingest --url https://github.com/crewAIInc/crewAI

# Example: Generate Triplets
python -m scripts.core.ingestion.generate_data --chunks data/processed/chunks.jsonl --output data/training

2. Training (scripts/core/training)

Handles model training and embedding generation.

  • train.py: Main training loop.
  • model.py: The CodeEmbedder model architecture.
  • trainer.py: The training loop logic.

Usage:

# Example: Train the model
python -m scripts.core.training.train --data_path data/training/triplets.jsonl --epochs 3

Why this structure?

  • Separation of Concerns: Training logic doesn't depend on web scraping libraries.
  • Reusability: You can import CodeEmbedder or RepoChunker in other projects easily.
  • Production Ready: Direct python scripts instead of notebooks.