Spaces:

icyayaka
/

CodeMode

Sleeping

App Files Files Community

CodeMode / scripts /core /README.md

CodeMode Agent

Deploy CodeMode via Agent

463fc7e 10 days ago

preview code

raw

history blame contribute delete

1.41 kB

A newer version of the Gradio SDK is available: 6.6.0

Upgrade

CodeMode Core Scripts 🚀

This directory contains the modular core logic for the CodeMode pipeline. It is designed to be cleaner and more production-ready than the experimental notebooks.

Structure

1. Ingestion (`scripts/core/ingestion`)

Handles data collection and processing.

ingest.py: The Git Crawler (formerly git_crawler.py).
chunk.py: The Universal Chunker (formerly repo_chunker.py).
generate_data.py: Creates training triplets (formerly pairs_triplets_generator.py).

Usage:

# Example: Ingest a repo
python -m scripts.core.ingestion.ingest --url https://github.com/crewAIInc/crewAI

# Example: Generate Triplets
python -m scripts.core.ingestion.generate_data --chunks data/processed/chunks.jsonl --output data/training

2. Training (`scripts/core/training`)

Handles model training and embedding generation.

train.py: Main training loop.
model.py: The CodeEmbedder model architecture.
trainer.py: The training loop logic.

Usage:

# Example: Train the model
python -m scripts.core.training.train --data_path data/training/triplets.jsonl --epochs 3

Why this structure?

Separation of Concerns: Training logic doesn't depend on web scraping libraries.
Reusability: You can import CodeEmbedder or RepoChunker in other projects easily.
Production Ready: Direct python scripts instead of notebooks.

CodeMode Core Scripts 🚀

Structure

1. Ingestion (scripts/core/ingestion)

2. Training (scripts/core/training)

Why this structure?

1. Ingestion (`scripts/core/ingestion`)

2. Training (`scripts/core/training`)