Spaces:

BART-ender
/

ml_trainer_env

Sleeping

App Files Files Community

ml_trainer_env / README.md

BART-ender

Upload folder using huggingface_hub

685c05b verified 16 days ago

preview code

raw

history blame contribute delete

9.42 kB

metadata

title: ML Training Optimizer
emoji: 🧠
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 8000
base_path: /web

ML Training Optimizer — OpenEnv Environment

An OpenEnv environment where AI agents learn to optimize the training of real ML/DL models by tuning hyperparameters, selecting optimizers, managing learning rate schedules, and applying regularization techniques.

This environment trains REAL PyTorch models on CPU — not simulations. The agent observes actual training curves, loss values, and validation metrics from real forward/backward passes, then decides what to change next.

This models a real workflow ML practitioners perform every day: hyperparameter tuning under limited compute, noisy validation curves, and real overfitting risk.

Motivation

ML practitioners spend enormous time on hyperparameter tuning. This environment recreates that workflow:

Agent observes training metrics (loss, accuracy, convergence signals)
Agent decides what to change (optimizer, LR, regularization, etc.)
Agent runs more training and iterates

The small dataset subsets (5k–10k samples) make overfitting a real, tangible problem the agent must address — exactly like real low-data regimes practitioners face daily.

Vision & Scalability

The long-term vision for this environment is to teach AI agents to monitor and optimize the training of large-scale models on distributed systems — multi-GPU clusters, sharded data pipelines, and fault-tolerant training loops. In production ML, human engineers spend significant time babysitting training runs: watching for loss spikes, adjusting learning rates, restarting from checkpoints, and rebalancing resources across nodes. An agent that masters these skills could dramatically accelerate the development cycle of foundation models.

To fit within current compute constraints (and the OpenEnv specification), the environment currently operates on small models trainable on standard CPUs. However, the core abstractions — observing training curves, adjusting hyperparameters mid-run, detecting convergence/divergence, and deciding when to stop — are identical to those required at scale. An agent that learns effective optimization strategies here can transfer those skills to larger, distributed settings as the environment scales up.

Tasks

Task 1: MNIST Digit Classifier (Easy)

Model: 2-layer MLP (~100k params)
Dataset: MNIST 5k subset (4k train / 1k val)
Budget: 100 epochs
Goal: Maximize validation accuracy (target ≥ 96%)
Grading: Linear scale 88%→97.5% → score 0.0→1.0

Task 2: Fashion Item Classifier (Medium)

Model: Small CNN (~200k params)
Dataset: FashionMNIST 8k subset (6.5k train / 1.5k val)
Budget: 80 epochs
Goal: Maximize accuracy while keeping overfitting gap < 5%
Grading: 60% accuracy score + 40% generalization score

Task 3: CIFAR-10 Under Budget (Hard)

Model: Deeper CNN (~500k params)
Dataset: CIFAR-10 10k subset (8k train / 2k val)
Budget: 60 epochs
Goal: Maximize accuracy under tight budget
Grading: 50% accuracy + 30% efficiency + 20% stability

Action Space (MCP Tools)

Tool	Parameters	Description
`configure_training`	optimizer, learning_rate, batch_size, weight_decay, dropout, lr_schedule, warmup_epochs, augmentation, augmentation_strength	Set/update training config
`run_epochs`	num_epochs (1–20)	Run N epochs of real PyTorch training
`adjust_learning_rate`	new_lr	Change LR mid-training
`toggle_augmentation`	enabled, strength	Toggle data augmentation
`get_training_status`	—	Query current metrics
`submit_model`	—	Submit for final grading

Configuration Options

Optimizers: sgd (with momentum=0.9), adam, adamw

LR Schedules: constant, step (decay by 0.1 every T/3 epochs), cosine (cosine annealing), warmup_cosine (linear warmup + cosine)

Regularization: weight_decay (L2), dropout (0.0–0.5), augmentation (random transforms)

Batch Sizes: 32, 64, 128, 256

Observation Space

After each action, the agent receives:

{
  "current_epoch": 30,
  "max_epochs": 100,
  "remaining_budget": 70,
  "train_loss": 0.342,
  "val_loss": 0.401,
  "train_accuracy": 0.891,
  "val_accuracy": 0.864,
  "best_val_accuracy": 0.871,
  "best_val_epoch": 25,
  "loss_history_last_10": [0.45, 0.43, ...],
  "val_loss_history_last_10": [0.52, 0.49, ...],
  "convergence_signal": "improving",
  "is_diverged": false
}

Convergence signals: not_started, warming_up, improving, plateaued, overfitting, stalling, diverged

Reward Function

Rewards per step (not just at the end):

Progress reward: +0.3 × accuracy improvement above previous best
Convergence reward: +0.05 for decreasing validation loss
Divergence penalty: −0.2 if training diverges
Overfitting penalty: −0.05 × excess when gap > 8%
Submission bonus: Final grader score (0.0–1.0) added on submit

Setup & Usage

Install

uv sync

Run the server locally

uvicorn server.app:app --host 0.0.0.0 --port 8000

Run with Docker

docker build -f server/Dockerfile -t ml-trainer-env .
docker run -p 8000:8000 ml-trainer-env

Run the baseline inference

This baseline uses the OpenAI API by default. With the default LLM_RPM_LIMIT=5, it spaces requests to stay under free-tier quotas and uses a small, quota-aware decision budget per task.

Recommended .env:

OPENAI_API_KEY=sk-proj-...
MODEL_NAME=gpt-4o-mini
ENV_URL=http://localhost:8000
LLM_RPM_LIMIT=5
LLM_MAX_RETRIES=3
LLM_REASONING_EFFORT=minimal
LLM_MAX_STEPS_EASY=5
LLM_MAX_STEPS_MEDIUM=6
LLM_MAX_STEPS_HARD=7

Then run:

export ENV_URL=http://localhost:8000
uv run inference.py

The script uses the OpenAI Python client against the official OpenAI API by default. You can also point it at other OpenAI-compatible providers (like OpenRouter or Gemini) by setting corresponding API_BASE_URL, OPENROUTER_API_KEY, or GEMINI_API_KEY in your .env.

Required environment variables

Optional inference tuning variables

Variable	Default	Purpose
`LLM_RPM_LIMIT`	`5`	Hard request cap used by the scheduler
`LLM_MAX_RETRIES`	`3`	Rate-limit retries per model request
`LLM_REASONING_EFFORT`	`minimal`	Gemini reasoning effort
`LLM_MAX_STEPS_EASY`	`5`	Max model decisions for `easy_mnist`
`LLM_MAX_STEPS_MEDIUM`	`6`	Max model decisions for `medium_fashion`
`LLM_MAX_STEPS_HARD`	`7`	Max model decisions for `hard_cifar`

Interact via Python client

from ml_trainer_env import MLTrainerEnv

with MLTrainerEnv(base_url="http://localhost:8000") as env:
    env.reset(task_id="easy_mnist")
    tools = env.list_tools()
    
    result = env.call_tool("configure_training",
        optimizer="adam", learning_rate=0.001, batch_size=64)
    
    result = env.call_tool("run_epochs", num_epochs=10)
    print(result)  # Real training metrics!
    
    result = env.call_tool("submit_model")
    print(result)  # Final score

Baseline Scores

Task	Expected Score Range	Notes
easy_mnist	0.6 – 0.9	Most models solve this well
medium_fashion	0.4 – 0.7	Requires regularization awareness
hard_cifar	0.2 – 0.5	Genuinely challenging under budget

Scores are reported as expected ranges rather than exact fixed values because training remains real, even though seeds and data subsets are deterministic.

Architecture

openenv-hack/
├── __init__.py           # Package exports
├── models.py             # Pydantic Action/Observation models
├── client.py             # MCPToolClient subclass
├── openenv.yaml          # OpenEnv manifest
├── pyproject.toml        # Dependencies
├── inference.py          # Baseline inference script
├── README.md
├── server/
│   ├── app.py            # FastAPI server
│   ├── ml_trainer_environment.py  # MCPEnvironment with tools
│   ├── trainer.py        # Real PyTorch training engine
│   ├── models_nn.py      # Neural network architectures
│   ├── datasets.py       # Dataset loading & subsetting
│   ├── tasks.py          # Task definitions & graders
│   └── Dockerfile
└── outputs/
    ├── logs/
    └── evals/

Technical Details

Real training: Actual PyTorch forward/backward passes on CPU
Deterministic: torch.manual_seed() ensures reproducible results
Constrained: torch.set_num_threads(2) matches 2 vCPU limit
Fast: ~0.5–3s per epoch depending on task
Pre-cached: Datasets downloaded at Docker build time
Quota-aware baseline: inference.py is optimized for low-RPM Gemini quotas and uses function calling with compact state summaries