ml_trainer_env / README.md
BART-ender's picture
Upload folder using huggingface_hub
685c05b verified
metadata
title: ML Training Optimizer
emoji: 🧠
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 8000
base_path: /web

ML Training Optimizer β€” OpenEnv Environment

An OpenEnv environment where AI agents learn to optimize the training of real ML/DL models by tuning hyperparameters, selecting optimizers, managing learning rate schedules, and applying regularization techniques.

This environment trains REAL PyTorch models on CPU β€” not simulations. The agent observes actual training curves, loss values, and validation metrics from real forward/backward passes, then decides what to change next.

This models a real workflow ML practitioners perform every day: hyperparameter tuning under limited compute, noisy validation curves, and real overfitting risk.

Motivation

ML practitioners spend enormous time on hyperparameter tuning. This environment recreates that workflow:

  • Agent observes training metrics (loss, accuracy, convergence signals)
  • Agent decides what to change (optimizer, LR, regularization, etc.)
  • Agent runs more training and iterates

The small dataset subsets (5k–10k samples) make overfitting a real, tangible problem the agent must address β€” exactly like real low-data regimes practitioners face daily.

Vision & Scalability

The long-term vision for this environment is to teach AI agents to monitor and optimize the training of large-scale models on distributed systems β€” multi-GPU clusters, sharded data pipelines, and fault-tolerant training loops. In production ML, human engineers spend significant time babysitting training runs: watching for loss spikes, adjusting learning rates, restarting from checkpoints, and rebalancing resources across nodes. An agent that masters these skills could dramatically accelerate the development cycle of foundation models.

To fit within current compute constraints (and the OpenEnv specification), the environment currently operates on small models trainable on standard CPUs. However, the core abstractions β€” observing training curves, adjusting hyperparameters mid-run, detecting convergence/divergence, and deciding when to stop β€” are identical to those required at scale. An agent that learns effective optimization strategies here can transfer those skills to larger, distributed settings as the environment scales up.

Tasks

Task 1: MNIST Digit Classifier (Easy)

  • Model: 2-layer MLP (~100k params)
  • Dataset: MNIST 5k subset (4k train / 1k val)
  • Budget: 100 epochs
  • Goal: Maximize validation accuracy (target β‰₯ 96%)
  • Grading: Linear scale 88%β†’97.5% β†’ score 0.0β†’1.0

Task 2: Fashion Item Classifier (Medium)

  • Model: Small CNN (~200k params)
  • Dataset: FashionMNIST 8k subset (6.5k train / 1.5k val)
  • Budget: 80 epochs
  • Goal: Maximize accuracy while keeping overfitting gap < 5%
  • Grading: 60% accuracy score + 40% generalization score

Task 3: CIFAR-10 Under Budget (Hard)

  • Model: Deeper CNN (~500k params)
  • Dataset: CIFAR-10 10k subset (8k train / 2k val)
  • Budget: 60 epochs
  • Goal: Maximize accuracy under tight budget
  • Grading: 50% accuracy + 30% efficiency + 20% stability

Action Space (MCP Tools)

Tool Parameters Description
configure_training optimizer, learning_rate, batch_size, weight_decay, dropout, lr_schedule, warmup_epochs, augmentation, augmentation_strength Set/update training config
run_epochs num_epochs (1–20) Run N epochs of real PyTorch training
adjust_learning_rate new_lr Change LR mid-training
toggle_augmentation enabled, strength Toggle data augmentation
get_training_status β€” Query current metrics
submit_model β€” Submit for final grading

Configuration Options

Optimizers: sgd (with momentum=0.9), adam, adamw

LR Schedules: constant, step (decay by 0.1 every T/3 epochs), cosine (cosine annealing), warmup_cosine (linear warmup + cosine)

Regularization: weight_decay (L2), dropout (0.0–0.5), augmentation (random transforms)

Batch Sizes: 32, 64, 128, 256

Observation Space

After each action, the agent receives:

{
  "current_epoch": 30,
  "max_epochs": 100,
  "remaining_budget": 70,
  "train_loss": 0.342,
  "val_loss": 0.401,
  "train_accuracy": 0.891,
  "val_accuracy": 0.864,
  "best_val_accuracy": 0.871,
  "best_val_epoch": 25,
  "loss_history_last_10": [0.45, 0.43, ...],
  "val_loss_history_last_10": [0.52, 0.49, ...],
  "convergence_signal": "improving",
  "is_diverged": false
}

Convergence signals: not_started, warming_up, improving, plateaued, overfitting, stalling, diverged

Reward Function

Rewards per step (not just at the end):

  • Progress reward: +0.3 Γ— accuracy improvement above previous best
  • Convergence reward: +0.05 for decreasing validation loss
  • Divergence penalty: βˆ’0.2 if training diverges
  • Overfitting penalty: βˆ’0.05 Γ— excess when gap > 8%
  • Submission bonus: Final grader score (0.0–1.0) added on submit

Setup & Usage

Install

uv sync

Run the server locally

uvicorn server.app:app --host 0.0.0.0 --port 8000

Run with Docker

docker build -f server/Dockerfile -t ml-trainer-env .
docker run -p 8000:8000 ml-trainer-env

Run the baseline inference

This baseline uses the OpenAI API by default. With the default LLM_RPM_LIMIT=5, it spaces requests to stay under free-tier quotas and uses a small, quota-aware decision budget per task.

Recommended .env:

OPENAI_API_KEY=sk-proj-...
MODEL_NAME=gpt-4o-mini
ENV_URL=http://localhost:8000
LLM_RPM_LIMIT=5
LLM_MAX_RETRIES=3
LLM_REASONING_EFFORT=minimal
LLM_MAX_STEPS_EASY=5
LLM_MAX_STEPS_MEDIUM=6
LLM_MAX_STEPS_HARD=7

Then run:

export ENV_URL=http://localhost:8000
uv run inference.py

The script uses the OpenAI Python client against the official OpenAI API by default. You can also point it at other OpenAI-compatible providers (like OpenRouter or Gemini) by setting corresponding API_BASE_URL, OPENROUTER_API_KEY, or GEMINI_API_KEY in your .env.

Required environment variables

| API_BASE_URL | no (defaults to OpenAI) | LLM API endpoint | | MODEL_NAME | yes | Model identifier (default: gpt-4o-mini) | | ENV_URL | yes | URL of the running OpenEnv environment | | OPENAI_API_KEY | yes | Auth for OpenAI (preferred) | | OPENROUTER_API_KEY or GEMINI_API_KEY | yes | Fallback Auth for alternative providers | | HF_TOKEN | needed for HF deployment workflows | Hugging Face auth token |

Optional inference tuning variables

Variable Default Purpose
LLM_RPM_LIMIT 5 Hard request cap used by the scheduler
LLM_MAX_RETRIES 3 Rate-limit retries per model request
LLM_REASONING_EFFORT minimal Gemini reasoning effort
LLM_MAX_STEPS_EASY 5 Max model decisions for easy_mnist
LLM_MAX_STEPS_MEDIUM 6 Max model decisions for medium_fashion
LLM_MAX_STEPS_HARD 7 Max model decisions for hard_cifar

Interact via Python client

from ml_trainer_env import MLTrainerEnv

with MLTrainerEnv(base_url="http://localhost:8000") as env:
    env.reset(task_id="easy_mnist")
    tools = env.list_tools()
    
    result = env.call_tool("configure_training",
        optimizer="adam", learning_rate=0.001, batch_size=64)
    
    result = env.call_tool("run_epochs", num_epochs=10)
    print(result)  # Real training metrics!
    
    result = env.call_tool("submit_model")
    print(result)  # Final score

Baseline Scores

Task Expected Score Range Notes
easy_mnist 0.6 – 0.9 Most models solve this well
medium_fashion 0.4 – 0.7 Requires regularization awareness
hard_cifar 0.2 – 0.5 Genuinely challenging under budget

Scores are reported as expected ranges rather than exact fixed values because training remains real, even though seeds and data subsets are deterministic.

Architecture

openenv-hack/
β”œβ”€β”€ __init__.py           # Package exports
β”œβ”€β”€ models.py             # Pydantic Action/Observation models
β”œβ”€β”€ client.py             # MCPToolClient subclass
β”œβ”€β”€ openenv.yaml          # OpenEnv manifest
β”œβ”€β”€ pyproject.toml        # Dependencies
β”œβ”€β”€ inference.py          # Baseline inference script
β”œβ”€β”€ README.md
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ app.py            # FastAPI server
β”‚   β”œβ”€β”€ ml_trainer_environment.py  # MCPEnvironment with tools
β”‚   β”œβ”€β”€ trainer.py        # Real PyTorch training engine
β”‚   β”œβ”€β”€ models_nn.py      # Neural network architectures
β”‚   β”œβ”€β”€ datasets.py       # Dataset loading & subsetting
β”‚   β”œβ”€β”€ tasks.py          # Task definitions & graders
β”‚   └── Dockerfile
└── outputs/
    β”œβ”€β”€ logs/
    └── evals/

Technical Details

  • Real training: Actual PyTorch forward/backward passes on CPU
  • Deterministic: torch.manual_seed() ensures reproducible results
  • Constrained: torch.set_num_threads(2) matches 2 vCPU limit
  • Fast: ~0.5–3s per epoch depending on task
  • Pre-cached: Datasets downloaded at Docker build time
  • Quota-aware baseline: inference.py is optimized for low-RPM Gemini quotas and uses function calling with compact state summaries