Spaces:
Sleeping
title: ML Training Optimizer
emoji: π§
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 8000
base_path: /web
ML Training Optimizer β OpenEnv Environment
An OpenEnv environment where AI agents learn to optimize the training of real ML/DL models by tuning hyperparameters, selecting optimizers, managing learning rate schedules, and applying regularization techniques.
This environment trains REAL PyTorch models on CPU β not simulations. The agent observes actual training curves, loss values, and validation metrics from real forward/backward passes, then decides what to change next.
This models a real workflow ML practitioners perform every day: hyperparameter tuning under limited compute, noisy validation curves, and real overfitting risk.
Motivation
ML practitioners spend enormous time on hyperparameter tuning. This environment recreates that workflow:
- Agent observes training metrics (loss, accuracy, convergence signals)
- Agent decides what to change (optimizer, LR, regularization, etc.)
- Agent runs more training and iterates
The small dataset subsets (5kβ10k samples) make overfitting a real, tangible problem the agent must address β exactly like real low-data regimes practitioners face daily.
Vision & Scalability
The long-term vision for this environment is to teach AI agents to monitor and optimize the training of large-scale models on distributed systems β multi-GPU clusters, sharded data pipelines, and fault-tolerant training loops. In production ML, human engineers spend significant time babysitting training runs: watching for loss spikes, adjusting learning rates, restarting from checkpoints, and rebalancing resources across nodes. An agent that masters these skills could dramatically accelerate the development cycle of foundation models.
To fit within current compute constraints (and the OpenEnv specification), the environment currently operates on small models trainable on standard CPUs. However, the core abstractions β observing training curves, adjusting hyperparameters mid-run, detecting convergence/divergence, and deciding when to stop β are identical to those required at scale. An agent that learns effective optimization strategies here can transfer those skills to larger, distributed settings as the environment scales up.
Tasks
Task 1: MNIST Digit Classifier (Easy)
- Model: 2-layer MLP (~100k params)
- Dataset: MNIST 5k subset (4k train / 1k val)
- Budget: 100 epochs
- Goal: Maximize validation accuracy (target β₯ 96%)
- Grading: Linear scale 88%β97.5% β score 0.0β1.0
Task 2: Fashion Item Classifier (Medium)
- Model: Small CNN (~200k params)
- Dataset: FashionMNIST 8k subset (6.5k train / 1.5k val)
- Budget: 80 epochs
- Goal: Maximize accuracy while keeping overfitting gap < 5%
- Grading: 60% accuracy score + 40% generalization score
Task 3: CIFAR-10 Under Budget (Hard)
- Model: Deeper CNN (~500k params)
- Dataset: CIFAR-10 10k subset (8k train / 2k val)
- Budget: 60 epochs
- Goal: Maximize accuracy under tight budget
- Grading: 50% accuracy + 30% efficiency + 20% stability
Action Space (MCP Tools)
| Tool | Parameters | Description |
|---|---|---|
configure_training |
optimizer, learning_rate, batch_size, weight_decay, dropout, lr_schedule, warmup_epochs, augmentation, augmentation_strength | Set/update training config |
run_epochs |
num_epochs (1β20) | Run N epochs of real PyTorch training |
adjust_learning_rate |
new_lr | Change LR mid-training |
toggle_augmentation |
enabled, strength | Toggle data augmentation |
get_training_status |
β | Query current metrics |
submit_model |
β | Submit for final grading |
Configuration Options
Optimizers: sgd (with momentum=0.9), adam, adamw
LR Schedules: constant, step (decay by 0.1 every T/3 epochs), cosine (cosine annealing), warmup_cosine (linear warmup + cosine)
Regularization: weight_decay (L2), dropout (0.0β0.5), augmentation (random transforms)
Batch Sizes: 32, 64, 128, 256
Observation Space
After each action, the agent receives:
{
"current_epoch": 30,
"max_epochs": 100,
"remaining_budget": 70,
"train_loss": 0.342,
"val_loss": 0.401,
"train_accuracy": 0.891,
"val_accuracy": 0.864,
"best_val_accuracy": 0.871,
"best_val_epoch": 25,
"loss_history_last_10": [0.45, 0.43, ...],
"val_loss_history_last_10": [0.52, 0.49, ...],
"convergence_signal": "improving",
"is_diverged": false
}
Convergence signals: not_started, warming_up, improving, plateaued, overfitting, stalling, diverged
Reward Function
Rewards per step (not just at the end):
- Progress reward: +0.3 Γ accuracy improvement above previous best
- Convergence reward: +0.05 for decreasing validation loss
- Divergence penalty: β0.2 if training diverges
- Overfitting penalty: β0.05 Γ excess when gap > 8%
- Submission bonus: Final grader score (0.0β1.0) added on submit
Setup & Usage
Install
uv sync
Run the server locally
uvicorn server.app:app --host 0.0.0.0 --port 8000
Run with Docker
docker build -f server/Dockerfile -t ml-trainer-env .
docker run -p 8000:8000 ml-trainer-env
Run the baseline inference
This baseline uses the OpenAI API by default. With the default LLM_RPM_LIMIT=5, it spaces requests to stay under free-tier quotas and uses a small, quota-aware decision budget per task.
Recommended .env:
OPENAI_API_KEY=sk-proj-...
MODEL_NAME=gpt-4o-mini
ENV_URL=http://localhost:8000
LLM_RPM_LIMIT=5
LLM_MAX_RETRIES=3
LLM_REASONING_EFFORT=minimal
LLM_MAX_STEPS_EASY=5
LLM_MAX_STEPS_MEDIUM=6
LLM_MAX_STEPS_HARD=7
Then run:
export ENV_URL=http://localhost:8000
uv run inference.py
The script uses the OpenAI Python client against the official OpenAI API by default. You can also point it at other OpenAI-compatible providers (like OpenRouter or Gemini) by setting corresponding API_BASE_URL, OPENROUTER_API_KEY, or GEMINI_API_KEY in your .env.
Required environment variables
| API_BASE_URL | no (defaults to OpenAI) | LLM API endpoint |
| MODEL_NAME | yes | Model identifier (default: gpt-4o-mini) |
| ENV_URL | yes | URL of the running OpenEnv environment |
| OPENAI_API_KEY | yes | Auth for OpenAI (preferred) |
| OPENROUTER_API_KEY or GEMINI_API_KEY | yes | Fallback Auth for alternative providers |
| HF_TOKEN | needed for HF deployment workflows | Hugging Face auth token |
Optional inference tuning variables
| Variable | Default | Purpose |
|---|---|---|
LLM_RPM_LIMIT |
5 |
Hard request cap used by the scheduler |
LLM_MAX_RETRIES |
3 |
Rate-limit retries per model request |
LLM_REASONING_EFFORT |
minimal |
Gemini reasoning effort |
LLM_MAX_STEPS_EASY |
5 |
Max model decisions for easy_mnist |
LLM_MAX_STEPS_MEDIUM |
6 |
Max model decisions for medium_fashion |
LLM_MAX_STEPS_HARD |
7 |
Max model decisions for hard_cifar |
Interact via Python client
from ml_trainer_env import MLTrainerEnv
with MLTrainerEnv(base_url="http://localhost:8000") as env:
env.reset(task_id="easy_mnist")
tools = env.list_tools()
result = env.call_tool("configure_training",
optimizer="adam", learning_rate=0.001, batch_size=64)
result = env.call_tool("run_epochs", num_epochs=10)
print(result) # Real training metrics!
result = env.call_tool("submit_model")
print(result) # Final score
Baseline Scores
| Task | Expected Score Range | Notes |
|---|---|---|
| easy_mnist | 0.6 β 0.9 | Most models solve this well |
| medium_fashion | 0.4 β 0.7 | Requires regularization awareness |
| hard_cifar | 0.2 β 0.5 | Genuinely challenging under budget |
Scores are reported as expected ranges rather than exact fixed values because training remains real, even though seeds and data subsets are deterministic.
Architecture
openenv-hack/
βββ __init__.py # Package exports
βββ models.py # Pydantic Action/Observation models
βββ client.py # MCPToolClient subclass
βββ openenv.yaml # OpenEnv manifest
βββ pyproject.toml # Dependencies
βββ inference.py # Baseline inference script
βββ README.md
βββ server/
β βββ app.py # FastAPI server
β βββ ml_trainer_environment.py # MCPEnvironment with tools
β βββ trainer.py # Real PyTorch training engine
β βββ models_nn.py # Neural network architectures
β βββ datasets.py # Dataset loading & subsetting
β βββ tasks.py # Task definitions & graders
β βββ Dockerfile
βββ outputs/
βββ logs/
βββ evals/
Technical Details
- Real training: Actual PyTorch forward/backward passes on CPU
- Deterministic:
torch.manual_seed()ensures reproducible results - Constrained:
torch.set_num_threads(2)matches 2 vCPU limit - Fast: ~0.5β3s per epoch depending on task
- Pre-cached: Datasets downloaded at Docker build time
- Quota-aware baseline:
inference.pyis optimized for low-RPM Gemini quotas and uses function calling with compact state summaries