YAML Metadata Warning: empty or missing yaml metadata in repo card
Check out the documentation for more information.
Reward Model Environment
An environment that uses an external reward model hosted via vLLM to train LLMs. This environment communicates with a reward model API, formats conversations using chat templates, batches requests for efficiency, and includes retry logic for robustness.
Features
- External Reward Model Integration: Connects to reward models hosted via vLLM's
/classifyendpoint - Automatic Model Discovery: Fetches the reward model name from
/v1/models - Batched Requests: Sends all rollouts in a single batch request for efficiency
- Retry Logic: Automatically retries failed requests with exponential backoff
- Chat Template Support: Properly formats conversations using tokenizer chat templates
- Sanity Checks: Logs statistics and warnings for reward values to ensure proper scaling
Installation
uv run vf-install reward-model-env
Usage
Basic Example
import verifiers as vf
# Load the environment
vf_env = vf.load_environment(
"reward-model-env",
dataset_name="gsm8k", # HF dataset with 'prompt' or 'question' column
dataset_config="main", # Optional: dataset config name (required for some datasets)
reward_model_url="http://localhost:8002", # URL where your reward model is hosted
tokenizer_path="./tokenizer.json", # Optional: path to tokenizer for chat template
num_train_examples=100, # Optional: limit training examples
)
# Evaluate with an OpenAI-compatible model
from openai import AsyncOpenAI
results = await vf_env.evaluate(
client=AsyncOpenAI(base_url="http://localhost:8000/v1"),
model="your-model",
num_examples=10,
rollouts_per_example=1,
)
See example.py for a complete working example.
Environment Variables
Set REWARD_MODEL_URL to avoid passing it as an argument:
export REWARD_MODEL_URL="http://localhost:8002"
Reward Model Setup
This environment expects a reward model hosted via vLLM with the classification API enabled. Example setup:
# Start vLLM with a reward model
vllm serve Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 \
--port 8002 \
--enable-classification
API Format
The environment expects the following API endpoints:
/v1/models (GET)
Returns available models:
{
"data": [
{"id": "Skywork/Skywork-Reward-Llama-3.1-8B-v0.2"}
]
}
/classify (POST)
Request:
{
"model": "Skywork/Skywork-Reward-Llama-3.1-8B-v0.2",
"input": [
"<s>[INST]question[/INST]answer</s>"
]
}
Response:
{
"data": [
{
"index": 0,
"label": "LABEL_0",
"probs": [0.85],
"num_classes": 1
}
]
}
The probs[0] value is used as the reward.
Chat Template Formatting
The environment properly formats multi-turn conversations for the reward model:
# Input conversation
[
{"role": "user", "content": "lets do python coding"},
{"role": "assistant", "content": "Sure! How'd you like to get started?"}
]
# Formatted output (using Llama-style template)
"<s>[INST]lets do python coding[/INST]Sure! How'd you like to get started?</s>"
If you provide a tokenizer_path, it will use the tokenizer's native chat template. Otherwise, it falls back to a simple Llama-style format.
Configuration Options
dataset_name(str): Hugging Face dataset namereward_model_url(str): Base URL for the reward model APIdataset_config(str | None): Dataset config name (e.g., "main" for gsm8k, optional)tokenizer_path(str | None): Path to tokenizer.json for chat template formattingsystem_prompt(str): System prompt for the environment (default: "You are a helpful assistant.")num_train_examples(int): Number of training examples (-1 for all)num_eval_examples(int): Number of eval examples (-1 for all)max_retries(int): Maximum retry attempts for API calls (default: 3)retry_delay(float): Base delay between retries in seconds (default: 1.0)timeout(float): Request timeout in seconds (default: 120.0)
Sanity Checks
The environment includes several sanity checks:
- Reward Range Logging: Logs min, max, mean, and median rewards for each batch
- Small Value Warnings: Warns if rewards are extremely small (< 1e-10) to help detect truncation issues
- Response Validation: Ensures the API response structure is correct and matches the input
Training Example
Use with vf-rl for reinforcement learning:
# configs/rl/reward_model.toml
model = "Qwen/Qwen3-4B-Instruct-2507"
[env]
id = "reward-model-env"
reward_model_url = "http://localhost:8002"
dataset_name = "your-dataset"
tokenizer_path = "./tokenizer.json"
[inference]
gpus = 1
[trainer]
gpus = 1
use_lora = true
learning_rate = 1e-5
max_steps = 100
uv run vf-rl @ configs/rl/reward_model.toml
Troubleshooting
Connection Issues
- Ensure your reward model is running and accessible at the specified URL
- Check firewall settings if connecting to a remote server
- Verify the
/v1/modelsendpoint returns valid data
Reward Scaling
- Check the logged reward statistics to ensure values are in the expected range
- If rewards are too small, they might not provide sufficient training signal
- Consider normalizing or scaling rewards based on your use case
Chat Template Issues
- If using a tokenizer, ensure it has a chat template defined
- The fallback simple formatting works for most Llama-style models
- Check the logged sample conversation to verify formatting is correct