YAML Metadata Warning: empty or missing yaml metadata in repo card

Check out the documentation for more information.

Reward Model Environment

An environment that uses an external reward model hosted via vLLM to train LLMs. This environment communicates with a reward model API, formats conversations using chat templates, batches requests for efficiency, and includes retry logic for robustness.

Features

External Reward Model Integration: Connects to reward models hosted via vLLM's /classify endpoint
Automatic Model Discovery: Fetches the reward model name from /v1/models
Batched Requests: Sends all rollouts in a single batch request for efficiency
Retry Logic: Automatically retries failed requests with exponential backoff
Chat Template Support: Properly formats conversations using tokenizer chat templates
Sanity Checks: Logs statistics and warnings for reward values to ensure proper scaling

Installation

uv run vf-install reward-model-env

Usage

Basic Example

import verifiers as vf

# Load the environment
vf_env = vf.load_environment(
    "reward-model-env",
    dataset_name="gsm8k",  # HF dataset with 'prompt' or 'question' column
    dataset_config="main",  # Optional: dataset config name (required for some datasets)
    reward_model_url="http://localhost:8002",  # URL where your reward model is hosted
    tokenizer_path="./tokenizer.json",  # Optional: path to tokenizer for chat template
    num_train_examples=100,  # Optional: limit training examples
)

# Evaluate with an OpenAI-compatible model
from openai import AsyncOpenAI

results = await vf_env.evaluate(
    client=AsyncOpenAI(base_url="http://localhost:8000/v1"),
    model="your-model",
    num_examples=10,
    rollouts_per_example=1,
)

See example.py for a complete working example.

Environment Variables

Set REWARD_MODEL_URL to avoid passing it as an argument:

export REWARD_MODEL_URL="http://localhost:8002"

Reward Model Setup

This environment expects a reward model hosted via vLLM with the classification API enabled. Example setup:

# Start vLLM with a reward model
vllm serve Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 \
  --port 8002 \
  --enable-classification

API Format

The environment expects the following API endpoints:

`/v1/models` (GET)

Returns available models:

{
  "data": [
    {"id": "Skywork/Skywork-Reward-Llama-3.1-8B-v0.2"}
  ]
}

`/classify` (POST)

Request:

{
  "model": "Skywork/Skywork-Reward-Llama-3.1-8B-v0.2",
  "input": [
    "<s>[INST]question[/INST]answer</s>"
  ]
}

Response:

{
  "data": [
    {
      "index": 0,
      "label": "LABEL_0",
      "probs": [0.85],
      "num_classes": 1
    }
  ]
}

The probs[0] value is used as the reward.

Chat Template Formatting

The environment properly formats multi-turn conversations for the reward model:

# Input conversation
[
  {"role": "user", "content": "lets do python coding"},
  {"role": "assistant", "content": "Sure! How'd you like to get started?"}
]

# Formatted output (using Llama-style template)
"<s>[INST]lets do python coding[/INST]Sure! How'd you like to get started?</s>"

If you provide a tokenizer_path, it will use the tokenizer's native chat template. Otherwise, it falls back to a simple Llama-style format.

Configuration Options

dataset_name (str): Hugging Face dataset name
reward_model_url (str): Base URL for the reward model API
dataset_config (str | None): Dataset config name (e.g., "main" for gsm8k, optional)
tokenizer_path (str | None): Path to tokenizer.json for chat template formatting
system_prompt (str): System prompt for the environment (default: "You are a helpful assistant.")
num_train_examples (int): Number of training examples (-1 for all)
num_eval_examples (int): Number of eval examples (-1 for all)
max_retries (int): Maximum retry attempts for API calls (default: 3)
retry_delay (float): Base delay between retries in seconds (default: 1.0)
timeout (float): Request timeout in seconds (default: 120.0)

Sanity Checks

The environment includes several sanity checks:

Reward Range Logging: Logs min, max, mean, and median rewards for each batch
Small Value Warnings: Warns if rewards are extremely small (< 1e-10) to help detect truncation issues
Response Validation: Ensures the API response structure is correct and matches the input

Training Example

Use with vf-rl for reinforcement learning:

# configs/rl/reward_model.toml
model = "Qwen/Qwen3-4B-Instruct-2507"

[env]
id = "reward-model-env"
reward_model_url = "http://localhost:8002"
dataset_name = "your-dataset"
tokenizer_path = "./tokenizer.json"

[inference]
gpus = 1

[trainer]
gpus = 1
use_lora = true
learning_rate = 1e-5
max_steps = 100

uv run vf-rl @ configs/rl/reward_model.toml

Troubleshooting

Connection Issues

Ensure your reward model is running and accessible at the specified URL
Check firewall settings if connecting to a remote server
Verify the /v1/models endpoint returns valid data

Reward Scaling

Check the logged reward statistics to ensure values are in the expected range
If rewards are too small, they might not provide sufficient training signal
Consider normalizing or scaling rewards based on your use case

Chat Template Issues

If using a tokenizer, ensure it has a chat template defined
The fallback simple formatting works for most Llama-style models
Check the logged sample conversation to verify formatting is correct

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support