YAML Metadata Warning: empty or missing yaml metadata in repo card

Check out the documentation for more information.

Reward Model Environment

An environment that uses an external reward model hosted via vLLM to train LLMs. This environment communicates with a reward model API, formats conversations using chat templates, batches requests for efficiency, and includes retry logic for robustness.

Features

  • External Reward Model Integration: Connects to reward models hosted via vLLM's /classify endpoint
  • Automatic Model Discovery: Fetches the reward model name from /v1/models
  • Batched Requests: Sends all rollouts in a single batch request for efficiency
  • Retry Logic: Automatically retries failed requests with exponential backoff
  • Chat Template Support: Properly formats conversations using tokenizer chat templates
  • Sanity Checks: Logs statistics and warnings for reward values to ensure proper scaling

Installation

uv run vf-install reward-model-env

Usage

Basic Example

import verifiers as vf

# Load the environment
vf_env = vf.load_environment(
    "reward-model-env",
    dataset_name="gsm8k",  # HF dataset with 'prompt' or 'question' column
    dataset_config="main",  # Optional: dataset config name (required for some datasets)
    reward_model_url="http://localhost:8002",  # URL where your reward model is hosted
    tokenizer_path="./tokenizer.json",  # Optional: path to tokenizer for chat template
    num_train_examples=100,  # Optional: limit training examples
)

# Evaluate with an OpenAI-compatible model
from openai import AsyncOpenAI

results = await vf_env.evaluate(
    client=AsyncOpenAI(base_url="http://localhost:8000/v1"),
    model="your-model",
    num_examples=10,
    rollouts_per_example=1,
)

See example.py for a complete working example.

Environment Variables

Set REWARD_MODEL_URL to avoid passing it as an argument:

export REWARD_MODEL_URL="http://localhost:8002"

Reward Model Setup

This environment expects a reward model hosted via vLLM with the classification API enabled. Example setup:

# Start vLLM with a reward model
vllm serve Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 \
  --port 8002 \
  --enable-classification

API Format

The environment expects the following API endpoints:

/v1/models (GET)

Returns available models:

{
  "data": [
    {"id": "Skywork/Skywork-Reward-Llama-3.1-8B-v0.2"}
  ]
}

/classify (POST)

Request:

{
  "model": "Skywork/Skywork-Reward-Llama-3.1-8B-v0.2",
  "input": [
    "<s>[INST]question[/INST]answer</s>"
  ]
}

Response:

{
  "data": [
    {
      "index": 0,
      "label": "LABEL_0",
      "probs": [0.85],
      "num_classes": 1
    }
  ]
}

The probs[0] value is used as the reward.

Chat Template Formatting

The environment properly formats multi-turn conversations for the reward model:

# Input conversation
[
  {"role": "user", "content": "lets do python coding"},
  {"role": "assistant", "content": "Sure! How'd you like to get started?"}
]

# Formatted output (using Llama-style template)
"<s>[INST]lets do python coding[/INST]Sure! How'd you like to get started?</s>"

If you provide a tokenizer_path, it will use the tokenizer's native chat template. Otherwise, it falls back to a simple Llama-style format.

Configuration Options

  • dataset_name (str): Hugging Face dataset name
  • reward_model_url (str): Base URL for the reward model API
  • dataset_config (str | None): Dataset config name (e.g., "main" for gsm8k, optional)
  • tokenizer_path (str | None): Path to tokenizer.json for chat template formatting
  • system_prompt (str): System prompt for the environment (default: "You are a helpful assistant.")
  • num_train_examples (int): Number of training examples (-1 for all)
  • num_eval_examples (int): Number of eval examples (-1 for all)
  • max_retries (int): Maximum retry attempts for API calls (default: 3)
  • retry_delay (float): Base delay between retries in seconds (default: 1.0)
  • timeout (float): Request timeout in seconds (default: 120.0)

Sanity Checks

The environment includes several sanity checks:

  1. Reward Range Logging: Logs min, max, mean, and median rewards for each batch
  2. Small Value Warnings: Warns if rewards are extremely small (< 1e-10) to help detect truncation issues
  3. Response Validation: Ensures the API response structure is correct and matches the input

Training Example

Use with vf-rl for reinforcement learning:

# configs/rl/reward_model.toml
model = "Qwen/Qwen3-4B-Instruct-2507"

[env]
id = "reward-model-env"
reward_model_url = "http://localhost:8002"
dataset_name = "your-dataset"
tokenizer_path = "./tokenizer.json"

[inference]
gpus = 1

[trainer]
gpus = 1
use_lora = true
learning_rate = 1e-5
max_steps = 100
uv run vf-rl @ configs/rl/reward_model.toml

Troubleshooting

Connection Issues

  • Ensure your reward model is running and accessible at the specified URL
  • Check firewall settings if connecting to a remote server
  • Verify the /v1/models endpoint returns valid data

Reward Scaling

  • Check the logged reward statistics to ensure values are in the expected range
  • If rewards are too small, they might not provide sufficient training signal
  • Consider normalizing or scaling rewards based on your use case

Chat Template Issues

  • If using a tokenizer, ensure it has a chat template defined
  • The fallback simple formatting works for most Llama-style models
  • Check the logged sample conversation to verify formatting is correct
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support