Reward Modeling

Guide to training reward models with TRL for RLHF pipelines.

Overview

Reward models score completions based on human preferences. Used in:

PPO training (RL feedback)
GRPO online RL
Completion ranking

Basic Training

from transformers import AutoModelForSequenceClassification, AutoTokenizer
from trl import RewardTrainer, RewardConfig
from datasets import load_dataset

# Load model (num_labels=1 for single reward score)
model = AutoModelForSequenceClassification.from_pretrained(
    "Qwen/Qwen2.5-0.5B-Instruct",
    num_labels=1
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")

# Load preference dataset (chosen/rejected pairs)
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

# Configure
config = RewardConfig(
    output_dir="Qwen2.5-Reward",
    per_device_train_batch_size=2,
    num_train_epochs=1,
    learning_rate=1e-5
)

# Train
trainer = RewardTrainer(
    model=model,
    args=config,
    processing_class=tokenizer,
    train_dataset=dataset
)
trainer.train()

Dataset Format

Required fields:

{
  "prompt": "Question or instruction",
  "chosen": "Better response",
  "rejected": "Worse response"
}

Bradley-Terry Loss

Default loss function:

loss = -log(sigmoid(reward_chosen - reward_rejected))

Learns to score chosen > rejected.

Using Reward Models

Inference

from transformers import pipeline

# Load trained reward model
reward_pipe = pipeline("text-classification", model="Qwen2.5-Reward")

# Score completions
texts = ["Good answer", "Bad answer"]
scores = reward_pipe(texts)
print(scores)  # Higher score = better

In PPO

from trl import PPOTrainer, PPOConfig

config = PPOConfig(
    reward_model_path="Qwen2.5-Reward"  # Use trained reward model
)

trainer = PPOTrainer(
    model=policy_model,
    config=config,
    # Reward model loaded automatically
)

Hyperparameters

Model Size	Learning Rate	Batch Size	Epochs
<1B	2e-5	4-8	1-2
1-7B	1e-5	2-4	1
7-13B	5e-6	1-2	1

Evaluation

Check reward separation:

# Chosen should score higher than rejected
chosen_rewards = model(**chosen_inputs).logits
rejected_rewards = model(**rejected_inputs).logits

accuracy = (chosen_rewards > rejected_rewards).float().mean()
print(f"Accuracy: {accuracy:.2%}")  # Target: >80%

References

InstructGPT paper: https://arxiv.org/abs/2203.02155
TRL docs: https://huggingface.co/docs/trl/reward_trainer

2.5 KiB Raw Permalink Blame History