2.5 KiB
2.5 KiB
Reward Modeling
Guide to training reward models with TRL for RLHF pipelines.
Overview
Reward models score completions based on human preferences. Used in:
- PPO training (RL feedback)
- GRPO online RL
- Completion ranking
Basic Training
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from trl import RewardTrainer, RewardConfig
from datasets import load_dataset
# Load model (num_labels=1 for single reward score)
model = AutoModelForSequenceClassification.from_pretrained(
"Qwen/Qwen2.5-0.5B-Instruct",
num_labels=1
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
# Load preference dataset (chosen/rejected pairs)
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
# Configure
config = RewardConfig(
output_dir="Qwen2.5-Reward",
per_device_train_batch_size=2,
num_train_epochs=1,
learning_rate=1e-5
)
# Train
trainer = RewardTrainer(
model=model,
args=config,
processing_class=tokenizer,
train_dataset=dataset
)
trainer.train()
Dataset Format
Required fields:
{
"prompt": "Question or instruction",
"chosen": "Better response",
"rejected": "Worse response"
}
Bradley-Terry Loss
Default loss function:
loss = -log(sigmoid(reward_chosen - reward_rejected))
Learns to score chosen > rejected.
Using Reward Models
Inference
from transformers import pipeline
# Load trained reward model
reward_pipe = pipeline("text-classification", model="Qwen2.5-Reward")
# Score completions
texts = ["Good answer", "Bad answer"]
scores = reward_pipe(texts)
print(scores) # Higher score = better
In PPO
from trl import PPOTrainer, PPOConfig
config = PPOConfig(
reward_model_path="Qwen2.5-Reward" # Use trained reward model
)
trainer = PPOTrainer(
model=policy_model,
config=config,
# Reward model loaded automatically
)
Hyperparameters
| Model Size | Learning Rate | Batch Size | Epochs |
|---|---|---|---|
| <1B | 2e-5 | 4-8 | 1-2 |
| 1-7B | 1e-5 | 2-4 | 1 |
| 7-13B | 5e-6 | 1-2 | 1 |
Evaluation
Check reward separation:
# Chosen should score higher than rejected
chosen_rewards = model(**chosen_inputs).logits
rejected_rewards = model(**rejected_inputs).logits
accuracy = (chosen_rewards > rejected_rewards).float().mean()
print(f"Accuracy: {accuracy:.2%}") # Target: >80%
References
- InstructGPT paper: https://arxiv.org/abs/2203.02155
- TRL docs: https://huggingface.co/docs/trl/reward_trainer