skills/mlops/training/trl-fine-tuning/references/online-rl.md

# Online RL Methods

Guide to online reinforcement learning with PPO, GRPO, RLOO, and OnlineDPO.

## Overview

Online RL generates completions during training and optimizes based on rewards.

## PPO (Proximal Policy Optimization)

Classic RL algorithm for LLM alignment.

### Basic Usage

```bash
python -m trl.scripts.ppo \
    --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
    --reward_model_path reward-model \
    --dataset_name trl-internal-testing/descriptiveness-sentiment-trl-style \
    --output_dir model-ppo \
    --learning_rate 3e-6 \
    --per_device_train_batch_size 64 \
    --total_episodes 10000 \
    --num_ppo_epochs 4 \
    --kl_coef 0.05
```

### Key Parameters

- `kl_coef`: KL penalty (0.05-0.2)
- `num_ppo_epochs`: Epochs per batch (2-4)
- `cliprange`: PPO clip (0.1-0.3)
- `vf_coef`: Value function coef (0.1)

## GRPO (Group Relative Policy Optimization)

Memory-efficient online RL.

### Basic Usage

```python
from trl import GRPOTrainer, GRPOConfig
from datasets import load_dataset

# Define reward function
def reward_func(completions, **kwargs):
    return [len(set(c.split())) for c in completions]

config = GRPOConfig(
    output_dir="model-grpo",
    num_generations=4,  # Completions per prompt
    max_new_tokens=128
)

trainer = GRPOTrainer(
    model="Qwen/Qwen2-0.5B-Instruct",
    reward_funcs=reward_func,
    args=config,
    train_dataset=load_dataset("trl-lib/tldr", split="train")
)
trainer.train()
```

### Key Parameters

- `num_generations`: 2-8 completions
- `max_new_tokens`: 64-256
- Learning rate: 1e-5 to 1e-4

## Memory Comparison

| Method | Memory (7B) | Speed | Use Case |
|--------|-------------|-------|----------|
| PPO | 40GB | Medium | Maximum control |
| GRPO | 24GB | Fast | **Memory-constrained** |
| OnlineDPO | 28GB | Fast | No reward model |

## References

- PPO paper: https://arxiv.org/abs/1707.06347
- GRPO paper: https://arxiv.org/abs/2402.03300
- TRL docs: https://huggingface.co/docs/trl/
Sync all skills and memories 2026-04-14 07:27 2026-04-14 07:27:20 +09:00			`# Online RL Methods`

			`Guide to online reinforcement learning with PPO, GRPO, RLOO, and OnlineDPO.`

			`## Overview`

			`Online RL generates completions during training and optimizes based on rewards.`

			`## PPO (Proximal Policy Optimization)`

			`Classic RL algorithm for LLM alignment.`

			`### Basic Usage`

			```bash
			`python -m trl.scripts.ppo \`
			`--model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \`
			`--reward_model_path reward-model \`
			`--dataset_name trl-internal-testing/descriptiveness-sentiment-trl-style \`
			`--output_dir model-ppo \`
			`--learning_rate 3e-6 \`
			`--per_device_train_batch_size 64 \`
			`--total_episodes 10000 \`
			`--num_ppo_epochs 4 \`
			`--kl_coef 0.05`
			```

			`### Key Parameters`

			- `kl_coef`: KL penalty (0.05-0.2)
			- `num_ppo_epochs`: Epochs per batch (2-4)
			- `cliprange`: PPO clip (0.1-0.3)
			- `vf_coef`: Value function coef (0.1)

			`## GRPO (Group Relative Policy Optimization)`

			`Memory-efficient online RL.`

			`### Basic Usage`

			```python
			`from trl import GRPOTrainer, GRPOConfig`
			`from datasets import load_dataset`

			`# Define reward function`
			`def reward_func(completions, **kwargs):`
			`return [len(set(c.split())) for c in completions]`

			`config = GRPOConfig(`
			`output_dir="model-grpo",`
			`num_generations=4, # Completions per prompt`
			`max_new_tokens=128`
			`)`

			`trainer = GRPOTrainer(`
			`model="Qwen/Qwen2-0.5B-Instruct",`
			`reward_funcs=reward_func,`
			`args=config,`
			`train_dataset=load_dataset("trl-lib/tldr", split="train")`
			`)`
			`trainer.train()`
			```

			`### Key Parameters`

			- `num_generations`: 2-8 completions
			- `max_new_tokens`: 64-256
			`- Learning rate: 1e-5 to 1e-4`

			`## Memory Comparison`

			`\| Method \| Memory (7B) \| Speed \| Use Case \|`
			`\|--------\|-------------\|-------\|----------\|`
			`\| PPO \| 40GB \| Medium \| Maximum control \|`
			`\| GRPO \| 24GB \| Fast \| Memory-constrained \|`
			`\| OnlineDPO \| 28GB \| Fast \| No reward model \|`

			`## References`

			`- PPO paper: https://arxiv.org/abs/1707.06347`
			`- GRPO paper: https://arxiv.org/abs/2402.03300`
			`- TRL docs: https://huggingface.co/docs/trl/`