Sync all skills and memories 2026-04-14 07:27

2026-04-14 07:27:20 +09:00
parent 516bb44fe6
commit 1eba2bca95
386 changed files with 167655 additions and 0 deletions
--- a/skills/mlops/training/trl-fine-tuning/references/online-rl.md
+++ b/skills/mlops/training/trl-fine-tuning/references/online-rl.md
@@ -0,0 +1,82 @@
+# Online RL Methods
+
+Guide to online reinforcement learning with PPO, GRPO, RLOO, and OnlineDPO.
+
+## Overview
+
+Online RL generates completions during training and optimizes based on rewards.
+
+## PPO (Proximal Policy Optimization)
+
+Classic RL algorithm for LLM alignment.
+
+### Basic Usage
+
+```bash
+python -m trl.scripts.ppo \
+    --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
+    --reward_model_path reward-model \
+    --dataset_name trl-internal-testing/descriptiveness-sentiment-trl-style \
+    --output_dir model-ppo \
+    --learning_rate 3e-6 \
+    --per_device_train_batch_size 64 \
+    --total_episodes 10000 \
+    --num_ppo_epochs 4 \
+    --kl_coef 0.05
+```
+
+### Key Parameters
+
+- `kl_coef`: KL penalty (0.05-0.2)
+- `num_ppo_epochs`: Epochs per batch (2-4)
+- `cliprange`: PPO clip (0.1-0.3)
+- `vf_coef`: Value function coef (0.1)
+
+## GRPO (Group Relative Policy Optimization)
+
+Memory-efficient online RL.
+
+### Basic Usage
+
+```python
+from trl import GRPOTrainer, GRPOConfig
+from datasets import load_dataset
+
+# Define reward function
+def reward_func(completions, **kwargs):
+    return [len(set(c.split())) for c in completions]
+
+config = GRPOConfig(
+    output_dir="model-grpo",
+    num_generations=4,  # Completions per prompt
+    max_new_tokens=128
+)
+
+trainer = GRPOTrainer(
+    model="Qwen/Qwen2-0.5B-Instruct",
+    reward_funcs=reward_func,
+    args=config,
+    train_dataset=load_dataset("trl-lib/tldr", split="train")
+)
+trainer.train()
+```
+
+### Key Parameters
+
+- `num_generations`: 2-8 completions
+- `max_new_tokens`: 64-256
+- Learning rate: 1e-5 to 1e-4
+
+## Memory Comparison
+
+| Method | Memory (7B) | Speed | Use Case |
+|--------|-------------|-------|----------|
+| PPO | 40GB | Medium | Maximum control |
+| GRPO | 24GB | Fast | **Memory-constrained** |
+| OnlineDPO | 28GB | Fast | No reward model |
+
+## References
+
+- PPO paper: https://arxiv.org/abs/1707.06347
+- GRPO paper: https://arxiv.org/abs/2402.03300
+- TRL docs: https://huggingface.co/docs/trl/