skills/mlops/training/grpo-rl-training/README.md

# GRPO/RL Training Skill

**Expert-level guidance for Group Relative Policy Optimization with TRL**

## 📁 Skill Structure

```
grpo-rl-training/
├── SKILL.md                              # Main skill documentation (READ THIS FIRST)
├── README.md                             # This file
├── templates/
│   └── basic_grpo_training.py            # Production-ready training template
└── examples/
    └── reward_functions_library.py       # 20+ reward function examples
```

## 🚀 Quick Start

1. **Read SKILL.md** - Comprehensive guide with all concepts and patterns
2. **Copy `templates/basic_grpo_training.py`** - Start with working code
3. **Browse `examples/reward_functions_library.py`** - Pick reward functions for your task
4. **Modify for your use case** - Adapt dataset, rewards, and config

## 💡 What's Inside

### SKILL.md (Main Documentation)
- Core GRPO concepts and algorithm fundamentals
- Complete implementation workflow (dataset → rewards → training → deployment)
- 10+ reward function examples with code
- Hyperparameter tuning guide
- Training insights (loss behavior, metrics, debugging)
- Troubleshooting guide
- Production best practices

### Templates
- **basic_grpo_training.py**: Minimal, production-ready training script
  - Uses Qwen 2.5 1.5B Instruct
  - 3 reward functions (format + correctness)
  - LoRA for efficient training
  - Fully documented and ready to run

### Examples
- **reward_functions_library.py**: 20+ battle-tested reward functions
  - Correctness rewards (exact match, fuzzy match, numeric, code execution)
  - Format rewards (XML, JSON, strict/soft)
  - Length rewards (ideal length, min/max)
  - Style rewards (reasoning quality, citations, repetition penalty)
  - Combined rewards (multi-objective optimization)
  - Preset collections for common tasks

## 📖 Usage for Agents

When this skill is loaded in your agent's context:

1. **Always read SKILL.md first** before implementing
2. **Start simple** - Use length-based reward to validate setup
3. **Build incrementally** - Add one reward function at a time
4. **Reference examples** - Copy patterns from reward_functions_library.py
5. **Monitor training** - Watch reward metrics (not loss!)

## 🎯 Common Use Cases

| Task Type | Recommended Rewards | Template |
|-----------|---------------------|----------|
| Math reasoning | `MATH_REASONING_REWARDS` preset | basic_grpo_training.py |
| Code generation | `CODE_GENERATION_REWARDS` preset | Modify dataset in template |
| Summarization | `SUMMARIZATION_REWARDS` preset | Adjust prompts + rewards |
| Q&A | `QA_REWARDS` preset | Use fuzzy match + citations |

## ⚠️ Critical Reminders

- **Loss goes UP during training** - This is normal (it's KL divergence)
- **Use 3-5 reward functions** - Single rewards often fail
- **Test rewards before training** - Debug each function independently
- **Monitor reward_std** - Should stay > 0.1 (avoid mode collapse)
- **Start with num_generations=4-8** - Scale up if GPU allows

## 🔗 External Resources

- [TRL Documentation](https://huggingface.co/docs/trl)
- [DeepSeek R1 Paper](https://arxiv.org/abs/2501.12948)
- [Open R1 Implementation](https://github.com/huggingface/open-r1)
- [Unsloth (2-3x faster)](https://docs.unsloth.ai/)

## 📝 Version

**v1.0.0** - Initial release (January 2025)

## 👨‍💻 Maintained By

Orchestra Research
For questions or improvements, see https://orchestra.com

---

**License:** MIT
**Last Updated:** January 2025
Sync all skills and memories 2026-04-14 07:27 2026-04-14 07:27:20 +09:00			`# GRPO/RL Training Skill`

			`Expert-level guidance for Group Relative Policy Optimization with TRL`

			`## 📁 Skill Structure`

			```
			`grpo-rl-training/`
			`├── SKILL.md # Main skill documentation (READ THIS FIRST)`
			`├── README.md # This file`
			`├── templates/`
			`│ └── basic_grpo_training.py # Production-ready training template`
			`└── examples/`
			`└── reward_functions_library.py # 20+ reward function examples`
			```

			`## 🚀 Quick Start`

			`1. Read SKILL.md - Comprehensive guide with all concepts and patterns`
			2. Copy `templates/basic_grpo_training.py` - Start with working code
			3. Browse `examples/reward_functions_library.py` - Pick reward functions for your task
			`4. Modify for your use case - Adapt dataset, rewards, and config`

			`## 💡 What's Inside`

			`### SKILL.md (Main Documentation)`
			`- Core GRPO concepts and algorithm fundamentals`
			`- Complete implementation workflow (dataset → rewards → training → deployment)`
			`- 10+ reward function examples with code`
			`- Hyperparameter tuning guide`
			`- Training insights (loss behavior, metrics, debugging)`
			`- Troubleshooting guide`
			`- Production best practices`

			`### Templates`
			`- basic_grpo_training.py: Minimal, production-ready training script`
			`- Uses Qwen 2.5 1.5B Instruct`
			`- 3 reward functions (format + correctness)`
			`- LoRA for efficient training`
			`- Fully documented and ready to run`

			`### Examples`
			`- reward_functions_library.py: 20+ battle-tested reward functions`
			`- Correctness rewards (exact match, fuzzy match, numeric, code execution)`
			`- Format rewards (XML, JSON, strict/soft)`
			`- Length rewards (ideal length, min/max)`
			`- Style rewards (reasoning quality, citations, repetition penalty)`
			`- Combined rewards (multi-objective optimization)`
			`- Preset collections for common tasks`

			`## 📖 Usage for Agents`

			`When this skill is loaded in your agent's context:`

			`1. Always read SKILL.md first before implementing`
			`2. Start simple - Use length-based reward to validate setup`
			`3. Build incrementally - Add one reward function at a time`
			`4. Reference examples - Copy patterns from reward_functions_library.py`
			`5. Monitor training - Watch reward metrics (not loss!)`

			`## 🎯 Common Use Cases`

			`\| Task Type \| Recommended Rewards \| Template \|`
			`\|-----------\|---------------------\|----------\|`
			\| Math reasoning \| `MATH_REASONING_REWARDS` preset \| basic_grpo_training.py \|
			\| Code generation \| `CODE_GENERATION_REWARDS` preset \| Modify dataset in template \|
			\| Summarization \| `SUMMARIZATION_REWARDS` preset \| Adjust prompts + rewards \|
			\| Q&A \| `QA_REWARDS` preset \| Use fuzzy match + citations \|

			`## ⚠️ Critical Reminders`

			`- Loss goes UP during training - This is normal (it's KL divergence)`
			`- Use 3-5 reward functions - Single rewards often fail`
			`- Test rewards before training - Debug each function independently`
			`- Monitor reward_std - Should stay > 0.1 (avoid mode collapse)`
			`- Start with num_generations=4-8 - Scale up if GPU allows`

			`## 🔗 External Resources`

			`- [TRL Documentation](https://huggingface.co/docs/trl)`
			`- [DeepSeek R1 Paper](https://arxiv.org/abs/2501.12948)`
			`- [Open R1 Implementation](https://github.com/huggingface/open-r1)`
			`- [Unsloth (2-3x faster)](https://docs.unsloth.ai/)`

			`## 📝 Version`

			`v1.0.0 - Initial release (January 2025)`

			`## 👨‍💻 Maintained By`

			`Orchestra Research`
			`For questions or improvements, see https://orchestra.com`

			`---`

			`License: MIT`
			`Last Updated: January 2025`