skills/mlops/training/grpo-rl-training/SKILL.md

---
name: grpo-rl-training
description: Expert guidance for GRPO/RL fine-tuning with TRL for reasoning and task-specific model training
version: 1.0.0
author: Orchestra Research
license: MIT
dependencies: [transformers>=4.47.0, trl>=0.14.0, datasets>=3.2.0, peft>=0.14.0, torch]
metadata:
  hermes:
    tags: [Post-Training, Reinforcement Learning, GRPO, TRL, RLHF, Reward Modeling, Reasoning, DPO, PPO, Structured Output]

---

# GRPO/RL Training with TRL

Expert-level guidance for implementing Group Relative Policy Optimization (GRPO) using the Transformer Reinforcement Learning (TRL) library. This skill provides battle-tested patterns, critical insights, and production-ready workflows for fine-tuning language models with custom reward functions.

## When to Use This Skill

Use GRPO training when you need to:
- **Enforce specific output formats** (e.g., XML tags, JSON, structured reasoning)
- **Teach verifiable tasks** with objective correctness metrics (math, coding, fact-checking)
- **Improve reasoning capabilities** by rewarding chain-of-thought patterns
- **Align models to domain-specific behaviors** without labeled preference data
- **Optimize for multiple objectives** simultaneously (format + correctness + style)

**Do NOT use GRPO for:**
- Simple supervised fine-tuning tasks (use SFT instead)
- Tasks without clear reward signals
- When you already have high-quality preference pairs (use DPO/PPO instead)

---

## Core Concepts

### 1. GRPO Algorithm Fundamentals

**Key Mechanism:**
- Generates **multiple completions** for each prompt (group size: 4-16)
- Compares completions within each group using reward functions
- Updates policy to favor higher-rewarded responses relative to the group

**Critical Difference from PPO:**
- No separate reward model needed
- More sample-efficient (learns from within-group comparisons)
- Simpler to implement and debug

**Mathematical Intuition:**
```
For each prompt p:
  1. Generate N completions: {c₁, c₂, ..., cₙ}
  2. Compute rewards: {r₁, r₂, ..., rₙ}
  3. Learn to increase probability of high-reward completions
     relative to low-reward ones in the same group
```

### 2. Reward Function Design Philosophy

**Golden Rules:**
1. **Compose multiple reward functions** - Each handles one aspect (format, correctness, style)
2. **Scale rewards appropriately** - Higher weight = stronger signal
3. **Use incremental rewards** - Partial credit for partial compliance
4. **Test rewards independently** - Debug each reward function in isolation

**Reward Function Types:**

| Type | Use Case | Example Weight |
|------|----------|----------------|
| **Correctness** | Verifiable tasks (math, code) | 2.0 (highest) |
| **Format** | Strict structure enforcement | 0.5-1.0 |
| **Length** | Encourage verbosity/conciseness | 0.1-0.5 |
| **Style** | Penalize unwanted patterns | -0.5 to 0.5 |

---

## Implementation Workflow

### Step 1: Dataset Preparation

**Critical Requirements:**
- Prompts in chat format (list of dicts with 'role' and 'content')
- Include system prompts to set expectations
- For verifiable tasks, include ground truth answers as additional columns

**Example Structure:**
```python
from datasets import load_dataset, Dataset

SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
[Your step-by-step thinking]
</reasoning>
<answer>
[Final answer]
</answer>
"""

def prepare_dataset(raw_data):
    """
    Transform raw data into GRPO-compatible format.

    Returns: Dataset with columns:
    - 'prompt': List[Dict] with role/content (system + user messages)
    - 'answer': str (ground truth, optional but recommended)
    """
    return raw_data.map(lambda x: {
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': extract_answer(x['raw_answer'])
    })
```

**Pro Tips:**
- Use one-shot or few-shot examples in system prompt for complex formats
- Keep prompts concise (max_prompt_length: 256-512 tokens)
- Validate data quality before training (garbage in = garbage out)

### Step 2: Reward Function Implementation

**Template Structure:**
```python
def reward_function_name(
    prompts,        # List[List[Dict]]: Original prompts
    completions,    # List[List[Dict]]: Model generations
    answer=None,    # Optional: Ground truth from dataset
    **kwargs        # Additional dataset columns
) -> list[float]:
    """
    Evaluate completions and return rewards.

    Returns: List of floats (one per completion)
    """
    # Extract completion text
    responses = [comp[0]['content'] for comp in completions]

    # Compute rewards
    rewards = []
    for response in responses:
        score = compute_score(response)
        rewards.append(score)

    return rewards
```

**Example 1: Correctness Reward (Math/Coding)**
```python
def correctness_reward(prompts, completions, answer, **kwargs):
    """Reward correct answers with high score."""
    responses = [comp[0]['content'] for comp in completions]
    extracted = [extract_final_answer(r) for r in responses]
    return [2.0 if ans == gt else 0.0
            for ans, gt in zip(extracted, answer)]
```

**Example 2: Format Reward (Structured Output)**
```python
import re

def format_reward(completions, **kwargs):
    """Reward XML-like structured format."""
    pattern = r'<reasoning>.*?</reasoning>\s*<answer>.*?</answer>'
    responses = [comp[0]['content'] for comp in completions]
    return [1.0 if re.search(pattern, r, re.DOTALL) else 0.0
            for r in responses]
```

**Example 3: Incremental Format Reward (Partial Credit)**
```python
def incremental_format_reward(completions, **kwargs):
    """Award partial credit for format compliance."""
    responses = [comp[0]['content'] for comp in completions]
    rewards = []

    for r in responses:
        score = 0.0
        if '<reasoning>' in r:
            score += 0.25
        if '</reasoning>' in r:
            score += 0.25
        if '<answer>' in r:
            score += 0.25
        if '</answer>' in r:
            score += 0.25
        # Penalize extra text after closing tag
        if r.count('</answer>') == 1:
            extra_text = r.split('</answer>')[-1].strip()
            score -= len(extra_text) * 0.001
        rewards.append(score)

    return rewards
```

**Critical Insight:**
Combine 3-5 reward functions for robust training. Order matters less than diversity of signals.

### Step 3: Training Configuration

**Memory-Optimized Config (Small GPU)**
```python
from trl import GRPOConfig

training_args = GRPOConfig(
    output_dir="outputs/grpo-model",

    # Learning rate
    learning_rate=5e-6,          # Lower = more stable
    adam_beta1=0.9,
    adam_beta2=0.99,
    weight_decay=0.1,
    warmup_ratio=0.1,
    lr_scheduler_type='cosine',

    # Batch settings
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,  # Effective batch = 4

    # GRPO-specific
    num_generations=8,            # Group size: 8-16 recommended
    max_prompt_length=256,
    max_completion_length=512,

    # Training duration
    num_train_epochs=1,
    max_steps=None,               # Or set fixed steps (e.g., 500)

    # Optimization
    bf16=True,                    # Faster on A100/H100
    optim="adamw_8bit",          # Memory-efficient optimizer
    max_grad_norm=0.1,

    # Logging
    logging_steps=1,
    save_steps=100,
    report_to="wandb",            # Or "none" for no logging
)
```

**High-Performance Config (Large GPU)**
```python
training_args = GRPOConfig(
    output_dir="outputs/grpo-model",
    learning_rate=1e-5,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    num_generations=16,           # Larger groups = better signal
    max_prompt_length=512,
    max_completion_length=1024,
    num_train_epochs=1,
    bf16=True,
    use_vllm=True,                # Fast generation with vLLM
    logging_steps=10,
)
```

**Critical Hyperparameters:**

| Parameter | Impact | Tuning Advice |
|-----------|--------|---------------|
| `num_generations` | Group size for comparison | Start with 8, increase to 16 if GPU allows |
| `learning_rate` | Convergence speed/stability | 5e-6 (safe), 1e-5 (faster, riskier) |
| `max_completion_length` | Output verbosity | Match your task (512 for reasoning, 256 for short answers) |
| `gradient_accumulation_steps` | Effective batch size | Increase if GPU memory limited |

### Step 4: Model Setup and Training

**Standard Setup (Transformers)**
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig
from trl import GRPOTrainer

# Load model
model_name = "Qwen/Qwen2.5-1.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",  # 2-3x faster
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Optional: LoRA for parameter-efficient training
peft_config = LoraConfig(
    r=16,                         # Rank (higher = more capacity)
    lora_alpha=32,               # Scaling factor (typically 2*r)
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    task_type="CAUSAL_LM",
    lora_dropout=0.05,
)

# Initialize trainer
trainer = GRPOTrainer(
    model=model,
    processing_class=tokenizer,
    reward_funcs=[
        incremental_format_reward,
        format_reward,
        correctness_reward,
    ],
    args=training_args,
    train_dataset=dataset,
    peft_config=peft_config,      # Remove for full fine-tuning
)

# Train
trainer.train()

# Save
trainer.save_model("final_model")
```

**Unsloth Setup (2-3x Faster)**
```python
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="google/gemma-3-1b-it",
    max_seq_length=1024,
    load_in_4bit=True,
    fast_inference=True,
    max_lora_rank=32,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=32,
    use_gradient_checkpointing="unsloth",
)

# Rest is identical to standard setup
trainer = GRPOTrainer(model=model, ...)
trainer.train()
```

---

## Critical Training Insights

### 1. Loss Behavior (EXPECTED PATTERN)
- **Loss starts near 0 and INCREASES during training**
- This is CORRECT - loss measures KL divergence from initial policy
- Model is learning (diverging from original behavior to optimize rewards)
- Monitor reward metrics instead of loss for progress

### 2. Reward Tracking
Key metrics to watch:
- `reward`: Average across all completions
- `reward_std`: Diversity within groups (should remain > 0)
- `kl`: KL divergence from reference (should grow moderately)

**Healthy Training Pattern:**
```
Step   Reward    Reward_Std   KL
100    0.5       0.3          0.02
200    0.8       0.25         0.05
300    1.2       0.2          0.08  ← Good progression
400    1.5       0.15         0.12
```

**Warning Signs:**
- Reward std → 0 (model collapsing to single response)
- KL exploding (> 0.5) (diverging too much, reduce LR)
- Reward stuck (reward functions too harsh or model capacity issue)

### 3. Common Pitfalls and Solutions

| Problem | Symptom | Solution |
|---------|---------|----------|
| **Mode collapse** | All completions identical | Increase `num_generations`, add diversity penalty |
| **No learning** | Flat rewards | Check reward function logic, increase LR |
| **OOM errors** | GPU memory exceeded | Reduce `num_generations`, enable gradient checkpointing |
| **Slow training** | < 1 it/s | Enable `use_vllm=True`, use Unsloth, reduce seq length |
| **Format ignored** | Model doesn't follow structure | Increase format reward weight, add incremental rewards |

---

## Advanced Patterns

### 1. Multi-Stage Training
For complex tasks, train in stages:

```python
# Stage 1: Format compliance (epochs=1)
trainer_stage1 = GRPOTrainer(
    model=model,
    reward_funcs=[incremental_format_reward, format_reward],
    ...
)
trainer_stage1.train()

# Stage 2: Correctness (epochs=1)
trainer_stage2 = GRPOTrainer(
    model=model,
    reward_funcs=[format_reward, correctness_reward],
    ...
)
trainer_stage2.train()
```

### 2. Adaptive Reward Scaling
```python
class AdaptiveReward:
    def __init__(self, base_reward_func, initial_weight=1.0):
        self.func = base_reward_func
        self.weight = initial_weight

    def __call__(self, *args, **kwargs):
        rewards = self.func(*args, **kwargs)
        return [r * self.weight for r in rewards]

    def adjust_weight(self, success_rate):
        """Increase weight if model struggling, decrease if succeeding."""
        if success_rate < 0.3:
            self.weight *= 1.2
        elif success_rate > 0.8:
            self.weight *= 0.9
```

### 3. Custom Dataset Integration
```python
def load_custom_knowledge_base(csv_path):
    """Example: School communication platform docs."""
    import pandas as pd
    df = pd.read_csv(csv_path)

    dataset = Dataset.from_pandas(df).map(lambda x: {
        'prompt': [
            {'role': 'system', 'content': CUSTOM_SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': x['expert_answer']
    })
    return dataset
```

---

## Deployment and Inference

### Save and Merge LoRA
```python
# Merge LoRA adapters into base model
if hasattr(trainer.model, 'merge_and_unload'):
    merged_model = trainer.model.merge_and_unload()
    merged_model.save_pretrained("production_model")
    tokenizer.save_pretrained("production_model")
```

### Inference Example
```python
from transformers import pipeline

generator = pipeline(
    "text-generation",
    model="production_model",
    tokenizer=tokenizer
)

result = generator(
    [
        {'role': 'system', 'content': SYSTEM_PROMPT},
        {'role': 'user', 'content': "What is 15 + 27?"}
    ],
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
    top_p=0.9
)
print(result[0]['generated_text'])
```

---

## Best Practices Checklist

**Before Training:**
- [ ] Validate dataset format (prompts as List[Dict])
- [ ] Test reward functions on sample data
- [ ] Calculate expected max_prompt_length from data
- [ ] Choose appropriate num_generations based on GPU memory
- [ ] Set up logging (wandb recommended)

**During Training:**
- [ ] Monitor reward progression (should increase)
- [ ] Check reward_std (should stay > 0.1)
- [ ] Watch for OOM errors (reduce batch size if needed)
- [ ] Sample generations every 50-100 steps
- [ ] Validate format compliance on holdout set

**After Training:**
- [ ] Merge LoRA weights if using PEFT
- [ ] Test on diverse prompts
- [ ] Compare to baseline model
- [ ] Document reward weights and hyperparameters
- [ ] Save reproducibility config

---

## Troubleshooting Guide

### Debugging Workflow
1. **Isolate reward functions** - Test each independently
2. **Check data distribution** - Ensure diversity in prompts
3. **Reduce complexity** - Start with single reward, add gradually
4. **Monitor generations** - Print samples every N steps
5. **Validate extraction logic** - Ensure answer parsing works

### Quick Fixes
```python
# Debug reward function
def debug_reward(completions, **kwargs):
    responses = [comp[0]['content'] for comp in completions]
    for i, r in enumerate(responses[:2]):  # Print first 2
        print(f"Response {i}: {r[:200]}...")
    return [1.0] * len(responses)  # Dummy rewards

# Test without training
trainer = GRPOTrainer(..., reward_funcs=[debug_reward])
trainer.generate_completions(dataset[:1])  # Generate without updating
```

---

## References and Resources

**Official Documentation:**
- TRL GRPO Trainer: https://huggingface.co/docs/trl/grpo_trainer
- DeepSeek R1 Paper: https://arxiv.org/abs/2501.12948
- Unsloth Docs: https://docs.unsloth.ai/

**Example Repositories:**
- Open R1 Implementation: https://github.com/huggingface/open-r1
- TRL Examples: https://github.com/huggingface/trl/tree/main/examples

**Recommended Reading:**
- Progressive Disclosure Pattern for agent instructions
- Reward shaping in RL (Ng et al.)
- LoRA paper (Hu et al., 2021)

---

## Usage Instructions for Agents

When this skill is loaded:

1. **Read this entire file** before implementing GRPO training
2. **Start with the simplest reward function** (e.g., length-based) to validate setup
3. **Use the templates** in `templates/` directory as starting points
4. **Reference examples** in `examples/` for task-specific implementations
5. **Follow the workflow** sequentially (don't skip steps)
6. **Debug incrementally** - add one reward function at a time

**Critical Reminders:**
- Always use multiple reward functions (3-5 is optimal)
- Monitor reward metrics, not loss
- Test reward functions before training
- Start small (num_generations=4), scale up gradually
- Save checkpoints frequently (every 100 steps)

This skill is designed for **expert-level implementation**. Beginners should start with supervised fine-tuning before attempting GRPO.
Sync all skills and memories 2026-04-14 07:27 2026-04-14 07:27:20 +09:00			`---`
			`name: grpo-rl-training`
			`description: Expert guidance for GRPO/RL fine-tuning with TRL for reasoning and task-specific model training`
			`version: 1.0.0`
			`author: Orchestra Research`
			`license: MIT`
			`dependencies: [transformers>=4.47.0, trl>=0.14.0, datasets>=3.2.0, peft>=0.14.0, torch]`
			`metadata:`
			`hermes:`
			`tags: [Post-Training, Reinforcement Learning, GRPO, TRL, RLHF, Reward Modeling, Reasoning, DPO, PPO, Structured Output]`

			`---`

			`# GRPO/RL Training with TRL`

			`Expert-level guidance for implementing Group Relative Policy Optimization (GRPO) using the Transformer Reinforcement Learning (TRL) library. This skill provides battle-tested patterns, critical insights, and production-ready workflows for fine-tuning language models with custom reward functions.`

			`## When to Use This Skill`

			`Use GRPO training when you need to:`
			`- Enforce specific output formats (e.g., XML tags, JSON, structured reasoning)`
			`- Teach verifiable tasks with objective correctness metrics (math, coding, fact-checking)`
			`- Improve reasoning capabilities by rewarding chain-of-thought patterns`
			`- Align models to domain-specific behaviors without labeled preference data`
			`- Optimize for multiple objectives simultaneously (format + correctness + style)`

			`Do NOT use GRPO for:`
			`- Simple supervised fine-tuning tasks (use SFT instead)`
			`- Tasks without clear reward signals`
			`- When you already have high-quality preference pairs (use DPO/PPO instead)`

			`---`

			`## Core Concepts`

			`### 1. GRPO Algorithm Fundamentals`

			`Key Mechanism:`
			`- Generates multiple completions for each prompt (group size: 4-16)`
			`- Compares completions within each group using reward functions`
			`- Updates policy to favor higher-rewarded responses relative to the group`

			`Critical Difference from PPO:`
			`- No separate reward model needed`
			`- More sample-efficient (learns from within-group comparisons)`
			`- Simpler to implement and debug`

			`Mathematical Intuition:`
			```
			`For each prompt p:`
			`1. Generate N completions: {c₁, c₂, ..., cₙ}`
			`2. Compute rewards: {r₁, r₂, ..., rₙ}`
			`3. Learn to increase probability of high-reward completions`
			`relative to low-reward ones in the same group`
			```

			`### 2. Reward Function Design Philosophy`

			`Golden Rules:`
			`1. Compose multiple reward functions - Each handles one aspect (format, correctness, style)`
			`2. Scale rewards appropriately - Higher weight = stronger signal`
			`3. Use incremental rewards - Partial credit for partial compliance`
			`4. Test rewards independently - Debug each reward function in isolation`

			`Reward Function Types:`

			`\| Type \| Use Case \| Example Weight \|`
			`\|------\|----------\|----------------\|`
			`\| Correctness \| Verifiable tasks (math, code) \| 2.0 (highest) \|`
			`\| Format \| Strict structure enforcement \| 0.5-1.0 \|`
			`\| Length \| Encourage verbosity/conciseness \| 0.1-0.5 \|`
			`\| Style \| Penalize unwanted patterns \| -0.5 to 0.5 \|`

			`---`

			`## Implementation Workflow`

			`### Step 1: Dataset Preparation`

			`Critical Requirements:`
			`- Prompts in chat format (list of dicts with 'role' and 'content')`
			`- Include system prompts to set expectations`
			`- For verifiable tasks, include ground truth answers as additional columns`

			`Example Structure:`
			```python
			`from datasets import load_dataset, Dataset`

			`SYSTEM_PROMPT = """`
			`Respond in the following format:`
			`<reasoning>`
			`[Your step-by-step thinking]`
			`</reasoning>`
			`<answer>`
			`[Final answer]`
			`</answer>`
			`"""`

			`def prepare_dataset(raw_data):`
			`"""`
			`Transform raw data into GRPO-compatible format.`

			`Returns: Dataset with columns:`
			`- 'prompt': List[Dict] with role/content (system + user messages)`
			`- 'answer': str (ground truth, optional but recommended)`
			`"""`
			`return raw_data.map(lambda x: {`
			`'prompt': [`
			`{'role': 'system', 'content': SYSTEM_PROMPT},`
			`{'role': 'user', 'content': x['question']}`
			`],`
			`'answer': extract_answer(x['raw_answer'])`
			`})`
			```

			`Pro Tips:`
			`- Use one-shot or few-shot examples in system prompt for complex formats`
			`- Keep prompts concise (max_prompt_length: 256-512 tokens)`
			`- Validate data quality before training (garbage in = garbage out)`

			`### Step 2: Reward Function Implementation`

			`Template Structure:`
			```python
			`def reward_function_name(`
			`prompts, # List[List[Dict]]: Original prompts`
			`completions, # List[List[Dict]]: Model generations`
			`answer=None, # Optional: Ground truth from dataset`
			`**kwargs # Additional dataset columns`
			`) -> list[float]:`
			`"""`
			`Evaluate completions and return rewards.`

			`Returns: List of floats (one per completion)`
			`"""`
			`# Extract completion text`
			`responses = [comp[0]['content'] for comp in completions]`

			`# Compute rewards`
			`rewards = []`
			`for response in responses:`
			`score = compute_score(response)`
			`rewards.append(score)`

			`return rewards`
			```

			`Example 1: Correctness Reward (Math/Coding)`
			```python
			`def correctness_reward(prompts, completions, answer, **kwargs):`
			`"""Reward correct answers with high score."""`
			`responses = [comp[0]['content'] for comp in completions]`
			`extracted = [extract_final_answer(r) for r in responses]`
			`return [2.0 if ans == gt else 0.0`
			`for ans, gt in zip(extracted, answer)]`
			```

			`Example 2: Format Reward (Structured Output)`
			```python
			`import re`

			`def format_reward(completions, **kwargs):`
			`"""Reward XML-like structured format."""`
			`pattern = r'<reasoning>.?</reasoning>\s<answer>.*?</answer>'`
			`responses = [comp[0]['content'] for comp in completions]`
			`return [1.0 if re.search(pattern, r, re.DOTALL) else 0.0`
			`for r in responses]`
			```

			`Example 3: Incremental Format Reward (Partial Credit)`
			```python
			`def incremental_format_reward(completions, **kwargs):`
			`"""Award partial credit for format compliance."""`
			`responses = [comp[0]['content'] for comp in completions]`
			`rewards = []`

			`for r in responses:`
			`score = 0.0`
			`if '<reasoning>' in r:`
			`score += 0.25`
			`if '</reasoning>' in r:`
			`score += 0.25`
			`if '<answer>' in r:`
			`score += 0.25`
			`if '</answer>' in r:`
			`score += 0.25`
			`# Penalize extra text after closing tag`
			`if r.count('</answer>') == 1:`
			`extra_text = r.split('</answer>')[-1].strip()`
			`score -= len(extra_text) * 0.001`
			`rewards.append(score)`

			`return rewards`
			```

			`Critical Insight:`
			`Combine 3-5 reward functions for robust training. Order matters less than diversity of signals.`

			`### Step 3: Training Configuration`

			`Memory-Optimized Config (Small GPU)`
			```python
			`from trl import GRPOConfig`

			`training_args = GRPOConfig(`
			`output_dir="outputs/grpo-model",`

			`# Learning rate`
			`learning_rate=5e-6, # Lower = more stable`
			`adam_beta1=0.9,`
			`adam_beta2=0.99,`
			`weight_decay=0.1,`
			`warmup_ratio=0.1,`
			`lr_scheduler_type='cosine',`

			`# Batch settings`
			`per_device_train_batch_size=1,`
			`gradient_accumulation_steps=4, # Effective batch = 4`

			`# GRPO-specific`
			`num_generations=8, # Group size: 8-16 recommended`
			`max_prompt_length=256,`
			`max_completion_length=512,`

			`# Training duration`
			`num_train_epochs=1,`
			`max_steps=None, # Or set fixed steps (e.g., 500)`

			`# Optimization`
			`bf16=True, # Faster on A100/H100`
			`optim="adamw_8bit", # Memory-efficient optimizer`
			`max_grad_norm=0.1,`

			`# Logging`
			`logging_steps=1,`
			`save_steps=100,`
			`report_to="wandb", # Or "none" for no logging`
			`)`
			```

			`High-Performance Config (Large GPU)`
			```python
			`training_args = GRPOConfig(`
			`output_dir="outputs/grpo-model",`
			`learning_rate=1e-5,`
			`per_device_train_batch_size=4,`
			`gradient_accumulation_steps=2,`
			`num_generations=16, # Larger groups = better signal`
			`max_prompt_length=512,`
			`max_completion_length=1024,`
			`num_train_epochs=1,`
			`bf16=True,`
			`use_vllm=True, # Fast generation with vLLM`
			`logging_steps=10,`
			`)`
			```

			`Critical Hyperparameters:`

			`\| Parameter \| Impact \| Tuning Advice \|`
			`\|-----------\|--------\|---------------\|`
			\| `num_generations` \| Group size for comparison \| Start with 8, increase to 16 if GPU allows \|
			\| `learning_rate` \| Convergence speed/stability \| 5e-6 (safe), 1e-5 (faster, riskier) \|
			\| `max_completion_length` \| Output verbosity \| Match your task (512 for reasoning, 256 for short answers) \|
			\| `gradient_accumulation_steps` \| Effective batch size \| Increase if GPU memory limited \|

			`### Step 4: Model Setup and Training`

			`Standard Setup (Transformers)`
			```python
			`import torch`
			`from transformers import AutoModelForCausalLM, AutoTokenizer`
			`from peft import LoraConfig`
			`from trl import GRPOTrainer`

			`# Load model`
			`model_name = "Qwen/Qwen2.5-1.5B-Instruct"`
			`model = AutoModelForCausalLM.from_pretrained(`
			`model_name,`
			`torch_dtype=torch.bfloat16,`
			`attn_implementation="flash_attention_2", # 2-3x faster`
			`device_map="auto"`
			`)`

			`tokenizer = AutoTokenizer.from_pretrained(model_name)`
			`tokenizer.pad_token = tokenizer.eos_token`

			`# Optional: LoRA for parameter-efficient training`
			`peft_config = LoraConfig(`
			`r=16, # Rank (higher = more capacity)`
			`lora_alpha=32, # Scaling factor (typically 2*r)`
			`target_modules=[`
			`"q_proj", "k_proj", "v_proj", "o_proj",`
			`"gate_proj", "up_proj", "down_proj"`
			`],`
			`task_type="CAUSAL_LM",`
			`lora_dropout=0.05,`
			`)`

			`# Initialize trainer`
			`trainer = GRPOTrainer(`
			`model=model,`
			`processing_class=tokenizer,`
			`reward_funcs=[`
			`incremental_format_reward,`
			`format_reward,`
			`correctness_reward,`
			`],`
			`args=training_args,`
			`train_dataset=dataset,`
			`peft_config=peft_config, # Remove for full fine-tuning`
			`)`

			`# Train`
			`trainer.train()`

			`# Save`
			`trainer.save_model("final_model")`
			```

			`Unsloth Setup (2-3x Faster)`
			```python
			`from unsloth import FastLanguageModel`

			`model, tokenizer = FastLanguageModel.from_pretrained(`
			`model_name="google/gemma-3-1b-it",`
			`max_seq_length=1024,`
			`load_in_4bit=True,`
			`fast_inference=True,`
			`max_lora_rank=32,`
			`)`

			`model = FastLanguageModel.get_peft_model(`
			`model,`
			`r=32,`
			`target_modules=["q_proj", "k_proj", "v_proj", "o_proj",`
			`"gate_proj", "up_proj", "down_proj"],`
			`lora_alpha=32,`
			`use_gradient_checkpointing="unsloth",`
			`)`

			`# Rest is identical to standard setup`
			`trainer = GRPOTrainer(model=model, ...)`
			`trainer.train()`
			```

			`---`

			`## Critical Training Insights`

			`### 1. Loss Behavior (EXPECTED PATTERN)`
			`- Loss starts near 0 and INCREASES during training`
			`- This is CORRECT - loss measures KL divergence from initial policy`
			`- Model is learning (diverging from original behavior to optimize rewards)`
			`- Monitor reward metrics instead of loss for progress`

			`### 2. Reward Tracking`
			`Key metrics to watch:`
			- `reward`: Average across all completions
			- `reward_std`: Diversity within groups (should remain > 0)
			- `kl`: KL divergence from reference (should grow moderately)

			`Healthy Training Pattern:`
			```
			`Step Reward Reward_Std KL`
			`100 0.5 0.3 0.02`
			`200 0.8 0.25 0.05`
			`300 1.2 0.2 0.08 ← Good progression`
			`400 1.5 0.15 0.12`
			```

			`Warning Signs:`
			`- Reward std → 0 (model collapsing to single response)`
			`- KL exploding (> 0.5) (diverging too much, reduce LR)`
			`- Reward stuck (reward functions too harsh or model capacity issue)`

			`### 3. Common Pitfalls and Solutions`

			`\| Problem \| Symptom \| Solution \|`
			`\|---------\|---------\|----------\|`
			\| Mode collapse \| All completions identical \| Increase `num_generations`, add diversity penalty \|
			`\| No learning \| Flat rewards \| Check reward function logic, increase LR \|`
			\| OOM errors \| GPU memory exceeded \| Reduce `num_generations`, enable gradient checkpointing \|
			\| Slow training \| < 1 it/s \| Enable `use_vllm=True`, use Unsloth, reduce seq length \|
			`\| Format ignored \| Model doesn't follow structure \| Increase format reward weight, add incremental rewards \|`

			`---`

			`## Advanced Patterns`

			`### 1. Multi-Stage Training`
			`For complex tasks, train in stages:`

			```python
			`# Stage 1: Format compliance (epochs=1)`
			`trainer_stage1 = GRPOTrainer(`
			`model=model,`
			`reward_funcs=[incremental_format_reward, format_reward],`
			`...`
			`)`
			`trainer_stage1.train()`

			`# Stage 2: Correctness (epochs=1)`
			`trainer_stage2 = GRPOTrainer(`
			`model=model,`
			`reward_funcs=[format_reward, correctness_reward],`
			`...`
			`)`
			`trainer_stage2.train()`
			```

			`### 2. Adaptive Reward Scaling`
			```python
			`class AdaptiveReward:`
			`def __init__(self, base_reward_func, initial_weight=1.0):`
			`self.func = base_reward_func`
			`self.weight = initial_weight`

			`def __call__(self, args, *kwargs):`
			`rewards = self.func(args, *kwargs)`
			`return [r * self.weight for r in rewards]`

			`def adjust_weight(self, success_rate):`
			`"""Increase weight if model struggling, decrease if succeeding."""`
			`if success_rate < 0.3:`
			`self.weight *= 1.2`
			`elif success_rate > 0.8:`
			`self.weight *= 0.9`
			```

			`### 3. Custom Dataset Integration`
			```python
			`def load_custom_knowledge_base(csv_path):`
			`"""Example: School communication platform docs."""`
			`import pandas as pd`
			`df = pd.read_csv(csv_path)`

			`dataset = Dataset.from_pandas(df).map(lambda x: {`
			`'prompt': [`
			`{'role': 'system', 'content': CUSTOM_SYSTEM_PROMPT},`
			`{'role': 'user', 'content': x['question']}`
			`],`
			`'answer': x['expert_answer']`
			`})`
			`return dataset`
			```

			`---`

			`## Deployment and Inference`

			`### Save and Merge LoRA`
			```python
			`# Merge LoRA adapters into base model`
			`if hasattr(trainer.model, 'merge_and_unload'):`
			`merged_model = trainer.model.merge_and_unload()`
			`merged_model.save_pretrained("production_model")`
			`tokenizer.save_pretrained("production_model")`
			```

			`### Inference Example`
			```python
			`from transformers import pipeline`

			`generator = pipeline(`
			`"text-generation",`
			`model="production_model",`
			`tokenizer=tokenizer`
			`)`

			`result = generator(`
			`[`
			`{'role': 'system', 'content': SYSTEM_PROMPT},`
			`{'role': 'user', 'content': "What is 15 + 27?"}`
			`],`
			`max_new_tokens=256,`
			`do_sample=True,`
			`temperature=0.7,`
			`top_p=0.9`
			`)`
			`print(result[0]['generated_text'])`
			```

			`---`

			`## Best Practices Checklist`

			`Before Training:`
			`- [ ] Validate dataset format (prompts as List[Dict])`
			`- [ ] Test reward functions on sample data`
			`- [ ] Calculate expected max_prompt_length from data`
			`- [ ] Choose appropriate num_generations based on GPU memory`
			`- [ ] Set up logging (wandb recommended)`

			`During Training:`
			`- [ ] Monitor reward progression (should increase)`
			`- [ ] Check reward_std (should stay > 0.1)`
			`- [ ] Watch for OOM errors (reduce batch size if needed)`
			`- [ ] Sample generations every 50-100 steps`
			`- [ ] Validate format compliance on holdout set`

			`After Training:`
			`- [ ] Merge LoRA weights if using PEFT`
			`- [ ] Test on diverse prompts`
			`- [ ] Compare to baseline model`
			`- [ ] Document reward weights and hyperparameters`
			`- [ ] Save reproducibility config`

			`---`

			`## Troubleshooting Guide`

			`### Debugging Workflow`
			`1. Isolate reward functions - Test each independently`
			`2. Check data distribution - Ensure diversity in prompts`
			`3. Reduce complexity - Start with single reward, add gradually`
			`4. Monitor generations - Print samples every N steps`
			`5. Validate extraction logic - Ensure answer parsing works`

			`### Quick Fixes`
			```python
			`# Debug reward function`
			`def debug_reward(completions, **kwargs):`
			`responses = [comp[0]['content'] for comp in completions]`
			`for i, r in enumerate(responses[:2]): # Print first 2`
			`print(f"Response {i}: {r[:200]}...")`
			`return [1.0] * len(responses) # Dummy rewards`

			`# Test without training`
			`trainer = GRPOTrainer(..., reward_funcs=[debug_reward])`
			`trainer.generate_completions(dataset[:1]) # Generate without updating`
			```

			`---`

			`## References and Resources`

			`Official Documentation:`
			`- TRL GRPO Trainer: https://huggingface.co/docs/trl/grpo_trainer`
			`- DeepSeek R1 Paper: https://arxiv.org/abs/2501.12948`
			`- Unsloth Docs: https://docs.unsloth.ai/`

			`Example Repositories:`
			`- Open R1 Implementation: https://github.com/huggingface/open-r1`
			`- TRL Examples: https://github.com/huggingface/trl/tree/main/examples`

			`Recommended Reading:`
			`- Progressive Disclosure Pattern for agent instructions`
			`- Reward shaping in RL (Ng et al.)`
			`- LoRA paper (Hu et al., 2021)`

			`---`

			`## Usage Instructions for Agents`

			`When this skill is loaded:`

			`1. Read this entire file before implementing GRPO training`
			`2. Start with the simplest reward function (e.g., length-based) to validate setup`
			3. Use the templates in `templates/` directory as starting points
			4. Reference examples in `examples/` for task-specific implementations
			`5. Follow the workflow sequentially (don't skip steps)`
			`6. Debug incrementally - add one reward function at a time`

			`Critical Reminders:`
			`- Always use multiple reward functions (3-5 is optimal)`
			`- Monitor reward metrics, not loss`
			`- Test reward functions before training`
			`- Start small (num_generations=4), scale up gradually`
			`- Save checkpoints frequently (every 100 steps)`

			`This skill is designed for expert-level implementation. Beginners should start with supervised fine-tuning before attempting GRPO.`