skills/mlops/training/trl-fine-tuning/references/dpo-variants.md

# DPO Variants

Complete guide to Direct Preference Optimization loss variants in TRL.

## Overview

DPO optimizes models using preference data (chosen/rejected pairs). TRL supports 10+ loss variants for different scenarios.

## Loss Types

### 1. Sigmoid (Standard DPO)

**Formula**: `-log(sigmoid(β * logits))`

**When to use**: Default choice, general preference alignment

**Config**:
```python
DPOConfig(
    loss_type="sigmoid",
    beta=0.1,  # KL penalty
    per_device_train_batch_size=64,
    learning_rate=1e-6
)
```

### 2. IPO (Identity Policy Optimization)

**Formula**: `(logits - 1/(2β))²`

**When to use**: Better theoretical foundation, reduce overfitting

**Config**:
```python
DPOConfig(
    loss_type="ipo",
    beta=0.1,
    per_device_train_batch_size=90,
    learning_rate=1e-2
)
```

### 3. Hinge (SLiC)

**Formula**: `ReLU(1 - β * logits)`

**When to use**: Margin-based objective

**Config**:
```python
DPOConfig(
    loss_type="hinge",
    beta=0.1,
    per_device_train_batch_size=512,
    learning_rate=1e-4
)
```

### 4. Robust DPO

**Formula**: Sigmoid with label smoothing for noise robustness

**When to use**: Noisy preference labels

**Config**:
```python
DPOConfig(
    loss_type="robust",
    beta=0.01,
    label_smoothing=0.1,  # Noise probability
    per_device_train_batch_size=16,
    learning_rate=1e-3,
    max_prompt_length=128,
    max_length=512
)
```

### 5. BCO Pair (Binary Classification)

**Formula**: Train binary classifier (chosen=1, rejected=0)

**When to use**: Pairwise preference data

**Config**:
```python
DPOConfig(
    loss_type="bco_pair",
    beta=0.01,
    per_device_train_batch_size=128,
    learning_rate=5e-7,
    max_prompt_length=1536,
    max_completion_length=512
)
```

### 6. SPPO Hard

**Formula**: Push chosen→0.5, rejected→-0.5

**When to use**: Nash equilibrium, sparse data

**Config**:
```python
DPOConfig(
    loss_type="sppo_hard",
    beta=0.1
)
```

### 7. DiscoPOP

**Formula**: Log-Ratio Modulated Loss

**When to use**: Automated loss discovery

**Config**:
```python
DPOConfig(
    loss_type="discopop",
    beta=0.05,
    discopop_tau=0.05,
    per_device_train_batch_size=64,
    learning_rate=5e-7
)
```

### 8. APO Zero

**Formula**: Increase chosen, decrease rejected likelihood

**When to use**: Model worse than winning outputs

**Config**:
```python
DPOConfig(
    loss_type="apo_zero",
    beta=0.1,
    per_device_train_batch_size=64,
    learning_rate=2e-7,
    max_prompt_length=512,
    max_completion_length=512
)
```

### 9. APO Down

**Formula**: Decrease both, emphasize rejected reduction

**When to use**: Model better than winning outputs

**Config**:
```python
DPOConfig(
    loss_type="apo_down",
    beta=0.1,
    # Same hyperparameters as apo_zero
)
```

### 10. AOT & AOT Pair

**Formula**: Distributional alignment via stochastic dominance

**When to use**:
- `aot_pair`: Paired preference data
- `aot`: Unpaired data

**Config**:
```python
DPOConfig(
    loss_type="aot_pair",  # or "aot"
    beta=0.1,
    label_smoothing=0.0
)
```

## Multi-Loss Training

Combine multiple losses:

```python
DPOConfig(
    loss_type=["sigmoid", "ipo"],
    loss_weights=[0.7, 0.3],  # Weighted combination
    beta=0.1
)
```

## Key Parameters

### Beta (β)

Controls deviation from reference model:
- **Higher** (0.5): More conservative, stays close to reference
- **Lower** (0.01): More aggressive alignment
- **Default**: 0.1

### Label Smoothing

For robust DPO:
- **0.0**: No smoothing (default)
- **0.1-0.3**: Moderate noise robustness
- **0.5**: Maximum noise tolerance

### Max Lengths

- `max_prompt_length`: 128-1536
- `max_completion_length`: 128-512
- `max_length`: Total sequence (1024-2048)

## Comparison Table

| Loss | Speed | Stability | Best For |
|------|-------|-----------|----------|
| Sigmoid | Fast | Good | **General use** |
| IPO | Fast | Better | Overfitting issues |
| Hinge | Fast | Good | Margin objectives |
| Robust | Fast | Best | Noisy data |
| BCO | Medium | Good | Binary classification |
| DiscoPOP | Fast | Good | New architectures |
| APO | Fast | Good | Model quality matching |

## References

- DPO paper: https://arxiv.org/abs/2305.18290
- IPO paper: https://arxiv.org/abs/2310.12036
- TRL docs: https://huggingface.co/docs/trl/dpo_trainer
Sync all skills and memories 2026-04-14 07:27 2026-04-14 07:27:20 +09:00			`# DPO Variants`

			`Complete guide to Direct Preference Optimization loss variants in TRL.`

			`## Overview`

			`DPO optimizes models using preference data (chosen/rejected pairs). TRL supports 10+ loss variants for different scenarios.`

			`## Loss Types`

			`### 1. Sigmoid (Standard DPO)`

			Formula: `-log(sigmoid(β * logits))`

			`When to use: Default choice, general preference alignment`

			`Config:`
			```python
			`DPOConfig(`
			`loss_type="sigmoid",`
			`beta=0.1, # KL penalty`
			`per_device_train_batch_size=64,`
			`learning_rate=1e-6`
			`)`
			```

			`### 2. IPO (Identity Policy Optimization)`

			Formula: `(logits - 1/(2β))²`

			`When to use: Better theoretical foundation, reduce overfitting`

			`Config:`
			```python
			`DPOConfig(`
			`loss_type="ipo",`
			`beta=0.1,`
			`per_device_train_batch_size=90,`
			`learning_rate=1e-2`
			`)`
			```

			`### 3. Hinge (SLiC)`

			Formula: `ReLU(1 - β * logits)`

			`When to use: Margin-based objective`

			`Config:`
			```python
			`DPOConfig(`
			`loss_type="hinge",`
			`beta=0.1,`
			`per_device_train_batch_size=512,`
			`learning_rate=1e-4`
			`)`
			```

			`### 4. Robust DPO`

			`Formula: Sigmoid with label smoothing for noise robustness`

			`When to use: Noisy preference labels`

			`Config:`
			```python
			`DPOConfig(`
			`loss_type="robust",`
			`beta=0.01,`
			`label_smoothing=0.1, # Noise probability`
			`per_device_train_batch_size=16,`
			`learning_rate=1e-3,`
			`max_prompt_length=128,`
			`max_length=512`
			`)`
			```

			`### 5. BCO Pair (Binary Classification)`

			`Formula: Train binary classifier (chosen=1, rejected=0)`

			`When to use: Pairwise preference data`

			`Config:`
			```python
			`DPOConfig(`
			`loss_type="bco_pair",`
			`beta=0.01,`
			`per_device_train_batch_size=128,`
			`learning_rate=5e-7,`
			`max_prompt_length=1536,`
			`max_completion_length=512`
			`)`
			```

			`### 6. SPPO Hard`

			`Formula: Push chosen→0.5, rejected→-0.5`

			`When to use: Nash equilibrium, sparse data`

			`Config:`
			```python
			`DPOConfig(`
			`loss_type="sppo_hard",`
			`beta=0.1`
			`)`
			```

			`### 7. DiscoPOP`

			`Formula: Log-Ratio Modulated Loss`

			`When to use: Automated loss discovery`

			`Config:`
			```python
			`DPOConfig(`
			`loss_type="discopop",`
			`beta=0.05,`
			`discopop_tau=0.05,`
			`per_device_train_batch_size=64,`
			`learning_rate=5e-7`
			`)`
			```

			`### 8. APO Zero`

			`Formula: Increase chosen, decrease rejected likelihood`

			`When to use: Model worse than winning outputs`

			`Config:`
			```python
			`DPOConfig(`
			`loss_type="apo_zero",`
			`beta=0.1,`
			`per_device_train_batch_size=64,`
			`learning_rate=2e-7,`
			`max_prompt_length=512,`
			`max_completion_length=512`
			`)`
			```

			`### 9. APO Down`

			`Formula: Decrease both, emphasize rejected reduction`

			`When to use: Model better than winning outputs`

			`Config:`
			```python
			`DPOConfig(`
			`loss_type="apo_down",`
			`beta=0.1,`
			`# Same hyperparameters as apo_zero`
			`)`
			```

			`### 10. AOT & AOT Pair`

			`Formula: Distributional alignment via stochastic dominance`

			`When to use:`
			- `aot_pair`: Paired preference data
			- `aot`: Unpaired data

			`Config:`
			```python
			`DPOConfig(`
			`loss_type="aot_pair", # or "aot"`
			`beta=0.1,`
			`label_smoothing=0.0`
			`)`
			```

			`## Multi-Loss Training`

			`Combine multiple losses:`

			```python
			`DPOConfig(`
			`loss_type=["sigmoid", "ipo"],`
			`loss_weights=[0.7, 0.3], # Weighted combination`
			`beta=0.1`
			`)`
			```

			`## Key Parameters`

			`### Beta (β)`

			`Controls deviation from reference model:`
			`- Higher (0.5): More conservative, stays close to reference`
			`- Lower (0.01): More aggressive alignment`
			`- Default: 0.1`

			`### Label Smoothing`

			`For robust DPO:`
			`- 0.0: No smoothing (default)`
			`- 0.1-0.3: Moderate noise robustness`
			`- 0.5: Maximum noise tolerance`

			`### Max Lengths`

			- `max_prompt_length`: 128-1536
			- `max_completion_length`: 128-512
			- `max_length`: Total sequence (1024-2048)

			`## Comparison Table`

			`\| Loss \| Speed \| Stability \| Best For \|`
			`\|------\|-------\|-----------\|----------\|`
			`\| Sigmoid \| Fast \| Good \| General use \|`
			`\| IPO \| Fast \| Better \| Overfitting issues \|`
			`\| Hinge \| Fast \| Good \| Margin objectives \|`
			`\| Robust \| Fast \| Best \| Noisy data \|`
			`\| BCO \| Medium \| Good \| Binary classification \|`
			`\| DiscoPOP \| Fast \| Good \| New architectures \|`
			`\| APO \| Fast \| Good \| Model quality matching \|`

			`## References`

			`- DPO paper: https://arxiv.org/abs/2305.18290`
			`- IPO paper: https://arxiv.org/abs/2310.12036`
			`- TRL docs: https://huggingface.co/docs/trl/dpo_trainer`