skills/mlops/training/peft/SKILL.md

---
name: peft-fine-tuning
description: Parameter-efficient fine-tuning for LLMs using LoRA, QLoRA, and 25+ methods. Use when fine-tuning large models (7B-70B) with limited GPU memory, when you need to train <1% of parameters with minimal accuracy loss, or for multi-adapter serving. HuggingFace's official library integrated with transformers ecosystem.
version: 1.0.0
author: Orchestra Research
license: MIT
dependencies: [peft>=0.13.0, transformers>=4.45.0, torch>=2.0.0, bitsandbytes>=0.43.0]
metadata:
  hermes:
    tags: [Fine-Tuning, PEFT, LoRA, QLoRA, Parameter-Efficient, Adapters, Low-Rank, Memory Optimization, Multi-Adapter]

---

# PEFT (Parameter-Efficient Fine-Tuning)

Fine-tune LLMs by training <1% of parameters using LoRA, QLoRA, and 25+ adapter methods.

## When to use PEFT

**Use PEFT/LoRA when:**
- Fine-tuning 7B-70B models on consumer GPUs (RTX 4090, A100)
- Need to train <1% parameters (6MB adapters vs 14GB full model)
- Want fast iteration with multiple task-specific adapters
- Deploying multiple fine-tuned variants from one base model

**Use QLoRA (PEFT + quantization) when:**
- Fine-tuning 70B models on single 24GB GPU
- Memory is the primary constraint
- Can accept ~5% quality trade-off vs full fine-tuning

**Use full fine-tuning instead when:**
- Training small models (<1B parameters)
- Need maximum quality and have compute budget
- Significant domain shift requires updating all weights

## Quick start

### Installation

```bash
# Basic installation
pip install peft

# With quantization support (recommended)
pip install peft bitsandbytes

# Full stack
pip install peft transformers accelerate bitsandbytes datasets
```

### LoRA fine-tuning (standard)

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from peft import get_peft_model, LoraConfig, TaskType
from datasets import load_dataset

# Load base model
model_name = "meta-llama/Llama-3.1-8B"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                          # Rank (8-64, higher = more capacity)
    lora_alpha=32,                 # Scaling factor (typically 2*r)
    lora_dropout=0.05,             # Dropout for regularization
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],  # Attention layers
    bias="none"                    # Don't train biases
)

# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 13,631,488 || all params: 8,043,307,008 || trainable%: 0.17%

# Prepare dataset
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

def tokenize(example):
    text = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['response']}"
    return tokenizer(text, truncation=True, max_length=512, padding="max_length")

tokenized = dataset.map(tokenize, remove_columns=dataset.column_names)

# Training
training_args = TrainingArguments(
    output_dir="./lora-llama",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized,
    data_collator=lambda data: {"input_ids": torch.stack([f["input_ids"] for f in data]),
                                 "attention_mask": torch.stack([f["attention_mask"] for f in data]),
                                 "labels": torch.stack([f["input_ids"] for f in data])}
)

trainer.train()

# Save adapter only (6MB vs 16GB)
model.save_pretrained("./lora-llama-adapter")
```

### QLoRA fine-tuning (memory-efficient)

```python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import get_peft_model, LoraConfig, prepare_model_for_kbit_training

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # NormalFloat4 (best for LLMs)
    bnb_4bit_compute_dtype="bfloat16",   # Compute in bf16
    bnb_4bit_use_double_quant=True       # Nested quantization
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-70B",
    quantization_config=bnb_config,
    device_map="auto"
)

# Prepare for training (enables gradient checkpointing)
model = prepare_model_for_kbit_training(model)

# LoRA config for QLoRA
lora_config = LoraConfig(
    r=64,                              # Higher rank for 70B
    lora_alpha=128,
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
# 70B model now fits on single 24GB GPU!
```

## LoRA parameter selection

### Rank (r) - capacity vs efficiency

| Rank | Trainable Params | Memory | Quality | Use Case |
|------|-----------------|--------|---------|----------|
| 4 | ~3M | Minimal | Lower | Simple tasks, prototyping |
| **8** | ~7M | Low | Good | **Recommended starting point** |
| **16** | ~14M | Medium | Better | **General fine-tuning** |
| 32 | ~27M | Higher | High | Complex tasks |
| 64 | ~54M | High | Highest | Domain adaptation, 70B models |

### Alpha (lora_alpha) - scaling factor

```python
# Rule of thumb: alpha = 2 * rank
LoraConfig(r=16, lora_alpha=32)  # Standard
LoraConfig(r=16, lora_alpha=16)  # Conservative (lower learning rate effect)
LoraConfig(r=16, lora_alpha=64)  # Aggressive (higher learning rate effect)
```

### Target modules by architecture

```python
# Llama / Mistral / Qwen
target_modules = ["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]

# GPT-2 / GPT-Neo
target_modules = ["c_attn", "c_proj", "c_fc"]

# Falcon
target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]

# BLOOM
target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]

# Auto-detect all linear layers
target_modules = "all-linear"  # PEFT 0.6.0+
```

## Loading and merging adapters

### Load trained adapter

```python
from peft import PeftModel, AutoPeftModelForCausalLM
from transformers import AutoModelForCausalLM

# Option 1: Load with PeftModel
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
model = PeftModel.from_pretrained(base_model, "./lora-llama-adapter")

# Option 2: Load directly (recommended)
model = AutoPeftModelForCausalLM.from_pretrained(
    "./lora-llama-adapter",
    device_map="auto"
)
```

### Merge adapter into base model

```python
# Merge for deployment (no adapter overhead)
merged_model = model.merge_and_unload()

# Save merged model
merged_model.save_pretrained("./llama-merged")
tokenizer.save_pretrained("./llama-merged")

# Push to Hub
merged_model.push_to_hub("username/llama-finetuned")
```

### Multi-adapter serving

```python
from peft import PeftModel

# Load base with first adapter
model = AutoPeftModelForCausalLM.from_pretrained("./adapter-task1")

# Load additional adapters
model.load_adapter("./adapter-task2", adapter_name="task2")
model.load_adapter("./adapter-task3", adapter_name="task3")

# Switch between adapters at runtime
model.set_adapter("task1")  # Use task1 adapter
output1 = model.generate(**inputs)

model.set_adapter("task2")  # Switch to task2
output2 = model.generate(**inputs)

# Disable adapters (use base model)
with model.disable_adapter():
    base_output = model.generate(**inputs)
```

## PEFT methods comparison

| Method | Trainable % | Memory | Speed | Best For |
|--------|------------|--------|-------|----------|
| **LoRA** | 0.1-1% | Low | Fast | General fine-tuning |
| **QLoRA** | 0.1-1% | Very Low | Medium | Memory-constrained |
| AdaLoRA | 0.1-1% | Low | Medium | Automatic rank selection |
| IA3 | 0.01% | Minimal | Fastest | Few-shot adaptation |
| Prefix Tuning | 0.1% | Low | Medium | Generation control |
| Prompt Tuning | 0.001% | Minimal | Fast | Simple task adaptation |
| P-Tuning v2 | 0.1% | Low | Medium | NLU tasks |

### IA3 (minimal parameters)

```python
from peft import IA3Config

ia3_config = IA3Config(
    target_modules=["q_proj", "v_proj", "k_proj", "down_proj"],
    feedforward_modules=["down_proj"]
)
model = get_peft_model(model, ia3_config)
# Trains only 0.01% of parameters!
```

### Prefix Tuning

```python
from peft import PrefixTuningConfig

prefix_config = PrefixTuningConfig(
    task_type="CAUSAL_LM",
    num_virtual_tokens=20,      # Prepended tokens
    prefix_projection=True       # Use MLP projection
)
model = get_peft_model(model, prefix_config)
```

## Integration patterns

### With TRL (SFTTrainer)

```python
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig

lora_config = LoraConfig(r=16, lora_alpha=32, target_modules="all-linear")

trainer = SFTTrainer(
    model=model,
    args=SFTConfig(output_dir="./output", max_seq_length=512),
    train_dataset=dataset,
    peft_config=lora_config,  # Pass LoRA config directly
)
trainer.train()
```

### With Axolotl (YAML config)

```yaml
# axolotl config.yaml
adapter: lora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - v_proj
  - k_proj
  - o_proj
lora_target_linear: true  # Target all linear layers
```

### With vLLM (inference)

```python
from vllm import LLM
from vllm.lora.request import LoRARequest

# Load base model with LoRA support
llm = LLM(model="meta-llama/Llama-3.1-8B", enable_lora=True)

# Serve with adapter
outputs = llm.generate(
    prompts,
    lora_request=LoRARequest("adapter1", 1, "./lora-adapter")
)
```

## Performance benchmarks

### Memory usage (Llama 3.1 8B)

| Method | GPU Memory | Trainable Params |
|--------|-----------|------------------|
| Full fine-tuning | 60+ GB | 8B (100%) |
| LoRA r=16 | 18 GB | 14M (0.17%) |
| QLoRA r=16 | 6 GB | 14M (0.17%) |
| IA3 | 16 GB | 800K (0.01%) |

### Training speed (A100 80GB)

| Method | Tokens/sec | vs Full FT |
|--------|-----------|------------|
| Full FT | 2,500 | 1x |
| LoRA | 3,200 | 1.3x |
| QLoRA | 2,100 | 0.84x |

### Quality (MMLU benchmark)

| Model | Full FT | LoRA | QLoRA |
|-------|---------|------|-------|
| Llama 2-7B | 45.3 | 44.8 | 44.1 |
| Llama 2-13B | 54.8 | 54.2 | 53.5 |

## Common issues

### CUDA OOM during training

```python
# Solution 1: Enable gradient checkpointing
model.gradient_checkpointing_enable()

# Solution 2: Reduce batch size + increase accumulation
TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16
)

# Solution 3: Use QLoRA
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")
```

### Adapter not applying

```python
# Verify adapter is active
print(model.active_adapters)  # Should show adapter name

# Check trainable parameters
model.print_trainable_parameters()

# Ensure model in training mode
model.train()
```

### Quality degradation

```python
# Increase rank
LoraConfig(r=32, lora_alpha=64)

# Target more modules
target_modules = "all-linear"

# Use more training data and epochs
TrainingArguments(num_train_epochs=5)

# Lower learning rate
TrainingArguments(learning_rate=1e-4)
```

## Best practices

1. **Start with r=8-16**, increase if quality insufficient
2. **Use alpha = 2 * rank** as starting point
3. **Target attention + MLP layers** for best quality/efficiency
4. **Enable gradient checkpointing** for memory savings
5. **Save adapters frequently** (small files, easy rollback)
6. **Evaluate on held-out data** before merging
7. **Use QLoRA for 70B+ models** on consumer hardware

## References

- **[Advanced Usage](references/advanced-usage.md)** - DoRA, LoftQ, rank stabilization, custom modules
- **[Troubleshooting](references/troubleshooting.md)** - Common errors, debugging, optimization

## Resources

- **GitHub**: https://github.com/huggingface/peft
- **Docs**: https://huggingface.co/docs/peft
- **LoRA Paper**: arXiv:2106.09685
- **QLoRA Paper**: arXiv:2305.14314
- **Models**: https://huggingface.co/models?library=peft
Sync all skills and memories 2026-04-14 07:27 2026-04-14 07:27:20 +09:00			`---`
			`name: peft-fine-tuning`
			`description: Parameter-efficient fine-tuning for LLMs using LoRA, QLoRA, and 25+ methods. Use when fine-tuning large models (7B-70B) with limited GPU memory, when you need to train <1% of parameters with minimal accuracy loss, or for multi-adapter serving. HuggingFace's official library integrated with transformers ecosystem.`
			`version: 1.0.0`
			`author: Orchestra Research`
			`license: MIT`
			`dependencies: [peft>=0.13.0, transformers>=4.45.0, torch>=2.0.0, bitsandbytes>=0.43.0]`
			`metadata:`
			`hermes:`
			`tags: [Fine-Tuning, PEFT, LoRA, QLoRA, Parameter-Efficient, Adapters, Low-Rank, Memory Optimization, Multi-Adapter]`

			`---`

			`# PEFT (Parameter-Efficient Fine-Tuning)`

			`Fine-tune LLMs by training <1% of parameters using LoRA, QLoRA, and 25+ adapter methods.`

			`## When to use PEFT`

			`Use PEFT/LoRA when:`
			`- Fine-tuning 7B-70B models on consumer GPUs (RTX 4090, A100)`
			`- Need to train <1% parameters (6MB adapters vs 14GB full model)`
			`- Want fast iteration with multiple task-specific adapters`
			`- Deploying multiple fine-tuned variants from one base model`

			`Use QLoRA (PEFT + quantization) when:`
			`- Fine-tuning 70B models on single 24GB GPU`
			`- Memory is the primary constraint`
			`- Can accept ~5% quality trade-off vs full fine-tuning`

			`Use full fine-tuning instead when:`
			`- Training small models (<1B parameters)`
			`- Need maximum quality and have compute budget`
			`- Significant domain shift requires updating all weights`

			`## Quick start`

			`### Installation`

			```bash
			`# Basic installation`
			`pip install peft`

			`# With quantization support (recommended)`
			`pip install peft bitsandbytes`

			`# Full stack`
			`pip install peft transformers accelerate bitsandbytes datasets`
			```

			`### LoRA fine-tuning (standard)`

			```python
			`from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer`
			`from peft import get_peft_model, LoraConfig, TaskType`
			`from datasets import load_dataset`

			`# Load base model`
			`model_name = "meta-llama/Llama-3.1-8B"`
			`model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")`
			`tokenizer = AutoTokenizer.from_pretrained(model_name)`
			`tokenizer.pad_token = tokenizer.eos_token`

			`# LoRA configuration`
			`lora_config = LoraConfig(`
			`task_type=TaskType.CAUSAL_LM,`
			`r=16, # Rank (8-64, higher = more capacity)`
			`lora_alpha=32, # Scaling factor (typically 2*r)`
			`lora_dropout=0.05, # Dropout for regularization`
			`target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], # Attention layers`
			`bias="none" # Don't train biases`
			`)`

			`# Apply LoRA`
			`model = get_peft_model(model, lora_config)`
			`model.print_trainable_parameters()`
			`# Output: trainable params: 13,631,488 \|\| all params: 8,043,307,008 \|\| trainable%: 0.17%`

			`# Prepare dataset`
			`dataset = load_dataset("databricks/databricks-dolly-15k", split="train")`

			`def tokenize(example):`
			`text = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['response']}"`
			`return tokenizer(text, truncation=True, max_length=512, padding="max_length")`

			`tokenized = dataset.map(tokenize, remove_columns=dataset.column_names)`

			`# Training`
			`training_args = TrainingArguments(`
			`output_dir="./lora-llama",`
			`num_train_epochs=3,`
			`per_device_train_batch_size=4,`
			`gradient_accumulation_steps=4,`
			`learning_rate=2e-4,`
			`fp16=True,`
			`logging_steps=10,`
			`save_strategy="epoch"`
			`)`

			`trainer = Trainer(`
			`model=model,`
			`args=training_args,`
			`train_dataset=tokenized,`
			`data_collator=lambda data: {"input_ids": torch.stack([f["input_ids"] for f in data]),`
			`"attention_mask": torch.stack([f["attention_mask"] for f in data]),`
			`"labels": torch.stack([f["input_ids"] for f in data])}`
			`)`

			`trainer.train()`

			`# Save adapter only (6MB vs 16GB)`
			`model.save_pretrained("./lora-llama-adapter")`
			```

			`### QLoRA fine-tuning (memory-efficient)`

			```python
			`from transformers import AutoModelForCausalLM, BitsAndBytesConfig`
			`from peft import get_peft_model, LoraConfig, prepare_model_for_kbit_training`

			`# 4-bit quantization config`
			`bnb_config = BitsAndBytesConfig(`
			`load_in_4bit=True,`
			`bnb_4bit_quant_type="nf4", # NormalFloat4 (best for LLMs)`
			`bnb_4bit_compute_dtype="bfloat16", # Compute in bf16`
			`bnb_4bit_use_double_quant=True # Nested quantization`
			`)`

			`# Load quantized model`
			`model = AutoModelForCausalLM.from_pretrained(`
			`"meta-llama/Llama-3.1-70B",`
			`quantization_config=bnb_config,`
			`device_map="auto"`
			`)`

			`# Prepare for training (enables gradient checkpointing)`
			`model = prepare_model_for_kbit_training(model)`

			`# LoRA config for QLoRA`
			`lora_config = LoraConfig(`
			`r=64, # Higher rank for 70B`
			`lora_alpha=128,`
			`lora_dropout=0.1,`
			`target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],`
			`bias="none",`
			`task_type="CAUSAL_LM"`
			`)`

			`model = get_peft_model(model, lora_config)`
			`# 70B model now fits on single 24GB GPU!`
			```

			`## LoRA parameter selection`

			`### Rank (r) - capacity vs efficiency`

			`\| Rank \| Trainable Params \| Memory \| Quality \| Use Case \|`
			`\|------\|-----------------\|--------\|---------\|----------\|`
			`\| 4 \| ~3M \| Minimal \| Lower \| Simple tasks, prototyping \|`
			`\| 8 \| ~7M \| Low \| Good \| Recommended starting point \|`
			`\| 16 \| ~14M \| Medium \| Better \| General fine-tuning \|`
			`\| 32 \| ~27M \| Higher \| High \| Complex tasks \|`
			`\| 64 \| ~54M \| High \| Highest \| Domain adaptation, 70B models \|`

			`### Alpha (lora_alpha) - scaling factor`

			```python
			`# Rule of thumb: alpha = 2 * rank`
			`LoraConfig(r=16, lora_alpha=32) # Standard`
			`LoraConfig(r=16, lora_alpha=16) # Conservative (lower learning rate effect)`
			`LoraConfig(r=16, lora_alpha=64) # Aggressive (higher learning rate effect)`
			```

			`### Target modules by architecture`

			```python
			`# Llama / Mistral / Qwen`
			`target_modules = ["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]`

			`# GPT-2 / GPT-Neo`
			`target_modules = ["c_attn", "c_proj", "c_fc"]`

			`# Falcon`
			`target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]`

			`# BLOOM`
			`target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]`

			`# Auto-detect all linear layers`
			`target_modules = "all-linear" # PEFT 0.6.0+`
			```

			`## Loading and merging adapters`

			`### Load trained adapter`

			```python
			`from peft import PeftModel, AutoPeftModelForCausalLM`
			`from transformers import AutoModelForCausalLM`

			`# Option 1: Load with PeftModel`
			`base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")`
			`model = PeftModel.from_pretrained(base_model, "./lora-llama-adapter")`

			`# Option 2: Load directly (recommended)`
			`model = AutoPeftModelForCausalLM.from_pretrained(`
			`"./lora-llama-adapter",`
			`device_map="auto"`
			`)`
			```

			`### Merge adapter into base model`

			```python
			`# Merge for deployment (no adapter overhead)`
			`merged_model = model.merge_and_unload()`

			`# Save merged model`
			`merged_model.save_pretrained("./llama-merged")`
			`tokenizer.save_pretrained("./llama-merged")`

			`# Push to Hub`
			`merged_model.push_to_hub("username/llama-finetuned")`
			```

			`### Multi-adapter serving`

			```python
			`from peft import PeftModel`

			`# Load base with first adapter`
			`model = AutoPeftModelForCausalLM.from_pretrained("./adapter-task1")`

			`# Load additional adapters`
			`model.load_adapter("./adapter-task2", adapter_name="task2")`
			`model.load_adapter("./adapter-task3", adapter_name="task3")`

			`# Switch between adapters at runtime`
			`model.set_adapter("task1") # Use task1 adapter`
			`output1 = model.generate(**inputs)`

			`model.set_adapter("task2") # Switch to task2`
			`output2 = model.generate(**inputs)`

			`# Disable adapters (use base model)`
			`with model.disable_adapter():`
			`base_output = model.generate(**inputs)`
			```

			`## PEFT methods comparison`

			`\| Method \| Trainable % \| Memory \| Speed \| Best For \|`
			`\|--------\|------------\|--------\|-------\|----------\|`
			`\| LoRA \| 0.1-1% \| Low \| Fast \| General fine-tuning \|`
			`\| QLoRA \| 0.1-1% \| Very Low \| Medium \| Memory-constrained \|`
			`\| AdaLoRA \| 0.1-1% \| Low \| Medium \| Automatic rank selection \|`
			`\| IA3 \| 0.01% \| Minimal \| Fastest \| Few-shot adaptation \|`
			`\| Prefix Tuning \| 0.1% \| Low \| Medium \| Generation control \|`
			`\| Prompt Tuning \| 0.001% \| Minimal \| Fast \| Simple task adaptation \|`
			`\| P-Tuning v2 \| 0.1% \| Low \| Medium \| NLU tasks \|`

			`### IA3 (minimal parameters)`

			```python
			`from peft import IA3Config`

			`ia3_config = IA3Config(`
			`target_modules=["q_proj", "v_proj", "k_proj", "down_proj"],`
			`feedforward_modules=["down_proj"]`
			`)`
			`model = get_peft_model(model, ia3_config)`
			`# Trains only 0.01% of parameters!`
			```

			`### Prefix Tuning`

			```python
			`from peft import PrefixTuningConfig`

			`prefix_config = PrefixTuningConfig(`
			`task_type="CAUSAL_LM",`
			`num_virtual_tokens=20, # Prepended tokens`
			`prefix_projection=True # Use MLP projection`
			`)`
			`model = get_peft_model(model, prefix_config)`
			```

			`## Integration patterns`

			`### With TRL (SFTTrainer)`

			```python
			`from trl import SFTTrainer, SFTConfig`
			`from peft import LoraConfig`

			`lora_config = LoraConfig(r=16, lora_alpha=32, target_modules="all-linear")`

			`trainer = SFTTrainer(`
			`model=model,`
			`args=SFTConfig(output_dir="./output", max_seq_length=512),`
			`train_dataset=dataset,`
			`peft_config=lora_config, # Pass LoRA config directly`
			`)`
			`trainer.train()`
			```

			`### With Axolotl (YAML config)`

			```yaml
			`# axolotl config.yaml`
			`adapter: lora`
			`lora_r: 16`
			`lora_alpha: 32`
			`lora_dropout: 0.05`
			`lora_target_modules:`
			`- q_proj`
			`- v_proj`
			`- k_proj`
			`- o_proj`
			`lora_target_linear: true # Target all linear layers`
			```

			`### With vLLM (inference)`

			```python
			`from vllm import LLM`
			`from vllm.lora.request import LoRARequest`

			`# Load base model with LoRA support`
			`llm = LLM(model="meta-llama/Llama-3.1-8B", enable_lora=True)`

			`# Serve with adapter`
			`outputs = llm.generate(`
			`prompts,`
			`lora_request=LoRARequest("adapter1", 1, "./lora-adapter")`
			`)`
			```

			`## Performance benchmarks`

			`### Memory usage (Llama 3.1 8B)`

			`\| Method \| GPU Memory \| Trainable Params \|`
			`\|--------\|-----------\|------------------\|`
			`\| Full fine-tuning \| 60+ GB \| 8B (100%) \|`
			`\| LoRA r=16 \| 18 GB \| 14M (0.17%) \|`
			`\| QLoRA r=16 \| 6 GB \| 14M (0.17%) \|`
			`\| IA3 \| 16 GB \| 800K (0.01%) \|`

			`### Training speed (A100 80GB)`

			`\| Method \| Tokens/sec \| vs Full FT \|`
			`\|--------\|-----------\|------------\|`
			`\| Full FT \| 2,500 \| 1x \|`
			`\| LoRA \| 3,200 \| 1.3x \|`
			`\| QLoRA \| 2,100 \| 0.84x \|`

			`### Quality (MMLU benchmark)`

			`\| Model \| Full FT \| LoRA \| QLoRA \|`
			`\|-------\|---------\|------\|-------\|`
			`\| Llama 2-7B \| 45.3 \| 44.8 \| 44.1 \|`
			`\| Llama 2-13B \| 54.8 \| 54.2 \| 53.5 \|`

			`## Common issues`

			`### CUDA OOM during training`

			```python
			`# Solution 1: Enable gradient checkpointing`
			`model.gradient_checkpointing_enable()`

			`# Solution 2: Reduce batch size + increase accumulation`
			`TrainingArguments(`
			`per_device_train_batch_size=1,`
			`gradient_accumulation_steps=16`
			`)`

			`# Solution 3: Use QLoRA`
			`from transformers import BitsAndBytesConfig`
			`bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")`
			```

			`### Adapter not applying`

			```python
			`# Verify adapter is active`
			`print(model.active_adapters) # Should show adapter name`

			`# Check trainable parameters`
			`model.print_trainable_parameters()`

			`# Ensure model in training mode`
			`model.train()`
			```

			`### Quality degradation`

			```python
			`# Increase rank`
			`LoraConfig(r=32, lora_alpha=64)`

			`# Target more modules`
			`target_modules = "all-linear"`

			`# Use more training data and epochs`
			`TrainingArguments(num_train_epochs=5)`

			`# Lower learning rate`
			`TrainingArguments(learning_rate=1e-4)`
			```

			`## Best practices`

			`1. Start with r=8-16, increase if quality insufficient`
			`2. *Use alpha = 2 rank** as starting point`
			`3. Target attention + MLP layers for best quality/efficiency`
			`4. Enable gradient checkpointing for memory savings`
			`5. Save adapters frequently (small files, easy rollback)`
			`6. Evaluate on held-out data before merging`
			`7. Use QLoRA for 70B+ models on consumer hardware`

			`## References`

			`- [Advanced Usage](references/advanced-usage.md) - DoRA, LoftQ, rank stabilization, custom modules`
			`- [Troubleshooting](references/troubleshooting.md) - Common errors, debugging, optimization`

			`## Resources`

			`- GitHub: https://github.com/huggingface/peft`
			`- Docs: https://huggingface.co/docs/peft`
			`- LoRA Paper: arXiv:2106.09685`
			`- QLoRA Paper: arXiv:2305.14314`
			`- Models: https://huggingface.co/models?library=peft`