skills/mlops/training/peft/references/troubleshooting.md

# PEFT Troubleshooting Guide

## Installation Issues

### bitsandbytes CUDA Error

**Error**: `CUDA Setup failed despite GPU being available`

**Fix**:
```bash
# Check CUDA version
nvcc --version

# Install matching bitsandbytes
pip uninstall bitsandbytes
pip install bitsandbytes --no-cache-dir

# Or compile from source for specific CUDA
git clone https://github.com/TimDettmers/bitsandbytes.git
cd bitsandbytes
CUDA_VERSION=118 make cuda11x  # Adjust for your CUDA
pip install .
```

### Triton Import Error

**Error**: `ModuleNotFoundError: No module named 'triton'`

**Fix**:
```bash
# Install triton (Linux only)
pip install triton

# Windows: Triton not supported, use CUDA backend
# Set environment variable to disable triton
export CUDA_VISIBLE_DEVICES=0
```

### PEFT Version Conflicts

**Error**: `AttributeError: 'LoraConfig' object has no attribute 'use_dora'`

**Fix**:
```bash
# Upgrade to latest PEFT
pip install peft>=0.13.0 --upgrade

# Check version
python -c "import peft; print(peft.__version__)"
```

## Training Issues

### CUDA Out of Memory

**Error**: `torch.cuda.OutOfMemoryError: CUDA out of memory`

**Solutions**:

1. **Enable gradient checkpointing**:
```python
from peft import prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
```

2. **Reduce batch size**:
```python
TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16  # Maintain effective batch size
)
```

3. **Use QLoRA**:
```python
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True
)
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)
```

4. **Lower LoRA rank**:
```python
LoraConfig(r=8)  # Instead of r=16 or higher
```

5. **Target fewer modules**:
```python
target_modules=["q_proj", "v_proj"]  # Instead of all-linear
```

### Loss Not Decreasing

**Problem**: Training loss stays flat or increases.

**Solutions**:

1. **Check learning rate**:
```python
# Start lower
TrainingArguments(learning_rate=1e-4)  # Not 2e-4 or higher
```

2. **Verify adapter is active**:
```python
model.print_trainable_parameters()
# Should show >0 trainable params

# Check adapter applied
print(model.peft_config)
```

3. **Check data formatting**:
```python
# Verify tokenization
sample = dataset[0]
decoded = tokenizer.decode(sample["input_ids"])
print(decoded)  # Should look correct
```

4. **Increase rank**:
```python
LoraConfig(r=32, lora_alpha=64)  # More capacity
```

### NaN Loss

**Error**: `Loss is NaN`

**Fix**:
```python
# Use bf16 instead of fp16
TrainingArguments(bf16=True, fp16=False)

# Or enable loss scaling
TrainingArguments(fp16=True, fp16_full_eval=True)

# Lower learning rate
TrainingArguments(learning_rate=5e-5)

# Check for data issues
for batch in dataloader:
    if torch.isnan(batch["input_ids"].float()).any():
        print("NaN in input!")
```

### Adapter Not Training

**Problem**: `trainable params: 0` or model not updating.

**Fix**:
```python
# Verify LoRA applied to correct modules
for name, module in model.named_modules():
    if "lora" in name.lower():
        print(f"Found LoRA: {name}")

# Check target_modules match model architecture
from peft.utils import TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING
print(TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING.get(model.config.model_type))

# Ensure model in training mode
model.train()

# Check requires_grad
for name, param in model.named_parameters():
    if param.requires_grad:
        print(f"Trainable: {name}")
```

## Loading Issues

### Adapter Loading Fails

**Error**: `ValueError: Can't find adapter weights`

**Fix**:
```python
# Check adapter files exist
import os
print(os.listdir("./adapter-path"))
# Should contain: adapter_config.json, adapter_model.safetensors

# Load with correct structure
from peft import PeftModel, PeftConfig

# Check config
config = PeftConfig.from_pretrained("./adapter-path")
print(config)

# Load base model first
base_model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path)
model = PeftModel.from_pretrained(base_model, "./adapter-path")
```

### Base Model Mismatch

**Error**: `RuntimeError: size mismatch`

**Fix**:
```python
# Ensure base model matches adapter
from peft import PeftConfig

config = PeftConfig.from_pretrained("./adapter-path")
print(f"Base model: {config.base_model_name_or_path}")

# Load exact same base model
base_model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path)
```

### Safetensors vs PyTorch Format

**Error**: `ValueError: We couldn't connect to 'https://huggingface.co'`

**Fix**:
```python
# Force local loading
model = PeftModel.from_pretrained(
    base_model,
    "./adapter-path",
    local_files_only=True
)

# Or specify format
model.save_pretrained("./adapter", safe_serialization=True)  # safetensors
model.save_pretrained("./adapter", safe_serialization=False)  # pytorch
```

## Inference Issues

### Slow Generation

**Problem**: Inference much slower than expected.

**Solutions**:

1. **Merge adapter for deployment**:
```python
merged_model = model.merge_and_unload()
# No adapter overhead during inference
```

2. **Use optimized inference engine**:
```python
from vllm import LLM
llm = LLM(model="./merged-model", dtype="half")
```

3. **Enable Flash Attention**:
```python
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    attn_implementation="flash_attention_2"
)
```

### Output Quality Issues

**Problem**: Fine-tuned model produces worse outputs.

**Solutions**:

1. **Check evaluation without adapter**:
```python
with model.disable_adapter():
    base_output = model.generate(**inputs)
# Compare with adapter output
```

2. **Lower temperature during eval**:
```python
model.generate(**inputs, temperature=0.1, do_sample=False)
```

3. **Retrain with more data**:
```python
# Increase training samples
# Use higher quality data
# Train for more epochs
```

### Wrong Adapter Active

**Problem**: Model using wrong adapter or no adapter.

**Fix**:
```python
# Check active adapters
print(model.active_adapters)

# Explicitly set adapter
model.set_adapter("your-adapter-name")

# List all adapters
print(model.peft_config.keys())
```

## QLoRA Specific Issues

### Quantization Errors

**Error**: `RuntimeError: mat1 and mat2 shapes cannot be multiplied`

**Fix**:
```python
# Ensure compute dtype matches
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,  # Match model dtype
    bnb_4bit_quant_type="nf4"
)

# Load with correct dtype
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16
)
```

### QLoRA OOM

**Error**: OOM even with 4-bit quantization.

**Fix**:
```python
# Enable double quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True  # Further memory reduction
)

# Use offloading
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    max_memory={0: "20GB", "cpu": "100GB"}
)
```

### QLoRA Merge Fails

**Error**: `RuntimeError: expected scalar type BFloat16 but found Float`

**Fix**:
```python
# Dequantize before merging
from peft import PeftModel

# Load in higher precision for merging
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    torch_dtype=torch.float16,  # Not quantized
    device_map="auto"
)

# Load adapter
model = PeftModel.from_pretrained(base_model, "./qlora-adapter")

# Now merge
merged = model.merge_and_unload()
```

## Multi-Adapter Issues

### Adapter Conflict

**Error**: `ValueError: Adapter with name 'default' already exists`

**Fix**:
```python
# Use unique names
model.load_adapter("./adapter1", adapter_name="task1")
model.load_adapter("./adapter2", adapter_name="task2")

# Or delete existing
model.delete_adapter("default")
```

### Mixed Precision Adapters

**Error**: Adapters trained with different dtypes.

**Fix**:
```python
# Convert adapter precision
model = PeftModel.from_pretrained(base_model, "./adapter")
model = model.to(torch.bfloat16)

# Or load with specific dtype
model = PeftModel.from_pretrained(
    base_model,
    "./adapter",
    torch_dtype=torch.bfloat16
)
```

## Performance Optimization

### Memory Profiling

```python
import torch

def print_memory():
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated() / 1e9
        reserved = torch.cuda.memory_reserved() / 1e9
        print(f"Allocated: {allocated:.2f}GB, Reserved: {reserved:.2f}GB")

# Profile during training
print_memory()  # Before
model.train()
loss = model(**batch).loss
loss.backward()
print_memory()  # After
```

### Speed Profiling

```python
import time
import torch

def benchmark_generation(model, tokenizer, prompt, n_runs=5):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    # Warmup
    model.generate(**inputs, max_new_tokens=10)
    torch.cuda.synchronize()

    # Benchmark
    times = []
    for _ in range(n_runs):
        start = time.perf_counter()
        outputs = model.generate(**inputs, max_new_tokens=100)
        torch.cuda.synchronize()
        times.append(time.perf_counter() - start)

    tokens = outputs.shape[1] - inputs.input_ids.shape[1]
    avg_time = sum(times) / len(times)
    print(f"Speed: {tokens/avg_time:.2f} tokens/sec")

# Compare adapter vs merged
benchmark_generation(adapter_model, tokenizer, "Hello")
benchmark_generation(merged_model, tokenizer, "Hello")
```

## Getting Help

1. **Check PEFT GitHub Issues**: https://github.com/huggingface/peft/issues
2. **HuggingFace Forums**: https://discuss.huggingface.co/
3. **PEFT Documentation**: https://huggingface.co/docs/peft

### Debugging Template

When reporting issues, include:

```python
# System info
import peft
import transformers
import torch

print(f"PEFT: {peft.__version__}")
print(f"Transformers: {transformers.__version__}")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA: {torch.version.cuda}")
print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'N/A'}")

# Config
print(model.peft_config)
model.print_trainable_parameters()
```
Sync all skills and memories 2026-04-14 07:27 2026-04-14 07:27:20 +09:00			`# PEFT Troubleshooting Guide`

			`## Installation Issues`

			`### bitsandbytes CUDA Error`

			Error: `CUDA Setup failed despite GPU being available`

			`Fix:`
			```bash
			`# Check CUDA version`
			`nvcc --version`

			`# Install matching bitsandbytes`
			`pip uninstall bitsandbytes`
			`pip install bitsandbytes --no-cache-dir`

			`# Or compile from source for specific CUDA`
			`git clone https://github.com/TimDettmers/bitsandbytes.git`
			`cd bitsandbytes`
			`CUDA_VERSION=118 make cuda11x # Adjust for your CUDA`
			`pip install .`
			```

			`### Triton Import Error`

			Error: `ModuleNotFoundError: No module named 'triton'`

			`Fix:`
			```bash
			`# Install triton (Linux only)`
			`pip install triton`

			`# Windows: Triton not supported, use CUDA backend`
			`# Set environment variable to disable triton`
			`export CUDA_VISIBLE_DEVICES=0`
			```

			`### PEFT Version Conflicts`

			Error: `AttributeError: 'LoraConfig' object has no attribute 'use_dora'`

			`Fix:`
			```bash
			`# Upgrade to latest PEFT`
			`pip install peft>=0.13.0 --upgrade`

			`# Check version`
			`python -c "import peft; print(peft.__version__)"`
			```

			`## Training Issues`

			`### CUDA Out of Memory`

			Error: `torch.cuda.OutOfMemoryError: CUDA out of memory`

			`Solutions:`

			`1. Enable gradient checkpointing:`
			```python
			`from peft import prepare_model_for_kbit_training`
			`model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)`
			```

			`2. Reduce batch size:`
			```python
			`TrainingArguments(`
			`per_device_train_batch_size=1,`
			`gradient_accumulation_steps=16 # Maintain effective batch size`
			`)`
			```

			`3. Use QLoRA:`
			```python
			`from transformers import BitsAndBytesConfig`

			`bnb_config = BitsAndBytesConfig(`
			`load_in_4bit=True,`
			`bnb_4bit_quant_type="nf4",`
			`bnb_4bit_use_double_quant=True`
			`)`
			`model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)`
			```

			`4. Lower LoRA rank:`
			```python
			`LoraConfig(r=8) # Instead of r=16 or higher`
			```

			`5. Target fewer modules:`
			```python
			`target_modules=["q_proj", "v_proj"] # Instead of all-linear`
			```

			`### Loss Not Decreasing`

			`Problem: Training loss stays flat or increases.`

			`Solutions:`

			`1. Check learning rate:`
			```python
			`# Start lower`
			`TrainingArguments(learning_rate=1e-4) # Not 2e-4 or higher`
			```

			`2. Verify adapter is active:`
			```python
			`model.print_trainable_parameters()`
			`# Should show >0 trainable params`

			`# Check adapter applied`
			`print(model.peft_config)`
			```

			`3. Check data formatting:`
			```python
			`# Verify tokenization`
			`sample = dataset[0]`
			`decoded = tokenizer.decode(sample["input_ids"])`
			`print(decoded) # Should look correct`
			```

			`4. Increase rank:`
			```python
			`LoraConfig(r=32, lora_alpha=64) # More capacity`
			```

			`### NaN Loss`

			Error: `Loss is NaN`

			`Fix:`
			```python
			`# Use bf16 instead of fp16`
			`TrainingArguments(bf16=True, fp16=False)`

			`# Or enable loss scaling`
			`TrainingArguments(fp16=True, fp16_full_eval=True)`

			`# Lower learning rate`
			`TrainingArguments(learning_rate=5e-5)`

			`# Check for data issues`
			`for batch in dataloader:`
			`if torch.isnan(batch["input_ids"].float()).any():`
			`print("NaN in input!")`
			```

			`### Adapter Not Training`

			Problem: `trainable params: 0` or model not updating.

			`Fix:`
			```python
			`# Verify LoRA applied to correct modules`
			`for name, module in model.named_modules():`
			`if "lora" in name.lower():`
			`print(f"Found LoRA: {name}")`

			`# Check target_modules match model architecture`
			`from peft.utils import TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING`
			`print(TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING.get(model.config.model_type))`

			`# Ensure model in training mode`
			`model.train()`

			`# Check requires_grad`
			`for name, param in model.named_parameters():`
			`if param.requires_grad:`
			`print(f"Trainable: {name}")`
			```

			`## Loading Issues`

			`### Adapter Loading Fails`

			Error: `ValueError: Can't find adapter weights`

			`Fix:`
			```python
			`# Check adapter files exist`
			`import os`
			`print(os.listdir("./adapter-path"))`
			`# Should contain: adapter_config.json, adapter_model.safetensors`

			`# Load with correct structure`
			`from peft import PeftModel, PeftConfig`

			`# Check config`
			`config = PeftConfig.from_pretrained("./adapter-path")`
			`print(config)`

			`# Load base model first`
			`base_model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path)`
			`model = PeftModel.from_pretrained(base_model, "./adapter-path")`
			```

			`### Base Model Mismatch`

			Error: `RuntimeError: size mismatch`

			`Fix:`
			```python
			`# Ensure base model matches adapter`
			`from peft import PeftConfig`

			`config = PeftConfig.from_pretrained("./adapter-path")`
			`print(f"Base model: {config.base_model_name_or_path}")`

			`# Load exact same base model`
			`base_model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path)`
			```

			`### Safetensors vs PyTorch Format`

			Error: `ValueError: We couldn't connect to 'https://huggingface.co'`

			`Fix:`
			```python
			`# Force local loading`
			`model = PeftModel.from_pretrained(`
			`base_model,`
			`"./adapter-path",`
			`local_files_only=True`
			`)`

			`# Or specify format`
			`model.save_pretrained("./adapter", safe_serialization=True) # safetensors`
			`model.save_pretrained("./adapter", safe_serialization=False) # pytorch`
			```

			`## Inference Issues`

			`### Slow Generation`

			`Problem: Inference much slower than expected.`

			`Solutions:`

			`1. Merge adapter for deployment:`
			```python
			`merged_model = model.merge_and_unload()`
			`# No adapter overhead during inference`
			```

			`2. Use optimized inference engine:`
			```python
			`from vllm import LLM`
			`llm = LLM(model="./merged-model", dtype="half")`
			```

			`3. Enable Flash Attention:`
			```python
			`model = AutoModelForCausalLM.from_pretrained(`
			`model_name,`
			`attn_implementation="flash_attention_2"`
			`)`
			```

			`### Output Quality Issues`

			`Problem: Fine-tuned model produces worse outputs.`

			`Solutions:`

			`1. Check evaluation without adapter:`
			```python
			`with model.disable_adapter():`
			`base_output = model.generate(**inputs)`
			`# Compare with adapter output`
			```

			`2. Lower temperature during eval:`
			```python
			`model.generate(**inputs, temperature=0.1, do_sample=False)`
			```

			`3. Retrain with more data:`
			```python
			`# Increase training samples`
			`# Use higher quality data`
			`# Train for more epochs`
			```

			`### Wrong Adapter Active`

			`Problem: Model using wrong adapter or no adapter.`

			`Fix:`
			```python
			`# Check active adapters`
			`print(model.active_adapters)`

			`# Explicitly set adapter`
			`model.set_adapter("your-adapter-name")`

			`# List all adapters`
			`print(model.peft_config.keys())`
			```

			`## QLoRA Specific Issues`

			`### Quantization Errors`

			Error: `RuntimeError: mat1 and mat2 shapes cannot be multiplied`

			`Fix:`
			```python
			`# Ensure compute dtype matches`
			`bnb_config = BitsAndBytesConfig(`
			`load_in_4bit=True,`
			`bnb_4bit_compute_dtype=torch.bfloat16, # Match model dtype`
			`bnb_4bit_quant_type="nf4"`
			`)`

			`# Load with correct dtype`
			`model = AutoModelForCausalLM.from_pretrained(`
			`model_name,`
			`quantization_config=bnb_config,`
			`torch_dtype=torch.bfloat16`
			`)`
			```

			`### QLoRA OOM`

			`Error: OOM even with 4-bit quantization.`

			`Fix:`
			```python
			`# Enable double quantization`
			`bnb_config = BitsAndBytesConfig(`
			`load_in_4bit=True,`
			`bnb_4bit_use_double_quant=True # Further memory reduction`
			`)`

			`# Use offloading`
			`model = AutoModelForCausalLM.from_pretrained(`
			`model_name,`
			`quantization_config=bnb_config,`
			`device_map="auto",`
			`max_memory={0: "20GB", "cpu": "100GB"}`
			`)`
			```

			`### QLoRA Merge Fails`

			Error: `RuntimeError: expected scalar type BFloat16 but found Float`

			`Fix:`
			```python
			`# Dequantize before merging`
			`from peft import PeftModel`

			`# Load in higher precision for merging`
			`base_model = AutoModelForCausalLM.from_pretrained(`
			`base_model_name,`
			`torch_dtype=torch.float16, # Not quantized`
			`device_map="auto"`
			`)`

			`# Load adapter`
			`model = PeftModel.from_pretrained(base_model, "./qlora-adapter")`

			`# Now merge`
			`merged = model.merge_and_unload()`
			```

			`## Multi-Adapter Issues`

			`### Adapter Conflict`

			Error: `ValueError: Adapter with name 'default' already exists`

			`Fix:`
			```python
			`# Use unique names`
			`model.load_adapter("./adapter1", adapter_name="task1")`
			`model.load_adapter("./adapter2", adapter_name="task2")`

			`# Or delete existing`
			`model.delete_adapter("default")`
			```

			`### Mixed Precision Adapters`

			`Error: Adapters trained with different dtypes.`

			`Fix:`
			```python
			`# Convert adapter precision`
			`model = PeftModel.from_pretrained(base_model, "./adapter")`
			`model = model.to(torch.bfloat16)`

			`# Or load with specific dtype`
			`model = PeftModel.from_pretrained(`
			`base_model,`
			`"./adapter",`
			`torch_dtype=torch.bfloat16`
			`)`
			```

			`## Performance Optimization`

			`### Memory Profiling`

			```python
			`import torch`

			`def print_memory():`
			`if torch.cuda.is_available():`
			`allocated = torch.cuda.memory_allocated() / 1e9`
			`reserved = torch.cuda.memory_reserved() / 1e9`
			`print(f"Allocated: {allocated:.2f}GB, Reserved: {reserved:.2f}GB")`

			`# Profile during training`
			`print_memory() # Before`
			`model.train()`
			`loss = model(**batch).loss`
			`loss.backward()`
			`print_memory() # After`
			```

			`### Speed Profiling`

			```python
			`import time`
			`import torch`

			`def benchmark_generation(model, tokenizer, prompt, n_runs=5):`
			`inputs = tokenizer(prompt, return_tensors="pt").to(model.device)`

			`# Warmup`
			`model.generate(**inputs, max_new_tokens=10)`
			`torch.cuda.synchronize()`

			`# Benchmark`
			`times = []`
			`for _ in range(n_runs):`
			`start = time.perf_counter()`
			`outputs = model.generate(**inputs, max_new_tokens=100)`
			`torch.cuda.synchronize()`
			`times.append(time.perf_counter() - start)`

			`tokens = outputs.shape[1] - inputs.input_ids.shape[1]`
			`avg_time = sum(times) / len(times)`
			`print(f"Speed: {tokens/avg_time:.2f} tokens/sec")`

			`# Compare adapter vs merged`
			`benchmark_generation(adapter_model, tokenizer, "Hello")`
			`benchmark_generation(merged_model, tokenizer, "Hello")`
			```

			`## Getting Help`

			`1. Check PEFT GitHub Issues: https://github.com/huggingface/peft/issues`
			`2. HuggingFace Forums: https://discuss.huggingface.co/`
			`3. PEFT Documentation: https://huggingface.co/docs/peft`

			`### Debugging Template`

			`When reporting issues, include:`

			```python
			`# System info`
			`import peft`
			`import transformers`
			`import torch`

			`print(f"PEFT: {peft.__version__}")`
			`print(f"Transformers: {transformers.__version__}")`
			`print(f"PyTorch: {torch.__version__}")`
			`print(f"CUDA: {torch.version.cuda}")`
			`print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'N/A'}")`

			`# Config`
			`print(model.peft_config)`
			`model.print_trainable_parameters()`
			```