848 lines
17 KiB
Markdown
848 lines
17 KiB
Markdown
|
|
# Comprehensive Hyperparameter Sweeps Guide
|
|||
|
|
|
|||
|
|
Complete guide to hyperparameter optimization with W&B Sweeps.
|
|||
|
|
|
|||
|
|
## Table of Contents
|
|||
|
|
- Sweep Configuration
|
|||
|
|
- Search Strategies
|
|||
|
|
- Parameter Distributions
|
|||
|
|
- Early Termination
|
|||
|
|
- Parallel Execution
|
|||
|
|
- Advanced Patterns
|
|||
|
|
- Real-World Examples
|
|||
|
|
|
|||
|
|
## Sweep Configuration
|
|||
|
|
|
|||
|
|
### Basic Sweep Config
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
sweep_config = {
|
|||
|
|
'method': 'bayes', # Search strategy
|
|||
|
|
'metric': {
|
|||
|
|
'name': 'val/accuracy',
|
|||
|
|
'goal': 'maximize' # or 'minimize'
|
|||
|
|
},
|
|||
|
|
'parameters': {
|
|||
|
|
'learning_rate': {
|
|||
|
|
'distribution': 'log_uniform',
|
|||
|
|
'min': 1e-5,
|
|||
|
|
'max': 1e-1
|
|||
|
|
},
|
|||
|
|
'batch_size': {
|
|||
|
|
'values': [16, 32, 64, 128]
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# Initialize sweep
|
|||
|
|
sweep_id = wandb.sweep(sweep_config, project="my-project")
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Complete Config Example
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
sweep_config = {
|
|||
|
|
# Required: Search method
|
|||
|
|
'method': 'bayes',
|
|||
|
|
|
|||
|
|
# Required: Optimization metric
|
|||
|
|
'metric': {
|
|||
|
|
'name': 'val/f1_score',
|
|||
|
|
'goal': 'maximize'
|
|||
|
|
},
|
|||
|
|
|
|||
|
|
# Required: Parameters to search
|
|||
|
|
'parameters': {
|
|||
|
|
# Continuous parameter
|
|||
|
|
'learning_rate': {
|
|||
|
|
'distribution': 'log_uniform',
|
|||
|
|
'min': 1e-5,
|
|||
|
|
'max': 1e-1
|
|||
|
|
},
|
|||
|
|
|
|||
|
|
# Discrete values
|
|||
|
|
'batch_size': {
|
|||
|
|
'values': [16, 32, 64, 128]
|
|||
|
|
},
|
|||
|
|
|
|||
|
|
# Categorical
|
|||
|
|
'optimizer': {
|
|||
|
|
'values': ['adam', 'sgd', 'rmsprop', 'adamw']
|
|||
|
|
},
|
|||
|
|
|
|||
|
|
# Uniform distribution
|
|||
|
|
'dropout': {
|
|||
|
|
'distribution': 'uniform',
|
|||
|
|
'min': 0.1,
|
|||
|
|
'max': 0.5
|
|||
|
|
},
|
|||
|
|
|
|||
|
|
# Integer range
|
|||
|
|
'num_layers': {
|
|||
|
|
'distribution': 'int_uniform',
|
|||
|
|
'min': 2,
|
|||
|
|
'max': 10
|
|||
|
|
},
|
|||
|
|
|
|||
|
|
# Fixed value (constant across runs)
|
|||
|
|
'epochs': {
|
|||
|
|
'value': 50
|
|||
|
|
}
|
|||
|
|
},
|
|||
|
|
|
|||
|
|
# Optional: Early termination
|
|||
|
|
'early_terminate': {
|
|||
|
|
'type': 'hyperband',
|
|||
|
|
'min_iter': 5,
|
|||
|
|
's': 2,
|
|||
|
|
'eta': 3,
|
|||
|
|
'max_iter': 27
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Search Strategies
|
|||
|
|
|
|||
|
|
### 1. Grid Search
|
|||
|
|
|
|||
|
|
Exhaustively search all combinations.
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
sweep_config = {
|
|||
|
|
'method': 'grid',
|
|||
|
|
'parameters': {
|
|||
|
|
'learning_rate': {
|
|||
|
|
'values': [0.001, 0.01, 0.1]
|
|||
|
|
},
|
|||
|
|
'batch_size': {
|
|||
|
|
'values': [16, 32, 64]
|
|||
|
|
},
|
|||
|
|
'optimizer': {
|
|||
|
|
'values': ['adam', 'sgd']
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# Total runs: 3 × 3 × 2 = 18 runs
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Pros:**
|
|||
|
|
- Comprehensive search
|
|||
|
|
- Reproducible results
|
|||
|
|
- No randomness
|
|||
|
|
|
|||
|
|
**Cons:**
|
|||
|
|
- Exponential growth with parameters
|
|||
|
|
- Inefficient for continuous parameters
|
|||
|
|
- Not scalable beyond 3-4 parameters
|
|||
|
|
|
|||
|
|
**When to use:**
|
|||
|
|
- Few parameters (< 4)
|
|||
|
|
- All discrete values
|
|||
|
|
- Need complete coverage
|
|||
|
|
|
|||
|
|
### 2. Random Search
|
|||
|
|
|
|||
|
|
Randomly sample parameter combinations.
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
sweep_config = {
|
|||
|
|
'method': 'random',
|
|||
|
|
'parameters': {
|
|||
|
|
'learning_rate': {
|
|||
|
|
'distribution': 'log_uniform',
|
|||
|
|
'min': 1e-5,
|
|||
|
|
'max': 1e-1
|
|||
|
|
},
|
|||
|
|
'batch_size': {
|
|||
|
|
'values': [16, 32, 64, 128, 256]
|
|||
|
|
},
|
|||
|
|
'dropout': {
|
|||
|
|
'distribution': 'uniform',
|
|||
|
|
'min': 0.0,
|
|||
|
|
'max': 0.5
|
|||
|
|
},
|
|||
|
|
'num_layers': {
|
|||
|
|
'distribution': 'int_uniform',
|
|||
|
|
'min': 2,
|
|||
|
|
'max': 8
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# Run 100 random trials
|
|||
|
|
wandb.agent(sweep_id, function=train, count=100)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Pros:**
|
|||
|
|
- Scales to many parameters
|
|||
|
|
- Can run indefinitely
|
|||
|
|
- Often finds good solutions quickly
|
|||
|
|
|
|||
|
|
**Cons:**
|
|||
|
|
- No learning from previous runs
|
|||
|
|
- May miss optimal region
|
|||
|
|
- Results vary with random seed
|
|||
|
|
|
|||
|
|
**When to use:**
|
|||
|
|
- Many parameters (> 4)
|
|||
|
|
- Quick exploration
|
|||
|
|
- Limited budget
|
|||
|
|
|
|||
|
|
### 3. Bayesian Optimization (Recommended)
|
|||
|
|
|
|||
|
|
Learn from previous trials to sample promising regions.
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
sweep_config = {
|
|||
|
|
'method': 'bayes',
|
|||
|
|
'metric': {
|
|||
|
|
'name': 'val/loss',
|
|||
|
|
'goal': 'minimize'
|
|||
|
|
},
|
|||
|
|
'parameters': {
|
|||
|
|
'learning_rate': {
|
|||
|
|
'distribution': 'log_uniform',
|
|||
|
|
'min': 1e-5,
|
|||
|
|
'max': 1e-1
|
|||
|
|
},
|
|||
|
|
'weight_decay': {
|
|||
|
|
'distribution': 'log_uniform',
|
|||
|
|
'min': 1e-6,
|
|||
|
|
'max': 1e-2
|
|||
|
|
},
|
|||
|
|
'dropout': {
|
|||
|
|
'distribution': 'uniform',
|
|||
|
|
'min': 0.1,
|
|||
|
|
'max': 0.5
|
|||
|
|
},
|
|||
|
|
'num_layers': {
|
|||
|
|
'values': [2, 3, 4, 5, 6]
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Pros:**
|
|||
|
|
- Most sample-efficient
|
|||
|
|
- Learns from past trials
|
|||
|
|
- Focuses on promising regions
|
|||
|
|
|
|||
|
|
**Cons:**
|
|||
|
|
- Initial random exploration phase
|
|||
|
|
- May get stuck in local optima
|
|||
|
|
- Slower per iteration
|
|||
|
|
|
|||
|
|
**When to use:**
|
|||
|
|
- Expensive training runs
|
|||
|
|
- Need best performance
|
|||
|
|
- Limited compute budget
|
|||
|
|
|
|||
|
|
## Parameter Distributions
|
|||
|
|
|
|||
|
|
### Continuous Distributions
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# Log-uniform: Good for learning rates, regularization
|
|||
|
|
'learning_rate': {
|
|||
|
|
'distribution': 'log_uniform',
|
|||
|
|
'min': 1e-6,
|
|||
|
|
'max': 1e-1
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# Uniform: Good for dropout, momentum
|
|||
|
|
'dropout': {
|
|||
|
|
'distribution': 'uniform',
|
|||
|
|
'min': 0.0,
|
|||
|
|
'max': 0.5
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# Normal distribution
|
|||
|
|
'parameter': {
|
|||
|
|
'distribution': 'normal',
|
|||
|
|
'mu': 0.5,
|
|||
|
|
'sigma': 0.1
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# Log-normal distribution
|
|||
|
|
'parameter': {
|
|||
|
|
'distribution': 'log_normal',
|
|||
|
|
'mu': 0.0,
|
|||
|
|
'sigma': 1.0
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Discrete Distributions
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# Fixed values
|
|||
|
|
'batch_size': {
|
|||
|
|
'values': [16, 32, 64, 128, 256]
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# Integer uniform
|
|||
|
|
'num_layers': {
|
|||
|
|
'distribution': 'int_uniform',
|
|||
|
|
'min': 2,
|
|||
|
|
'max': 10
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# Quantized uniform (step size)
|
|||
|
|
'layer_size': {
|
|||
|
|
'distribution': 'q_uniform',
|
|||
|
|
'min': 32,
|
|||
|
|
'max': 512,
|
|||
|
|
'q': 32 # Step by 32: 32, 64, 96, 128...
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# Quantized log-uniform
|
|||
|
|
'hidden_size': {
|
|||
|
|
'distribution': 'q_log_uniform',
|
|||
|
|
'min': 32,
|
|||
|
|
'max': 1024,
|
|||
|
|
'q': 32
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Categorical Parameters
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# Optimizers
|
|||
|
|
'optimizer': {
|
|||
|
|
'values': ['adam', 'sgd', 'rmsprop', 'adamw']
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# Model architectures
|
|||
|
|
'model': {
|
|||
|
|
'values': ['resnet18', 'resnet34', 'resnet50', 'efficientnet_b0']
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# Activation functions
|
|||
|
|
'activation': {
|
|||
|
|
'values': ['relu', 'gelu', 'silu', 'leaky_relu']
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Early Termination
|
|||
|
|
|
|||
|
|
Stop underperforming runs early to save compute.
|
|||
|
|
|
|||
|
|
### Hyperband
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
sweep_config = {
|
|||
|
|
'method': 'bayes',
|
|||
|
|
'metric': {'name': 'val/accuracy', 'goal': 'maximize'},
|
|||
|
|
'parameters': {...},
|
|||
|
|
|
|||
|
|
# Hyperband early termination
|
|||
|
|
'early_terminate': {
|
|||
|
|
'type': 'hyperband',
|
|||
|
|
'min_iter': 3, # Minimum iterations before termination
|
|||
|
|
's': 2, # Bracket count
|
|||
|
|
'eta': 3, # Downsampling rate
|
|||
|
|
'max_iter': 27 # Maximum iterations
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**How it works:**
|
|||
|
|
- Runs trials in brackets
|
|||
|
|
- Keeps top 1/eta performers each round
|
|||
|
|
- Eliminates bottom performers early
|
|||
|
|
|
|||
|
|
### Custom Termination
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
def train():
|
|||
|
|
run = wandb.init()
|
|||
|
|
|
|||
|
|
for epoch in range(MAX_EPOCHS):
|
|||
|
|
loss = train_epoch()
|
|||
|
|
val_acc = validate()
|
|||
|
|
|
|||
|
|
wandb.log({'val/accuracy': val_acc, 'epoch': epoch})
|
|||
|
|
|
|||
|
|
# Custom early stopping
|
|||
|
|
if epoch > 5 and val_acc < 0.5:
|
|||
|
|
print("Early stop: Poor performance")
|
|||
|
|
break
|
|||
|
|
|
|||
|
|
if epoch > 10 and val_acc > best_acc - 0.01:
|
|||
|
|
print("Early stop: No improvement")
|
|||
|
|
break
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Training Function
|
|||
|
|
|
|||
|
|
### Basic Template
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
def train():
|
|||
|
|
# Initialize W&B run
|
|||
|
|
run = wandb.init()
|
|||
|
|
|
|||
|
|
# Get hyperparameters
|
|||
|
|
config = wandb.config
|
|||
|
|
|
|||
|
|
# Build model with config
|
|||
|
|
model = build_model(
|
|||
|
|
hidden_size=config.hidden_size,
|
|||
|
|
num_layers=config.num_layers,
|
|||
|
|
dropout=config.dropout
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# Create optimizer
|
|||
|
|
optimizer = create_optimizer(
|
|||
|
|
model.parameters(),
|
|||
|
|
name=config.optimizer,
|
|||
|
|
lr=config.learning_rate,
|
|||
|
|
weight_decay=config.weight_decay
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# Training loop
|
|||
|
|
for epoch in range(config.epochs):
|
|||
|
|
# Train
|
|||
|
|
train_loss, train_acc = train_epoch(
|
|||
|
|
model, optimizer, train_loader, config.batch_size
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# Validate
|
|||
|
|
val_loss, val_acc = validate(model, val_loader)
|
|||
|
|
|
|||
|
|
# Log metrics
|
|||
|
|
wandb.log({
|
|||
|
|
'train/loss': train_loss,
|
|||
|
|
'train/accuracy': train_acc,
|
|||
|
|
'val/loss': val_loss,
|
|||
|
|
'val/accuracy': val_acc,
|
|||
|
|
'epoch': epoch
|
|||
|
|
})
|
|||
|
|
|
|||
|
|
# Log final model
|
|||
|
|
torch.save(model.state_dict(), 'model.pth')
|
|||
|
|
wandb.save('model.pth')
|
|||
|
|
|
|||
|
|
# Finish run
|
|||
|
|
wandb.finish()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### With PyTorch
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
import torch
|
|||
|
|
import torch.nn as nn
|
|||
|
|
from torch.utils.data import DataLoader
|
|||
|
|
import wandb
|
|||
|
|
|
|||
|
|
def train():
|
|||
|
|
run = wandb.init()
|
|||
|
|
config = wandb.config
|
|||
|
|
|
|||
|
|
# Data
|
|||
|
|
train_loader = DataLoader(
|
|||
|
|
train_dataset,
|
|||
|
|
batch_size=config.batch_size,
|
|||
|
|
shuffle=True
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# Model
|
|||
|
|
model = ResNet(
|
|||
|
|
num_classes=config.num_classes,
|
|||
|
|
dropout=config.dropout
|
|||
|
|
).to(device)
|
|||
|
|
|
|||
|
|
# Optimizer
|
|||
|
|
if config.optimizer == 'adam':
|
|||
|
|
optimizer = torch.optim.Adam(
|
|||
|
|
model.parameters(),
|
|||
|
|
lr=config.learning_rate,
|
|||
|
|
weight_decay=config.weight_decay
|
|||
|
|
)
|
|||
|
|
elif config.optimizer == 'sgd':
|
|||
|
|
optimizer = torch.optim.SGD(
|
|||
|
|
model.parameters(),
|
|||
|
|
lr=config.learning_rate,
|
|||
|
|
momentum=config.momentum,
|
|||
|
|
weight_decay=config.weight_decay
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# Scheduler
|
|||
|
|
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
|
|||
|
|
optimizer, T_max=config.epochs
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# Training
|
|||
|
|
for epoch in range(config.epochs):
|
|||
|
|
model.train()
|
|||
|
|
train_loss = 0.0
|
|||
|
|
|
|||
|
|
for data, target in train_loader:
|
|||
|
|
data, target = data.to(device), target.to(device)
|
|||
|
|
|
|||
|
|
optimizer.zero_grad()
|
|||
|
|
output = model(data)
|
|||
|
|
loss = nn.CrossEntropyLoss()(output, target)
|
|||
|
|
loss.backward()
|
|||
|
|
optimizer.step()
|
|||
|
|
|
|||
|
|
train_loss += loss.item()
|
|||
|
|
|
|||
|
|
# Validation
|
|||
|
|
model.eval()
|
|||
|
|
val_loss, val_acc = validate(model, val_loader)
|
|||
|
|
|
|||
|
|
# Step scheduler
|
|||
|
|
scheduler.step()
|
|||
|
|
|
|||
|
|
# Log
|
|||
|
|
wandb.log({
|
|||
|
|
'train/loss': train_loss / len(train_loader),
|
|||
|
|
'val/loss': val_loss,
|
|||
|
|
'val/accuracy': val_acc,
|
|||
|
|
'learning_rate': scheduler.get_last_lr()[0],
|
|||
|
|
'epoch': epoch
|
|||
|
|
})
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Parallel Execution
|
|||
|
|
|
|||
|
|
### Multiple Agents
|
|||
|
|
|
|||
|
|
Run sweep agents in parallel to speed up search.
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# Initialize sweep once
|
|||
|
|
sweep_id = wandb.sweep(sweep_config, project="my-project")
|
|||
|
|
|
|||
|
|
# Run multiple agents in parallel
|
|||
|
|
# Agent 1 (Terminal 1)
|
|||
|
|
wandb.agent(sweep_id, function=train, count=20)
|
|||
|
|
|
|||
|
|
# Agent 2 (Terminal 2)
|
|||
|
|
wandb.agent(sweep_id, function=train, count=20)
|
|||
|
|
|
|||
|
|
# Agent 3 (Terminal 3)
|
|||
|
|
wandb.agent(sweep_id, function=train, count=20)
|
|||
|
|
|
|||
|
|
# Total: 60 runs across 3 agents
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Multi-GPU Execution
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
import os
|
|||
|
|
|
|||
|
|
def train():
|
|||
|
|
# Get available GPU
|
|||
|
|
gpu_id = os.environ.get('CUDA_VISIBLE_DEVICES', '0')
|
|||
|
|
|
|||
|
|
run = wandb.init()
|
|||
|
|
config = wandb.config
|
|||
|
|
|
|||
|
|
# Train on specific GPU
|
|||
|
|
device = torch.device(f'cuda:{gpu_id}')
|
|||
|
|
model = model.to(device)
|
|||
|
|
|
|||
|
|
# ... rest of training ...
|
|||
|
|
|
|||
|
|
# Run agents on different GPUs
|
|||
|
|
# Terminal 1
|
|||
|
|
# CUDA_VISIBLE_DEVICES=0 wandb agent sweep_id
|
|||
|
|
|
|||
|
|
# Terminal 2
|
|||
|
|
# CUDA_VISIBLE_DEVICES=1 wandb agent sweep_id
|
|||
|
|
|
|||
|
|
# Terminal 3
|
|||
|
|
# CUDA_VISIBLE_DEVICES=2 wandb agent sweep_id
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Advanced Patterns
|
|||
|
|
|
|||
|
|
### Nested Parameters
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
sweep_config = {
|
|||
|
|
'method': 'bayes',
|
|||
|
|
'metric': {'name': 'val/accuracy', 'goal': 'maximize'},
|
|||
|
|
'parameters': {
|
|||
|
|
'model': {
|
|||
|
|
'parameters': {
|
|||
|
|
'type': {
|
|||
|
|
'values': ['resnet', 'efficientnet']
|
|||
|
|
},
|
|||
|
|
'size': {
|
|||
|
|
'values': ['small', 'medium', 'large']
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
},
|
|||
|
|
'optimizer': {
|
|||
|
|
'parameters': {
|
|||
|
|
'type': {
|
|||
|
|
'values': ['adam', 'sgd']
|
|||
|
|
},
|
|||
|
|
'lr': {
|
|||
|
|
'distribution': 'log_uniform',
|
|||
|
|
'min': 1e-5,
|
|||
|
|
'max': 1e-1
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# Access nested config
|
|||
|
|
def train():
|
|||
|
|
run = wandb.init()
|
|||
|
|
model_type = wandb.config.model.type
|
|||
|
|
model_size = wandb.config.model.size
|
|||
|
|
opt_type = wandb.config.optimizer.type
|
|||
|
|
lr = wandb.config.optimizer.lr
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Conditional Parameters
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
sweep_config = {
|
|||
|
|
'method': 'bayes',
|
|||
|
|
'parameters': {
|
|||
|
|
'optimizer': {
|
|||
|
|
'values': ['adam', 'sgd']
|
|||
|
|
},
|
|||
|
|
'learning_rate': {
|
|||
|
|
'distribution': 'log_uniform',
|
|||
|
|
'min': 1e-5,
|
|||
|
|
'max': 1e-1
|
|||
|
|
},
|
|||
|
|
# Only used if optimizer == 'sgd'
|
|||
|
|
'momentum': {
|
|||
|
|
'distribution': 'uniform',
|
|||
|
|
'min': 0.5,
|
|||
|
|
'max': 0.99
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
def train():
|
|||
|
|
run = wandb.init()
|
|||
|
|
config = wandb.config
|
|||
|
|
|
|||
|
|
if config.optimizer == 'adam':
|
|||
|
|
optimizer = torch.optim.Adam(
|
|||
|
|
model.parameters(),
|
|||
|
|
lr=config.learning_rate
|
|||
|
|
)
|
|||
|
|
elif config.optimizer == 'sgd':
|
|||
|
|
optimizer = torch.optim.SGD(
|
|||
|
|
model.parameters(),
|
|||
|
|
lr=config.learning_rate,
|
|||
|
|
momentum=config.momentum # Conditional parameter
|
|||
|
|
)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Real-World Examples
|
|||
|
|
|
|||
|
|
### Image Classification
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
sweep_config = {
|
|||
|
|
'method': 'bayes',
|
|||
|
|
'metric': {
|
|||
|
|
'name': 'val/top1_accuracy',
|
|||
|
|
'goal': 'maximize'
|
|||
|
|
},
|
|||
|
|
'parameters': {
|
|||
|
|
# Model
|
|||
|
|
'architecture': {
|
|||
|
|
'values': ['resnet50', 'resnet101', 'efficientnet_b0', 'efficientnet_b3']
|
|||
|
|
},
|
|||
|
|
'pretrained': {
|
|||
|
|
'values': [True, False]
|
|||
|
|
},
|
|||
|
|
|
|||
|
|
# Training
|
|||
|
|
'learning_rate': {
|
|||
|
|
'distribution': 'log_uniform',
|
|||
|
|
'min': 1e-5,
|
|||
|
|
'max': 1e-2
|
|||
|
|
},
|
|||
|
|
'batch_size': {
|
|||
|
|
'values': [16, 32, 64, 128]
|
|||
|
|
},
|
|||
|
|
'optimizer': {
|
|||
|
|
'values': ['adam', 'sgd', 'adamw']
|
|||
|
|
},
|
|||
|
|
'weight_decay': {
|
|||
|
|
'distribution': 'log_uniform',
|
|||
|
|
'min': 1e-6,
|
|||
|
|
'max': 1e-2
|
|||
|
|
},
|
|||
|
|
|
|||
|
|
# Regularization
|
|||
|
|
'dropout': {
|
|||
|
|
'distribution': 'uniform',
|
|||
|
|
'min': 0.0,
|
|||
|
|
'max': 0.5
|
|||
|
|
},
|
|||
|
|
'label_smoothing': {
|
|||
|
|
'distribution': 'uniform',
|
|||
|
|
'min': 0.0,
|
|||
|
|
'max': 0.2
|
|||
|
|
},
|
|||
|
|
|
|||
|
|
# Data augmentation
|
|||
|
|
'mixup_alpha': {
|
|||
|
|
'distribution': 'uniform',
|
|||
|
|
'min': 0.0,
|
|||
|
|
'max': 1.0
|
|||
|
|
},
|
|||
|
|
'cutmix_alpha': {
|
|||
|
|
'distribution': 'uniform',
|
|||
|
|
'min': 0.0,
|
|||
|
|
'max': 1.0
|
|||
|
|
}
|
|||
|
|
},
|
|||
|
|
'early_terminate': {
|
|||
|
|
'type': 'hyperband',
|
|||
|
|
'min_iter': 5
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### NLP Fine-Tuning
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
sweep_config = {
|
|||
|
|
'method': 'bayes',
|
|||
|
|
'metric': {'name': 'eval/f1', 'goal': 'maximize'},
|
|||
|
|
'parameters': {
|
|||
|
|
# Model
|
|||
|
|
'model_name': {
|
|||
|
|
'values': ['bert-base-uncased', 'roberta-base', 'distilbert-base-uncased']
|
|||
|
|
},
|
|||
|
|
|
|||
|
|
# Training
|
|||
|
|
'learning_rate': {
|
|||
|
|
'distribution': 'log_uniform',
|
|||
|
|
'min': 1e-6,
|
|||
|
|
'max': 1e-4
|
|||
|
|
},
|
|||
|
|
'per_device_train_batch_size': {
|
|||
|
|
'values': [8, 16, 32]
|
|||
|
|
},
|
|||
|
|
'num_train_epochs': {
|
|||
|
|
'values': [3, 4, 5]
|
|||
|
|
},
|
|||
|
|
'warmup_ratio': {
|
|||
|
|
'distribution': 'uniform',
|
|||
|
|
'min': 0.0,
|
|||
|
|
'max': 0.1
|
|||
|
|
},
|
|||
|
|
'weight_decay': {
|
|||
|
|
'distribution': 'log_uniform',
|
|||
|
|
'min': 1e-4,
|
|||
|
|
'max': 1e-1
|
|||
|
|
},
|
|||
|
|
|
|||
|
|
# Optimizer
|
|||
|
|
'adam_beta1': {
|
|||
|
|
'distribution': 'uniform',
|
|||
|
|
'min': 0.8,
|
|||
|
|
'max': 0.95
|
|||
|
|
},
|
|||
|
|
'adam_beta2': {
|
|||
|
|
'distribution': 'uniform',
|
|||
|
|
'min': 0.95,
|
|||
|
|
'max': 0.999
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Best Practices
|
|||
|
|
|
|||
|
|
### 1. Start Small
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# Initial exploration: Random search, 20 runs
|
|||
|
|
sweep_config_v1 = {
|
|||
|
|
'method': 'random',
|
|||
|
|
'parameters': {...}
|
|||
|
|
}
|
|||
|
|
wandb.agent(sweep_id_v1, train, count=20)
|
|||
|
|
|
|||
|
|
# Refined search: Bayes, narrow ranges
|
|||
|
|
sweep_config_v2 = {
|
|||
|
|
'method': 'bayes',
|
|||
|
|
'parameters': {
|
|||
|
|
'learning_rate': {
|
|||
|
|
'min': 5e-5, # Narrowed from 1e-6 to 1e-4
|
|||
|
|
'max': 1e-4
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 2. Use Log Scales
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# ✅ Good: Log scale for learning rate
|
|||
|
|
'learning_rate': {
|
|||
|
|
'distribution': 'log_uniform',
|
|||
|
|
'min': 1e-6,
|
|||
|
|
'max': 1e-2
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# ❌ Bad: Linear scale
|
|||
|
|
'learning_rate': {
|
|||
|
|
'distribution': 'uniform',
|
|||
|
|
'min': 0.000001,
|
|||
|
|
'max': 0.01
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 3. Set Reasonable Ranges
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# Base ranges on prior knowledge
|
|||
|
|
'learning_rate': {'min': 1e-5, 'max': 1e-3}, # Typical for Adam
|
|||
|
|
'batch_size': {'values': [16, 32, 64]}, # GPU memory limits
|
|||
|
|
'dropout': {'min': 0.1, 'max': 0.5} # Too high hurts training
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 4. Monitor Resource Usage
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
def train():
|
|||
|
|
run = wandb.init()
|
|||
|
|
|
|||
|
|
# Log system metrics
|
|||
|
|
wandb.log({
|
|||
|
|
'system/gpu_memory_allocated': torch.cuda.memory_allocated(),
|
|||
|
|
'system/gpu_memory_reserved': torch.cuda.memory_reserved()
|
|||
|
|
})
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 5. Save Best Models
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
def train():
|
|||
|
|
run = wandb.init()
|
|||
|
|
best_acc = 0.0
|
|||
|
|
|
|||
|
|
for epoch in range(config.epochs):
|
|||
|
|
val_acc = validate(model)
|
|||
|
|
|
|||
|
|
if val_acc > best_acc:
|
|||
|
|
best_acc = val_acc
|
|||
|
|
# Save best checkpoint
|
|||
|
|
torch.save(model.state_dict(), 'best_model.pth')
|
|||
|
|
wandb.save('best_model.pth')
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Resources
|
|||
|
|
|
|||
|
|
- **Sweeps Documentation**: https://docs.wandb.ai/guides/sweeps
|
|||
|
|
- **Configuration Reference**: https://docs.wandb.ai/guides/sweeps/configuration
|
|||
|
|
- **Examples**: https://github.com/wandb/examples/tree/master/examples/wandb-sweeps
|