11 KiB
Distributed Evaluation
Guide to running evaluation across multiple GPUs using data parallelism and tensor/pipeline parallelism.
Overview
Distributed evaluation speeds up benchmarking by:
- Data Parallelism: Split evaluation samples across GPUs (each GPU has full model copy)
- Tensor Parallelism: Split model weights across GPUs (for large models)
- Pipeline Parallelism: Split model layers across GPUs (for very large models)
When to use:
- Data Parallel: Model fits on single GPU, want faster evaluation
- Tensor/Pipeline Parallel: Model too large for single GPU
HuggingFace Models (hf)
Data Parallelism (Recommended)
Each GPU loads a full copy of the model and processes a subset of evaluation data.
Single Node (8 GPUs):
accelerate launch --multi_gpu --num_processes 8 \
-m lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf,dtype=bfloat16 \
--tasks mmlu,gsm8k,hellaswag \
--batch_size 16
Speedup: Near-linear (8 GPUs = ~8× faster)
Memory: Each GPU needs full model (7B model ≈ 14GB × 8 = 112GB total)
Tensor Parallelism (Model Sharding)
Split model weights across GPUs for models too large for single GPU.
Without accelerate launcher:
lm_eval --model hf \
--model_args \
pretrained=meta-llama/Llama-2-70b-hf,\
parallelize=True,\
dtype=bfloat16 \
--tasks mmlu,gsm8k \
--batch_size 8
With 8 GPUs: 70B model (140GB) / 8 = 17.5GB per GPU ✅
Advanced sharding:
lm_eval --model hf \
--model_args \
pretrained=meta-llama/Llama-2-70b-hf,\
parallelize=True,\
device_map_option=auto,\
max_memory_per_gpu=40GB,\
max_cpu_memory=100GB,\
dtype=bfloat16 \
--tasks mmlu
Options:
device_map_option:"auto"(default),"balanced","balanced_low_0"max_memory_per_gpu: Max memory per GPU (e.g.,"40GB")max_cpu_memory: Max CPU memory for offloadingoffload_folder: Disk offloading directory
Combined Data + Tensor Parallelism
Use both for very large models.
Example: 70B model on 16 GPUs (2 copies, 8 GPUs each):
accelerate launch --multi_gpu --num_processes 2 \
-m lm_eval --model hf \
--model_args \
pretrained=meta-llama/Llama-2-70b-hf,\
parallelize=True,\
dtype=bfloat16 \
--tasks mmlu \
--batch_size 8
Result: 2× speedup from data parallelism, 70B model fits via tensor parallelism
Configuration with accelerate config
Create ~/.cache/huggingface/accelerate/default_config.yaml:
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
num_machines: 1
num_processes: 8
gpu_ids: all
mixed_precision: bf16
Then run:
accelerate launch -m lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks mmlu
vLLM Models (vllm)
vLLM provides highly optimized distributed inference.
Tensor Parallelism
Single Node (4 GPUs):
lm_eval --model vllm \
--model_args \
pretrained=meta-llama/Llama-2-70b-hf,\
tensor_parallel_size=4,\
dtype=auto,\
gpu_memory_utilization=0.9 \
--tasks mmlu,gsm8k \
--batch_size auto
Memory: 70B model split across 4 GPUs = ~35GB per GPU
Data Parallelism
Multiple model replicas:
lm_eval --model vllm \
--model_args \
pretrained=meta-llama/Llama-2-7b-hf,\
data_parallel_size=4,\
dtype=auto,\
gpu_memory_utilization=0.8 \
--tasks hellaswag,arc_challenge \
--batch_size auto
Result: 4 model replicas = 4× throughput
Combined Tensor + Data Parallelism
Example: 8 GPUs = 4 TP × 2 DP:
lm_eval --model vllm \
--model_args \
pretrained=meta-llama/Llama-2-70b-hf,\
tensor_parallel_size=4,\
data_parallel_size=2,\
dtype=auto,\
gpu_memory_utilization=0.85 \
--tasks mmlu \
--batch_size auto
Result: 70B model fits (TP=4), 2× speedup (DP=2)
Multi-Node vLLM
vLLM doesn't natively support multi-node. Use Ray:
# Start Ray cluster
ray start --head --port=6379
# Run evaluation
lm_eval --model vllm \
--model_args \
pretrained=meta-llama/Llama-2-70b-hf,\
tensor_parallel_size=8,\
dtype=auto \
--tasks mmlu
NVIDIA NeMo Models (nemo_lm)
Data Replication
8 replicas on 8 GPUs:
torchrun --nproc-per-node=8 --no-python \
lm_eval --model nemo_lm \
--model_args \
path=/path/to/model.nemo,\
devices=8 \
--tasks hellaswag,arc_challenge \
--batch_size 32
Speedup: Near-linear (8× faster)
Tensor Parallelism
4-way tensor parallelism:
torchrun --nproc-per-node=4 --no-python \
lm_eval --model nemo_lm \
--model_args \
path=/path/to/70b_model.nemo,\
devices=4,\
tensor_model_parallel_size=4 \
--tasks mmlu,gsm8k \
--batch_size 16
Pipeline Parallelism
2 TP × 2 PP on 4 GPUs:
torchrun --nproc-per-node=4 --no-python \
lm_eval --model nemo_lm \
--model_args \
path=/path/to/model.nemo,\
devices=4,\
tensor_model_parallel_size=2,\
pipeline_model_parallel_size=2 \
--tasks mmlu \
--batch_size 8
Constraint: devices = TP × PP
Multi-Node NeMo
Currently not supported by lm-evaluation-harness.
SGLang Models (sglang)
Tensor Parallelism
lm_eval --model sglang \
--model_args \
pretrained=meta-llama/Llama-2-70b-hf,\
tp_size=4,\
dtype=auto \
--tasks gsm8k \
--batch_size auto
Data Parallelism (Deprecated)
Note: SGLang is deprecating data parallelism. Use tensor parallelism instead.
lm_eval --model sglang \
--model_args \
pretrained=meta-llama/Llama-2-7b-hf,\
dp_size=4,\
dtype=auto \
--tasks mmlu
Performance Comparison
70B Model Evaluation (MMLU, 5-shot)
| Method | GPUs | Time | Memory/GPU | Notes |
|---|---|---|---|---|
| HF (no parallel) | 1 | 8 hours | 140GB (OOM) | Won't fit |
| HF (TP=8) | 8 | 2 hours | 17.5GB | Slower, fits |
| HF (DP=8) | 8 | 1 hour | 140GB (OOM) | Won't fit |
| vLLM (TP=4) | 4 | 30 min | 35GB | Fast! |
| vLLM (TP=4, DP=2) | 8 | 15 min | 35GB | Fastest |
7B Model Evaluation (Multiple Tasks)
| Method | GPUs | Time | Speedup |
|---|---|---|---|
| HF (single) | 1 | 4 hours | 1× |
| HF (DP=4) | 4 | 1 hour | 4× |
| HF (DP=8) | 8 | 30 min | 8× |
| vLLM (DP=8) | 8 | 15 min | 16× |
Takeaway: vLLM is significantly faster than HuggingFace for inference.
Choosing Parallelism Strategy
Decision Tree
Model fits on single GPU?
├─ YES: Use data parallelism
│ ├─ HF: accelerate launch --multi_gpu --num_processes N
│ └─ vLLM: data_parallel_size=N (fastest)
│
└─ NO: Use tensor/pipeline parallelism
├─ Model < 70B:
│ └─ vLLM: tensor_parallel_size=4
├─ Model 70-175B:
│ ├─ vLLM: tensor_parallel_size=8
│ └─ Or HF: parallelize=True
└─ Model > 175B:
└─ Contact framework authors
Memory Estimation
Rule of thumb:
Memory (GB) = Parameters (B) × Precision (bytes) × 1.2 (overhead)
Examples:
- 7B FP16: 7 × 2 × 1.2 = 16.8GB ✅ Fits A100 40GB
- 13B FP16: 13 × 2 × 1.2 = 31.2GB ✅ Fits A100 40GB
- 70B FP16: 70 × 2 × 1.2 = 168GB ❌ Need TP=4 or TP=8
- 70B BF16: 70 × 2 × 1.2 = 168GB (same as FP16)
With tensor parallelism:
Memory per GPU = Total Memory / TP
- 70B on 4 GPUs: 168GB / 4 = 42GB per GPU ✅
- 70B on 8 GPUs: 168GB / 8 = 21GB per GPU ✅
Multi-Node Evaluation
HuggingFace with SLURM
Submit job:
#!/bin/bash
#SBATCH --nodes=4
#SBATCH --gpus-per-node=8
#SBATCH --ntasks-per-node=1
srun accelerate launch --multi_gpu \
--num_processes $((SLURM_NNODES * 8)) \
-m lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks mmlu,gsm8k,hellaswag \
--batch_size 16
Submit:
sbatch eval_job.sh
Manual Multi-Node Setup
On each node, run:
accelerate launch \
--multi_gpu \
--num_machines 4 \
--num_processes 32 \
--main_process_ip $MASTER_IP \
--main_process_port 29500 \
--machine_rank $NODE_RANK \
-m lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks mmlu
Environment variables:
MASTER_IP: IP of rank 0 nodeNODE_RANK: 0, 1, 2, 3 for each node
Best Practices
1. Start Small
Test on small sample first:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-70b-hf,parallelize=True \
--tasks mmlu \
--limit 100 # Just 100 samples
2. Monitor GPU Usage
# Terminal 1: Run evaluation
lm_eval --model hf ...
# Terminal 2: Monitor
watch -n 1 nvidia-smi
Look for:
- GPU utilization > 90%
- Memory usage stable
- All GPUs active
3. Optimize Batch Size
# Auto batch size (recommended)
--batch_size auto
# Or tune manually
--batch_size 16 # Start here
--batch_size 32 # Increase if memory allows
4. Use Mixed Precision
--model_args dtype=bfloat16 # Faster, less memory
5. Check Communication
For data parallelism, check network bandwidth:
# Should see InfiniBand or high-speed network
nvidia-smi topo -m
Troubleshooting
"CUDA out of memory"
Solutions:
-
Increase tensor parallelism:
--model_args tensor_parallel_size=8 # Was 4 -
Reduce batch size:
--batch_size 4 # Was 16 -
Lower precision:
--model_args dtype=int8 # Quantization
"NCCL error" or Hanging
Check:
- All GPUs visible:
nvidia-smi - NCCL installed:
python -c "import torch; print(torch.cuda.nccl.version())" - Network connectivity between nodes
Fix:
export NCCL_DEBUG=INFO # Enable debug logging
export NCCL_IB_DISABLE=0 # Use InfiniBand if available
Slow Evaluation
Possible causes:
- Data loading bottleneck: Preprocess dataset
- Low GPU utilization: Increase batch size
- Communication overhead: Reduce parallelism degree
Profile:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks mmlu \
--limit 100 \
--log_samples # Check timing
GPUs Imbalanced
Symptom: GPU 0 at 100%, others at 50%
Solution: Use device_map_option=balanced:
--model_args parallelize=True,device_map_option=balanced
Example Configurations
Small Model (7B) - Fast Evaluation
# 8 A100s, data parallel
accelerate launch --multi_gpu --num_processes 8 \
-m lm_eval --model hf \
--model_args \
pretrained=meta-llama/Llama-2-7b-hf,\
dtype=bfloat16 \
--tasks mmlu,gsm8k,hellaswag,arc_challenge \
--num_fewshot 5 \
--batch_size 32
# Time: ~30 minutes
Large Model (70B) - vLLM
# 8 H100s, tensor parallel
lm_eval --model vllm \
--model_args \
pretrained=meta-llama/Llama-2-70b-hf,\
tensor_parallel_size=8,\
dtype=auto,\
gpu_memory_utilization=0.9 \
--tasks mmlu,gsm8k,humaneval \
--num_fewshot 5 \
--batch_size auto
# Time: ~1 hour
Very Large Model (175B+)
Requires specialized setup - contact framework maintainers
References
- HuggingFace Accelerate: https://huggingface.co/docs/accelerate/
- vLLM docs: https://docs.vllm.ai/
- NeMo docs: https://docs.nvidia.com/nemo-framework/
- lm-eval distributed guide:
docs/model_guide.md