8.8 KiB
8.8 KiB
Troubleshooting Guide
Contents
- Out of memory (OOM) errors
- Performance issues
- Model loading errors
- Network and connection issues
- Quantization problems
- Distributed serving issues
- Debugging tools and commands
Out of memory (OOM) errors
Symptom: torch.cuda.OutOfMemoryError during model loading
Cause: Model + KV cache exceeds available VRAM
Solutions (try in order):
- Reduce GPU memory utilization:
vllm serve MODEL --gpu-memory-utilization 0.7 # Try 0.7, 0.75, 0.8
- Reduce max sequence length:
vllm serve MODEL --max-model-len 4096 # Instead of 8192
- Enable quantization:
vllm serve MODEL --quantization awq # 4x memory reduction
- Use tensor parallelism (multiple GPUs):
vllm serve MODEL --tensor-parallel-size 2 # Split across 2 GPUs
- Reduce max concurrent sequences:
vllm serve MODEL --max-num-seqs 128 # Default is 256
Symptom: OOM during inference (not model loading)
Cause: KV cache fills up during generation
Solutions:
# Reduce KV cache allocation
vllm serve MODEL --gpu-memory-utilization 0.85
# Reduce batch size
vllm serve MODEL --max-num-seqs 64
# Reduce max tokens per request
# Set in client request: max_tokens=512
Symptom: OOM with quantized model
Cause: Quantization overhead or incorrect configuration
Solution:
# Ensure quantization flag matches model
vllm serve TheBloke/Llama-2-70B-AWQ --quantization awq # Must specify
# Try different dtype
vllm serve MODEL --quantization awq --dtype float16
Performance issues
Symptom: Low throughput (<50 req/sec expected >100)
Diagnostic steps:
- Check GPU utilization:
watch -n 1 nvidia-smi
# GPU utilization should be >80%
If <80%, increase concurrent requests:
vllm serve MODEL --max-num-seqs 512 # Increase from 256
- Check if memory-bound:
# If memory at 100% but GPU <80%, reduce sequence length
vllm serve MODEL --max-model-len 4096
- Enable optimizations:
vllm serve MODEL \
--enable-prefix-caching \
--enable-chunked-prefill \
--max-num-seqs 512
- Check tensor parallelism settings:
# Must use power-of-2 GPUs
vllm serve MODEL --tensor-parallel-size 4 # Not 3 or 5
Symptom: High TTFT (time to first token >1 second)
Causes and solutions:
Long prompts:
vllm serve MODEL --enable-chunked-prefill
No prefix caching:
vllm serve MODEL --enable-prefix-caching # For repeated prompts
Too many concurrent requests:
vllm serve MODEL --max-num-seqs 64 # Reduce to prioritize latency
Model too large for single GPU:
vllm serve MODEL --tensor-parallel-size 2 # Parallelize prefill
Symptom: Slow token generation (low tokens/sec)
Diagnostic:
# Check if model is correct size
vllm serve MODEL # Should see model size in logs
# Check speculative decoding
vllm serve MODEL --speculative-model DRAFT_MODEL
For H100 GPUs, enable FP8:
vllm serve MODEL --quantization fp8
Model loading errors
Symptom: OSError: MODEL not found
Causes:
- Model name typo:
# Check exact model name on HuggingFace
vllm serve meta-llama/Llama-3-8B-Instruct # Correct capitalization
- Private/gated model:
# Login to HuggingFace first
huggingface-cli login
# Then run vLLM
vllm serve meta-llama/Llama-3-70B-Instruct
- Custom model needs trust flag:
vllm serve MODEL --trust-remote-code
Symptom: ValueError: Tokenizer not found
Solution:
# Download model manually first
python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('MODEL')"
# Then launch vLLM
vllm serve MODEL
Symptom: ImportError: No module named 'flash_attn'
Solution:
# Install flash attention
pip install flash-attn --no-build-isolation
# Or disable flash attention
vllm serve MODEL --disable-flash-attn
Network and connection issues
Symptom: Connection refused when querying server
Diagnostic:
- Check server is running:
curl http://localhost:8000/health
- Check port binding:
# Bind to all interfaces for remote access
vllm serve MODEL --host 0.0.0.0 --port 8000
# Check if port is in use
lsof -i :8000
- Check firewall:
# Allow port through firewall
sudo ufw allow 8000
Symptom: Slow response times over network
Solutions:
- Increase timeout:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY",
timeout=300.0 # 5 minute timeout
)
- Check network latency:
ping SERVER_IP # Should be <10ms for local network
- Use connection pooling:
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
session = requests.Session()
retries = Retry(total=3, backoff_factor=1)
session.mount('http://', HTTPAdapter(max_retries=retries))
Quantization problems
Symptom: RuntimeError: Quantization format not supported
Solution:
# Ensure correct quantization method
vllm serve MODEL --quantization awq # For AWQ models
vllm serve MODEL --quantization gptq # For GPTQ models
# Check model card for quantization type
Symptom: Poor quality outputs after quantization
Diagnostic:
- Verify model is correctly quantized:
# Check model config.json for quantization_config
cat ~/.cache/huggingface/hub/models--MODEL/config.json
- Try different quantization method:
# If AWQ quality issues, try FP8 (H100 only)
vllm serve MODEL --quantization fp8
# Or use less aggressive quantization
vllm serve MODEL # No quantization
- Increase temperature for better diversity:
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
Distributed serving issues
Symptom: RuntimeError: Distributed init failed
Diagnostic:
- Check environment variables:
# On all nodes
echo $MASTER_ADDR # Should be same
echo $MASTER_PORT # Should be same
echo $RANK # Should be unique per node (0, 1, 2, ...)
echo $WORLD_SIZE # Should be same (total nodes)
- Check network connectivity:
# From node 1 to node 2
ping NODE2_IP
nc -zv NODE2_IP 29500 # Check port accessibility
- Check NCCL settings:
export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=eth0 # Or your network interface
vllm serve MODEL --tensor-parallel-size 8
Symptom: NCCL error: unhandled cuda error
Solutions:
# Set NCCL to use correct network interface
export NCCL_SOCKET_IFNAME=eth0 # Replace with your interface
# Increase timeout
export NCCL_TIMEOUT=1800 # 30 minutes
# Force P2P for debugging
export NCCL_P2P_DISABLE=1
Debugging tools and commands
Enable debug logging
export VLLM_LOGGING_LEVEL=DEBUG
vllm serve MODEL
Monitor GPU usage
# Real-time GPU monitoring
watch -n 1 nvidia-smi
# Memory breakdown
nvidia-smi --query-gpu=memory.used,memory.free --format=csv -l 1
Profile performance
# Built-in benchmarking
vllm bench throughput \
--model MODEL \
--input-tokens 128 \
--output-tokens 256 \
--num-prompts 100
vllm bench latency \
--model MODEL \
--input-tokens 128 \
--output-tokens 256 \
--batch-size 8
Check metrics
# Prometheus metrics
curl http://localhost:9090/metrics
# Filter for specific metrics
curl http://localhost:9090/metrics | grep vllm_time_to_first_token
# Key metrics to monitor:
# - vllm_time_to_first_token_seconds
# - vllm_time_per_output_token_seconds
# - vllm_num_requests_running
# - vllm_gpu_cache_usage_perc
# - vllm_request_success_total
Test server health
# Health check
curl http://localhost:8000/health
# Model info
curl http://localhost:8000/v1/models
# Test completion
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "MODEL",
"prompt": "Hello",
"max_tokens": 10
}'
Common environment variables
# CUDA settings
export CUDA_VISIBLE_DEVICES=0,1,2,3 # Limit to specific GPUs
# vLLM settings
export VLLM_LOGGING_LEVEL=DEBUG
export VLLM_TRACE_FUNCTION=1 # Profile functions
export VLLM_USE_V1=1 # Use v1.0 engine (faster)
# NCCL settings (distributed)
export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_DISABLE=0 # Enable InfiniBand
Collect diagnostic info for bug reports
# System info
nvidia-smi
python --version
pip show vllm
# vLLM version and config
vllm --version
python -c "import vllm; print(vllm.__version__)"
# Run with debug logging
export VLLM_LOGGING_LEVEL=DEBUG
vllm serve MODEL 2>&1 | tee vllm_debug.log
# Include in bug report:
# - vllm_debug.log
# - nvidia-smi output
# - Full command used
# - Expected vs actual behavior