448 lines
8.8 KiB
Markdown
448 lines
8.8 KiB
Markdown
|
|
# Troubleshooting Guide
|
||
|
|
|
||
|
|
## Contents
|
||
|
|
- Out of memory (OOM) errors
|
||
|
|
- Performance issues
|
||
|
|
- Model loading errors
|
||
|
|
- Network and connection issues
|
||
|
|
- Quantization problems
|
||
|
|
- Distributed serving issues
|
||
|
|
- Debugging tools and commands
|
||
|
|
|
||
|
|
## Out of memory (OOM) errors
|
||
|
|
|
||
|
|
### Symptom: `torch.cuda.OutOfMemoryError` during model loading
|
||
|
|
|
||
|
|
**Cause**: Model + KV cache exceeds available VRAM
|
||
|
|
|
||
|
|
**Solutions (try in order)**:
|
||
|
|
|
||
|
|
1. **Reduce GPU memory utilization**:
|
||
|
|
```bash
|
||
|
|
vllm serve MODEL --gpu-memory-utilization 0.7 # Try 0.7, 0.75, 0.8
|
||
|
|
```
|
||
|
|
|
||
|
|
2. **Reduce max sequence length**:
|
||
|
|
```bash
|
||
|
|
vllm serve MODEL --max-model-len 4096 # Instead of 8192
|
||
|
|
```
|
||
|
|
|
||
|
|
3. **Enable quantization**:
|
||
|
|
```bash
|
||
|
|
vllm serve MODEL --quantization awq # 4x memory reduction
|
||
|
|
```
|
||
|
|
|
||
|
|
4. **Use tensor parallelism** (multiple GPUs):
|
||
|
|
```bash
|
||
|
|
vllm serve MODEL --tensor-parallel-size 2 # Split across 2 GPUs
|
||
|
|
```
|
||
|
|
|
||
|
|
5. **Reduce max concurrent sequences**:
|
||
|
|
```bash
|
||
|
|
vllm serve MODEL --max-num-seqs 128 # Default is 256
|
||
|
|
```
|
||
|
|
|
||
|
|
### Symptom: OOM during inference (not model loading)
|
||
|
|
|
||
|
|
**Cause**: KV cache fills up during generation
|
||
|
|
|
||
|
|
**Solutions**:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Reduce KV cache allocation
|
||
|
|
vllm serve MODEL --gpu-memory-utilization 0.85
|
||
|
|
|
||
|
|
# Reduce batch size
|
||
|
|
vllm serve MODEL --max-num-seqs 64
|
||
|
|
|
||
|
|
# Reduce max tokens per request
|
||
|
|
# Set in client request: max_tokens=512
|
||
|
|
```
|
||
|
|
|
||
|
|
### Symptom: OOM with quantized model
|
||
|
|
|
||
|
|
**Cause**: Quantization overhead or incorrect configuration
|
||
|
|
|
||
|
|
**Solution**:
|
||
|
|
```bash
|
||
|
|
# Ensure quantization flag matches model
|
||
|
|
vllm serve TheBloke/Llama-2-70B-AWQ --quantization awq # Must specify
|
||
|
|
|
||
|
|
# Try different dtype
|
||
|
|
vllm serve MODEL --quantization awq --dtype float16
|
||
|
|
```
|
||
|
|
|
||
|
|
## Performance issues
|
||
|
|
|
||
|
|
### Symptom: Low throughput (<50 req/sec expected >100)
|
||
|
|
|
||
|
|
**Diagnostic steps**:
|
||
|
|
|
||
|
|
1. **Check GPU utilization**:
|
||
|
|
```bash
|
||
|
|
watch -n 1 nvidia-smi
|
||
|
|
# GPU utilization should be >80%
|
||
|
|
```
|
||
|
|
|
||
|
|
If <80%, increase concurrent requests:
|
||
|
|
```bash
|
||
|
|
vllm serve MODEL --max-num-seqs 512 # Increase from 256
|
||
|
|
```
|
||
|
|
|
||
|
|
2. **Check if memory-bound**:
|
||
|
|
```bash
|
||
|
|
# If memory at 100% but GPU <80%, reduce sequence length
|
||
|
|
vllm serve MODEL --max-model-len 4096
|
||
|
|
```
|
||
|
|
|
||
|
|
3. **Enable optimizations**:
|
||
|
|
```bash
|
||
|
|
vllm serve MODEL \
|
||
|
|
--enable-prefix-caching \
|
||
|
|
--enable-chunked-prefill \
|
||
|
|
--max-num-seqs 512
|
||
|
|
```
|
||
|
|
|
||
|
|
4. **Check tensor parallelism settings**:
|
||
|
|
```bash
|
||
|
|
# Must use power-of-2 GPUs
|
||
|
|
vllm serve MODEL --tensor-parallel-size 4 # Not 3 or 5
|
||
|
|
```
|
||
|
|
|
||
|
|
### Symptom: High TTFT (time to first token >1 second)
|
||
|
|
|
||
|
|
**Causes and solutions**:
|
||
|
|
|
||
|
|
**Long prompts**:
|
||
|
|
```bash
|
||
|
|
vllm serve MODEL --enable-chunked-prefill
|
||
|
|
```
|
||
|
|
|
||
|
|
**No prefix caching**:
|
||
|
|
```bash
|
||
|
|
vllm serve MODEL --enable-prefix-caching # For repeated prompts
|
||
|
|
```
|
||
|
|
|
||
|
|
**Too many concurrent requests**:
|
||
|
|
```bash
|
||
|
|
vllm serve MODEL --max-num-seqs 64 # Reduce to prioritize latency
|
||
|
|
```
|
||
|
|
|
||
|
|
**Model too large for single GPU**:
|
||
|
|
```bash
|
||
|
|
vllm serve MODEL --tensor-parallel-size 2 # Parallelize prefill
|
||
|
|
```
|
||
|
|
|
||
|
|
### Symptom: Slow token generation (low tokens/sec)
|
||
|
|
|
||
|
|
**Diagnostic**:
|
||
|
|
```bash
|
||
|
|
# Check if model is correct size
|
||
|
|
vllm serve MODEL # Should see model size in logs
|
||
|
|
|
||
|
|
# Check speculative decoding
|
||
|
|
vllm serve MODEL --speculative-model DRAFT_MODEL
|
||
|
|
```
|
||
|
|
|
||
|
|
**For H100 GPUs**, enable FP8:
|
||
|
|
```bash
|
||
|
|
vllm serve MODEL --quantization fp8
|
||
|
|
```
|
||
|
|
|
||
|
|
## Model loading errors
|
||
|
|
|
||
|
|
### Symptom: `OSError: MODEL not found`
|
||
|
|
|
||
|
|
**Causes**:
|
||
|
|
|
||
|
|
1. **Model name typo**:
|
||
|
|
```bash
|
||
|
|
# Check exact model name on HuggingFace
|
||
|
|
vllm serve meta-llama/Llama-3-8B-Instruct # Correct capitalization
|
||
|
|
```
|
||
|
|
|
||
|
|
2. **Private/gated model**:
|
||
|
|
```bash
|
||
|
|
# Login to HuggingFace first
|
||
|
|
huggingface-cli login
|
||
|
|
# Then run vLLM
|
||
|
|
vllm serve meta-llama/Llama-3-70B-Instruct
|
||
|
|
```
|
||
|
|
|
||
|
|
3. **Custom model needs trust flag**:
|
||
|
|
```bash
|
||
|
|
vllm serve MODEL --trust-remote-code
|
||
|
|
```
|
||
|
|
|
||
|
|
### Symptom: `ValueError: Tokenizer not found`
|
||
|
|
|
||
|
|
**Solution**:
|
||
|
|
```bash
|
||
|
|
# Download model manually first
|
||
|
|
python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('MODEL')"
|
||
|
|
|
||
|
|
# Then launch vLLM
|
||
|
|
vllm serve MODEL
|
||
|
|
```
|
||
|
|
|
||
|
|
### Symptom: `ImportError: No module named 'flash_attn'`
|
||
|
|
|
||
|
|
**Solution**:
|
||
|
|
```bash
|
||
|
|
# Install flash attention
|
||
|
|
pip install flash-attn --no-build-isolation
|
||
|
|
|
||
|
|
# Or disable flash attention
|
||
|
|
vllm serve MODEL --disable-flash-attn
|
||
|
|
```
|
||
|
|
|
||
|
|
## Network and connection issues
|
||
|
|
|
||
|
|
### Symptom: `Connection refused` when querying server
|
||
|
|
|
||
|
|
**Diagnostic**:
|
||
|
|
|
||
|
|
1. **Check server is running**:
|
||
|
|
```bash
|
||
|
|
curl http://localhost:8000/health
|
||
|
|
```
|
||
|
|
|
||
|
|
2. **Check port binding**:
|
||
|
|
```bash
|
||
|
|
# Bind to all interfaces for remote access
|
||
|
|
vllm serve MODEL --host 0.0.0.0 --port 8000
|
||
|
|
|
||
|
|
# Check if port is in use
|
||
|
|
lsof -i :8000
|
||
|
|
```
|
||
|
|
|
||
|
|
3. **Check firewall**:
|
||
|
|
```bash
|
||
|
|
# Allow port through firewall
|
||
|
|
sudo ufw allow 8000
|
||
|
|
```
|
||
|
|
|
||
|
|
### Symptom: Slow response times over network
|
||
|
|
|
||
|
|
**Solutions**:
|
||
|
|
|
||
|
|
1. **Increase timeout**:
|
||
|
|
```python
|
||
|
|
from openai import OpenAI
|
||
|
|
|
||
|
|
client = OpenAI(
|
||
|
|
base_url="http://localhost:8000/v1",
|
||
|
|
api_key="EMPTY",
|
||
|
|
timeout=300.0 # 5 minute timeout
|
||
|
|
)
|
||
|
|
```
|
||
|
|
|
||
|
|
2. **Check network latency**:
|
||
|
|
```bash
|
||
|
|
ping SERVER_IP # Should be <10ms for local network
|
||
|
|
```
|
||
|
|
|
||
|
|
3. **Use connection pooling**:
|
||
|
|
```python
|
||
|
|
import requests
|
||
|
|
from requests.adapters import HTTPAdapter
|
||
|
|
from urllib3.util.retry import Retry
|
||
|
|
|
||
|
|
session = requests.Session()
|
||
|
|
retries = Retry(total=3, backoff_factor=1)
|
||
|
|
session.mount('http://', HTTPAdapter(max_retries=retries))
|
||
|
|
```
|
||
|
|
|
||
|
|
## Quantization problems
|
||
|
|
|
||
|
|
### Symptom: `RuntimeError: Quantization format not supported`
|
||
|
|
|
||
|
|
**Solution**:
|
||
|
|
```bash
|
||
|
|
# Ensure correct quantization method
|
||
|
|
vllm serve MODEL --quantization awq # For AWQ models
|
||
|
|
vllm serve MODEL --quantization gptq # For GPTQ models
|
||
|
|
|
||
|
|
# Check model card for quantization type
|
||
|
|
```
|
||
|
|
|
||
|
|
### Symptom: Poor quality outputs after quantization
|
||
|
|
|
||
|
|
**Diagnostic**:
|
||
|
|
|
||
|
|
1. **Verify model is correctly quantized**:
|
||
|
|
```bash
|
||
|
|
# Check model config.json for quantization_config
|
||
|
|
cat ~/.cache/huggingface/hub/models--MODEL/config.json
|
||
|
|
```
|
||
|
|
|
||
|
|
2. **Try different quantization method**:
|
||
|
|
```bash
|
||
|
|
# If AWQ quality issues, try FP8 (H100 only)
|
||
|
|
vllm serve MODEL --quantization fp8
|
||
|
|
|
||
|
|
# Or use less aggressive quantization
|
||
|
|
vllm serve MODEL # No quantization
|
||
|
|
```
|
||
|
|
|
||
|
|
3. **Increase temperature for better diversity**:
|
||
|
|
```python
|
||
|
|
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||
|
|
```
|
||
|
|
|
||
|
|
## Distributed serving issues
|
||
|
|
|
||
|
|
### Symptom: `RuntimeError: Distributed init failed`
|
||
|
|
|
||
|
|
**Diagnostic**:
|
||
|
|
|
||
|
|
1. **Check environment variables**:
|
||
|
|
```bash
|
||
|
|
# On all nodes
|
||
|
|
echo $MASTER_ADDR # Should be same
|
||
|
|
echo $MASTER_PORT # Should be same
|
||
|
|
echo $RANK # Should be unique per node (0, 1, 2, ...)
|
||
|
|
echo $WORLD_SIZE # Should be same (total nodes)
|
||
|
|
```
|
||
|
|
|
||
|
|
2. **Check network connectivity**:
|
||
|
|
```bash
|
||
|
|
# From node 1 to node 2
|
||
|
|
ping NODE2_IP
|
||
|
|
nc -zv NODE2_IP 29500 # Check port accessibility
|
||
|
|
```
|
||
|
|
|
||
|
|
3. **Check NCCL settings**:
|
||
|
|
```bash
|
||
|
|
export NCCL_DEBUG=INFO
|
||
|
|
export NCCL_SOCKET_IFNAME=eth0 # Or your network interface
|
||
|
|
vllm serve MODEL --tensor-parallel-size 8
|
||
|
|
```
|
||
|
|
|
||
|
|
### Symptom: `NCCL error: unhandled cuda error`
|
||
|
|
|
||
|
|
**Solutions**:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Set NCCL to use correct network interface
|
||
|
|
export NCCL_SOCKET_IFNAME=eth0 # Replace with your interface
|
||
|
|
|
||
|
|
# Increase timeout
|
||
|
|
export NCCL_TIMEOUT=1800 # 30 minutes
|
||
|
|
|
||
|
|
# Force P2P for debugging
|
||
|
|
export NCCL_P2P_DISABLE=1
|
||
|
|
```
|
||
|
|
|
||
|
|
## Debugging tools and commands
|
||
|
|
|
||
|
|
### Enable debug logging
|
||
|
|
|
||
|
|
```bash
|
||
|
|
export VLLM_LOGGING_LEVEL=DEBUG
|
||
|
|
vllm serve MODEL
|
||
|
|
```
|
||
|
|
|
||
|
|
### Monitor GPU usage
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Real-time GPU monitoring
|
||
|
|
watch -n 1 nvidia-smi
|
||
|
|
|
||
|
|
# Memory breakdown
|
||
|
|
nvidia-smi --query-gpu=memory.used,memory.free --format=csv -l 1
|
||
|
|
```
|
||
|
|
|
||
|
|
### Profile performance
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Built-in benchmarking
|
||
|
|
vllm bench throughput \
|
||
|
|
--model MODEL \
|
||
|
|
--input-tokens 128 \
|
||
|
|
--output-tokens 256 \
|
||
|
|
--num-prompts 100
|
||
|
|
|
||
|
|
vllm bench latency \
|
||
|
|
--model MODEL \
|
||
|
|
--input-tokens 128 \
|
||
|
|
--output-tokens 256 \
|
||
|
|
--batch-size 8
|
||
|
|
```
|
||
|
|
|
||
|
|
### Check metrics
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Prometheus metrics
|
||
|
|
curl http://localhost:9090/metrics
|
||
|
|
|
||
|
|
# Filter for specific metrics
|
||
|
|
curl http://localhost:9090/metrics | grep vllm_time_to_first_token
|
||
|
|
|
||
|
|
# Key metrics to monitor:
|
||
|
|
# - vllm_time_to_first_token_seconds
|
||
|
|
# - vllm_time_per_output_token_seconds
|
||
|
|
# - vllm_num_requests_running
|
||
|
|
# - vllm_gpu_cache_usage_perc
|
||
|
|
# - vllm_request_success_total
|
||
|
|
```
|
||
|
|
|
||
|
|
### Test server health
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Health check
|
||
|
|
curl http://localhost:8000/health
|
||
|
|
|
||
|
|
# Model info
|
||
|
|
curl http://localhost:8000/v1/models
|
||
|
|
|
||
|
|
# Test completion
|
||
|
|
curl http://localhost:8000/v1/completions \
|
||
|
|
-H "Content-Type: application/json" \
|
||
|
|
-d '{
|
||
|
|
"model": "MODEL",
|
||
|
|
"prompt": "Hello",
|
||
|
|
"max_tokens": 10
|
||
|
|
}'
|
||
|
|
```
|
||
|
|
|
||
|
|
### Common environment variables
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# CUDA settings
|
||
|
|
export CUDA_VISIBLE_DEVICES=0,1,2,3 # Limit to specific GPUs
|
||
|
|
|
||
|
|
# vLLM settings
|
||
|
|
export VLLM_LOGGING_LEVEL=DEBUG
|
||
|
|
export VLLM_TRACE_FUNCTION=1 # Profile functions
|
||
|
|
export VLLM_USE_V1=1 # Use v1.0 engine (faster)
|
||
|
|
|
||
|
|
# NCCL settings (distributed)
|
||
|
|
export NCCL_DEBUG=INFO
|
||
|
|
export NCCL_SOCKET_IFNAME=eth0
|
||
|
|
export NCCL_IB_DISABLE=0 # Enable InfiniBand
|
||
|
|
```
|
||
|
|
|
||
|
|
### Collect diagnostic info for bug reports
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# System info
|
||
|
|
nvidia-smi
|
||
|
|
python --version
|
||
|
|
pip show vllm
|
||
|
|
|
||
|
|
# vLLM version and config
|
||
|
|
vllm --version
|
||
|
|
python -c "import vllm; print(vllm.__version__)"
|
||
|
|
|
||
|
|
# Run with debug logging
|
||
|
|
export VLLM_LOGGING_LEVEL=DEBUG
|
||
|
|
vllm serve MODEL 2>&1 | tee vllm_debug.log
|
||
|
|
|
||
|
|
# Include in bug report:
|
||
|
|
# - vllm_debug.log
|
||
|
|
# - nvidia-smi output
|
||
|
|
# - Full command used
|
||
|
|
# - Expected vs actual behavior
|
||
|
|
```
|