skills/mlops/inference/vllm/references/troubleshooting.md

# Troubleshooting Guide

## Contents
- Out of memory (OOM) errors
- Performance issues
- Model loading errors
- Network and connection issues
- Quantization problems
- Distributed serving issues
- Debugging tools and commands

## Out of memory (OOM) errors

### Symptom: `torch.cuda.OutOfMemoryError` during model loading

**Cause**: Model + KV cache exceeds available VRAM

**Solutions (try in order)**:

1. **Reduce GPU memory utilization**:
```bash
vllm serve MODEL --gpu-memory-utilization 0.7  # Try 0.7, 0.75, 0.8
```

2. **Reduce max sequence length**:
```bash
vllm serve MODEL --max-model-len 4096  # Instead of 8192
```

3. **Enable quantization**:
```bash
vllm serve MODEL --quantization awq  # 4x memory reduction
```

4. **Use tensor parallelism** (multiple GPUs):
```bash
vllm serve MODEL --tensor-parallel-size 2  # Split across 2 GPUs
```

5. **Reduce max concurrent sequences**:
```bash
vllm serve MODEL --max-num-seqs 128  # Default is 256
```

### Symptom: OOM during inference (not model loading)

**Cause**: KV cache fills up during generation

**Solutions**:

```bash
# Reduce KV cache allocation
vllm serve MODEL --gpu-memory-utilization 0.85

# Reduce batch size
vllm serve MODEL --max-num-seqs 64

# Reduce max tokens per request
# Set in client request: max_tokens=512
```

### Symptom: OOM with quantized model

**Cause**: Quantization overhead or incorrect configuration

**Solution**:
```bash
# Ensure quantization flag matches model
vllm serve TheBloke/Llama-2-70B-AWQ --quantization awq  # Must specify

# Try different dtype
vllm serve MODEL --quantization awq --dtype float16
```

## Performance issues

### Symptom: Low throughput (<50 req/sec expected >100)

**Diagnostic steps**:

1. **Check GPU utilization**:
```bash
watch -n 1 nvidia-smi
# GPU utilization should be >80%
```

If <80%, increase concurrent requests:
```bash
vllm serve MODEL --max-num-seqs 512  # Increase from 256
```

2. **Check if memory-bound**:
```bash
# If memory at 100% but GPU <80%, reduce sequence length
vllm serve MODEL --max-model-len 4096
```

3. **Enable optimizations**:
```bash
vllm serve MODEL \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --max-num-seqs 512
```

4. **Check tensor parallelism settings**:
```bash
# Must use power-of-2 GPUs
vllm serve MODEL --tensor-parallel-size 4  # Not 3 or 5
```

### Symptom: High TTFT (time to first token >1 second)

**Causes and solutions**:

**Long prompts**:
```bash
vllm serve MODEL --enable-chunked-prefill
```

**No prefix caching**:
```bash
vllm serve MODEL --enable-prefix-caching  # For repeated prompts
```

**Too many concurrent requests**:
```bash
vllm serve MODEL --max-num-seqs 64  # Reduce to prioritize latency
```

**Model too large for single GPU**:
```bash
vllm serve MODEL --tensor-parallel-size 2  # Parallelize prefill
```

### Symptom: Slow token generation (low tokens/sec)

**Diagnostic**:
```bash
# Check if model is correct size
vllm serve MODEL  # Should see model size in logs

# Check speculative decoding
vllm serve MODEL --speculative-model DRAFT_MODEL
```

**For H100 GPUs**, enable FP8:
```bash
vllm serve MODEL --quantization fp8
```

## Model loading errors

### Symptom: `OSError: MODEL not found`

**Causes**:

1. **Model name typo**:
```bash
# Check exact model name on HuggingFace
vllm serve meta-llama/Llama-3-8B-Instruct  # Correct capitalization
```

2. **Private/gated model**:
```bash
# Login to HuggingFace first
huggingface-cli login
# Then run vLLM
vllm serve meta-llama/Llama-3-70B-Instruct
```

3. **Custom model needs trust flag**:
```bash
vllm serve MODEL --trust-remote-code
```

### Symptom: `ValueError: Tokenizer not found`

**Solution**:
```bash
# Download model manually first
python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('MODEL')"

# Then launch vLLM
vllm serve MODEL
```

### Symptom: `ImportError: No module named 'flash_attn'`

**Solution**:
```bash
# Install flash attention
pip install flash-attn --no-build-isolation

# Or disable flash attention
vllm serve MODEL --disable-flash-attn
```

## Network and connection issues

### Symptom: `Connection refused` when querying server

**Diagnostic**:

1. **Check server is running**:
```bash
curl http://localhost:8000/health
```

2. **Check port binding**:
```bash
# Bind to all interfaces for remote access
vllm serve MODEL --host 0.0.0.0 --port 8000

# Check if port is in use
lsof -i :8000
```

3. **Check firewall**:
```bash
# Allow port through firewall
sudo ufw allow 8000
```

### Symptom: Slow response times over network

**Solutions**:

1. **Increase timeout**:
```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",
    timeout=300.0  # 5 minute timeout
)
```

2. **Check network latency**:
```bash
ping SERVER_IP  # Should be <10ms for local network
```

3. **Use connection pooling**:
```python
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

session = requests.Session()
retries = Retry(total=3, backoff_factor=1)
session.mount('http://', HTTPAdapter(max_retries=retries))
```

## Quantization problems

### Symptom: `RuntimeError: Quantization format not supported`

**Solution**:
```bash
# Ensure correct quantization method
vllm serve MODEL --quantization awq  # For AWQ models
vllm serve MODEL --quantization gptq  # For GPTQ models

# Check model card for quantization type
```

### Symptom: Poor quality outputs after quantization

**Diagnostic**:

1. **Verify model is correctly quantized**:
```bash
# Check model config.json for quantization_config
cat ~/.cache/huggingface/hub/models--MODEL/config.json
```

2. **Try different quantization method**:
```bash
# If AWQ quality issues, try FP8 (H100 only)
vllm serve MODEL --quantization fp8

# Or use less aggressive quantization
vllm serve MODEL  # No quantization
```

3. **Increase temperature for better diversity**:
```python
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
```

## Distributed serving issues

### Symptom: `RuntimeError: Distributed init failed`

**Diagnostic**:

1. **Check environment variables**:
```bash
# On all nodes
echo $MASTER_ADDR  # Should be same
echo $MASTER_PORT  # Should be same
echo $RANK  # Should be unique per node (0, 1, 2, ...)
echo $WORLD_SIZE  # Should be same (total nodes)
```

2. **Check network connectivity**:
```bash
# From node 1 to node 2
ping NODE2_IP
nc -zv NODE2_IP 29500  # Check port accessibility
```

3. **Check NCCL settings**:
```bash
export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=eth0  # Or your network interface
vllm serve MODEL --tensor-parallel-size 8
```

### Symptom: `NCCL error: unhandled cuda error`

**Solutions**:

```bash
# Set NCCL to use correct network interface
export NCCL_SOCKET_IFNAME=eth0  # Replace with your interface

# Increase timeout
export NCCL_TIMEOUT=1800  # 30 minutes

# Force P2P for debugging
export NCCL_P2P_DISABLE=1
```

## Debugging tools and commands

### Enable debug logging

```bash
export VLLM_LOGGING_LEVEL=DEBUG
vllm serve MODEL
```

### Monitor GPU usage

```bash
# Real-time GPU monitoring
watch -n 1 nvidia-smi

# Memory breakdown
nvidia-smi --query-gpu=memory.used,memory.free --format=csv -l 1
```

### Profile performance

```bash
# Built-in benchmarking
vllm bench throughput \
  --model MODEL \
  --input-tokens 128 \
  --output-tokens 256 \
  --num-prompts 100

vllm bench latency \
  --model MODEL \
  --input-tokens 128 \
  --output-tokens 256 \
  --batch-size 8
```

### Check metrics

```bash
# Prometheus metrics
curl http://localhost:9090/metrics

# Filter for specific metrics
curl http://localhost:9090/metrics | grep vllm_time_to_first_token

# Key metrics to monitor:
# - vllm_time_to_first_token_seconds
# - vllm_time_per_output_token_seconds
# - vllm_num_requests_running
# - vllm_gpu_cache_usage_perc
# - vllm_request_success_total
```

### Test server health

```bash
# Health check
curl http://localhost:8000/health

# Model info
curl http://localhost:8000/v1/models

# Test completion
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MODEL",
    "prompt": "Hello",
    "max_tokens": 10
  }'
```

### Common environment variables

```bash
# CUDA settings
export CUDA_VISIBLE_DEVICES=0,1,2,3  # Limit to specific GPUs

# vLLM settings
export VLLM_LOGGING_LEVEL=DEBUG
export VLLM_TRACE_FUNCTION=1  # Profile functions
export VLLM_USE_V1=1  # Use v1.0 engine (faster)

# NCCL settings (distributed)
export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_DISABLE=0  # Enable InfiniBand
```

### Collect diagnostic info for bug reports

```bash
# System info
nvidia-smi
python --version
pip show vllm

# vLLM version and config
vllm --version
python -c "import vllm; print(vllm.__version__)"

# Run with debug logging
export VLLM_LOGGING_LEVEL=DEBUG
vllm serve MODEL 2>&1 | tee vllm_debug.log

# Include in bug report:
# - vllm_debug.log
# - nvidia-smi output
# - Full command used
# - Expected vs actual behavior
```
Sync all skills and memories 2026-04-14 07:27 2026-04-14 07:27:20 +09:00			`# Troubleshooting Guide`

			`## Contents`
			`- Out of memory (OOM) errors`
			`- Performance issues`
			`- Model loading errors`
			`- Network and connection issues`
			`- Quantization problems`
			`- Distributed serving issues`
			`- Debugging tools and commands`

			`## Out of memory (OOM) errors`

			### Symptom: `torch.cuda.OutOfMemoryError` during model loading

			`Cause: Model + KV cache exceeds available VRAM`

			`Solutions (try in order):`

			`1. Reduce GPU memory utilization:`
			```bash
			`vllm serve MODEL --gpu-memory-utilization 0.7 # Try 0.7, 0.75, 0.8`
			```

			`2. Reduce max sequence length:`
			```bash
			`vllm serve MODEL --max-model-len 4096 # Instead of 8192`
			```

			`3. Enable quantization:`
			```bash
			`vllm serve MODEL --quantization awq # 4x memory reduction`
			```

			`4. Use tensor parallelism (multiple GPUs):`
			```bash
			`vllm serve MODEL --tensor-parallel-size 2 # Split across 2 GPUs`
			```

			`5. Reduce max concurrent sequences:`
			```bash
			`vllm serve MODEL --max-num-seqs 128 # Default is 256`
			```

			`### Symptom: OOM during inference (not model loading)`

			`Cause: KV cache fills up during generation`

			`Solutions:`

			```bash
			`# Reduce KV cache allocation`
			`vllm serve MODEL --gpu-memory-utilization 0.85`

			`# Reduce batch size`
			`vllm serve MODEL --max-num-seqs 64`

			`# Reduce max tokens per request`
			`# Set in client request: max_tokens=512`
			```

			`### Symptom: OOM with quantized model`

			`Cause: Quantization overhead or incorrect configuration`

			`Solution:`
			```bash
			`# Ensure quantization flag matches model`
			`vllm serve TheBloke/Llama-2-70B-AWQ --quantization awq # Must specify`

			`# Try different dtype`
			`vllm serve MODEL --quantization awq --dtype float16`
			```

			`## Performance issues`

			`### Symptom: Low throughput (<50 req/sec expected >100)`

			`Diagnostic steps:`

			`1. Check GPU utilization:`
			```bash
			`watch -n 1 nvidia-smi`
			`# GPU utilization should be >80%`
			```

			`If <80%, increase concurrent requests:`
			```bash
			`vllm serve MODEL --max-num-seqs 512 # Increase from 256`
			```

			`2. Check if memory-bound:`
			```bash
			`# If memory at 100% but GPU <80%, reduce sequence length`
			`vllm serve MODEL --max-model-len 4096`
			```

			`3. Enable optimizations:`
			```bash
			`vllm serve MODEL \`
			`--enable-prefix-caching \`
			`--enable-chunked-prefill \`
			`--max-num-seqs 512`
			```

			`4. Check tensor parallelism settings:`
			```bash
			`# Must use power-of-2 GPUs`
			`vllm serve MODEL --tensor-parallel-size 4 # Not 3 or 5`
			```

			`### Symptom: High TTFT (time to first token >1 second)`

			`Causes and solutions:`

			`Long prompts:`
			```bash
			`vllm serve MODEL --enable-chunked-prefill`
			```

			`No prefix caching:`
			```bash
			`vllm serve MODEL --enable-prefix-caching # For repeated prompts`
			```

			`Too many concurrent requests:`
			```bash
			`vllm serve MODEL --max-num-seqs 64 # Reduce to prioritize latency`
			```

			`Model too large for single GPU:`
			```bash
			`vllm serve MODEL --tensor-parallel-size 2 # Parallelize prefill`
			```

			`### Symptom: Slow token generation (low tokens/sec)`

			`Diagnostic:`
			```bash
			`# Check if model is correct size`
			`vllm serve MODEL # Should see model size in logs`

			`# Check speculative decoding`
			`vllm serve MODEL --speculative-model DRAFT_MODEL`
			```

			`For H100 GPUs, enable FP8:`
			```bash
			`vllm serve MODEL --quantization fp8`
			```

			`## Model loading errors`

			### Symptom: `OSError: MODEL not found`

			`Causes:`

			`1. Model name typo:`
			```bash
			`# Check exact model name on HuggingFace`
			`vllm serve meta-llama/Llama-3-8B-Instruct # Correct capitalization`
			```

			`2. Private/gated model:`
			```bash
			`# Login to HuggingFace first`
			`huggingface-cli login`
			`# Then run vLLM`
			`vllm serve meta-llama/Llama-3-70B-Instruct`
			```

			`3. Custom model needs trust flag:`
			```bash
			`vllm serve MODEL --trust-remote-code`
			```

			### Symptom: `ValueError: Tokenizer not found`

			`Solution:`
			```bash
			`# Download model manually first`
			`python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('MODEL')"`

			`# Then launch vLLM`
			`vllm serve MODEL`
			```

			### Symptom: `ImportError: No module named 'flash_attn'`

			`Solution:`
			```bash
			`# Install flash attention`
			`pip install flash-attn --no-build-isolation`

			`# Or disable flash attention`
			`vllm serve MODEL --disable-flash-attn`
			```

			`## Network and connection issues`

			### Symptom: `Connection refused` when querying server

			`Diagnostic:`

			`1. Check server is running:`
			```bash
			`curl http://localhost:8000/health`
			```

			`2. Check port binding:`
			```bash
			`# Bind to all interfaces for remote access`
			`vllm serve MODEL --host 0.0.0.0 --port 8000`

			`# Check if port is in use`
			`lsof -i :8000`
			```

			`3. Check firewall:`
			```bash
			`# Allow port through firewall`
			`sudo ufw allow 8000`
			```

			`### Symptom: Slow response times over network`

			`Solutions:`

			`1. Increase timeout:`
			```python
			`from openai import OpenAI`

			`client = OpenAI(`
			`base_url="http://localhost:8000/v1",`
			`api_key="EMPTY",`
			`timeout=300.0 # 5 minute timeout`
			`)`
			```

			`2. Check network latency:`
			```bash
			`ping SERVER_IP # Should be <10ms for local network`
			```

			`3. Use connection pooling:`
			```python
			`import requests`
			`from requests.adapters import HTTPAdapter`
			`from urllib3.util.retry import Retry`

			`session = requests.Session()`
			`retries = Retry(total=3, backoff_factor=1)`
			`session.mount('http://', HTTPAdapter(max_retries=retries))`
			```

			`## Quantization problems`

			### Symptom: `RuntimeError: Quantization format not supported`

			`Solution:`
			```bash
			`# Ensure correct quantization method`
			`vllm serve MODEL --quantization awq # For AWQ models`
			`vllm serve MODEL --quantization gptq # For GPTQ models`

			`# Check model card for quantization type`
			```

			`### Symptom: Poor quality outputs after quantization`

			`Diagnostic:`

			`1. Verify model is correctly quantized:`
			```bash
			`# Check model config.json for quantization_config`
			`cat ~/.cache/huggingface/hub/models--MODEL/config.json`
			```

			`2. Try different quantization method:`
			```bash
			`# If AWQ quality issues, try FP8 (H100 only)`
			`vllm serve MODEL --quantization fp8`

			`# Or use less aggressive quantization`
			`vllm serve MODEL # No quantization`
			```

			`3. Increase temperature for better diversity:`
			```python
			`sampling_params = SamplingParams(temperature=0.8, top_p=0.95)`
			```

			`## Distributed serving issues`

			### Symptom: `RuntimeError: Distributed init failed`

			`Diagnostic:`

			`1. Check environment variables:`
			```bash
			`# On all nodes`
			`echo $MASTER_ADDR # Should be same`
			`echo $MASTER_PORT # Should be same`
			`echo $RANK # Should be unique per node (0, 1, 2, ...)`
			`echo $WORLD_SIZE # Should be same (total nodes)`
			```

			`2. Check network connectivity:`
			```bash
			`# From node 1 to node 2`
			`ping NODE2_IP`
			`nc -zv NODE2_IP 29500 # Check port accessibility`
			```

			`3. Check NCCL settings:`
			```bash
			`export NCCL_DEBUG=INFO`
			`export NCCL_SOCKET_IFNAME=eth0 # Or your network interface`
			`vllm serve MODEL --tensor-parallel-size 8`
			```

			### Symptom: `NCCL error: unhandled cuda error`

			`Solutions:`

			```bash
			`# Set NCCL to use correct network interface`
			`export NCCL_SOCKET_IFNAME=eth0 # Replace with your interface`

			`# Increase timeout`
			`export NCCL_TIMEOUT=1800 # 30 minutes`

			`# Force P2P for debugging`
			`export NCCL_P2P_DISABLE=1`
			```

			`## Debugging tools and commands`

			`### Enable debug logging`

			```bash
			`export VLLM_LOGGING_LEVEL=DEBUG`
			`vllm serve MODEL`
			```

			`### Monitor GPU usage`

			```bash
			`# Real-time GPU monitoring`
			`watch -n 1 nvidia-smi`

			`# Memory breakdown`
			`nvidia-smi --query-gpu=memory.used,memory.free --format=csv -l 1`
			```

			`### Profile performance`

			```bash
			`# Built-in benchmarking`
			`vllm bench throughput \`
			`--model MODEL \`
			`--input-tokens 128 \`
			`--output-tokens 256 \`
			`--num-prompts 100`

			`vllm bench latency \`
			`--model MODEL \`
			`--input-tokens 128 \`
			`--output-tokens 256 \`
			`--batch-size 8`
			```

			`### Check metrics`

			```bash
			`# Prometheus metrics`
			`curl http://localhost:9090/metrics`

			`# Filter for specific metrics`
			`curl http://localhost:9090/metrics \| grep vllm_time_to_first_token`

			`# Key metrics to monitor:`
			`# - vllm_time_to_first_token_seconds`
			`# - vllm_time_per_output_token_seconds`
			`# - vllm_num_requests_running`
			`# - vllm_gpu_cache_usage_perc`
			`# - vllm_request_success_total`
			```

			`### Test server health`

			```bash
			`# Health check`
			`curl http://localhost:8000/health`

			`# Model info`
			`curl http://localhost:8000/v1/models`

			`# Test completion`
			`curl http://localhost:8000/v1/completions \`
			`-H "Content-Type: application/json" \`
			`-d '{`
			`"model": "MODEL",`
			`"prompt": "Hello",`
			`"max_tokens": 10`
			`}'`
			```

			`### Common environment variables`

			```bash
			`# CUDA settings`
			`export CUDA_VISIBLE_DEVICES=0,1,2,3 # Limit to specific GPUs`

			`# vLLM settings`
			`export VLLM_LOGGING_LEVEL=DEBUG`
			`export VLLM_TRACE_FUNCTION=1 # Profile functions`
			`export VLLM_USE_V1=1 # Use v1.0 engine (faster)`

			`# NCCL settings (distributed)`
			`export NCCL_DEBUG=INFO`
			`export NCCL_SOCKET_IFNAME=eth0`
			`export NCCL_IB_DISABLE=0 # Enable InfiniBand`
			```

			`### Collect diagnostic info for bug reports`

			```bash
			`# System info`
			`nvidia-smi`
			`python --version`
			`pip show vllm`

			`# vLLM version and config`
			`vllm --version`
			`python -c "import vllm; print(vllm.__version__)"`

			`# Run with debug logging`
			`export VLLM_LOGGING_LEVEL=DEBUG`
			`vllm serve MODEL 2>&1 \| tee vllm_debug.log`

			`# Include in bug report:`
			`# - vllm_debug.log`
			`# - nvidia-smi output`
			`# - Full command used`
			`# - Expected vs actual behavior`
			```