Mixer/hermes-sync

Fork 0

Files

Mixer 1eba2bca95 Sync all skills and memories 2026-04-14 07:27

2026-04-14 07:27:20 +09:00

8.8 KiB

Raw Permalink Blame History

Troubleshooting Guide

Out of memory (OOM) errors
Performance issues
Model loading errors
Network and connection issues
Quantization problems
Distributed serving issues
Debugging tools and commands

Out of memory (OOM) errors

Symptom: `torch.cuda.OutOfMemoryError` during model loading

Cause: Model + KV cache exceeds available VRAM

Solutions (try in order):

Reduce GPU memory utilization:

vllm serve MODEL --gpu-memory-utilization 0.7  # Try 0.7, 0.75, 0.8

Reduce max sequence length:

vllm serve MODEL --max-model-len 4096  # Instead of 8192

Enable quantization:

vllm serve MODEL --quantization awq  # 4x memory reduction

Use tensor parallelism (multiple GPUs):

vllm serve MODEL --tensor-parallel-size 2  # Split across 2 GPUs

Reduce max concurrent sequences:

vllm serve MODEL --max-num-seqs 128  # Default is 256

Symptom: OOM during inference (not model loading)

Cause: KV cache fills up during generation

Solutions:

# Reduce KV cache allocation
vllm serve MODEL --gpu-memory-utilization 0.85

# Reduce batch size
vllm serve MODEL --max-num-seqs 64

# Reduce max tokens per request
# Set in client request: max_tokens=512

Symptom: OOM with quantized model

Cause: Quantization overhead or incorrect configuration

Solution:

# Ensure quantization flag matches model
vllm serve TheBloke/Llama-2-70B-AWQ --quantization awq  # Must specify

# Try different dtype
vllm serve MODEL --quantization awq --dtype float16

Performance issues

Symptom: Low throughput (<50 req/sec expected >100)

Diagnostic steps:

Check GPU utilization:

watch -n 1 nvidia-smi
# GPU utilization should be >80%

If <80%, increase concurrent requests:

vllm serve MODEL --max-num-seqs 512  # Increase from 256

Check if memory-bound:

# If memory at 100% but GPU <80%, reduce sequence length
vllm serve MODEL --max-model-len 4096

Enable optimizations:

vllm serve MODEL \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --max-num-seqs 512

Check tensor parallelism settings:

# Must use power-of-2 GPUs
vllm serve MODEL --tensor-parallel-size 4  # Not 3 or 5

Symptom: High TTFT (time to first token >1 second)

Causes and solutions:

Long prompts:

vllm serve MODEL --enable-chunked-prefill

No prefix caching:

vllm serve MODEL --enable-prefix-caching  # For repeated prompts

Too many concurrent requests:

vllm serve MODEL --max-num-seqs 64  # Reduce to prioritize latency

Model too large for single GPU:

vllm serve MODEL --tensor-parallel-size 2  # Parallelize prefill

Symptom: Slow token generation (low tokens/sec)

Diagnostic:

# Check if model is correct size
vllm serve MODEL  # Should see model size in logs

# Check speculative decoding
vllm serve MODEL --speculative-model DRAFT_MODEL

For H100 GPUs, enable FP8:

vllm serve MODEL --quantization fp8

Model loading errors

Symptom: `OSError: MODEL not found`

Causes:

Model name typo:

# Check exact model name on HuggingFace
vllm serve meta-llama/Llama-3-8B-Instruct  # Correct capitalization

Private/gated model:

# Login to HuggingFace first
huggingface-cli login
# Then run vLLM
vllm serve meta-llama/Llama-3-70B-Instruct

Custom model needs trust flag:

vllm serve MODEL --trust-remote-code

Symptom: `ValueError: Tokenizer not found`

Solution:

# Download model manually first
python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('MODEL')"

# Then launch vLLM
vllm serve MODEL

Symptom: `ImportError: No module named 'flash_attn'`

Solution:

# Install flash attention
pip install flash-attn --no-build-isolation

# Or disable flash attention
vllm serve MODEL --disable-flash-attn

Network and connection issues

Symptom: `Connection refused` when querying server

Diagnostic:

Check server is running:

curl http://localhost:8000/health

Check port binding:

# Bind to all interfaces for remote access
vllm serve MODEL --host 0.0.0.0 --port 8000

# Check if port is in use
lsof -i :8000

Check firewall:

# Allow port through firewall
sudo ufw allow 8000

Symptom: Slow response times over network

Solutions:

Increase timeout:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",
    timeout=300.0  # 5 minute timeout
)

Check network latency:

ping SERVER_IP  # Should be <10ms for local network

Use connection pooling:

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

session = requests.Session()
retries = Retry(total=3, backoff_factor=1)
session.mount('http://', HTTPAdapter(max_retries=retries))

Quantization problems

Symptom: `RuntimeError: Quantization format not supported`

Solution:

# Ensure correct quantization method
vllm serve MODEL --quantization awq  # For AWQ models
vllm serve MODEL --quantization gptq  # For GPTQ models

# Check model card for quantization type

Symptom: Poor quality outputs after quantization

Diagnostic:

Verify model is correctly quantized:

# Check model config.json for quantization_config
cat ~/.cache/huggingface/hub/models--MODEL/config.json

Try different quantization method:

# If AWQ quality issues, try FP8 (H100 only)
vllm serve MODEL --quantization fp8

# Or use less aggressive quantization
vllm serve MODEL  # No quantization

Increase temperature for better diversity:

sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

Distributed serving issues

Symptom: `RuntimeError: Distributed init failed`

Diagnostic:

Check environment variables:

# On all nodes
echo $MASTER_ADDR  # Should be same
echo $MASTER_PORT  # Should be same
echo $RANK  # Should be unique per node (0, 1, 2, ...)
echo $WORLD_SIZE  # Should be same (total nodes)

Check network connectivity:

# From node 1 to node 2
ping NODE2_IP
nc -zv NODE2_IP 29500  # Check port accessibility

Check NCCL settings:

export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=eth0  # Or your network interface
vllm serve MODEL --tensor-parallel-size 8

Symptom: `NCCL error: unhandled cuda error`

Solutions:

# Set NCCL to use correct network interface
export NCCL_SOCKET_IFNAME=eth0  # Replace with your interface

# Increase timeout
export NCCL_TIMEOUT=1800  # 30 minutes

# Force P2P for debugging
export NCCL_P2P_DISABLE=1

Debugging tools and commands

Enable debug logging

export VLLM_LOGGING_LEVEL=DEBUG
vllm serve MODEL

Monitor GPU usage

# Real-time GPU monitoring
watch -n 1 nvidia-smi

# Memory breakdown
nvidia-smi --query-gpu=memory.used,memory.free --format=csv -l 1

Profile performance

# Built-in benchmarking
vllm bench throughput \
  --model MODEL \
  --input-tokens 128 \
  --output-tokens 256 \
  --num-prompts 100

vllm bench latency \
  --model MODEL \
  --input-tokens 128 \
  --output-tokens 256 \
  --batch-size 8

Check metrics

# Prometheus metrics
curl http://localhost:9090/metrics

# Filter for specific metrics
curl http://localhost:9090/metrics | grep vllm_time_to_first_token

# Key metrics to monitor:
# - vllm_time_to_first_token_seconds
# - vllm_time_per_output_token_seconds
# - vllm_num_requests_running
# - vllm_gpu_cache_usage_perc
# - vllm_request_success_total

Test server health

# Health check
curl http://localhost:8000/health

# Model info
curl http://localhost:8000/v1/models

# Test completion
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MODEL",
    "prompt": "Hello",
    "max_tokens": 10
  }'

Common environment variables

# CUDA settings
export CUDA_VISIBLE_DEVICES=0,1,2,3  # Limit to specific GPUs

# vLLM settings
export VLLM_LOGGING_LEVEL=DEBUG
export VLLM_TRACE_FUNCTION=1  # Profile functions
export VLLM_USE_V1=1  # Use v1.0 engine (faster)

# NCCL settings (distributed)
export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_DISABLE=0  # Enable InfiniBand

Collect diagnostic info for bug reports

# System info
nvidia-smi
python --version
pip show vllm

# vLLM version and config
vllm --version
python -c "import vllm; print(vllm.__version__)"

# Run with debug logging
export VLLM_LOGGING_LEVEL=DEBUG
vllm serve MODEL 2>&1 | tee vllm_debug.log

# Include in bug report:
# - vllm_debug.log
# - nvidia-smi output
# - Full command used
# - Expected vs actual behavior

8.8 KiB Raw Permalink Blame History

Troubleshooting Guide

Contents

Out of memory (OOM) errors

Symptom: torch.cuda.OutOfMemoryError during model loading

Symptom: OOM during inference (not model loading)

Symptom: OOM with quantized model

Performance issues

Symptom: Low throughput (<50 req/sec expected >100)

Symptom: High TTFT (time to first token >1 second)

Symptom: Slow token generation (low tokens/sec)

Model loading errors

Symptom: OSError: MODEL not found

Symptom: ValueError: Tokenizer not found

Symptom: ImportError: No module named 'flash_attn'

Network and connection issues

Symptom: Connection refused when querying server

Symptom: Slow response times over network

Quantization problems

Symptom: RuntimeError: Quantization format not supported

Symptom: Poor quality outputs after quantization

Distributed serving issues

Symptom: RuntimeError: Distributed init failed

Symptom: NCCL error: unhandled cuda error

Debugging tools and commands

Enable debug logging

Monitor GPU usage

Profile performance

Check metrics

Test server health

Common environment variables

Collect diagnostic info for bug reports

8.8 KiB

Raw Permalink Blame History

Symptom: `torch.cuda.OutOfMemoryError` during model loading

Symptom: `OSError: MODEL not found`

Symptom: `ValueError: Tokenizer not found`

Symptom: `ImportError: No module named 'flash_attn'`

Symptom: `Connection refused` when querying server

Symptom: `RuntimeError: Quantization format not supported`

Symptom: `RuntimeError: Distributed init failed`

Symptom: `NCCL error: unhandled cuda error`