skills/mlops/inference/vllm/references/quantization.md

# Quantization Guide

## Contents
- Quantization methods comparison
- AWQ setup and usage
- GPTQ setup and usage
- FP8 quantization (H100)
- Model preparation
- Accuracy vs compression trade-offs

## Quantization methods comparison

| Method | Compression | Accuracy Loss | Speed | Best For |
|--------|-------------|---------------|-------|----------|
| **AWQ** | 4-bit (75%) | <1% | Fast | 70B models, production |
| **GPTQ** | 4-bit (75%) | 1-2% | Fast | Wide model support |
| **FP8** | 8-bit (50%) | <0.5% | Fastest | H100 GPUs only |
| **SqueezeLLM** | 3-4 bit (75-80%) | 2-3% | Medium | Extreme compression |

**Recommendation**:
- **Production**: Use AWQ for 70B models
- **H100 GPUs**: Use FP8 for best speed
- **Maximum compatibility**: Use GPTQ
- **Extreme compression**: Use SqueezeLLM

## AWQ setup and usage

**AWQ** (Activation-aware Weight Quantization) achieves best accuracy at 4-bit.

**Step 1: Find pre-quantized model**

Search HuggingFace for AWQ models:
```bash
# Example: TheBloke/Llama-2-70B-AWQ
# Example: TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ
```

**Step 2: Launch with AWQ**

```bash
vllm serve TheBloke/Llama-2-70B-AWQ \
  --quantization awq \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.95
```

**Memory savings**:
```
Llama 2 70B fp16: 140GB VRAM (4x A100 needed)
Llama 2 70B AWQ: 35GB VRAM (1x A100 40GB)
= 4x memory reduction
```

**Step 3: Verify performance**

Test that outputs are acceptable:
```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

# Test complex reasoning
response = client.chat.completions.create(
    model="TheBloke/Llama-2-70B-AWQ",
    messages=[{"role": "user", "content": "Explain quantum entanglement"}]
)

print(response.choices[0].message.content)
# Verify quality matches your requirements
```

**Quantize your own model** (requires GPU with 80GB+ VRAM):

```python
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "meta-llama/Llama-2-70b-hf"
quant_path = "llama-2-70b-awq"

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Quantize
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4}
model.quantize(tokenizer, quant_config=quant_config)

# Save
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
```

## GPTQ setup and usage

**GPTQ** has widest model support and good compression.

**Step 1: Find GPTQ model**

```bash
# Example: TheBloke/Llama-2-13B-GPTQ
# Example: TheBloke/CodeLlama-34B-GPTQ
```

**Step 2: Launch with GPTQ**

```bash
vllm serve TheBloke/Llama-2-13B-GPTQ \
  --quantization gptq \
  --dtype float16
```

**GPTQ configuration options**:
```bash
# Specify GPTQ parameters if needed
vllm serve MODEL \
  --quantization gptq \
  --gptq-act-order \  # Activation ordering
  --dtype float16
```

**Quantize your own model**:

```python
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer

model_name = "meta-llama/Llama-2-13b-hf"
quantized_name = "llama-2-13b-gptq"

# Load model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoGPTQForCausalLM.from_pretrained(model_name, quantize_config)

# Prepare calibration data
calib_data = [...]  # List of sample texts

# Quantize
quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    desc_act=True
)
model.quantize(calib_data)

# Save
model.save_quantized(quantized_name)
```

## FP8 quantization (H100)

**FP8** (8-bit floating point) offers best speed on H100 GPUs with minimal accuracy loss.

**Requirements**:
- H100 or H800 GPU
- CUDA 12.3+ (12.8 recommended)
- Hopper architecture support

**Step 1: Enable FP8**

```bash
vllm serve meta-llama/Llama-3-70B-Instruct \
  --quantization fp8 \
  --tensor-parallel-size 2
```

**Performance gains on H100**:
```
fp16: 180 tokens/sec
FP8: 320 tokens/sec
= 1.8x speedup
```

**Step 2: Verify accuracy**

FP8 typically has <0.5% accuracy degradation:
```python
# Run evaluation suite
# Compare FP8 vs FP16 on your tasks
# Verify acceptable accuracy
```

**Dynamic FP8 quantization** (no pre-quantized model needed):

```bash
# vLLM automatically quantizes at runtime
vllm serve MODEL --quantization fp8
# No model preparation required
```

## Model preparation

**Pre-quantized models (easiest)**:

1. Search HuggingFace: `[model name] AWQ` or `[model name] GPTQ`
2. Download or use directly: `TheBloke/[Model]-AWQ`
3. Launch with appropriate `--quantization` flag

**Quantize your own model**:

**AWQ**:
```bash
# Install AutoAWQ
pip install autoawq

# Run quantization script
python quantize_awq.py --model MODEL --output OUTPUT
```

**GPTQ**:
```bash
# Install AutoGPTQ
pip install auto-gptq

# Run quantization script
python quantize_gptq.py --model MODEL --output OUTPUT
```

**Calibration data**:
- Use 128-512 diverse examples from target domain
- Representative of production inputs
- Higher quality calibration = better accuracy

## Accuracy vs compression trade-offs

**Empirical results** (Llama 2 70B on MMLU benchmark):

| Quantization | Accuracy | Memory | Speed | Production-Ready |
|--------------|----------|--------|-------|------------------|
| FP16 (baseline) | 100% | 140GB | 1.0x | ✅ (if memory available) |
| FP8 | 99.5% | 70GB | 1.8x | ✅ (H100 only) |
| AWQ 4-bit | 99.0% | 35GB | 1.5x | ✅ (best for 70B) |
| GPTQ 4-bit | 98.5% | 35GB | 1.5x | ✅ (good compatibility) |
| SqueezeLLM 3-bit | 96.0% | 26GB | 1.3x | ⚠️ (check accuracy) |

**When to use each**:

**No quantization (FP16)**:
- Have sufficient GPU memory
- Need absolute best accuracy
- Model <13B parameters

**FP8**:
- Using H100/H800 GPUs
- Need best speed with minimal accuracy loss
- Production deployment

**AWQ 4-bit**:
- Need to fit 70B model in 40GB GPU
- Production deployment
- <1% accuracy loss acceptable

**GPTQ 4-bit**:
- Wide model support needed
- Not on H100 (use FP8 instead)
- 1-2% accuracy loss acceptable

**Testing strategy**:

1. **Baseline**: Measure FP16 accuracy on your evaluation set
2. **Quantize**: Create quantized version
3. **Evaluate**: Compare quantized vs baseline on same tasks
4. **Decide**: Accept if degradation < threshold (typically 1-2%)

**Example evaluation**:
```python
from evaluate import load_evaluation_suite

# Run on FP16 baseline
baseline_score = evaluate(model_fp16, eval_suite)

# Run on quantized
quant_score = evaluate(model_awq, eval_suite)

# Compare
degradation = (baseline_score - quant_score) / baseline_score * 100
print(f"Accuracy degradation: {degradation:.2f}%")

# Decision
if degradation < 1.0:
    print("✅ Quantization acceptable for production")
else:
    print("⚠️ Review accuracy loss")
```
Sync all skills and memories 2026-04-14 07:27 2026-04-14 07:27:20 +09:00			`# Quantization Guide`

			`## Contents`
			`- Quantization methods comparison`
			`- AWQ setup and usage`
			`- GPTQ setup and usage`
			`- FP8 quantization (H100)`
			`- Model preparation`
			`- Accuracy vs compression trade-offs`

			`## Quantization methods comparison`

			`\| Method \| Compression \| Accuracy Loss \| Speed \| Best For \|`
			`\|--------\|-------------\|---------------\|-------\|----------\|`
			`\| AWQ \| 4-bit (75%) \| <1% \| Fast \| 70B models, production \|`
			`\| GPTQ \| 4-bit (75%) \| 1-2% \| Fast \| Wide model support \|`
			`\| FP8 \| 8-bit (50%) \| <0.5% \| Fastest \| H100 GPUs only \|`
			`\| SqueezeLLM \| 3-4 bit (75-80%) \| 2-3% \| Medium \| Extreme compression \|`

			`Recommendation:`
			`- Production: Use AWQ for 70B models`
			`- H100 GPUs: Use FP8 for best speed`
			`- Maximum compatibility: Use GPTQ`
			`- Extreme compression: Use SqueezeLLM`

			`## AWQ setup and usage`

			`AWQ (Activation-aware Weight Quantization) achieves best accuracy at 4-bit.`

			`Step 1: Find pre-quantized model`

			`Search HuggingFace for AWQ models:`
			```bash
			`# Example: TheBloke/Llama-2-70B-AWQ`
			`# Example: TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ`
			```

			`Step 2: Launch with AWQ`

			```bash
			`vllm serve TheBloke/Llama-2-70B-AWQ \`
			`--quantization awq \`
			`--tensor-parallel-size 1 \`
			`--gpu-memory-utilization 0.95`
			```

			`Memory savings:`
			```
			`Llama 2 70B fp16: 140GB VRAM (4x A100 needed)`
			`Llama 2 70B AWQ: 35GB VRAM (1x A100 40GB)`
			`= 4x memory reduction`
			```

			`Step 3: Verify performance`

			`Test that outputs are acceptable:`
			```python
			`from openai import OpenAI`

			`client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")`

			`# Test complex reasoning`
			`response = client.chat.completions.create(`
			`model="TheBloke/Llama-2-70B-AWQ",`
			`messages=[{"role": "user", "content": "Explain quantum entanglement"}]`
			`)`

			`print(response.choices[0].message.content)`
			`# Verify quality matches your requirements`
			```

			`Quantize your own model (requires GPU with 80GB+ VRAM):`

			```python
			`from awq import AutoAWQForCausalLM`
			`from transformers import AutoTokenizer`

			`model_path = "meta-llama/Llama-2-70b-hf"`
			`quant_path = "llama-2-70b-awq"`

			`# Load model`
			`model = AutoAWQForCausalLM.from_pretrained(model_path)`
			`tokenizer = AutoTokenizer.from_pretrained(model_path)`

			`# Quantize`
			`quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4}`
			`model.quantize(tokenizer, quant_config=quant_config)`

			`# Save`
			`model.save_quantized(quant_path)`
			`tokenizer.save_pretrained(quant_path)`
			```

			`## GPTQ setup and usage`

			`GPTQ has widest model support and good compression.`

			`Step 1: Find GPTQ model`

			```bash
			`# Example: TheBloke/Llama-2-13B-GPTQ`
			`# Example: TheBloke/CodeLlama-34B-GPTQ`
			```

			`Step 2: Launch with GPTQ`

			```bash
			`vllm serve TheBloke/Llama-2-13B-GPTQ \`
			`--quantization gptq \`
			`--dtype float16`
			```

			`GPTQ configuration options:`
			```bash
			`# Specify GPTQ parameters if needed`
			`vllm serve MODEL \`
			`--quantization gptq \`
			`--gptq-act-order \ # Activation ordering`
			`--dtype float16`
			```

			`Quantize your own model:`

			```python
			`from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig`
			`from transformers import AutoTokenizer`

			`model_name = "meta-llama/Llama-2-13b-hf"`
			`quantized_name = "llama-2-13b-gptq"`

			`# Load model`
			`tokenizer = AutoTokenizer.from_pretrained(model_name)`
			`model = AutoGPTQForCausalLM.from_pretrained(model_name, quantize_config)`

			`# Prepare calibration data`
			`calib_data = [...] # List of sample texts`

			`# Quantize`
			`quantize_config = BaseQuantizeConfig(`
			`bits=4,`
			`group_size=128,`
			`desc_act=True`
			`)`
			`model.quantize(calib_data)`

			`# Save`
			`model.save_quantized(quantized_name)`
			```

			`## FP8 quantization (H100)`

			`FP8 (8-bit floating point) offers best speed on H100 GPUs with minimal accuracy loss.`

			`Requirements:`
			`- H100 or H800 GPU`
			`- CUDA 12.3+ (12.8 recommended)`
			`- Hopper architecture support`

			`Step 1: Enable FP8`

			```bash
			`vllm serve meta-llama/Llama-3-70B-Instruct \`
			`--quantization fp8 \`
			`--tensor-parallel-size 2`
			```

			`Performance gains on H100:`
			```
			`fp16: 180 tokens/sec`
			`FP8: 320 tokens/sec`
			`= 1.8x speedup`
			```

			`Step 2: Verify accuracy`

			`FP8 typically has <0.5% accuracy degradation:`
			```python
			`# Run evaluation suite`
			`# Compare FP8 vs FP16 on your tasks`
			`# Verify acceptable accuracy`
			```

			`Dynamic FP8 quantization (no pre-quantized model needed):`

			```bash
			`# vLLM automatically quantizes at runtime`
			`vllm serve MODEL --quantization fp8`
			`# No model preparation required`
			```

			`## Model preparation`

			`Pre-quantized models (easiest):`

			1. Search HuggingFace: `[model name] AWQ` or `[model name] GPTQ`
			2. Download or use directly: `TheBloke/[Model]-AWQ`
			3. Launch with appropriate `--quantization` flag

			`Quantize your own model:`

			`AWQ:`
			```bash
			`# Install AutoAWQ`
			`pip install autoawq`

			`# Run quantization script`
			`python quantize_awq.py --model MODEL --output OUTPUT`
			```

			`GPTQ:`
			```bash
			`# Install AutoGPTQ`
			`pip install auto-gptq`

			`# Run quantization script`
			`python quantize_gptq.py --model MODEL --output OUTPUT`
			```

			`Calibration data:`
			`- Use 128-512 diverse examples from target domain`
			`- Representative of production inputs`
			`- Higher quality calibration = better accuracy`

			`## Accuracy vs compression trade-offs`

			`Empirical results (Llama 2 70B on MMLU benchmark):`

			`\| Quantization \| Accuracy \| Memory \| Speed \| Production-Ready \|`
			`\|--------------\|----------\|--------\|-------\|------------------\|`
			`\| FP16 (baseline) \| 100% \| 140GB \| 1.0x \| ✅ (if memory available) \|`
			`\| FP8 \| 99.5% \| 70GB \| 1.8x \| ✅ (H100 only) \|`
			`\| AWQ 4-bit \| 99.0% \| 35GB \| 1.5x \| ✅ (best for 70B) \|`
			`\| GPTQ 4-bit \| 98.5% \| 35GB \| 1.5x \| ✅ (good compatibility) \|`
			`\| SqueezeLLM 3-bit \| 96.0% \| 26GB \| 1.3x \| ⚠️ (check accuracy) \|`

			`When to use each:`

			`No quantization (FP16):`
			`- Have sufficient GPU memory`
			`- Need absolute best accuracy`
			`- Model <13B parameters`

			`FP8:`
			`- Using H100/H800 GPUs`
			`- Need best speed with minimal accuracy loss`
			`- Production deployment`

			`AWQ 4-bit:`
			`- Need to fit 70B model in 40GB GPU`
			`- Production deployment`
			`- <1% accuracy loss acceptable`

			`GPTQ 4-bit:`
			`- Wide model support needed`
			`- Not on H100 (use FP8 instead)`
			`- 1-2% accuracy loss acceptable`

			`Testing strategy:`

			`1. Baseline: Measure FP16 accuracy on your evaluation set`
			`2. Quantize: Create quantized version`
			`3. Evaluate: Compare quantized vs baseline on same tasks`
			`4. Decide: Accept if degradation < threshold (typically 1-2%)`

			`Example evaluation:`
			```python
			`from evaluate import load_evaluation_suite`

			`# Run on FP16 baseline`
			`baseline_score = evaluate(model_fp16, eval_suite)`

			`# Run on quantized`
			`quant_score = evaluate(model_awq, eval_suite)`

			`# Compare`
			`degradation = (baseline_score - quant_score) / baseline_score * 100`
			`print(f"Accuracy degradation: {degradation:.2f}%")`

			`# Decision`
			`if degradation < 1.0:`
			`print("✅ Quantization acceptable for production")`
			`else:`
			`print("⚠️ Review accuracy loss")`
			```