Mixer/hermes-sync

Fork 0

Files

Mixer 1eba2bca95 Sync all skills and memories 2026-04-14 07:27

2026-04-14 07:27:20 +09:00

6.6 KiB

Raw Blame History

Quantization Guide

Quantization methods comparison
AWQ setup and usage
GPTQ setup and usage
FP8 quantization (H100)
Model preparation
Accuracy vs compression trade-offs

Quantization methods comparison

Method	Compression	Accuracy Loss	Speed	Best For
AWQ	4-bit (75%)	<1%	Fast	70B models, production
GPTQ	4-bit (75%)	1-2%	Fast	Wide model support
FP8	8-bit (50%)	<0.5%	Fastest	H100 GPUs only
SqueezeLLM	3-4 bit (75-80%)	2-3%	Medium	Extreme compression

Recommendation:

Production: Use AWQ for 70B models
H100 GPUs: Use FP8 for best speed
Maximum compatibility: Use GPTQ
Extreme compression: Use SqueezeLLM

AWQ setup and usage

AWQ (Activation-aware Weight Quantization) achieves best accuracy at 4-bit.

Step 1: Find pre-quantized model

Search HuggingFace for AWQ models:

# Example: TheBloke/Llama-2-70B-AWQ
# Example: TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ

Step 2: Launch with AWQ

vllm serve TheBloke/Llama-2-70B-AWQ \
  --quantization awq \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.95

Memory savings:

Llama 2 70B fp16: 140GB VRAM (4x A100 needed)
Llama 2 70B AWQ: 35GB VRAM (1x A100 40GB)
= 4x memory reduction

Step 3: Verify performance

Test that outputs are acceptable:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

# Test complex reasoning
response = client.chat.completions.create(
    model="TheBloke/Llama-2-70B-AWQ",
    messages=[{"role": "user", "content": "Explain quantum entanglement"}]
)

print(response.choices[0].message.content)
# Verify quality matches your requirements

Quantize your own model (requires GPU with 80GB+ VRAM):

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "meta-llama/Llama-2-70b-hf"
quant_path = "llama-2-70b-awq"

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Quantize
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4}
model.quantize(tokenizer, quant_config=quant_config)

# Save
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

GPTQ setup and usage

GPTQ has widest model support and good compression.

Step 1: Find GPTQ model

# Example: TheBloke/Llama-2-13B-GPTQ
# Example: TheBloke/CodeLlama-34B-GPTQ

Step 2: Launch with GPTQ

vllm serve TheBloke/Llama-2-13B-GPTQ \
  --quantization gptq \
  --dtype float16

GPTQ configuration options:

# Specify GPTQ parameters if needed
vllm serve MODEL \
  --quantization gptq \
  --gptq-act-order \  # Activation ordering
  --dtype float16

Quantize your own model:

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer

model_name = "meta-llama/Llama-2-13b-hf"
quantized_name = "llama-2-13b-gptq"

# Load model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoGPTQForCausalLM.from_pretrained(model_name, quantize_config)

# Prepare calibration data
calib_data = [...]  # List of sample texts

# Quantize
quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    desc_act=True
)
model.quantize(calib_data)

# Save
model.save_quantized(quantized_name)

FP8 quantization (H100)

FP8 (8-bit floating point) offers best speed on H100 GPUs with minimal accuracy loss.

Requirements:

H100 or H800 GPU
CUDA 12.3+ (12.8 recommended)
Hopper architecture support

Step 1: Enable FP8

vllm serve meta-llama/Llama-3-70B-Instruct \
  --quantization fp8 \
  --tensor-parallel-size 2

Performance gains on H100:

fp16: 180 tokens/sec
FP8: 320 tokens/sec
= 1.8x speedup

Step 2: Verify accuracy

FP8 typically has <0.5% accuracy degradation:

# Run evaluation suite
# Compare FP8 vs FP16 on your tasks
# Verify acceptable accuracy

Dynamic FP8 quantization (no pre-quantized model needed):

# vLLM automatically quantizes at runtime
vllm serve MODEL --quantization fp8
# No model preparation required

Model preparation

Pre-quantized models (easiest):

Search HuggingFace: [model name] AWQ or [model name] GPTQ
Download or use directly: TheBloke/[Model]-AWQ
Launch with appropriate --quantization flag

Quantize your own model:

AWQ:

# Install AutoAWQ
pip install autoawq

# Run quantization script
python quantize_awq.py --model MODEL --output OUTPUT

GPTQ:

# Install AutoGPTQ
pip install auto-gptq

# Run quantization script
python quantize_gptq.py --model MODEL --output OUTPUT

Calibration data:

Use 128-512 diverse examples from target domain
Representative of production inputs
Higher quality calibration = better accuracy

Accuracy vs compression trade-offs

Empirical results (Llama 2 70B on MMLU benchmark):

Quantization	Accuracy	Memory	Speed	Production-Ready
FP16 (baseline)	100%	140GB	1.0x	✅ (if memory available)
FP8	99.5%	70GB	1.8x	✅ (H100 only)
AWQ 4-bit	99.0%	35GB	1.5x	✅ (best for 70B)
GPTQ 4-bit	98.5%	35GB	1.5x	✅ (good compatibility)
SqueezeLLM 3-bit	96.0%	26GB	1.3x	⚠️ (check accuracy)

When to use each:

No quantization (FP16):

Have sufficient GPU memory
Need absolute best accuracy
Model <13B parameters

FP8:

Using H100/H800 GPUs
Need best speed with minimal accuracy loss
Production deployment

AWQ 4-bit:

Need to fit 70B model in 40GB GPU
Production deployment
<1% accuracy loss acceptable

GPTQ 4-bit:

Wide model support needed
Not on H100 (use FP8 instead)
1-2% accuracy loss acceptable

Testing strategy:

Baseline: Measure FP16 accuracy on your evaluation set
Quantize: Create quantized version
Evaluate: Compare quantized vs baseline on same tasks
Decide: Accept if degradation < threshold (typically 1-2%)

Example evaluation:

from evaluate import load_evaluation_suite

# Run on FP16 baseline
baseline_score = evaluate(model_fp16, eval_suite)

# Run on quantized
quant_score = evaluate(model_awq, eval_suite)

# Compare
degradation = (baseline_score - quant_score) / baseline_score * 100
print(f"Accuracy degradation: {degradation:.2f}%")

# Decision
if degradation < 1.0:
    print("✅ Quantization acceptable for production")
else:
    print("⚠️ Review accuracy loss")

6.6 KiB Raw Blame History