431 lines
10 KiB
Markdown
431 lines
10 KiB
Markdown
---
|
|
name: gguf-quantization
|
|
description: GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements.
|
|
version: 1.0.0
|
|
author: Orchestra Research
|
|
license: MIT
|
|
dependencies: [llama-cpp-python>=0.2.0]
|
|
metadata:
|
|
hermes:
|
|
tags: [GGUF, Quantization, llama.cpp, CPU Inference, Apple Silicon, Model Compression, Optimization]
|
|
|
|
---
|
|
|
|
# GGUF - Quantization Format for llama.cpp
|
|
|
|
The GGUF (GPT-Generated Unified Format) is the standard file format for llama.cpp, enabling efficient inference on CPUs, Apple Silicon, and GPUs with flexible quantization options.
|
|
|
|
## When to use GGUF
|
|
|
|
**Use GGUF when:**
|
|
- Deploying on consumer hardware (laptops, desktops)
|
|
- Running on Apple Silicon (M1/M2/M3) with Metal acceleration
|
|
- Need CPU inference without GPU requirements
|
|
- Want flexible quantization (Q2_K to Q8_0)
|
|
- Using local AI tools (LM Studio, Ollama, text-generation-webui)
|
|
|
|
**Key advantages:**
|
|
- **Universal hardware**: CPU, Apple Silicon, NVIDIA, AMD support
|
|
- **No Python runtime**: Pure C/C++ inference
|
|
- **Flexible quantization**: 2-8 bit with various methods (K-quants)
|
|
- **Ecosystem support**: LM Studio, Ollama, koboldcpp, and more
|
|
- **imatrix**: Importance matrix for better low-bit quality
|
|
|
|
**Use alternatives instead:**
|
|
- **AWQ/GPTQ**: Maximum accuracy with calibration on NVIDIA GPUs
|
|
- **HQQ**: Fast calibration-free quantization for HuggingFace
|
|
- **bitsandbytes**: Simple integration with transformers library
|
|
- **TensorRT-LLM**: Production NVIDIA deployment with maximum speed
|
|
|
|
## Quick start
|
|
|
|
### Installation
|
|
|
|
```bash
|
|
# Clone llama.cpp
|
|
git clone https://github.com/ggml-org/llama.cpp
|
|
cd llama.cpp
|
|
|
|
# Build (CPU)
|
|
make
|
|
|
|
# Build with CUDA (NVIDIA)
|
|
make GGML_CUDA=1
|
|
|
|
# Build with Metal (Apple Silicon)
|
|
make GGML_METAL=1
|
|
|
|
# Install Python bindings (optional)
|
|
pip install llama-cpp-python
|
|
```
|
|
|
|
### Convert model to GGUF
|
|
|
|
```bash
|
|
# Install requirements
|
|
pip install -r requirements.txt
|
|
|
|
# Convert HuggingFace model to GGUF (FP16)
|
|
python convert_hf_to_gguf.py ./path/to/model --outfile model-f16.gguf
|
|
|
|
# Or specify output type
|
|
python convert_hf_to_gguf.py ./path/to/model \
|
|
--outfile model-f16.gguf \
|
|
--outtype f16
|
|
```
|
|
|
|
### Quantize model
|
|
|
|
```bash
|
|
# Basic quantization to Q4_K_M
|
|
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
|
|
|
|
# Quantize with importance matrix (better quality)
|
|
./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
|
|
./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
|
|
```
|
|
|
|
### Run inference
|
|
|
|
```bash
|
|
# CLI inference
|
|
./llama-cli -m model-q4_k_m.gguf -p "Hello, how are you?"
|
|
|
|
# Interactive mode
|
|
./llama-cli -m model-q4_k_m.gguf --interactive
|
|
|
|
# With GPU offload
|
|
./llama-cli -m model-q4_k_m.gguf -ngl 35 -p "Hello!"
|
|
```
|
|
|
|
## Quantization types
|
|
|
|
### K-quant methods (recommended)
|
|
|
|
| Type | Bits | Size (7B) | Quality | Use Case |
|
|
|------|------|-----------|---------|----------|
|
|
| Q2_K | 2.5 | ~2.8 GB | Low | Extreme compression |
|
|
| Q3_K_S | 3.0 | ~3.0 GB | Low-Med | Memory constrained |
|
|
| Q3_K_M | 3.3 | ~3.3 GB | Medium | Balance |
|
|
| Q4_K_S | 4.0 | ~3.8 GB | Med-High | Good balance |
|
|
| Q4_K_M | 4.5 | ~4.1 GB | High | **Recommended default** |
|
|
| Q5_K_S | 5.0 | ~4.6 GB | High | Quality focused |
|
|
| Q5_K_M | 5.5 | ~4.8 GB | Very High | High quality |
|
|
| Q6_K | 6.0 | ~5.5 GB | Excellent | Near-original |
|
|
| Q8_0 | 8.0 | ~7.2 GB | Best | Maximum quality |
|
|
|
|
### Legacy methods
|
|
|
|
| Type | Description |
|
|
|------|-------------|
|
|
| Q4_0 | 4-bit, basic |
|
|
| Q4_1 | 4-bit with delta |
|
|
| Q5_0 | 5-bit, basic |
|
|
| Q5_1 | 5-bit with delta |
|
|
|
|
**Recommendation**: Use K-quant methods (Q4_K_M, Q5_K_M) for best quality/size ratio.
|
|
|
|
## Conversion workflows
|
|
|
|
### Workflow 1: HuggingFace to GGUF
|
|
|
|
```bash
|
|
# 1. Download model
|
|
huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b
|
|
|
|
# 2. Convert to GGUF (FP16)
|
|
python convert_hf_to_gguf.py ./llama-3.1-8b \
|
|
--outfile llama-3.1-8b-f16.gguf \
|
|
--outtype f16
|
|
|
|
# 3. Quantize
|
|
./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M
|
|
|
|
# 4. Test
|
|
./llama-cli -m llama-3.1-8b-q4_k_m.gguf -p "Hello!" -n 50
|
|
```
|
|
|
|
### Workflow 2: With importance matrix (better quality)
|
|
|
|
```bash
|
|
# 1. Convert to GGUF
|
|
python convert_hf_to_gguf.py ./model --outfile model-f16.gguf
|
|
|
|
# 2. Create calibration text (diverse samples)
|
|
cat > calibration.txt << 'EOF'
|
|
The quick brown fox jumps over the lazy dog.
|
|
Machine learning is a subset of artificial intelligence.
|
|
Python is a popular programming language.
|
|
# Add more diverse text samples...
|
|
EOF
|
|
|
|
# 3. Generate importance matrix
|
|
./llama-imatrix -m model-f16.gguf \
|
|
-f calibration.txt \
|
|
--chunk 512 \
|
|
-o model.imatrix \
|
|
-ngl 35 # GPU layers if available
|
|
|
|
# 4. Quantize with imatrix
|
|
./llama-quantize --imatrix model.imatrix \
|
|
model-f16.gguf \
|
|
model-q4_k_m.gguf \
|
|
Q4_K_M
|
|
```
|
|
|
|
### Workflow 3: Multiple quantizations
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
MODEL="llama-3.1-8b-f16.gguf"
|
|
IMATRIX="llama-3.1-8b.imatrix"
|
|
|
|
# Generate imatrix once
|
|
./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35
|
|
|
|
# Create multiple quantizations
|
|
for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do
|
|
OUTPUT="llama-3.1-8b-${QUANT,,}.gguf"
|
|
./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT
|
|
echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))"
|
|
done
|
|
```
|
|
|
|
## Python usage
|
|
|
|
### llama-cpp-python
|
|
|
|
```python
|
|
from llama_cpp import Llama
|
|
|
|
# Load model
|
|
llm = Llama(
|
|
model_path="./model-q4_k_m.gguf",
|
|
n_ctx=4096, # Context window
|
|
n_gpu_layers=35, # GPU offload (0 for CPU only)
|
|
n_threads=8 # CPU threads
|
|
)
|
|
|
|
# Generate
|
|
output = llm(
|
|
"What is machine learning?",
|
|
max_tokens=256,
|
|
temperature=0.7,
|
|
stop=["</s>", "\n\n"]
|
|
)
|
|
print(output["choices"][0]["text"])
|
|
```
|
|
|
|
### Chat completion
|
|
|
|
```python
|
|
from llama_cpp import Llama
|
|
|
|
llm = Llama(
|
|
model_path="./model-q4_k_m.gguf",
|
|
n_ctx=4096,
|
|
n_gpu_layers=35,
|
|
chat_format="llama-3" # Or "chatml", "mistral", etc.
|
|
)
|
|
|
|
messages = [
|
|
{"role": "system", "content": "You are a helpful assistant."},
|
|
{"role": "user", "content": "What is Python?"}
|
|
]
|
|
|
|
response = llm.create_chat_completion(
|
|
messages=messages,
|
|
max_tokens=256,
|
|
temperature=0.7
|
|
)
|
|
print(response["choices"][0]["message"]["content"])
|
|
```
|
|
|
|
### Streaming
|
|
|
|
```python
|
|
from llama_cpp import Llama
|
|
|
|
llm = Llama(model_path="./model-q4_k_m.gguf", n_gpu_layers=35)
|
|
|
|
# Stream tokens
|
|
for chunk in llm(
|
|
"Explain quantum computing:",
|
|
max_tokens=256,
|
|
stream=True
|
|
):
|
|
print(chunk["choices"][0]["text"], end="", flush=True)
|
|
```
|
|
|
|
## Server mode
|
|
|
|
### Start OpenAI-compatible server
|
|
|
|
```bash
|
|
# Start server
|
|
./llama-server -m model-q4_k_m.gguf \
|
|
--host 0.0.0.0 \
|
|
--port 8080 \
|
|
-ngl 35 \
|
|
-c 4096
|
|
|
|
# Or with Python bindings
|
|
python -m llama_cpp.server \
|
|
--model model-q4_k_m.gguf \
|
|
--n_gpu_layers 35 \
|
|
--host 0.0.0.0 \
|
|
--port 8080
|
|
```
|
|
|
|
### Use with OpenAI client
|
|
|
|
```python
|
|
from openai import OpenAI
|
|
|
|
client = OpenAI(
|
|
base_url="http://localhost:8080/v1",
|
|
api_key="not-needed"
|
|
)
|
|
|
|
response = client.chat.completions.create(
|
|
model="local-model",
|
|
messages=[{"role": "user", "content": "Hello!"}],
|
|
max_tokens=256
|
|
)
|
|
print(response.choices[0].message.content)
|
|
```
|
|
|
|
## Hardware optimization
|
|
|
|
### Apple Silicon (Metal)
|
|
|
|
```bash
|
|
# Build with Metal
|
|
make clean && make GGML_METAL=1
|
|
|
|
# Run with Metal acceleration
|
|
./llama-cli -m model.gguf -ngl 99 -p "Hello"
|
|
|
|
# Python with Metal
|
|
llm = Llama(
|
|
model_path="model.gguf",
|
|
n_gpu_layers=99, # Offload all layers
|
|
n_threads=1 # Metal handles parallelism
|
|
)
|
|
```
|
|
|
|
### NVIDIA CUDA
|
|
|
|
```bash
|
|
# Build with CUDA
|
|
make clean && make GGML_CUDA=1
|
|
|
|
# Run with CUDA
|
|
./llama-cli -m model.gguf -ngl 35 -p "Hello"
|
|
|
|
# Specify GPU
|
|
CUDA_VISIBLE_DEVICES=0 ./llama-cli -m model.gguf -ngl 35
|
|
```
|
|
|
|
### CPU optimization
|
|
|
|
```bash
|
|
# Build with AVX2/AVX512
|
|
make clean && make
|
|
|
|
# Run with optimal threads
|
|
./llama-cli -m model.gguf -t 8 -p "Hello"
|
|
|
|
# Python CPU config
|
|
llm = Llama(
|
|
model_path="model.gguf",
|
|
n_gpu_layers=0, # CPU only
|
|
n_threads=8, # Match physical cores
|
|
n_batch=512 # Batch size for prompt processing
|
|
)
|
|
```
|
|
|
|
## Integration with tools
|
|
|
|
### Ollama
|
|
|
|
```bash
|
|
# Create Modelfile
|
|
cat > Modelfile << 'EOF'
|
|
FROM ./model-q4_k_m.gguf
|
|
TEMPLATE """{{ .System }}
|
|
{{ .Prompt }}"""
|
|
PARAMETER temperature 0.7
|
|
PARAMETER num_ctx 4096
|
|
EOF
|
|
|
|
# Create Ollama model
|
|
ollama create mymodel -f Modelfile
|
|
|
|
# Run
|
|
ollama run mymodel "Hello!"
|
|
```
|
|
|
|
### LM Studio
|
|
|
|
1. Place GGUF file in `~/.cache/lm-studio/models/`
|
|
2. Open LM Studio and select the model
|
|
3. Configure context length and GPU offload
|
|
4. Start inference
|
|
|
|
### text-generation-webui
|
|
|
|
```bash
|
|
# Place in models folder
|
|
cp model-q4_k_m.gguf text-generation-webui/models/
|
|
|
|
# Start with llama.cpp loader
|
|
python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35
|
|
```
|
|
|
|
## Best practices
|
|
|
|
1. **Use K-quants**: Q4_K_M offers best quality/size balance
|
|
2. **Use imatrix**: Always use importance matrix for Q4 and below
|
|
3. **GPU offload**: Offload as many layers as VRAM allows
|
|
4. **Context length**: Start with 4096, increase if needed
|
|
5. **Thread count**: Match physical CPU cores, not logical
|
|
6. **Batch size**: Increase n_batch for faster prompt processing
|
|
|
|
## Common issues
|
|
|
|
**Model loads slowly:**
|
|
```bash
|
|
# Use mmap for faster loading
|
|
./llama-cli -m model.gguf --mmap
|
|
```
|
|
|
|
**Out of memory:**
|
|
```bash
|
|
# Reduce GPU layers
|
|
./llama-cli -m model.gguf -ngl 20 # Reduce from 35
|
|
|
|
# Or use smaller quantization
|
|
./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M
|
|
```
|
|
|
|
**Poor quality at low bits:**
|
|
```bash
|
|
# Always use imatrix for Q4 and below
|
|
./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
|
|
./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
|
|
```
|
|
|
|
## References
|
|
|
|
- **[Advanced Usage](references/advanced-usage.md)** - Batching, speculative decoding, custom builds
|
|
- **[Troubleshooting](references/troubleshooting.md)** - Common issues, debugging, benchmarks
|
|
|
|
## Resources
|
|
|
|
- **Repository**: https://github.com/ggml-org/llama.cpp
|
|
- **Python Bindings**: https://github.com/abetlen/llama-cpp-python
|
|
- **Pre-quantized Models**: https://huggingface.co/TheBloke
|
|
- **GGUF Converter**: https://huggingface.co/spaces/ggml-org/gguf-my-repo
|
|
- **License**: MIT
|