8.7 KiB
GGUF Troubleshooting Guide
Installation Issues
Build Fails
Error: make: *** No targets specified and no makefile found
Fix:
# Ensure you're in llama.cpp directory
cd llama.cpp
make
Error: fatal error: cuda_runtime.h: No such file or directory
Fix:
# Install CUDA toolkit
# Ubuntu
sudo apt install nvidia-cuda-toolkit
# Or set CUDA path
export CUDA_PATH=/usr/local/cuda
export PATH=$CUDA_PATH/bin:$PATH
make GGML_CUDA=1
Python Bindings Issues
Error: ERROR: Failed building wheel for llama-cpp-python
Fix:
# Install build dependencies
pip install cmake scikit-build-core
# For CUDA support
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
# For Metal (macOS)
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
Error: ImportError: libcudart.so.XX: cannot open shared object file
Fix:
# Add CUDA libraries to path
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
# Or reinstall with correct CUDA version
pip uninstall llama-cpp-python
CUDACXX=/usr/local/cuda/bin/nvcc CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python
Conversion Issues
Model Not Supported
Error: KeyError: 'model.embed_tokens.weight'
Fix:
# Check model architecture
python -c "from transformers import AutoConfig; print(AutoConfig.from_pretrained('./model').architectures)"
# Use appropriate conversion script
# For most models:
python convert_hf_to_gguf.py ./model --outfile model.gguf
# For older models, check if legacy script needed
Vocabulary Mismatch
Error: RuntimeError: Vocabulary size mismatch
Fix:
# Ensure tokenizer matches model
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("./model")
model = AutoModelForCausalLM.from_pretrained("./model")
print(f"Tokenizer vocab size: {len(tokenizer)}")
print(f"Model vocab size: {model.config.vocab_size}")
# If mismatch, resize embeddings before conversion
model.resize_token_embeddings(len(tokenizer))
model.save_pretrained("./model-fixed")
Out of Memory During Conversion
Error: torch.cuda.OutOfMemoryError during conversion
Fix:
# Use CPU for conversion
CUDA_VISIBLE_DEVICES="" python convert_hf_to_gguf.py ./model --outfile model.gguf
# Or use low memory mode
python convert_hf_to_gguf.py ./model --outfile model.gguf --outtype f16
Quantization Issues
Wrong Output File Size
Problem: Quantized file is larger than expected
Check:
# Verify quantization type
./llama-cli -m model.gguf --verbose
# Expected sizes for 7B model:
# Q4_K_M: ~4.1 GB
# Q5_K_M: ~4.8 GB
# Q8_0: ~7.2 GB
# F16: ~13.5 GB
Quantization Crashes
Error: Segmentation fault during quantization
Fix:
# Increase stack size
ulimit -s unlimited
# Or use less threads
./llama-quantize -t 4 model-f16.gguf model-q4.gguf Q4_K_M
Poor Quality After Quantization
Problem: Model outputs gibberish after quantization
Solutions:
- Use importance matrix:
# Generate imatrix with good calibration data
./llama-imatrix -m model-f16.gguf \
-f wiki_sample.txt \
--chunk 512 \
-o model.imatrix
# Quantize with imatrix
./llama-quantize --imatrix model.imatrix \
model-f16.gguf model-q4_k_m.gguf Q4_K_M
- Try higher precision:
# Use Q5_K_M or Q6_K instead of Q4
./llama-quantize model-f16.gguf model-q5_k_m.gguf Q5_K_M
- Check original model:
# Test FP16 version first
./llama-cli -m model-f16.gguf -p "Hello, how are you?" -n 50
Inference Issues
Slow Generation
Problem: Generation is slower than expected
Solutions:
- Enable GPU offload:
./llama-cli -m model.gguf -ngl 35 -p "Hello"
- Optimize batch size:
llm = Llama(
model_path="model.gguf",
n_batch=512, # Increase for faster prompt processing
n_gpu_layers=35
)
- Use appropriate threads:
# Match physical cores, not logical
./llama-cli -m model.gguf -t 8 -p "Hello"
- Enable Flash Attention (if supported):
./llama-cli -m model.gguf -ngl 35 --flash-attn -p "Hello"
Out of Memory
Error: CUDA out of memory or system freeze
Solutions:
- Reduce GPU layers:
# Start low and increase
llm = Llama(model_path="model.gguf", n_gpu_layers=10)
- Use smaller quantization:
./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M
- Reduce context length:
llm = Llama(
model_path="model.gguf",
n_ctx=2048, # Reduce from 4096
n_gpu_layers=35
)
- Quantize KV cache:
llm = Llama(
model_path="model.gguf",
type_k=2, # Q4_0 for K cache
type_v=2, # Q4_0 for V cache
n_gpu_layers=35
)
Garbage Output
Problem: Model outputs random characters or nonsense
Diagnose:
# Check model loading
llm = Llama(model_path="model.gguf", verbose=True)
# Test with simple prompt
output = llm("1+1=", max_tokens=5, temperature=0)
print(output)
Solutions:
- Check model integrity:
# Verify GGUF file
./llama-cli -m model.gguf --verbose 2>&1 | head -50
- Use correct chat format:
llm = Llama(
model_path="model.gguf",
chat_format="llama-3" # Match your model: chatml, mistral, etc.
)
- Check temperature:
# Use lower temperature for deterministic output
output = llm("Hello", max_tokens=50, temperature=0.1)
Token Issues
Error: RuntimeError: unknown token or encoding errors
Fix:
# Ensure UTF-8 encoding
prompt = "Hello, world!".encode('utf-8').decode('utf-8')
output = llm(prompt, max_tokens=50)
Server Issues
Connection Refused
Error: Connection refused when accessing server
Fix:
# Bind to all interfaces
./llama-server -m model.gguf --host 0.0.0.0 --port 8080
# Check if port is in use
lsof -i :8080
Server Crashes Under Load
Problem: Server crashes with multiple concurrent requests
Solutions:
- Limit parallelism:
./llama-server -m model.gguf \
--parallel 2 \
-c 4096 \
--cont-batching
- Add request timeout:
./llama-server -m model.gguf --timeout 300
- Monitor memory:
watch -n 1 nvidia-smi # For GPU
watch -n 1 free -h # For RAM
API Compatibility Issues
Problem: OpenAI client not working with server
Fix:
from openai import OpenAI
# Use correct base URL format
client = OpenAI(
base_url="http://localhost:8080/v1", # Include /v1
api_key="not-needed"
)
# Use correct model name
response = client.chat.completions.create(
model="local", # Or the actual model name
messages=[{"role": "user", "content": "Hello"}]
)
Apple Silicon Issues
Metal Not Working
Problem: Metal acceleration not enabled
Check:
# Verify Metal support
./llama-cli -m model.gguf --verbose 2>&1 | grep -i metal
Fix:
# Rebuild with Metal
make clean
make GGML_METAL=1
# Python bindings
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall
Incorrect Memory Usage on M1/M2
Problem: Model uses too much unified memory
Fix:
# Offload all layers for Metal
llm = Llama(
model_path="model.gguf",
n_gpu_layers=99, # Offload everything
n_threads=1 # Metal handles parallelism
)
Debugging
Enable Verbose Output
# CLI verbose mode
./llama-cli -m model.gguf --verbose -p "Hello" -n 50
# Python verbose
llm = Llama(model_path="model.gguf", verbose=True)
Check Model Metadata
# View GGUF metadata
./llama-cli -m model.gguf --verbose 2>&1 | head -100
Validate GGUF File
import struct
def validate_gguf(filepath):
with open(filepath, 'rb') as f:
magic = f.read(4)
if magic != b'GGUF':
print(f"Invalid magic: {magic}")
return False
version = struct.unpack('<I', f.read(4))[0]
print(f"GGUF version: {version}")
tensor_count = struct.unpack('<Q', f.read(8))[0]
metadata_count = struct.unpack('<Q', f.read(8))[0]
print(f"Tensors: {tensor_count}, Metadata: {metadata_count}")
return True
validate_gguf("model.gguf")
Getting Help
- GitHub Issues: https://github.com/ggml-org/llama.cpp/issues
- Discussions: https://github.com/ggml-org/llama.cpp/discussions
- Reddit: r/LocalLLaMA
Reporting Issues
Include:
- llama.cpp version/commit hash
- Build command used
- Model name and quantization
- Full error message/stack trace
- Hardware: CPU/GPU model, RAM, VRAM
- OS version
- Minimal reproduction steps