Files

8.7 KiB

GGUF Troubleshooting Guide

Installation Issues

Build Fails

Error: make: *** No targets specified and no makefile found

Fix:

# Ensure you're in llama.cpp directory
cd llama.cpp
make

Error: fatal error: cuda_runtime.h: No such file or directory

Fix:

# Install CUDA toolkit
# Ubuntu
sudo apt install nvidia-cuda-toolkit

# Or set CUDA path
export CUDA_PATH=/usr/local/cuda
export PATH=$CUDA_PATH/bin:$PATH
make GGML_CUDA=1

Python Bindings Issues

Error: ERROR: Failed building wheel for llama-cpp-python

Fix:

# Install build dependencies
pip install cmake scikit-build-core

# For CUDA support
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir

# For Metal (macOS)
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall --no-cache-dir

Error: ImportError: libcudart.so.XX: cannot open shared object file

Fix:

# Add CUDA libraries to path
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

# Or reinstall with correct CUDA version
pip uninstall llama-cpp-python
CUDACXX=/usr/local/cuda/bin/nvcc CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python

Conversion Issues

Model Not Supported

Error: KeyError: 'model.embed_tokens.weight'

Fix:

# Check model architecture
python -c "from transformers import AutoConfig; print(AutoConfig.from_pretrained('./model').architectures)"

# Use appropriate conversion script
# For most models:
python convert_hf_to_gguf.py ./model --outfile model.gguf

# For older models, check if legacy script needed

Vocabulary Mismatch

Error: RuntimeError: Vocabulary size mismatch

Fix:

# Ensure tokenizer matches model
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("./model")
model = AutoModelForCausalLM.from_pretrained("./model")

print(f"Tokenizer vocab size: {len(tokenizer)}")
print(f"Model vocab size: {model.config.vocab_size}")

# If mismatch, resize embeddings before conversion
model.resize_token_embeddings(len(tokenizer))
model.save_pretrained("./model-fixed")

Out of Memory During Conversion

Error: torch.cuda.OutOfMemoryError during conversion

Fix:

# Use CPU for conversion
CUDA_VISIBLE_DEVICES="" python convert_hf_to_gguf.py ./model --outfile model.gguf

# Or use low memory mode
python convert_hf_to_gguf.py ./model --outfile model.gguf --outtype f16

Quantization Issues

Wrong Output File Size

Problem: Quantized file is larger than expected

Check:

# Verify quantization type
./llama-cli -m model.gguf --verbose

# Expected sizes for 7B model:
# Q4_K_M: ~4.1 GB
# Q5_K_M: ~4.8 GB
# Q8_0: ~7.2 GB
# F16: ~13.5 GB

Quantization Crashes

Error: Segmentation fault during quantization

Fix:

# Increase stack size
ulimit -s unlimited

# Or use less threads
./llama-quantize -t 4 model-f16.gguf model-q4.gguf Q4_K_M

Poor Quality After Quantization

Problem: Model outputs gibberish after quantization

Solutions:

  1. Use importance matrix:
# Generate imatrix with good calibration data
./llama-imatrix -m model-f16.gguf \
    -f wiki_sample.txt \
    --chunk 512 \
    -o model.imatrix

# Quantize with imatrix
./llama-quantize --imatrix model.imatrix \
    model-f16.gguf model-q4_k_m.gguf Q4_K_M
  1. Try higher precision:
# Use Q5_K_M or Q6_K instead of Q4
./llama-quantize model-f16.gguf model-q5_k_m.gguf Q5_K_M
  1. Check original model:
# Test FP16 version first
./llama-cli -m model-f16.gguf -p "Hello, how are you?" -n 50

Inference Issues

Slow Generation

Problem: Generation is slower than expected

Solutions:

  1. Enable GPU offload:
./llama-cli -m model.gguf -ngl 35 -p "Hello"
  1. Optimize batch size:
llm = Llama(
    model_path="model.gguf",
    n_batch=512,        # Increase for faster prompt processing
    n_gpu_layers=35
)
  1. Use appropriate threads:
# Match physical cores, not logical
./llama-cli -m model.gguf -t 8 -p "Hello"
  1. Enable Flash Attention (if supported):
./llama-cli -m model.gguf -ngl 35 --flash-attn -p "Hello"

Out of Memory

Error: CUDA out of memory or system freeze

Solutions:

  1. Reduce GPU layers:
# Start low and increase
llm = Llama(model_path="model.gguf", n_gpu_layers=10)
  1. Use smaller quantization:
./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M
  1. Reduce context length:
llm = Llama(
    model_path="model.gguf",
    n_ctx=2048,  # Reduce from 4096
    n_gpu_layers=35
)
  1. Quantize KV cache:
llm = Llama(
    model_path="model.gguf",
    type_k=2,    # Q4_0 for K cache
    type_v=2,    # Q4_0 for V cache
    n_gpu_layers=35
)

Garbage Output

Problem: Model outputs random characters or nonsense

Diagnose:

# Check model loading
llm = Llama(model_path="model.gguf", verbose=True)

# Test with simple prompt
output = llm("1+1=", max_tokens=5, temperature=0)
print(output)

Solutions:

  1. Check model integrity:
# Verify GGUF file
./llama-cli -m model.gguf --verbose 2>&1 | head -50
  1. Use correct chat format:
llm = Llama(
    model_path="model.gguf",
    chat_format="llama-3"  # Match your model: chatml, mistral, etc.
)
  1. Check temperature:
# Use lower temperature for deterministic output
output = llm("Hello", max_tokens=50, temperature=0.1)

Token Issues

Error: RuntimeError: unknown token or encoding errors

Fix:

# Ensure UTF-8 encoding
prompt = "Hello, world!".encode('utf-8').decode('utf-8')
output = llm(prompt, max_tokens=50)

Server Issues

Connection Refused

Error: Connection refused when accessing server

Fix:

# Bind to all interfaces
./llama-server -m model.gguf --host 0.0.0.0 --port 8080

# Check if port is in use
lsof -i :8080

Server Crashes Under Load

Problem: Server crashes with multiple concurrent requests

Solutions:

  1. Limit parallelism:
./llama-server -m model.gguf \
    --parallel 2 \
    -c 4096 \
    --cont-batching
  1. Add request timeout:
./llama-server -m model.gguf --timeout 300
  1. Monitor memory:
watch -n 1 nvidia-smi  # For GPU
watch -n 1 free -h     # For RAM

API Compatibility Issues

Problem: OpenAI client not working with server

Fix:

from openai import OpenAI

# Use correct base URL format
client = OpenAI(
    base_url="http://localhost:8080/v1",  # Include /v1
    api_key="not-needed"
)

# Use correct model name
response = client.chat.completions.create(
    model="local",  # Or the actual model name
    messages=[{"role": "user", "content": "Hello"}]
)

Apple Silicon Issues

Metal Not Working

Problem: Metal acceleration not enabled

Check:

# Verify Metal support
./llama-cli -m model.gguf --verbose 2>&1 | grep -i metal

Fix:

# Rebuild with Metal
make clean
make GGML_METAL=1

# Python bindings
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall

Incorrect Memory Usage on M1/M2

Problem: Model uses too much unified memory

Fix:

# Offload all layers for Metal
llm = Llama(
    model_path="model.gguf",
    n_gpu_layers=99,    # Offload everything
    n_threads=1         # Metal handles parallelism
)

Debugging

Enable Verbose Output

# CLI verbose mode
./llama-cli -m model.gguf --verbose -p "Hello" -n 50

# Python verbose
llm = Llama(model_path="model.gguf", verbose=True)

Check Model Metadata

# View GGUF metadata
./llama-cli -m model.gguf --verbose 2>&1 | head -100

Validate GGUF File

import struct

def validate_gguf(filepath):
    with open(filepath, 'rb') as f:
        magic = f.read(4)
        if magic != b'GGUF':
            print(f"Invalid magic: {magic}")
            return False

        version = struct.unpack('<I', f.read(4))[0]
        print(f"GGUF version: {version}")

        tensor_count = struct.unpack('<Q', f.read(8))[0]
        metadata_count = struct.unpack('<Q', f.read(8))[0]
        print(f"Tensors: {tensor_count}, Metadata: {metadata_count}")

        return True

validate_gguf("model.gguf")

Getting Help

  1. GitHub Issues: https://github.com/ggml-org/llama.cpp/issues
  2. Discussions: https://github.com/ggml-org/llama.cpp/discussions
  3. Reddit: r/LocalLLaMA

Reporting Issues

Include:

  • llama.cpp version/commit hash
  • Build command used
  • Model name and quantization
  • Full error message/stack trace
  • Hardware: CPU/GPU model, RAM, VRAM
  • OS version
  • Minimal reproduction steps