Sync all skills and memories 2026-04-14 07:27

2026-04-14 07:27:20 +09:00
parent 516bb44fe6
commit 1eba2bca95
386 changed files with 167655 additions and 0 deletions
--- a/skills/mlops/inference/gguf/references/advanced-usage.md
+++ b/skills/mlops/inference/gguf/references/advanced-usage.md
@@ -0,0 +1,504 @@
+# GGUF Advanced Usage Guide
+
+## Speculative Decoding
+
+### Draft Model Approach
+
+```bash
+# Use smaller model as draft for faster generation
+./llama-speculative \
+    -m large-model-q4_k_m.gguf \
+    -md draft-model-q4_k_m.gguf \
+    -p "Write a story about AI" \
+    -n 500 \
+    --draft 8  # Draft tokens before verification
+```
+
+### Self-Speculative Decoding
+
+```bash
+# Use same model with different context for speculation
+./llama-cli -m model-q4_k_m.gguf \
+    --lookup-cache-static lookup.bin \
+    --lookup-cache-dynamic lookup-dynamic.bin \
+    -p "Hello world"
+```
+
+## Batched Inference
+
+### Process Multiple Prompts
+
+```python
+from llama_cpp import Llama
+
+llm = Llama(
+    model_path="model-q4_k_m.gguf",
+    n_ctx=4096,
+    n_gpu_layers=35,
+    n_batch=512  # Larger batch for parallel processing
+)
+
+prompts = [
+    "What is Python?",
+    "Explain machine learning.",
+    "Describe neural networks."
+]
+
+# Process in batch (each prompt gets separate context)
+for prompt in prompts:
+    output = llm(prompt, max_tokens=100)
+    print(f"Q: {prompt}")
+    print(f"A: {output['choices'][0]['text']}\n")
+```
+
+### Server Batching
+
+```bash
+# Start server with batching
+./llama-server -m model-q4_k_m.gguf \
+    --host 0.0.0.0 \
+    --port 8080 \
+    -ngl 35 \
+    -c 4096 \
+    --parallel 4        # Concurrent requests
+    --cont-batching     # Continuous batching
+```
+
+## Custom Model Conversion
+
+### Convert with Vocabulary Modifications
+
+```python
+# custom_convert.py
+import sys
+sys.path.insert(0, './llama.cpp')
+
+from convert_hf_to_gguf import main
+from gguf import GGUFWriter
+
+# Custom conversion with modified vocab
+def convert_with_custom_vocab(model_path, output_path):
+    # Load and modify tokenizer
+    from transformers import AutoTokenizer
+    tokenizer = AutoTokenizer.from_pretrained(model_path)
+
+    # Add special tokens if needed
+    special_tokens = {"additional_special_tokens": ["<|custom|>"]}
+    tokenizer.add_special_tokens(special_tokens)
+    tokenizer.save_pretrained(model_path)
+
+    # Then run standard conversion
+    main([model_path, "--outfile", output_path])
+```
+
+### Convert Specific Architecture
+
+```bash
+# For Mistral-style models
+python convert_hf_to_gguf.py ./mistral-model \
+    --outfile mistral-f16.gguf \
+    --outtype f16
+
+# For Qwen models
+python convert_hf_to_gguf.py ./qwen-model \
+    --outfile qwen-f16.gguf \
+    --outtype f16
+
+# For Phi models
+python convert_hf_to_gguf.py ./phi-model \
+    --outfile phi-f16.gguf \
+    --outtype f16
+```
+
+## Advanced Quantization
+
+### Mixed Quantization
+
+```bash
+# Quantize different layer types differently
+./llama-quantize model-f16.gguf model-mixed.gguf Q4_K_M \
+    --allow-requantize \
+    --leave-output-tensor
+```
+
+### Quantization with Token Embeddings
+
+```bash
+# Keep embeddings at higher precision
+./llama-quantize model-f16.gguf model-q4.gguf Q4_K_M \
+    --token-embedding-type f16
+```
+
+### IQ Quantization (Importance-aware)
+
+```bash
+# Ultra-low bit quantization with importance
+./llama-quantize --imatrix model.imatrix \
+    model-f16.gguf model-iq2_xxs.gguf IQ2_XXS
+
+# Available IQ types: IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_XS, IQ3_S, IQ4_XS
+```
+
+## Memory Optimization
+
+### Memory Mapping
+
+```python
+from llama_cpp import Llama
+
+# Use memory mapping for large models
+llm = Llama(
+    model_path="model-q4_k_m.gguf",
+    use_mmap=True,       # Memory map the model
+    use_mlock=False,     # Don't lock in RAM
+    n_gpu_layers=35
+)
+```
+
+### Partial GPU Offload
+
+```python
+# Calculate layers to offload based on VRAM
+import subprocess
+
+def get_free_vram_gb():
+    result = subprocess.run(
+        ['nvidia-smi', '--query-gpu=memory.free', '--format=csv,nounits,noheader'],
+        capture_output=True, text=True
+    )
+    return int(result.stdout.strip()) / 1024
+
+# Estimate layers based on VRAM (rough: 0.5GB per layer for 7B Q4)
+free_vram = get_free_vram_gb()
+layers_to_offload = int(free_vram / 0.5)
+
+llm = Llama(
+    model_path="model-q4_k_m.gguf",
+    n_gpu_layers=min(layers_to_offload, 35)  # Cap at total layers
+)
+```
+
+### KV Cache Optimization
+
+```python
+from llama_cpp import Llama
+
+# Optimize KV cache for long contexts
+llm = Llama(
+    model_path="model-q4_k_m.gguf",
+    n_ctx=8192,          # Large context
+    n_gpu_layers=35,
+    type_k=1,            # Q8_0 for K cache (1)
+    type_v=1,            # Q8_0 for V cache (1)
+    # Or use Q4_0 (2) for more compression
+)
+```
+
+## Context Management
+
+### Context Shifting
+
+```python
+from llama_cpp import Llama
+
+llm = Llama(
+    model_path="model-q4_k_m.gguf",
+    n_ctx=4096,
+    n_gpu_layers=35
+)
+
+# Handle long conversations with context shifting
+conversation = []
+max_history = 10
+
+def chat(user_message):
+    conversation.append({"role": "user", "content": user_message})
+
+    # Keep only recent history
+    if len(conversation) > max_history * 2:
+        conversation = conversation[-max_history * 2:]
+
+    response = llm.create_chat_completion(
+        messages=conversation,
+        max_tokens=256
+    )
+
+    assistant_message = response["choices"][0]["message"]["content"]
+    conversation.append({"role": "assistant", "content": assistant_message})
+    return assistant_message
+```
+
+### Save and Load State
+
+```bash
+# Save state to file
+./llama-cli -m model.gguf \
+    -p "Once upon a time" \
+    --save-session session.bin \
+    -n 100
+
+# Load and continue
+./llama-cli -m model.gguf \
+    --load-session session.bin \
+    -p " and they lived" \
+    -n 100
+```
+
+## Grammar Constrained Generation
+
+### JSON Output
+
+```python
+from llama_cpp import Llama, LlamaGrammar
+
+# Define JSON grammar
+json_grammar = LlamaGrammar.from_string('''
+root ::= object
+object ::= "{" ws pair ("," ws pair)* "}" ws
+pair ::= string ":" ws value
+value ::= string | number | object | array | "true" | "false" | "null"
+array ::= "[" ws value ("," ws value)* "]" ws
+string ::= "\\"" [^"\\\\]* "\\""
+number ::= [0-9]+
+ws ::= [ \\t\\n]*
+''')
+
+llm = Llama(model_path="model-q4_k_m.gguf", n_gpu_layers=35)
+
+output = llm(
+    "Output a JSON object with name and age:",
+    grammar=json_grammar,
+    max_tokens=100
+)
+print(output["choices"][0]["text"])
+```
+
+### Custom Grammar
+
+```python
+# Grammar for specific format
+answer_grammar = LlamaGrammar.from_string('''
+root ::= "Answer: " letter "\\n" "Explanation: " explanation
+letter ::= [A-D]
+explanation ::= [a-zA-Z0-9 .,!?]+
+''')
+
+output = llm(
+    "Q: What is 2+2? A) 3 B) 4 C) 5 D) 6",
+    grammar=answer_grammar,
+    max_tokens=100
+)
+```
+
+## LoRA Integration
+
+### Load LoRA Adapter
+
+```bash
+# Apply LoRA at runtime
+./llama-cli -m base-model-q4_k_m.gguf \
+    --lora lora-adapter.gguf \
+    --lora-scale 1.0 \
+    -p "Hello!"
+```
+
+### Multiple LoRA Adapters
+
+```bash
+# Stack multiple adapters
+./llama-cli -m base-model.gguf \
+    --lora adapter1.gguf --lora-scale 0.5 \
+    --lora adapter2.gguf --lora-scale 0.5 \
+    -p "Hello!"
+```
+
+### Python LoRA Usage
+
+```python
+from llama_cpp import Llama
+
+llm = Llama(
+    model_path="base-model-q4_k_m.gguf",
+    lora_path="lora-adapter.gguf",
+    lora_scale=1.0,
+    n_gpu_layers=35
+)
+```
+
+## Embedding Generation
+
+### Extract Embeddings
+
+```python
+from llama_cpp import Llama
+
+llm = Llama(
+    model_path="model-q4_k_m.gguf",
+    embedding=True,      # Enable embedding mode
+    n_gpu_layers=35
+)
+
+# Get embeddings
+embeddings = llm.embed("This is a test sentence.")
+print(f"Embedding dimension: {len(embeddings)}")
+```
+
+### Batch Embeddings
+
+```python
+texts = [
+    "Machine learning is fascinating.",
+    "Deep learning uses neural networks.",
+    "Python is a programming language."
+]
+
+embeddings = [llm.embed(text) for text in texts]
+
+# Calculate similarity
+import numpy as np
+
+def cosine_similarity(a, b):
+    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
+
+sim = cosine_similarity(embeddings[0], embeddings[1])
+print(f"Similarity: {sim:.4f}")
+```
+
+## Performance Tuning
+
+### Benchmark Script
+
+```python
+import time
+from llama_cpp import Llama
+
+def benchmark(model_path, prompt, n_tokens=100, n_runs=5):
+    llm = Llama(
+        model_path=model_path,
+        n_gpu_layers=35,
+        n_ctx=2048,
+        verbose=False
+    )
+
+    # Warmup
+    llm(prompt, max_tokens=10)
+
+    # Benchmark
+    times = []
+    for _ in range(n_runs):
+        start = time.time()
+        output = llm(prompt, max_tokens=n_tokens)
+        elapsed = time.time() - start
+        times.append(elapsed)
+
+    avg_time = sum(times) / len(times)
+    tokens_per_sec = n_tokens / avg_time
+
+    print(f"Model: {model_path}")
+    print(f"Avg time: {avg_time:.2f}s")
+    print(f"Tokens/sec: {tokens_per_sec:.1f}")
+
+    return tokens_per_sec
+
+# Compare quantizations
+for quant in ["q4_k_m", "q5_k_m", "q8_0"]:
+    benchmark(f"model-{quant}.gguf", "Explain quantum computing:", 100)
+```
+
+### Optimal Configuration Finder
+
+```python
+def find_optimal_config(model_path, target_vram_gb=8):
+    """Find optimal n_gpu_layers and n_batch for target VRAM."""
+    from llama_cpp import Llama
+    import gc
+
+    best_config = None
+    best_speed = 0
+
+    for n_gpu_layers in range(0, 50, 5):
+        for n_batch in [128, 256, 512, 1024]:
+            try:
+                gc.collect()
+                llm = Llama(
+                    model_path=model_path,
+                    n_gpu_layers=n_gpu_layers,
+                    n_batch=n_batch,
+                    n_ctx=2048,
+                    verbose=False
+                )
+
+                # Quick benchmark
+                start = time.time()
+                llm("Hello", max_tokens=50)
+                speed = 50 / (time.time() - start)
+
+                if speed > best_speed:
+                    best_speed = speed
+                    best_config = {
+                        "n_gpu_layers": n_gpu_layers,
+                        "n_batch": n_batch,
+                        "speed": speed
+                    }
+
+                del llm
+                gc.collect()
+
+            except Exception as e:
+                print(f"OOM at layers={n_gpu_layers}, batch={n_batch}")
+                break
+
+    return best_config
+```
+
+## Multi-GPU Setup
+
+### Distribute Across GPUs
+
+```bash
+# Split model across multiple GPUs
+./llama-cli -m large-model.gguf \
+    --tensor-split 0.5,0.5 \
+    -ngl 60 \
+    -p "Hello!"
+```
+
+### Python Multi-GPU
+
+```python
+import os
+os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
+
+from llama_cpp import Llama
+
+llm = Llama(
+    model_path="large-model-q4_k_m.gguf",
+    n_gpu_layers=60,
+    tensor_split=[0.5, 0.5]  # Split evenly across 2 GPUs
+)
+```
+
+## Custom Builds
+
+### Build with All Optimizations
+
+```bash
+# Clean build with all CPU optimizations
+make clean
+LLAMA_OPENBLAS=1 LLAMA_BLAS_VENDOR=OpenBLAS make -j
+
+# With CUDA and cuBLAS
+make clean
+GGML_CUDA=1 LLAMA_CUBLAS=1 make -j
+
+# With specific CUDA architecture
+GGML_CUDA=1 CUDA_DOCKER_ARCH=sm_86 make -j
+```
+
+### CMake Build
+
+```bash
+mkdir build && cd build
+cmake .. -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
+cmake --build . --config Release -j
+```
--- a/skills/mlops/inference/gguf/references/troubleshooting.md
+++ b/skills/mlops/inference/gguf/references/troubleshooting.md
@@ -0,0 +1,442 @@
+# GGUF Troubleshooting Guide
+
+## Installation Issues
+
+### Build Fails
+
+**Error**: `make: *** No targets specified and no makefile found`
+
+**Fix**:
+```bash
+# Ensure you're in llama.cpp directory
+cd llama.cpp
+make
+```
+
+**Error**: `fatal error: cuda_runtime.h: No such file or directory`
+
+**Fix**:
+```bash
+# Install CUDA toolkit
+# Ubuntu
+sudo apt install nvidia-cuda-toolkit
+
+# Or set CUDA path
+export CUDA_PATH=/usr/local/cuda
+export PATH=$CUDA_PATH/bin:$PATH
+make GGML_CUDA=1
+```
+
+### Python Bindings Issues
+
+**Error**: `ERROR: Failed building wheel for llama-cpp-python`
+
+**Fix**:
+```bash
+# Install build dependencies
+pip install cmake scikit-build-core
+
+# For CUDA support
+CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
+
+# For Metal (macOS)
+CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
+```
+
+**Error**: `ImportError: libcudart.so.XX: cannot open shared object file`
+
+**Fix**:
+```bash
+# Add CUDA libraries to path
+export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
+
+# Or reinstall with correct CUDA version
+pip uninstall llama-cpp-python
+CUDACXX=/usr/local/cuda/bin/nvcc CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python
+```
+
+## Conversion Issues
+
+### Model Not Supported
+
+**Error**: `KeyError: 'model.embed_tokens.weight'`
+
+**Fix**:
+```bash
+# Check model architecture
+python -c "from transformers import AutoConfig; print(AutoConfig.from_pretrained('./model').architectures)"
+
+# Use appropriate conversion script
+# For most models:
+python convert_hf_to_gguf.py ./model --outfile model.gguf
+
+# For older models, check if legacy script needed
+```
+
+### Vocabulary Mismatch
+
+**Error**: `RuntimeError: Vocabulary size mismatch`
+
+**Fix**:
+```python
+# Ensure tokenizer matches model
+from transformers import AutoTokenizer, AutoModelForCausalLM
+
+tokenizer = AutoTokenizer.from_pretrained("./model")
+model = AutoModelForCausalLM.from_pretrained("./model")
+
+print(f"Tokenizer vocab size: {len(tokenizer)}")
+print(f"Model vocab size: {model.config.vocab_size}")
+
+# If mismatch, resize embeddings before conversion
+model.resize_token_embeddings(len(tokenizer))
+model.save_pretrained("./model-fixed")
+```
+
+### Out of Memory During Conversion
+
+**Error**: `torch.cuda.OutOfMemoryError` during conversion
+
+**Fix**:
+```bash
+# Use CPU for conversion
+CUDA_VISIBLE_DEVICES="" python convert_hf_to_gguf.py ./model --outfile model.gguf
+
+# Or use low memory mode
+python convert_hf_to_gguf.py ./model --outfile model.gguf --outtype f16
+```
+
+## Quantization Issues
+
+### Wrong Output File Size
+
+**Problem**: Quantized file is larger than expected
+
+**Check**:
+```bash
+# Verify quantization type
+./llama-cli -m model.gguf --verbose
+
+# Expected sizes for 7B model:
+# Q4_K_M: ~4.1 GB
+# Q5_K_M: ~4.8 GB
+# Q8_0: ~7.2 GB
+# F16: ~13.5 GB
+```
+
+### Quantization Crashes
+
+**Error**: `Segmentation fault` during quantization
+
+**Fix**:
+```bash
+# Increase stack size
+ulimit -s unlimited
+
+# Or use less threads
+./llama-quantize -t 4 model-f16.gguf model-q4.gguf Q4_K_M
+```
+
+### Poor Quality After Quantization
+
+**Problem**: Model outputs gibberish after quantization
+
+**Solutions**:
+
+1. **Use importance matrix**:
+```bash
+# Generate imatrix with good calibration data
+./llama-imatrix -m model-f16.gguf \
+    -f wiki_sample.txt \
+    --chunk 512 \
+    -o model.imatrix
+
+# Quantize with imatrix
+./llama-quantize --imatrix model.imatrix \
+    model-f16.gguf model-q4_k_m.gguf Q4_K_M
+```
+
+2. **Try higher precision**:
+```bash
+# Use Q5_K_M or Q6_K instead of Q4
+./llama-quantize model-f16.gguf model-q5_k_m.gguf Q5_K_M
+```
+
+3. **Check original model**:
+```bash
+# Test FP16 version first
+./llama-cli -m model-f16.gguf -p "Hello, how are you?" -n 50
+```
+
+## Inference Issues
+
+### Slow Generation
+
+**Problem**: Generation is slower than expected
+
+**Solutions**:
+
+1. **Enable GPU offload**:
+```bash
+./llama-cli -m model.gguf -ngl 35 -p "Hello"
+```
+
+2. **Optimize batch size**:
+```python
+llm = Llama(
+    model_path="model.gguf",
+    n_batch=512,        # Increase for faster prompt processing
+    n_gpu_layers=35
+)
+```
+
+3. **Use appropriate threads**:
+```bash
+# Match physical cores, not logical
+./llama-cli -m model.gguf -t 8 -p "Hello"
+```
+
+4. **Enable Flash Attention** (if supported):
+```bash
+./llama-cli -m model.gguf -ngl 35 --flash-attn -p "Hello"
+```
+
+### Out of Memory
+
+**Error**: `CUDA out of memory` or system freeze
+
+**Solutions**:
+
+1. **Reduce GPU layers**:
+```python
+# Start low and increase
+llm = Llama(model_path="model.gguf", n_gpu_layers=10)
+```
+
+2. **Use smaller quantization**:
+```bash
+./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M
+```
+
+3. **Reduce context length**:
+```python
+llm = Llama(
+    model_path="model.gguf",
+    n_ctx=2048,  # Reduce from 4096
+    n_gpu_layers=35
+)
+```
+
+4. **Quantize KV cache**:
+```python
+llm = Llama(
+    model_path="model.gguf",
+    type_k=2,    # Q4_0 for K cache
+    type_v=2,    # Q4_0 for V cache
+    n_gpu_layers=35
+)
+```
+
+### Garbage Output
+
+**Problem**: Model outputs random characters or nonsense
+
+**Diagnose**:
+```python
+# Check model loading
+llm = Llama(model_path="model.gguf", verbose=True)
+
+# Test with simple prompt
+output = llm("1+1=", max_tokens=5, temperature=0)
+print(output)
+```
+
+**Solutions**:
+
+1. **Check model integrity**:
+```bash
+# Verify GGUF file
+./llama-cli -m model.gguf --verbose 2>&1 | head -50
+```
+
+2. **Use correct chat format**:
+```python
+llm = Llama(
+    model_path="model.gguf",
+    chat_format="llama-3"  # Match your model: chatml, mistral, etc.
+)
+```
+
+3. **Check temperature**:
+```python
+# Use lower temperature for deterministic output
+output = llm("Hello", max_tokens=50, temperature=0.1)
+```
+
+### Token Issues
+
+**Error**: `RuntimeError: unknown token` or encoding errors
+
+**Fix**:
+```python
+# Ensure UTF-8 encoding
+prompt = "Hello, world!".encode('utf-8').decode('utf-8')
+output = llm(prompt, max_tokens=50)
+```
+
+## Server Issues
+
+### Connection Refused
+
+**Error**: `Connection refused` when accessing server
+
+**Fix**:
+```bash
+# Bind to all interfaces
+./llama-server -m model.gguf --host 0.0.0.0 --port 8080
+
+# Check if port is in use
+lsof -i :8080
+```
+
+### Server Crashes Under Load
+
+**Problem**: Server crashes with multiple concurrent requests
+
+**Solutions**:
+
+1. **Limit parallelism**:
+```bash
+./llama-server -m model.gguf \
+    --parallel 2 \
+    -c 4096 \
+    --cont-batching
+```
+
+2. **Add request timeout**:
+```bash
+./llama-server -m model.gguf --timeout 300
+```
+
+3. **Monitor memory**:
+```bash
+watch -n 1 nvidia-smi  # For GPU
+watch -n 1 free -h     # For RAM
+```
+
+### API Compatibility Issues
+
+**Problem**: OpenAI client not working with server
+
+**Fix**:
+```python
+from openai import OpenAI
+
+# Use correct base URL format
+client = OpenAI(
+    base_url="http://localhost:8080/v1",  # Include /v1
+    api_key="not-needed"
+)
+
+# Use correct model name
+response = client.chat.completions.create(
+    model="local",  # Or the actual model name
+    messages=[{"role": "user", "content": "Hello"}]
+)
+```
+
+## Apple Silicon Issues
+
+### Metal Not Working
+
+**Problem**: Metal acceleration not enabled
+
+**Check**:
+```bash
+# Verify Metal support
+./llama-cli -m model.gguf --verbose 2>&1 | grep -i metal
+```
+
+**Fix**:
+```bash
+# Rebuild with Metal
+make clean
+make GGML_METAL=1
+
+# Python bindings
+CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall
+```
+
+### Incorrect Memory Usage on M1/M2
+
+**Problem**: Model uses too much unified memory
+
+**Fix**:
+```python
+# Offload all layers for Metal
+llm = Llama(
+    model_path="model.gguf",
+    n_gpu_layers=99,    # Offload everything
+    n_threads=1         # Metal handles parallelism
+)
+```
+
+## Debugging
+
+### Enable Verbose Output
+
+```bash
+# CLI verbose mode
+./llama-cli -m model.gguf --verbose -p "Hello" -n 50
+
+# Python verbose
+llm = Llama(model_path="model.gguf", verbose=True)
+```
+
+### Check Model Metadata
+
+```bash
+# View GGUF metadata
+./llama-cli -m model.gguf --verbose 2>&1 | head -100
+```
+
+### Validate GGUF File
+
+```python
+import struct
+
+def validate_gguf(filepath):
+    with open(filepath, 'rb') as f:
+        magic = f.read(4)
+        if magic != b'GGUF':
+            print(f"Invalid magic: {magic}")
+            return False
+
+        version = struct.unpack('<I', f.read(4))[0]
+        print(f"GGUF version: {version}")
+
+        tensor_count = struct.unpack('<Q', f.read(8))[0]
+        metadata_count = struct.unpack('<Q', f.read(8))[0]
+        print(f"Tensors: {tensor_count}, Metadata: {metadata_count}")
+
+        return True
+
+validate_gguf("model.gguf")
+```
+
+## Getting Help
+
+1. **GitHub Issues**: https://github.com/ggml-org/llama.cpp/issues
+2. **Discussions**: https://github.com/ggml-org/llama.cpp/discussions
+3. **Reddit**: r/LocalLLaMA
+
+### Reporting Issues
+
+Include:
+- llama.cpp version/commit hash
+- Build command used
+- Model name and quantization
+- Full error message/stack trace
+- Hardware: CPU/GPU model, RAM, VRAM
+- OS version
+- Minimal reproduction steps