Sync all skills and memories 2026-04-14 07:27

2026-04-14 07:27:20 +09:00
parent 516bb44fe6
commit 1eba2bca95
386 changed files with 167655 additions and 0 deletions
--- a/skills/mlops/evaluation/DESCRIPTION.md
+++ b/skills/mlops/evaluation/DESCRIPTION.md
@@ -0,0 +1,3 @@
+---
+description: Model evaluation benchmarks, experiment tracking, data curation, tokenizers, and interpretability tools.
+---
--- a/skills/mlops/evaluation/lm-evaluation-harness/SKILL.md
+++ b/skills/mlops/evaluation/lm-evaluation-harness/SKILL.md
@@ -0,0 +1,493 @@
+---
+name: evaluating-llms-harness
+description: Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.
+version: 1.0.0
+author: Orchestra Research
+license: MIT
+dependencies: [lm-eval, transformers, vllm]
+metadata:
+  hermes:
+    tags: [Evaluation, LM Evaluation Harness, Benchmarking, MMLU, HumanEval, GSM8K, EleutherAI, Model Quality, Academic Benchmarks, Industry Standard]
+
+---
+
+# lm-evaluation-harness - LLM Benchmarking
+
+## Quick start
+
+lm-evaluation-harness evaluates LLMs across 60+ academic benchmarks using standardized prompts and metrics.
+
+**Installation**:
+```bash
+pip install lm-eval
+```
+
+**Evaluate any HuggingFace model**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf \
+  --tasks mmlu,gsm8k,hellaswag \
+  --device cuda:0 \
+  --batch_size 8
+```
+
+**View available tasks**:
+```bash
+lm_eval --tasks list
+```
+
+## Common workflows
+
+### Workflow 1: Standard benchmark evaluation
+
+Evaluate model on core benchmarks (MMLU, GSM8K, HumanEval).
+
+Copy this checklist:
+
+```
+Benchmark Evaluation:
+- [ ] Step 1: Choose benchmark suite
+- [ ] Step 2: Configure model
+- [ ] Step 3: Run evaluation
+- [ ] Step 4: Analyze results
+```
+
+**Step 1: Choose benchmark suite**
+
+**Core reasoning benchmarks**:
+- **MMLU** (Massive Multitask Language Understanding) - 57 subjects, multiple choice
+- **GSM8K** - Grade school math word problems
+- **HellaSwag** - Common sense reasoning
+- **TruthfulQA** - Truthfulness and factuality
+- **ARC** (AI2 Reasoning Challenge) - Science questions
+
+**Code benchmarks**:
+- **HumanEval** - Python code generation (164 problems)
+- **MBPP** (Mostly Basic Python Problems) - Python coding
+
+**Standard suite** (recommended for model releases):
+```bash
+--tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge
+```
+
+**Step 2: Configure model**
+
+**HuggingFace model**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf,dtype=bfloat16 \
+  --tasks mmlu \
+  --device cuda:0 \
+  --batch_size auto  # Auto-detect optimal batch size
+```
+
+**Quantized model (4-bit/8-bit)**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf,load_in_4bit=True \
+  --tasks mmlu \
+  --device cuda:0
+```
+
+**Custom checkpoint**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=/path/to/my-model,tokenizer=/path/to/tokenizer \
+  --tasks mmlu \
+  --device cuda:0
+```
+
+**Step 3: Run evaluation**
+
+```bash
+# Full MMLU evaluation (57 subjects)
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf \
+  --tasks mmlu \
+  --num_fewshot 5 \  # 5-shot evaluation (standard)
+  --batch_size 8 \
+  --output_path results/ \
+  --log_samples  # Save individual predictions
+
+# Multiple benchmarks at once
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf \
+  --tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge \
+  --num_fewshot 5 \
+  --batch_size 8 \
+  --output_path results/llama2-7b-eval.json
+```
+
+**Step 4: Analyze results**
+
+Results saved to `results/llama2-7b-eval.json`:
+
+```json
+{
+  "results": {
+    "mmlu": {
+      "acc": 0.459,
+      "acc_stderr": 0.004
+    },
+    "gsm8k": {
+      "exact_match": 0.142,
+      "exact_match_stderr": 0.006
+    },
+    "hellaswag": {
+      "acc_norm": 0.765,
+      "acc_norm_stderr": 0.004
+    }
+  },
+  "config": {
+    "model": "hf",
+    "model_args": "pretrained=meta-llama/Llama-2-7b-hf",
+    "num_fewshot": 5
+  }
+}
+```
+
+### Workflow 2: Track training progress
+
+Evaluate checkpoints during training.
+
+```
+Training Progress Tracking:
+- [ ] Step 1: Set up periodic evaluation
+- [ ] Step 2: Choose quick benchmarks
+- [ ] Step 3: Automate evaluation
+- [ ] Step 4: Plot learning curves
+```
+
+**Step 1: Set up periodic evaluation**
+
+Evaluate every N training steps:
+
+```bash
+#!/bin/bash
+# eval_checkpoint.sh
+
+CHECKPOINT_DIR=$1
+STEP=$2
+
+lm_eval --model hf \
+  --model_args pretrained=$CHECKPOINT_DIR/checkpoint-$STEP \
+  --tasks gsm8k,hellaswag \
+  --num_fewshot 0 \  # 0-shot for speed
+  --batch_size 16 \
+  --output_path results/step-$STEP.json
+```
+
+**Step 2: Choose quick benchmarks**
+
+Fast benchmarks for frequent evaluation:
+- **HellaSwag**: ~10 minutes on 1 GPU
+- **GSM8K**: ~5 minutes
+- **PIQA**: ~2 minutes
+
+Avoid for frequent eval (too slow):
+- **MMLU**: ~2 hours (57 subjects)
+- **HumanEval**: Requires code execution
+
+**Step 3: Automate evaluation**
+
+Integrate with training script:
+
+```python
+# In training loop
+if step % eval_interval == 0:
+    model.save_pretrained(f"checkpoints/step-{step}")
+
+    # Run evaluation
+    os.system(f"./eval_checkpoint.sh checkpoints step-{step}")
+```
+
+Or use PyTorch Lightning callbacks:
+
+```python
+from pytorch_lightning import Callback
+
+class EvalHarnessCallback(Callback):
+    def on_validation_epoch_end(self, trainer, pl_module):
+        step = trainer.global_step
+        checkpoint_path = f"checkpoints/step-{step}"
+
+        # Save checkpoint
+        trainer.save_checkpoint(checkpoint_path)
+
+        # Run lm-eval
+        os.system(f"lm_eval --model hf --model_args pretrained={checkpoint_path} ...")
+```
+
+**Step 4: Plot learning curves**
+
+```python
+import json
+import matplotlib.pyplot as plt
+
+# Load all results
+steps = []
+mmlu_scores = []
+
+for file in sorted(glob.glob("results/step-*.json")):
+    with open(file) as f:
+        data = json.load(f)
+        step = int(file.split("-")[1].split(".")[0])
+        steps.append(step)
+        mmlu_scores.append(data["results"]["mmlu"]["acc"])
+
+# Plot
+plt.plot(steps, mmlu_scores)
+plt.xlabel("Training Step")
+plt.ylabel("MMLU Accuracy")
+plt.title("Training Progress")
+plt.savefig("training_curve.png")
+```
+
+### Workflow 3: Compare multiple models
+
+Benchmark suite for model comparison.
+
+```
+Model Comparison:
+- [ ] Step 1: Define model list
+- [ ] Step 2: Run evaluations
+- [ ] Step 3: Generate comparison table
+```
+
+**Step 1: Define model list**
+
+```bash
+# models.txt
+meta-llama/Llama-2-7b-hf
+meta-llama/Llama-2-13b-hf
+mistralai/Mistral-7B-v0.1
+microsoft/phi-2
+```
+
+**Step 2: Run evaluations**
+
+```bash
+#!/bin/bash
+# eval_all_models.sh
+
+TASKS="mmlu,gsm8k,hellaswag,truthfulqa"
+
+while read model; do
+    echo "Evaluating $model"
+
+    # Extract model name for output file
+    model_name=$(echo $model | sed 's/\//-/g')
+
+    lm_eval --model hf \
+      --model_args pretrained=$model,dtype=bfloat16 \
+      --tasks $TASKS \
+      --num_fewshot 5 \
+      --batch_size auto \
+      --output_path results/$model_name.json
+
+done < models.txt
+```
+
+**Step 3: Generate comparison table**
+
+```python
+import json
+import pandas as pd
+
+models = [
+    "meta-llama-Llama-2-7b-hf",
+    "meta-llama-Llama-2-13b-hf",
+    "mistralai-Mistral-7B-v0.1",
+    "microsoft-phi-2"
+]
+
+tasks = ["mmlu", "gsm8k", "hellaswag", "truthfulqa"]
+
+results = []
+for model in models:
+    with open(f"results/{model}.json") as f:
+        data = json.load(f)
+        row = {"Model": model.replace("-", "/")}
+        for task in tasks:
+            # Get primary metric for each task
+            metrics = data["results"][task]
+            if "acc" in metrics:
+                row[task.upper()] = f"{metrics['acc']:.3f}"
+            elif "exact_match" in metrics:
+                row[task.upper()] = f"{metrics['exact_match']:.3f}"
+        results.append(row)
+
+df = pd.DataFrame(results)
+print(df.to_markdown(index=False))
+```
+
+Output:
+```
+| Model                  | MMLU  | GSM8K | HELLASWAG | TRUTHFULQA |
+|------------------------|-------|-------|-----------|------------|
+| meta-llama/Llama-2-7b  | 0.459 | 0.142 | 0.765     | 0.391      |
+| meta-llama/Llama-2-13b | 0.549 | 0.287 | 0.801     | 0.430      |
+| mistralai/Mistral-7B   | 0.626 | 0.395 | 0.812     | 0.428      |
+| microsoft/phi-2        | 0.560 | 0.613 | 0.682     | 0.447      |
+```
+
+### Workflow 4: Evaluate with vLLM (faster inference)
+
+Use vLLM backend for 5-10x faster evaluation.
+
+```
+vLLM Evaluation:
+- [ ] Step 1: Install vLLM
+- [ ] Step 2: Configure vLLM backend
+- [ ] Step 3: Run evaluation
+```
+
+**Step 1: Install vLLM**
+
+```bash
+pip install vllm
+```
+
+**Step 2: Configure vLLM backend**
+
+```bash
+lm_eval --model vllm \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8 \
+  --tasks mmlu \
+  --batch_size auto
+```
+
+**Step 3: Run evaluation**
+
+vLLM is 5-10× faster than standard HuggingFace:
+
+```bash
+# Standard HF: ~2 hours for MMLU on 7B model
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf \
+  --tasks mmlu \
+  --batch_size 8
+
+# vLLM: ~15-20 minutes for MMLU on 7B model
+lm_eval --model vllm \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=2 \
+  --tasks mmlu \
+  --batch_size auto
+```
+
+## When to use vs alternatives
+
+**Use lm-evaluation-harness when:**
+- Benchmarking models for academic papers
+- Comparing model quality across standard tasks
+- Tracking training progress
+- Reporting standardized metrics (everyone uses same prompts)
+- Need reproducible evaluation
+
+**Use alternatives instead:**
+- **HELM** (Stanford): Broader evaluation (fairness, efficiency, calibration)
+- **AlpacaEval**: Instruction-following evaluation with LLM judges
+- **MT-Bench**: Conversational multi-turn evaluation
+- **Custom scripts**: Domain-specific evaluation
+
+## Common issues
+
+**Issue: Evaluation too slow**
+
+Use vLLM backend:
+```bash
+lm_eval --model vllm \
+  --model_args pretrained=model-name,tensor_parallel_size=2
+```
+
+Or reduce fewshot examples:
+```bash
+--num_fewshot 0  # Instead of 5
+```
+
+Or evaluate subset of MMLU:
+```bash
+--tasks mmlu_stem  # Only STEM subjects
+```
+
+**Issue: Out of memory**
+
+Reduce batch size:
+```bash
+--batch_size 1  # Or --batch_size auto
+```
+
+Use quantization:
+```bash
+--model_args pretrained=model-name,load_in_8bit=True
+```
+
+Enable CPU offloading:
+```bash
+--model_args pretrained=model-name,device_map=auto,offload_folder=offload
+```
+
+**Issue: Different results than reported**
+
+Check fewshot count:
+```bash
+--num_fewshot 5  # Most papers use 5-shot
+```
+
+Check exact task name:
+```bash
+--tasks mmlu  # Not mmlu_direct or mmlu_fewshot
+```
+
+Verify model and tokenizer match:
+```bash
+--model_args pretrained=model-name,tokenizer=same-model-name
+```
+
+**Issue: HumanEval not executing code**
+
+Install execution dependencies:
+```bash
+pip install human-eval
+```
+
+Enable code execution:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=model-name \
+  --tasks humaneval \
+  --allow_code_execution  # Required for HumanEval
+```
+
+## Advanced topics
+
+**Benchmark descriptions**: See [references/benchmark-guide.md](references/benchmark-guide.md) for detailed description of all 60+ tasks, what they measure, and interpretation.
+
+**Custom tasks**: See [references/custom-tasks.md](references/custom-tasks.md) for creating domain-specific evaluation tasks.
+
+**API evaluation**: See [references/api-evaluation.md](references/api-evaluation.md) for evaluating OpenAI, Anthropic, and other API models.
+
+**Multi-GPU strategies**: See [references/distributed-eval.md](references/distributed-eval.md) for data parallel and tensor parallel evaluation.
+
+## Hardware requirements
+
+- **GPU**: NVIDIA (CUDA 11.8+), works on CPU (very slow)
+- **VRAM**:
+  - 7B model: 16GB (bf16) or 8GB (8-bit)
+  - 13B model: 28GB (bf16) or 14GB (8-bit)
+  - 70B model: Requires multi-GPU or quantization
+- **Time** (7B model, single A100):
+  - HellaSwag: 10 minutes
+  - GSM8K: 5 minutes
+  - MMLU (full): 2 hours
+  - HumanEval: 20 minutes
+
+## Resources
+
+- GitHub: https://github.com/EleutherAI/lm-evaluation-harness
+- Docs: https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs
+- Task library: 60+ tasks including MMLU, GSM8K, HumanEval, TruthfulQA, HellaSwag, ARC, WinoGrande, etc.
+- Leaderboard: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard (uses this harness)
+
+
+
--- a/skills/mlops/evaluation/lm-evaluation-harness/references/api-evaluation.md
+++ b/skills/mlops/evaluation/lm-evaluation-harness/references/api-evaluation.md
@@ -0,0 +1,490 @@
+# API Evaluation
+
+Guide to evaluating OpenAI, Anthropic, and other API-based language models.
+
+## Overview
+
+The lm-evaluation-harness supports evaluating API-based models through a unified `TemplateAPI` interface. This allows benchmarking of:
+- OpenAI models (GPT-4, GPT-3.5, etc.)
+- Anthropic models (Claude 3, Claude 2, etc.)
+- Local OpenAI-compatible APIs
+- Custom API endpoints
+
+**Why evaluate API models**:
+- Benchmark closed-source models
+- Compare API models to open models
+- Validate API performance
+- Track model updates over time
+
+## Supported API Models
+
+| Provider | Model Type | Request Types | Logprobs |
+|----------|------------|---------------|----------|
+| OpenAI (completions) | `openai-completions` | All | ✅ Yes |
+| OpenAI (chat) | `openai-chat-completions` | `generate_until` only | ❌ No |
+| Anthropic (completions) | `anthropic-completions` | All | ❌ No |
+| Anthropic (chat) | `anthropic-chat` | `generate_until` only | ❌ No |
+| Local (OpenAI-compatible) | `local-completions` | Depends on server | Varies |
+
+**Note**: Models without logprobs can only be evaluated on generation tasks, not perplexity or loglikelihood tasks.
+
+## OpenAI Models
+
+### Setup
+
+```bash
+export OPENAI_API_KEY=sk-...
+```
+
+### Completion Models (Legacy)
+
+**Available models**: `davinci-002`, `babbage-002`
+
+```bash
+lm_eval --model openai-completions \
+  --model_args model=davinci-002 \
+  --tasks lambada_openai,hellaswag \
+  --batch_size auto
+```
+
+**Supports**:
+- `generate_until`: ✅
+- `loglikelihood`: ✅
+- `loglikelihood_rolling`: ✅
+
+### Chat Models
+
+**Available models**: `gpt-4`, `gpt-4-turbo`, `gpt-3.5-turbo`
+
+```bash
+lm_eval --model openai-chat-completions \
+  --model_args model=gpt-4-turbo \
+  --tasks mmlu,gsm8k,humaneval \
+  --num_fewshot 5 \
+  --batch_size auto
+```
+
+**Supports**:
+- `generate_until`: ✅
+- `loglikelihood`: ❌ (no logprobs)
+- `loglikelihood_rolling`: ❌
+
+**Important**: Chat models don't provide logprobs, so they can only be used with generation tasks (MMLU, GSM8K, HumanEval), not perplexity tasks.
+
+### Configuration Options
+
+```bash
+lm_eval --model openai-chat-completions \
+  --model_args \
+    model=gpt-4-turbo,\
+    base_url=https://api.openai.com/v1,\
+    num_concurrent=5,\
+    max_retries=3,\
+    timeout=60,\
+    batch_size=auto
+```
+
+**Parameters**:
+- `model`: Model identifier (required)
+- `base_url`: API endpoint (default: OpenAI)
+- `num_concurrent`: Concurrent requests (default: 5)
+- `max_retries`: Retry failed requests (default: 3)
+- `timeout`: Request timeout in seconds (default: 60)
+- `tokenizer`: Tokenizer to use (default: matches model)
+- `tokenizer_backend`: `"tiktoken"` or `"huggingface"`
+
+### Cost Management
+
+OpenAI charges per token. Estimate costs before running:
+
+```python
+# Rough estimate
+num_samples = 1000
+avg_tokens_per_sample = 500  # input + output
+cost_per_1k_tokens = 0.01  # GPT-3.5 Turbo
+
+total_cost = (num_samples * avg_tokens_per_sample / 1000) * cost_per_1k_tokens
+print(f"Estimated cost: ${total_cost:.2f}")
+```
+
+**Cost-saving tips**:
+- Use `--limit N` for testing
+- Start with `gpt-3.5-turbo` before `gpt-4`
+- Set `max_gen_toks` to minimum needed
+- Use `num_fewshot=0` for zero-shot when possible
+
+## Anthropic Models
+
+### Setup
+
+```bash
+export ANTHROPIC_API_KEY=sk-ant-...
+```
+
+### Completion Models (Legacy)
+
+```bash
+lm_eval --model anthropic-completions \
+  --model_args model=claude-2.1 \
+  --tasks lambada_openai,hellaswag \
+  --batch_size auto
+```
+
+### Chat Models (Recommended)
+
+**Available models**: `claude-3-5-sonnet-20241022`, `claude-3-opus-20240229`, `claude-3-sonnet-20240229`, `claude-3-haiku-20240307`
+
+```bash
+lm_eval --model anthropic-chat \
+  --model_args model=claude-3-5-sonnet-20241022 \
+  --tasks mmlu,gsm8k,humaneval \
+  --num_fewshot 5 \
+  --batch_size auto
+```
+
+**Aliases**: `anthropic-chat-completions` (same as `anthropic-chat`)
+
+### Configuration Options
+
+```bash
+lm_eval --model anthropic-chat \
+  --model_args \
+    model=claude-3-5-sonnet-20241022,\
+    base_url=https://api.anthropic.com,\
+    num_concurrent=5,\
+    max_retries=3,\
+    timeout=60
+```
+
+### Cost Management
+
+Anthropic pricing (as of 2024):
+- Claude 3.5 Sonnet: $3.00 / 1M input, $15.00 / 1M output
+- Claude 3 Opus: $15.00 / 1M input, $75.00 / 1M output
+- Claude 3 Haiku: $0.25 / 1M input, $1.25 / 1M output
+
+**Budget-friendly strategy**:
+```bash
+# Test on small sample first
+lm_eval --model anthropic-chat \
+  --model_args model=claude-3-haiku-20240307 \
+  --tasks mmlu \
+  --limit 100
+
+# Then run full eval on best model
+lm_eval --model anthropic-chat \
+  --model_args model=claude-3-5-sonnet-20241022 \
+  --tasks mmlu \
+  --num_fewshot 5
+```
+
+## Local OpenAI-Compatible APIs
+
+Many local inference servers expose OpenAI-compatible APIs (vLLM, Text Generation Inference, llama.cpp, Ollama).
+
+### vLLM Local Server
+
+**Start server**:
+```bash
+vllm serve meta-llama/Llama-2-7b-hf \
+  --host 0.0.0.0 \
+  --port 8000
+```
+
+**Evaluate**:
+```bash
+lm_eval --model local-completions \
+  --model_args \
+    model=meta-llama/Llama-2-7b-hf,\
+    base_url=http://localhost:8000/v1,\
+    num_concurrent=1 \
+  --tasks mmlu,gsm8k \
+  --batch_size auto
+```
+
+### Text Generation Inference (TGI)
+
+**Start server**:
+```bash
+docker run --gpus all --shm-size 1g -p 8080:80 \
+  ghcr.io/huggingface/text-generation-inference:latest \
+  --model-id meta-llama/Llama-2-7b-hf
+```
+
+**Evaluate**:
+```bash
+lm_eval --model local-completions \
+  --model_args \
+    model=meta-llama/Llama-2-7b-hf,\
+    base_url=http://localhost:8080/v1 \
+  --tasks hellaswag,arc_challenge
+```
+
+### Ollama
+
+**Start server**:
+```bash
+ollama serve
+ollama pull llama2:7b
+```
+
+**Evaluate**:
+```bash
+lm_eval --model local-completions \
+  --model_args \
+    model=llama2:7b,\
+    base_url=http://localhost:11434/v1 \
+  --tasks mmlu
+```
+
+### llama.cpp Server
+
+**Start server**:
+```bash
+./server -m models/llama-2-7b.gguf --host 0.0.0.0 --port 8080
+```
+
+**Evaluate**:
+```bash
+lm_eval --model local-completions \
+  --model_args \
+    model=llama2,\
+    base_url=http://localhost:8080/v1 \
+  --tasks gsm8k
+```
+
+## Custom API Implementation
+
+For custom API endpoints, subclass `TemplateAPI`:
+
+### Create `my_api.py`
+
+```python
+from lm_eval.models.api_models import TemplateAPI
+import requests
+
+class MyCustomAPI(TemplateAPI):
+    """Custom API model."""
+
+    def __init__(self, base_url, api_key, **kwargs):
+        super().__init__(base_url=base_url, **kwargs)
+        self.api_key = api_key
+
+    def _create_payload(self, messages, gen_kwargs):
+        """Create API request payload."""
+        return {
+            "messages": messages,
+            "api_key": self.api_key,
+            **gen_kwargs
+        }
+
+    def parse_generations(self, response):
+        """Parse generation response."""
+        return response.json()["choices"][0]["text"]
+
+    def parse_logprobs(self, response):
+        """Parse logprobs (if available)."""
+        # Return None if API doesn't provide logprobs
+        logprobs = response.json().get("logprobs")
+        if logprobs:
+            return logprobs["token_logprobs"]
+        return None
+```
+
+### Register and Use
+
+```python
+from lm_eval import evaluator
+from my_api import MyCustomAPI
+
+model = MyCustomAPI(
+    base_url="https://api.example.com/v1",
+    api_key="your-key"
+)
+
+results = evaluator.simple_evaluate(
+    model=model,
+    tasks=["mmlu", "gsm8k"],
+    num_fewshot=5,
+    batch_size="auto"
+)
+```
+
+## Comparing API and Open Models
+
+### Side-by-Side Evaluation
+
+```bash
+# Evaluate OpenAI GPT-4
+lm_eval --model openai-chat-completions \
+  --model_args model=gpt-4-turbo \
+  --tasks mmlu,gsm8k,hellaswag \
+  --num_fewshot 5 \
+  --output_path results/gpt4.json
+
+# Evaluate open Llama 2 70B
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-70b-hf,dtype=bfloat16 \
+  --tasks mmlu,gsm8k,hellaswag \
+  --num_fewshot 5 \
+  --output_path results/llama2-70b.json
+
+# Compare results
+python scripts/compare_results.py \
+  results/gpt4.json \
+  results/llama2-70b.json
+```
+
+### Typical Comparisons
+
+| Model | MMLU | GSM8K | HumanEval | Cost |
+|-------|------|-------|-----------|------|
+| GPT-4 Turbo | 86.4% | 92.0% | 67.0% | $$$$ |
+| Claude 3 Opus | 86.8% | 95.0% | 84.9% | $$$$ |
+| GPT-3.5 Turbo | 70.0% | 57.1% | 48.1% | $$ |
+| Llama 2 70B | 68.9% | 56.8% | 29.9% | Free (self-host) |
+| Mixtral 8x7B | 70.6% | 58.4% | 40.2% | Free (self-host) |
+
+## Best Practices
+
+### Rate Limiting
+
+Respect API rate limits:
+```bash
+lm_eval --model openai-chat-completions \
+  --model_args \
+    model=gpt-4-turbo,\
+    num_concurrent=3,\  # Lower concurrency
+    timeout=120 \  # Longer timeout
+  --tasks mmlu
+```
+
+### Reproducibility
+
+Set temperature to 0 for deterministic results:
+```bash
+lm_eval --model openai-chat-completions \
+  --model_args model=gpt-4-turbo \
+  --tasks mmlu \
+  --gen_kwargs temperature=0.0
+```
+
+Or use `seed` for sampling:
+```bash
+lm_eval --model anthropic-chat \
+  --model_args model=claude-3-5-sonnet-20241022 \
+  --tasks gsm8k \
+  --gen_kwargs temperature=0.7,seed=42
+```
+
+### Caching
+
+API models automatically cache responses to avoid redundant calls:
+```bash
+# First run: makes API calls
+lm_eval --model openai-chat-completions \
+  --model_args model=gpt-4-turbo \
+  --tasks mmlu \
+  --limit 100
+
+# Second run: uses cache (instant, free)
+lm_eval --model openai-chat-completions \
+  --model_args model=gpt-4-turbo \
+  --tasks mmlu \
+  --limit 100
+```
+
+Cache location: `~/.cache/lm_eval/`
+
+### Error Handling
+
+APIs can fail. Use retries:
+```bash
+lm_eval --model openai-chat-completions \
+  --model_args \
+    model=gpt-4-turbo,\
+    max_retries=5,\
+    timeout=120 \
+  --tasks mmlu
+```
+
+## Troubleshooting
+
+### "Authentication failed"
+
+Check API key:
+```bash
+echo $OPENAI_API_KEY  # Should print sk-...
+echo $ANTHROPIC_API_KEY  # Should print sk-ant-...
+```
+
+### "Rate limit exceeded"
+
+Reduce concurrency:
+```bash
+--model_args num_concurrent=1
+```
+
+Or add delays between requests.
+
+### "Timeout error"
+
+Increase timeout:
+```bash
+--model_args timeout=180
+```
+
+### "Model not found"
+
+For local APIs, verify server is running:
+```bash
+curl http://localhost:8000/v1/models
+```
+
+### Cost Runaway
+
+Use `--limit` for testing:
+```bash
+lm_eval --model openai-chat-completions \
+  --model_args model=gpt-4-turbo \
+  --tasks mmlu \
+  --limit 50  # Only 50 samples
+```
+
+## Advanced Features
+
+### Custom Headers
+
+```bash
+lm_eval --model local-completions \
+  --model_args \
+    base_url=http://api.example.com/v1,\
+    header="Authorization: Bearer token,X-Custom: value"
+```
+
+### Disable SSL Verification (Development Only)
+
+```bash
+lm_eval --model local-completions \
+  --model_args \
+    base_url=https://localhost:8000/v1,\
+    verify_certificate=false
+```
+
+### Custom Tokenizer
+
+```bash
+lm_eval --model openai-chat-completions \
+  --model_args \
+    model=gpt-4-turbo,\
+    tokenizer=gpt2,\
+    tokenizer_backend=huggingface
+```
+
+## References
+
+- OpenAI API: https://platform.openai.com/docs/api-reference
+- Anthropic API: https://docs.anthropic.com/claude/reference
+- TemplateAPI: `lm_eval/models/api_models.py`
+- OpenAI models: `lm_eval/models/openai_completions.py`
+- Anthropic models: `lm_eval/models/anthropic_llms.py`
--- a/skills/mlops/evaluation/lm-evaluation-harness/references/benchmark-guide.md
+++ b/skills/mlops/evaluation/lm-evaluation-harness/references/benchmark-guide.md
@@ -0,0 +1,488 @@
+# Benchmark Guide
+
+Complete guide to all 60+ evaluation tasks in lm-evaluation-harness, what they measure, and how to interpret results.
+
+## Overview
+
+The lm-evaluation-harness includes 60+ benchmarks spanning:
+- Language understanding (MMLU, GLUE)
+- Mathematical reasoning (GSM8K, MATH)
+- Code generation (HumanEval, MBPP)
+- Instruction following (IFEval, AlpacaEval)
+- Long-context understanding (LongBench)
+- Multilingual capabilities (AfroBench, NorEval)
+- Reasoning (BBH, ARC)
+- Truthfulness (TruthfulQA)
+
+**List all tasks**:
+```bash
+lm_eval --tasks list
+```
+
+## Major Benchmarks
+
+### MMLU (Massive Multitask Language Understanding)
+
+**What it measures**: Broad knowledge across 57 subjects (STEM, humanities, social sciences, law).
+
+**Task variants**:
+- `mmlu`: Original 57-subject benchmark
+- `mmlu_pro`: More challenging version with reasoning-focused questions
+- `mmlu_prox`: Multilingual extension
+
+**Format**: Multiple choice (4 options)
+
+**Example**:
+```
+Question: What is the capital of France?
+A. Berlin
+B. Paris
+C. London
+D. Madrid
+Answer: B
+```
+
+**Command**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf \
+  --tasks mmlu \
+  --num_fewshot 5
+```
+
+**Interpretation**:
+- Random: 25% (chance)
+- GPT-3 (175B): 43.9%
+- GPT-4: 86.4%
+- Human expert: ~90%
+
+**Good for**: Assessing general knowledge and domain expertise.
+
+### GSM8K (Grade School Math 8K)
+
+**What it measures**: Mathematical reasoning on grade-school level word problems.
+
+**Task variants**:
+- `gsm8k`: Base task
+- `gsm8k_cot`: With chain-of-thought prompting
+- `gsm_plus`: Adversarial variant with perturbations
+
+**Format**: Free-form generation, extract numerical answer
+
+**Example**:
+```
+Question: A baker made 200 cookies. He sold 3/5 of them in the morning and 1/4 of the remaining in the afternoon. How many cookies does he have left?
+Answer: 60
+```
+
+**Command**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf \
+  --tasks gsm8k \
+  --num_fewshot 5
+```
+
+**Interpretation**:
+- Random: ~0%
+- GPT-3 (175B): 17.0%
+- GPT-4: 92.0%
+- Llama 2 70B: 56.8%
+
+**Good for**: Testing multi-step reasoning and arithmetic.
+
+### HumanEval
+
+**What it measures**: Python code generation from docstrings (functional correctness).
+
+**Task variants**:
+- `humaneval`: Standard benchmark
+- `humaneval_instruct`: For instruction-tuned models
+
+**Format**: Code generation, execution-based evaluation
+
+**Example**:
+```python
+def has_close_elements(numbers: List[float], threshold: float) -> bool:
+    """ Check if in given list of numbers, are any two numbers closer to each other than
+    given threshold.
+    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
+    False
+    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
+    True
+    """
+```
+
+**Command**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=codellama/CodeLlama-7b-hf \
+  --tasks humaneval \
+  --batch_size 1
+```
+
+**Interpretation**:
+- Random: 0%
+- GPT-3 (175B): 0%
+- Codex: 28.8%
+- GPT-4: 67.0%
+- Code Llama 34B: 53.7%
+
+**Good for**: Evaluating code generation capabilities.
+
+### BBH (BIG-Bench Hard)
+
+**What it measures**: 23 challenging reasoning tasks where models previously failed to beat humans.
+
+**Categories**:
+- Logical reasoning
+- Math word problems
+- Social understanding
+- Algorithmic reasoning
+
+**Format**: Multiple choice and free-form
+
+**Command**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf \
+  --tasks bbh \
+  --num_fewshot 3
+```
+
+**Interpretation**:
+- Random: ~25%
+- GPT-3 (175B): 33.9%
+- PaLM 540B: 58.3%
+- GPT-4: 86.7%
+
+**Good for**: Testing advanced reasoning capabilities.
+
+### IFEval (Instruction-Following Evaluation)
+
+**What it measures**: Ability to follow specific, verifiable instructions.
+
+**Instruction types**:
+- Format constraints (e.g., "answer in 3 sentences")
+- Length constraints (e.g., "use at least 100 words")
+- Content constraints (e.g., "include the word 'banana'")
+- Structural constraints (e.g., "use bullet points")
+
+**Format**: Free-form generation with rule-based verification
+
+**Command**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-chat-hf \
+  --tasks ifeval \
+  --batch_size auto
+```
+
+**Interpretation**:
+- Measures: Instruction adherence (not quality)
+- GPT-4: 86% instruction following
+- Claude 2: 84%
+
+**Good for**: Evaluating chat/instruct models.
+
+### GLUE (General Language Understanding Evaluation)
+
+**What it measures**: Natural language understanding across 9 tasks.
+
+**Tasks**:
+- `cola`: Grammatical acceptability
+- `sst2`: Sentiment analysis
+- `mrpc`: Paraphrase detection
+- `qqp`: Question pairs
+- `stsb`: Semantic similarity
+- `mnli`: Natural language inference
+- `qnli`: Question answering NLI
+- `rte`: Recognizing textual entailment
+- `wnli`: Winograd schemas
+
+**Command**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=bert-base-uncased \
+  --tasks glue \
+  --num_fewshot 0
+```
+
+**Interpretation**:
+- BERT Base: 78.3 (GLUE score)
+- RoBERTa Large: 88.5
+- Human baseline: 87.1
+
+**Good for**: Encoder-only models, fine-tuning baselines.
+
+### LongBench
+
+**What it measures**: Long-context understanding (4K-32K tokens).
+
+**21 tasks covering**:
+- Single-document QA
+- Multi-document QA
+- Summarization
+- Few-shot learning
+- Code completion
+- Synthetic tasks
+
+**Command**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf \
+  --tasks longbench \
+  --batch_size 1
+```
+
+**Interpretation**:
+- Tests context utilization
+- Many models struggle beyond 4K tokens
+- GPT-4 Turbo: 54.3%
+
+**Good for**: Evaluating long-context models.
+
+## Additional Benchmarks
+
+### TruthfulQA
+
+**What it measures**: Model's propensity to be truthful vs. generate plausible-sounding falsehoods.
+
+**Format**: Multiple choice with 4-5 options
+
+**Command**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf \
+  --tasks truthfulqa_mc2 \
+  --batch_size auto
+```
+
+**Interpretation**:
+- Larger models often score worse (more convincing lies)
+- GPT-3: 58.8%
+- GPT-4: 59.0%
+- Human: ~94%
+
+### ARC (AI2 Reasoning Challenge)
+
+**What it measures**: Grade-school science questions.
+
+**Variants**:
+- `arc_easy`: Easier questions
+- `arc_challenge`: Harder questions requiring reasoning
+
+**Command**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf \
+  --tasks arc_challenge \
+  --num_fewshot 25
+```
+
+**Interpretation**:
+- ARC-Easy: Most models >80%
+- ARC-Challenge random: 25%
+- GPT-4: 96.3%
+
+### HellaSwag
+
+**What it measures**: Commonsense reasoning about everyday situations.
+
+**Format**: Choose most plausible continuation
+
+**Command**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf \
+  --tasks hellaswag \
+  --num_fewshot 10
+```
+
+**Interpretation**:
+- Random: 25%
+- GPT-3: 78.9%
+- Llama 2 70B: 85.3%
+
+### WinoGrande
+
+**What it measures**: Commonsense reasoning via pronoun resolution.
+
+**Example**:
+```
+The trophy doesn't fit in the brown suitcase because _ is too large.
+A. the trophy
+B. the suitcase
+```
+
+**Command**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf \
+  --tasks winogrande \
+  --num_fewshot 5
+```
+
+### PIQA
+
+**What it measures**: Physical commonsense reasoning.
+
+**Example**: "To clean a keyboard, use compressed air or..."
+
+**Command**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf \
+  --tasks piqa
+```
+
+## Multilingual Benchmarks
+
+### AfroBench
+
+**What it measures**: Performance across 64 African languages.
+
+**15 tasks**: NLU, text generation, knowledge, QA, math reasoning
+
+**Command**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf \
+  --tasks afrobench
+```
+
+### NorEval
+
+**What it measures**: Norwegian language understanding (9 task categories).
+
+**Command**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=NbAiLab/nb-gpt-j-6B \
+  --tasks noreval
+```
+
+## Domain-Specific Benchmarks
+
+### MATH
+
+**What it measures**: High-school competition math problems.
+
+**Command**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf \
+  --tasks math \
+  --num_fewshot 4
+```
+
+**Interpretation**:
+- Very challenging
+- GPT-4: 42.5%
+- Minerva 540B: 33.6%
+
+### MBPP (Mostly Basic Python Problems)
+
+**What it measures**: Python programming from natural language descriptions.
+
+**Command**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=codellama/CodeLlama-7b-hf \
+  --tasks mbpp \
+  --batch_size 1
+```
+
+### DROP
+
+**What it measures**: Reading comprehension requiring discrete reasoning.
+
+**Command**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf \
+  --tasks drop
+```
+
+## Benchmark Selection Guide
+
+### For General Purpose Models
+
+Run this suite:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf \
+  --tasks mmlu,gsm8k,hellaswag,arc_challenge,truthfulqa_mc2 \
+  --num_fewshot 5
+```
+
+### For Code Models
+
+```bash
+lm_eval --model hf \
+  --model_args pretrained=codellama/CodeLlama-7b-hf \
+  --tasks humaneval,mbpp \
+  --batch_size 1
+```
+
+### For Chat/Instruct Models
+
+```bash
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-chat-hf \
+  --tasks ifeval,mmlu,gsm8k_cot \
+  --batch_size auto
+```
+
+### For Long Context Models
+
+```bash
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-3.1-8B \
+  --tasks longbench \
+  --batch_size 1
+```
+
+## Interpreting Results
+
+### Understanding Metrics
+
+**Accuracy**: Percentage of correct answers (most common)
+
+**Exact Match (EM)**: Requires exact string match (strict)
+
+**F1 Score**: Balances precision and recall
+
+**BLEU/ROUGE**: Text generation similarity
+
+**Pass@k**: Percentage passing when generating k samples
+
+### Typical Score Ranges
+
+| Model Size | MMLU | GSM8K | HumanEval | HellaSwag |
+|------------|------|-------|-----------|-----------|
+| 7B | 40-50% | 10-20% | 5-15% | 70-80% |
+| 13B | 45-55% | 20-35% | 15-25% | 75-82% |
+| 70B | 60-70% | 50-65% | 35-50% | 82-87% |
+| GPT-4 | 86% | 92% | 67% | 95% |
+
+### Red Flags
+
+- **All tasks at random chance**: Model not trained properly
+- **Exact 0% on generation tasks**: Likely format/parsing issue
+- **Huge variance across runs**: Check seed/sampling settings
+- **Better than GPT-4 on everything**: Likely contamination
+
+## Best Practices
+
+1. **Always report few-shot setting**: 0-shot, 5-shot, etc.
+2. **Run multiple seeds**: Report mean ± std
+3. **Check for data contamination**: Search training data for benchmark examples
+4. **Compare to published baselines**: Validate your setup
+5. **Report all hyperparameters**: Model, batch size, max tokens, temperature
+
+## References
+
+- Task list: `lm_eval --tasks list`
+- Task README: `lm_eval/tasks/README.md`
+- Papers: See individual benchmark papers
--- a/skills/mlops/evaluation/lm-evaluation-harness/references/custom-tasks.md
+++ b/skills/mlops/evaluation/lm-evaluation-harness/references/custom-tasks.md
@@ -0,0 +1,602 @@
+# Custom Tasks
+
+Complete guide to creating domain-specific evaluation tasks in lm-evaluation-harness.
+
+## Overview
+
+Custom tasks allow you to evaluate models on your own datasets and metrics. Tasks are defined using YAML configuration files with optional Python utilities for complex logic.
+
+**Why create custom tasks**:
+- Evaluate on proprietary/domain-specific data
+- Test specific capabilities not covered by existing benchmarks
+- Create evaluation pipelines for internal models
+- Reproduce research experiments
+
+## Quick Start
+
+### Minimal Custom Task
+
+Create `my_tasks/simple_qa.yaml`:
+
+```yaml
+task: simple_qa
+dataset_path: data/simple_qa.jsonl
+output_type: generate_until
+doc_to_text: "Question: {{question}}\nAnswer:"
+doc_to_target: "{{answer}}"
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
+```
+
+**Run it**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf \
+  --tasks simple_qa \
+  --include_path my_tasks/
+```
+
+## Task Configuration Reference
+
+### Essential Fields
+
+```yaml
+# Task identification
+task: my_custom_task           # Unique task name (required)
+task_alias: "My Task"          # Display name
+tag:                           # Tags for grouping
+  - custom
+  - domain_specific
+
+# Dataset configuration
+dataset_path: data/my_data.jsonl  # HuggingFace dataset or local path
+dataset_name: default             # Subset name (if applicable)
+training_split: train
+validation_split: validation
+test_split: test
+
+# Evaluation configuration
+output_type: generate_until    # or loglikelihood, multiple_choice
+num_fewshot: 5                 # Number of few-shot examples
+batch_size: auto               # Batch size
+
+# Prompt templates (Jinja2)
+doc_to_text: "Question: {{question}}"
+doc_to_target: "{{answer}}"
+
+# Metrics
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
+
+# Metadata
+metadata:
+  version: 1.0
+```
+
+### Output Types
+
+**`generate_until`**: Free-form generation
+```yaml
+output_type: generate_until
+generation_kwargs:
+  max_gen_toks: 256
+  until:
+    - "\n"
+    - "."
+  temperature: 0.0
+```
+
+**`loglikelihood`**: Compute log probability of targets
+```yaml
+output_type: loglikelihood
+# Used for perplexity, classification
+```
+
+**`multiple_choice`**: Choose from options
+```yaml
+output_type: multiple_choice
+doc_to_choice: "{{choices}}"  # List of choices
+```
+
+## Data Formats
+
+### Local JSONL File
+
+`data/my_data.jsonl`:
+```json
+{"question": "What is 2+2?", "answer": "4"}
+{"question": "Capital of France?", "answer": "Paris"}
+```
+
+**Task config**:
+```yaml
+dataset_path: data/my_data.jsonl
+dataset_kwargs:
+  data_files:
+    test: data/my_data.jsonl
+```
+
+### HuggingFace Dataset
+
+```yaml
+dataset_path: squad
+dataset_name: plain_text
+test_split: validation
+```
+
+### CSV File
+
+`data/my_data.csv`:
+```csv
+question,answer,category
+What is 2+2?,4,math
+Capital of France?,Paris,geography
+```
+
+**Task config**:
+```yaml
+dataset_path: data/my_data.csv
+dataset_kwargs:
+  data_files:
+    test: data/my_data.csv
+```
+
+## Prompt Engineering
+
+### Simple Template
+
+```yaml
+doc_to_text: "Question: {{question}}\nAnswer:"
+doc_to_target: "{{answer}}"
+```
+
+### Conditional Logic
+
+```yaml
+doc_to_text: |
+  {% if context %}
+  Context: {{context}}
+  {% endif %}
+  Question: {{question}}
+  Answer:
+```
+
+### Multiple Choice
+
+```yaml
+doc_to_text: |
+  Question: {{question}}
+  A. {{choices[0]}}
+  B. {{choices[1]}}
+  C. {{choices[2]}}
+  D. {{choices[3]}}
+  Answer:
+
+doc_to_target: "{{ 'ABCD'[answer_idx] }}"
+doc_to_choice: ["A", "B", "C", "D"]
+```
+
+### Few-Shot Formatting
+
+```yaml
+fewshot_delimiter: "\n\n"        # Between examples
+target_delimiter: " "            # Between question and answer
+doc_to_text: "Q: {{question}}"
+doc_to_target: "A: {{answer}}"
+```
+
+## Custom Python Functions
+
+For complex logic, use Python functions in `utils.py`.
+
+### Create `my_tasks/utils.py`
+
+```python
+def process_docs(dataset):
+    """Preprocess documents."""
+    def _process(doc):
+        # Custom preprocessing
+        doc["question"] = doc["question"].strip().lower()
+        return doc
+
+    return dataset.map(_process)
+
+def doc_to_text(doc):
+    """Custom prompt formatting."""
+    context = doc.get("context", "")
+    question = doc["question"]
+
+    if context:
+        return f"Context: {context}\nQuestion: {question}\nAnswer:"
+    return f"Question: {question}\nAnswer:"
+
+def doc_to_target(doc):
+    """Custom target extraction."""
+    return doc["answer"].strip().lower()
+
+def aggregate_scores(items):
+    """Custom metric aggregation."""
+    correct = sum(1 for item in items if item == 1.0)
+    total = len(items)
+    return correct / total if total > 0 else 0.0
+```
+
+### Use in Task Config
+
+```yaml
+task: my_custom_task
+dataset_path: data/my_data.jsonl
+
+# Use Python functions
+process_docs: !function utils.process_docs
+doc_to_text: !function utils.doc_to_text
+doc_to_target: !function utils.doc_to_target
+
+metric_list:
+  - metric: exact_match
+    aggregation: !function utils.aggregate_scores
+    higher_is_better: true
+```
+
+## Real-World Examples
+
+### Example 1: Domain QA Task
+
+**Goal**: Evaluate medical question answering.
+
+`medical_qa/medical_qa.yaml`:
+```yaml
+task: medical_qa
+dataset_path: data/medical_qa.jsonl
+output_type: generate_until
+num_fewshot: 3
+
+doc_to_text: |
+  Medical Question: {{question}}
+  Context: {{context}}
+  Answer (be concise):
+
+doc_to_target: "{{answer}}"
+
+generation_kwargs:
+  max_gen_toks: 100
+  until:
+    - "\n\n"
+  temperature: 0.0
+
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
+  - metric: !function utils.medical_f1
+    aggregation: mean
+    higher_is_better: true
+
+filter_list:
+  - name: lowercase
+    filter:
+      - function: lowercase
+      - function: remove_whitespace
+
+metadata:
+  version: 1.0
+  domain: medical
+```
+
+`medical_qa/utils.py`:
+```python
+from sklearn.metrics import f1_score
+import re
+
+def medical_f1(predictions, references):
+    """Custom F1 for medical terms."""
+    pred_terms = set(extract_medical_terms(predictions[0]))
+    ref_terms = set(extract_medical_terms(references[0]))
+
+    if not pred_terms and not ref_terms:
+        return 1.0
+    if not pred_terms or not ref_terms:
+        return 0.0
+
+    tp = len(pred_terms & ref_terms)
+    fp = len(pred_terms - ref_terms)
+    fn = len(ref_terms - pred_terms)
+
+    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
+    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
+
+    return 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
+
+def extract_medical_terms(text):
+    """Extract medical terminology."""
+    # Custom logic
+    return re.findall(r'\b[A-Z][a-z]+(?:[A-Z][a-z]+)*\b', text)
+```
+
+### Example 2: Code Evaluation
+
+`code_eval/python_challenges.yaml`:
+```yaml
+task: python_challenges
+dataset_path: data/python_problems.jsonl
+output_type: generate_until
+num_fewshot: 0
+
+doc_to_text: |
+  Write a Python function to solve:
+  {{problem_statement}}
+
+  Function signature:
+  {{function_signature}}
+
+doc_to_target: "{{canonical_solution}}"
+
+generation_kwargs:
+  max_gen_toks: 512
+  until:
+    - "\n\nclass"
+    - "\n\ndef"
+  temperature: 0.2
+
+metric_list:
+  - metric: !function utils.execute_code
+    aggregation: mean
+    higher_is_better: true
+
+process_results: !function utils.process_code_results
+
+metadata:
+  version: 1.0
+```
+
+`code_eval/utils.py`:
+```python
+import subprocess
+import json
+
+def execute_code(predictions, references):
+    """Execute generated code against test cases."""
+    generated_code = predictions[0]
+    test_cases = json.loads(references[0])
+
+    try:
+        # Execute code with test cases
+        for test_input, expected_output in test_cases:
+            result = execute_with_timeout(generated_code, test_input, timeout=5)
+            if result != expected_output:
+                return 0.0
+        return 1.0
+    except Exception:
+        return 0.0
+
+def execute_with_timeout(code, input_data, timeout=5):
+    """Safely execute code with timeout."""
+    # Implementation with subprocess and timeout
+    pass
+
+def process_code_results(doc, results):
+    """Process code execution results."""
+    return {
+        "passed": results[0] == 1.0,
+        "generated_code": results[1]
+    }
+```
+
+### Example 3: Instruction Following
+
+`instruction_eval/instruction_eval.yaml`:
+```yaml
+task: instruction_following
+dataset_path: data/instructions.jsonl
+output_type: generate_until
+num_fewshot: 0
+
+doc_to_text: |
+  Instruction: {{instruction}}
+  {% if constraints %}
+  Constraints: {{constraints}}
+  {% endif %}
+  Response:
+
+doc_to_target: "{{expected_response}}"
+
+generation_kwargs:
+  max_gen_toks: 256
+  temperature: 0.7
+
+metric_list:
+  - metric: !function utils.check_constraints
+    aggregation: mean
+    higher_is_better: true
+  - metric: !function utils.semantic_similarity
+    aggregation: mean
+    higher_is_better: true
+
+process_docs: !function utils.add_constraint_checkers
+```
+
+`instruction_eval/utils.py`:
+```python
+from sentence_transformers import SentenceTransformer, util
+
+model = SentenceTransformer('all-MiniLM-L6-v2')
+
+def check_constraints(predictions, references):
+    """Check if response satisfies constraints."""
+    response = predictions[0]
+    constraints = json.loads(references[0])
+
+    satisfied = 0
+    total = len(constraints)
+
+    for constraint in constraints:
+        if verify_constraint(response, constraint):
+            satisfied += 1
+
+    return satisfied / total if total > 0 else 1.0
+
+def verify_constraint(response, constraint):
+    """Verify single constraint."""
+    if constraint["type"] == "length":
+        return len(response.split()) >= constraint["min_words"]
+    elif constraint["type"] == "contains":
+        return constraint["keyword"] in response.lower()
+    # Add more constraint types
+    return True
+
+def semantic_similarity(predictions, references):
+    """Compute semantic similarity."""
+    pred_embedding = model.encode(predictions[0])
+    ref_embedding = model.encode(references[0])
+    return float(util.cos_sim(pred_embedding, ref_embedding))
+
+def add_constraint_checkers(dataset):
+    """Parse constraints into verifiable format."""
+    def _parse(doc):
+        # Parse constraint string into structured format
+        doc["parsed_constraints"] = parse_constraints(doc.get("constraints", ""))
+        return doc
+    return dataset.map(_parse)
+```
+
+## Advanced Features
+
+### Output Filtering
+
+```yaml
+filter_list:
+  - name: extract_answer
+    filter:
+      - function: regex
+        regex_pattern: "Answer: (.*)"
+        group: 1
+      - function: lowercase
+      - function: strip_whitespace
+```
+
+### Multiple Metrics
+
+```yaml
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
+  - metric: f1
+    aggregation: mean
+    higher_is_better: true
+  - metric: bleu
+    aggregation: mean
+    higher_is_better: true
+```
+
+### Task Groups
+
+Create `my_tasks/_default.yaml`:
+```yaml
+group: my_eval_suite
+task:
+  - simple_qa
+  - medical_qa
+  - python_challenges
+```
+
+**Run entire suite**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf \
+  --tasks my_eval_suite \
+  --include_path my_tasks/
+```
+
+## Testing Your Task
+
+### Validate Configuration
+
+```bash
+# Test task loading
+lm_eval --tasks my_custom_task --include_path my_tasks/ --limit 0
+
+# Run on 5 samples
+lm_eval --model hf \
+  --model_args pretrained=gpt2 \
+  --tasks my_custom_task \
+  --include_path my_tasks/ \
+  --limit 5
+```
+
+### Debug Mode
+
+```bash
+lm_eval --model hf \
+  --model_args pretrained=gpt2 \
+  --tasks my_custom_task \
+  --include_path my_tasks/ \
+  --limit 1 \
+  --log_samples  # Save input/output samples
+```
+
+## Best Practices
+
+1. **Start simple**: Test with minimal config first
+2. **Version your tasks**: Use `metadata.version`
+3. **Document your metrics**: Explain custom metrics in comments
+4. **Test with multiple models**: Ensure robustness
+5. **Validate on known examples**: Include sanity checks
+6. **Use filters carefully**: Can hide errors
+7. **Handle edge cases**: Empty strings, missing fields
+
+## Common Patterns
+
+### Classification Task
+
+```yaml
+output_type: loglikelihood
+doc_to_text: "Text: {{text}}\nLabel:"
+doc_to_target: " {{label}}"  # Space prefix important!
+metric_list:
+  - metric: acc
+    aggregation: mean
+```
+
+### Perplexity Evaluation
+
+```yaml
+output_type: loglikelihood_rolling
+doc_to_text: "{{text}}"
+metric_list:
+  - metric: perplexity
+    aggregation: perplexity
+```
+
+### Ranking Task
+
+```yaml
+output_type: loglikelihood
+doc_to_text: "Query: {{query}}\nPassage: {{passage}}\nRelevant:"
+doc_to_target: [" Yes", " No"]
+metric_list:
+  - metric: acc
+    aggregation: mean
+```
+
+## Troubleshooting
+
+**"Task not found"**: Check `--include_path` and task name
+
+**Empty results**: Verify `doc_to_text` and `doc_to_target` templates
+
+**Metric errors**: Ensure metric names are correct (exact_match, not exact-match)
+
+**Filter issues**: Test filters with `--log_samples`
+
+**Python function not found**: Check `!function module.function_name` syntax
+
+## References
+
+- Task system: EleutherAI/lm-evaluation-harness docs
+- Example tasks: `lm_eval/tasks/` directory
+- TaskConfig: `lm_eval/api/task.py`
--- a/skills/mlops/evaluation/lm-evaluation-harness/references/distributed-eval.md
+++ b/skills/mlops/evaluation/lm-evaluation-harness/references/distributed-eval.md
@@ -0,0 +1,519 @@
+# Distributed Evaluation
+
+Guide to running evaluation across multiple GPUs using data parallelism and tensor/pipeline parallelism.
+
+## Overview
+
+Distributed evaluation speeds up benchmarking by:
+- **Data Parallelism**: Split evaluation samples across GPUs (each GPU has full model copy)
+- **Tensor Parallelism**: Split model weights across GPUs (for large models)
+- **Pipeline Parallelism**: Split model layers across GPUs (for very large models)
+
+**When to use**:
+- Data Parallel: Model fits on single GPU, want faster evaluation
+- Tensor/Pipeline Parallel: Model too large for single GPU
+
+## HuggingFace Models (`hf`)
+
+### Data Parallelism (Recommended)
+
+Each GPU loads a full copy of the model and processes a subset of evaluation data.
+
+**Single Node (8 GPUs)**:
+```bash
+accelerate launch --multi_gpu --num_processes 8 \
+  -m lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf,dtype=bfloat16 \
+  --tasks mmlu,gsm8k,hellaswag \
+  --batch_size 16
+```
+
+**Speedup**: Near-linear (8 GPUs = ~8× faster)
+
+**Memory**: Each GPU needs full model (7B model ≈ 14GB × 8 = 112GB total)
+
+### Tensor Parallelism (Model Sharding)
+
+Split model weights across GPUs for models too large for single GPU.
+
+**Without accelerate launcher**:
+```bash
+lm_eval --model hf \
+  --model_args \
+    pretrained=meta-llama/Llama-2-70b-hf,\
+    parallelize=True,\
+    dtype=bfloat16 \
+  --tasks mmlu,gsm8k \
+  --batch_size 8
+```
+
+**With 8 GPUs**: 70B model (140GB) / 8 = 17.5GB per GPU ✅
+
+**Advanced sharding**:
+```bash
+lm_eval --model hf \
+  --model_args \
+    pretrained=meta-llama/Llama-2-70b-hf,\
+    parallelize=True,\
+    device_map_option=auto,\
+    max_memory_per_gpu=40GB,\
+    max_cpu_memory=100GB,\
+    dtype=bfloat16 \
+  --tasks mmlu
+```
+
+**Options**:
+- `device_map_option`: `"auto"` (default), `"balanced"`, `"balanced_low_0"`
+- `max_memory_per_gpu`: Max memory per GPU (e.g., `"40GB"`)
+- `max_cpu_memory`: Max CPU memory for offloading
+- `offload_folder`: Disk offloading directory
+
+### Combined Data + Tensor Parallelism
+
+Use both for very large models.
+
+**Example: 70B model on 16 GPUs (2 copies, 8 GPUs each)**:
+```bash
+accelerate launch --multi_gpu --num_processes 2 \
+  -m lm_eval --model hf \
+  --model_args \
+    pretrained=meta-llama/Llama-2-70b-hf,\
+    parallelize=True,\
+    dtype=bfloat16 \
+  --tasks mmlu \
+  --batch_size 8
+```
+
+**Result**: 2× speedup from data parallelism, 70B model fits via tensor parallelism
+
+### Configuration with `accelerate config`
+
+Create `~/.cache/huggingface/accelerate/default_config.yaml`:
+```yaml
+compute_environment: LOCAL_MACHINE
+distributed_type: MULTI_GPU
+num_machines: 1
+num_processes: 8
+gpu_ids: all
+mixed_precision: bf16
+```
+
+**Then run**:
+```bash
+accelerate launch -m lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf \
+  --tasks mmlu
+```
+
+## vLLM Models (`vllm`)
+
+vLLM provides highly optimized distributed inference.
+
+### Tensor Parallelism
+
+**Single Node (4 GPUs)**:
+```bash
+lm_eval --model vllm \
+  --model_args \
+    pretrained=meta-llama/Llama-2-70b-hf,\
+    tensor_parallel_size=4,\
+    dtype=auto,\
+    gpu_memory_utilization=0.9 \
+  --tasks mmlu,gsm8k \
+  --batch_size auto
+```
+
+**Memory**: 70B model split across 4 GPUs = ~35GB per GPU
+
+### Data Parallelism
+
+**Multiple model replicas**:
+```bash
+lm_eval --model vllm \
+  --model_args \
+    pretrained=meta-llama/Llama-2-7b-hf,\
+    data_parallel_size=4,\
+    dtype=auto,\
+    gpu_memory_utilization=0.8 \
+  --tasks hellaswag,arc_challenge \
+  --batch_size auto
+```
+
+**Result**: 4 model replicas = 4× throughput
+
+### Combined Tensor + Data Parallelism
+
+**Example: 8 GPUs = 4 TP × 2 DP**:
+```bash
+lm_eval --model vllm \
+  --model_args \
+    pretrained=meta-llama/Llama-2-70b-hf,\
+    tensor_parallel_size=4,\
+    data_parallel_size=2,\
+    dtype=auto,\
+    gpu_memory_utilization=0.85 \
+  --tasks mmlu \
+  --batch_size auto
+```
+
+**Result**: 70B model fits (TP=4), 2× speedup (DP=2)
+
+### Multi-Node vLLM
+
+vLLM doesn't natively support multi-node. Use Ray:
+
+```bash
+# Start Ray cluster
+ray start --head --port=6379
+
+# Run evaluation
+lm_eval --model vllm \
+  --model_args \
+    pretrained=meta-llama/Llama-2-70b-hf,\
+    tensor_parallel_size=8,\
+    dtype=auto \
+  --tasks mmlu
+```
+
+## NVIDIA NeMo Models (`nemo_lm`)
+
+### Data Replication
+
+**8 replicas on 8 GPUs**:
+```bash
+torchrun --nproc-per-node=8 --no-python \
+  lm_eval --model nemo_lm \
+  --model_args \
+    path=/path/to/model.nemo,\
+    devices=8 \
+  --tasks hellaswag,arc_challenge \
+  --batch_size 32
+```
+
+**Speedup**: Near-linear (8× faster)
+
+### Tensor Parallelism
+
+**4-way tensor parallelism**:
+```bash
+torchrun --nproc-per-node=4 --no-python \
+  lm_eval --model nemo_lm \
+  --model_args \
+    path=/path/to/70b_model.nemo,\
+    devices=4,\
+    tensor_model_parallel_size=4 \
+  --tasks mmlu,gsm8k \
+  --batch_size 16
+```
+
+### Pipeline Parallelism
+
+**2 TP × 2 PP on 4 GPUs**:
+```bash
+torchrun --nproc-per-node=4 --no-python \
+  lm_eval --model nemo_lm \
+  --model_args \
+    path=/path/to/model.nemo,\
+    devices=4,\
+    tensor_model_parallel_size=2,\
+    pipeline_model_parallel_size=2 \
+  --tasks mmlu \
+  --batch_size 8
+```
+
+**Constraint**: `devices = TP × PP`
+
+### Multi-Node NeMo
+
+Currently not supported by lm-evaluation-harness.
+
+## SGLang Models (`sglang`)
+
+### Tensor Parallelism
+
+```bash
+lm_eval --model sglang \
+  --model_args \
+    pretrained=meta-llama/Llama-2-70b-hf,\
+    tp_size=4,\
+    dtype=auto \
+  --tasks gsm8k \
+  --batch_size auto
+```
+
+### Data Parallelism (Deprecated)
+
+**Note**: SGLang is deprecating data parallelism. Use tensor parallelism instead.
+
+```bash
+lm_eval --model sglang \
+  --model_args \
+    pretrained=meta-llama/Llama-2-7b-hf,\
+    dp_size=4,\
+    dtype=auto \
+  --tasks mmlu
+```
+
+## Performance Comparison
+
+### 70B Model Evaluation (MMLU, 5-shot)
+
+| Method | GPUs | Time | Memory/GPU | Notes |
+|--------|------|------|------------|-------|
+| HF (no parallel) | 1 | 8 hours | 140GB (OOM) | Won't fit |
+| HF (TP=8) | 8 | 2 hours | 17.5GB | Slower, fits |
+| HF (DP=8) | 8 | 1 hour | 140GB (OOM) | Won't fit |
+| vLLM (TP=4) | 4 | 30 min | 35GB | Fast! |
+| vLLM (TP=4, DP=2) | 8 | 15 min | 35GB | Fastest |
+
+### 7B Model Evaluation (Multiple Tasks)
+
+| Method | GPUs | Time | Speedup |
+|--------|------|------|---------|
+| HF (single) | 1 | 4 hours | 1× |
+| HF (DP=4) | 4 | 1 hour | 4× |
+| HF (DP=8) | 8 | 30 min | 8× |
+| vLLM (DP=8) | 8 | 15 min | 16× |
+
+**Takeaway**: vLLM is significantly faster than HuggingFace for inference.
+
+## Choosing Parallelism Strategy
+
+### Decision Tree
+
+```
+Model fits on single GPU?
+├─ YES: Use data parallelism
+│   ├─ HF: accelerate launch --multi_gpu --num_processes N
+│   └─ vLLM: data_parallel_size=N (fastest)
+│
+└─ NO: Use tensor/pipeline parallelism
+    ├─ Model < 70B:
+    │   └─ vLLM: tensor_parallel_size=4
+    ├─ Model 70-175B:
+    │   ├─ vLLM: tensor_parallel_size=8
+    │   └─ Or HF: parallelize=True
+    └─ Model > 175B:
+        └─ Contact framework authors
+```
+
+### Memory Estimation
+
+**Rule of thumb**:
+```
+Memory (GB) = Parameters (B) × Precision (bytes) × 1.2 (overhead)
+```
+
+**Examples**:
+- 7B FP16: 7 × 2 × 1.2 = 16.8GB ✅ Fits A100 40GB
+- 13B FP16: 13 × 2 × 1.2 = 31.2GB ✅ Fits A100 40GB
+- 70B FP16: 70 × 2 × 1.2 = 168GB ❌ Need TP=4 or TP=8
+- 70B BF16: 70 × 2 × 1.2 = 168GB (same as FP16)
+
+**With tensor parallelism**:
+```
+Memory per GPU = Total Memory / TP
+```
+
+- 70B on 4 GPUs: 168GB / 4 = 42GB per GPU ✅
+- 70B on 8 GPUs: 168GB / 8 = 21GB per GPU ✅
+
+## Multi-Node Evaluation
+
+### HuggingFace with SLURM
+
+**Submit job**:
+```bash
+#!/bin/bash
+#SBATCH --nodes=4
+#SBATCH --gpus-per-node=8
+#SBATCH --ntasks-per-node=1
+
+srun accelerate launch --multi_gpu \
+  --num_processes $((SLURM_NNODES * 8)) \
+  -m lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf \
+  --tasks mmlu,gsm8k,hellaswag \
+  --batch_size 16
+```
+
+**Submit**:
+```bash
+sbatch eval_job.sh
+```
+
+### Manual Multi-Node Setup
+
+**On each node, run**:
+```bash
+accelerate launch \
+  --multi_gpu \
+  --num_machines 4 \
+  --num_processes 32 \
+  --main_process_ip $MASTER_IP \
+  --main_process_port 29500 \
+  --machine_rank $NODE_RANK \
+  -m lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf \
+  --tasks mmlu
+```
+
+**Environment variables**:
+- `MASTER_IP`: IP of rank 0 node
+- `NODE_RANK`: 0, 1, 2, 3 for each node
+
+## Best Practices
+
+### 1. Start Small
+
+Test on small sample first:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-70b-hf,parallelize=True \
+  --tasks mmlu \
+  --limit 100  # Just 100 samples
+```
+
+### 2. Monitor GPU Usage
+
+```bash
+# Terminal 1: Run evaluation
+lm_eval --model hf ...
+
+# Terminal 2: Monitor
+watch -n 1 nvidia-smi
+```
+
+Look for:
+- GPU utilization > 90%
+- Memory usage stable
+- All GPUs active
+
+### 3. Optimize Batch Size
+
+```bash
+# Auto batch size (recommended)
+--batch_size auto
+
+# Or tune manually
+--batch_size 16  # Start here
+--batch_size 32  # Increase if memory allows
+```
+
+### 4. Use Mixed Precision
+
+```bash
+--model_args dtype=bfloat16  # Faster, less memory
+```
+
+### 5. Check Communication
+
+For data parallelism, check network bandwidth:
+```bash
+# Should see InfiniBand or high-speed network
+nvidia-smi topo -m
+```
+
+## Troubleshooting
+
+### "CUDA out of memory"
+
+**Solutions**:
+1. Increase tensor parallelism:
+   ```bash
+   --model_args tensor_parallel_size=8  # Was 4
+   ```
+
+2. Reduce batch size:
+   ```bash
+   --batch_size 4  # Was 16
+   ```
+
+3. Lower precision:
+   ```bash
+   --model_args dtype=int8  # Quantization
+   ```
+
+### "NCCL error" or Hanging
+
+**Check**:
+1. All GPUs visible: `nvidia-smi`
+2. NCCL installed: `python -c "import torch; print(torch.cuda.nccl.version())"`
+3. Network connectivity between nodes
+
+**Fix**:
+```bash
+export NCCL_DEBUG=INFO  # Enable debug logging
+export NCCL_IB_DISABLE=0  # Use InfiniBand if available
+```
+
+### Slow Evaluation
+
+**Possible causes**:
+1. **Data loading bottleneck**: Preprocess dataset
+2. **Low GPU utilization**: Increase batch size
+3. **Communication overhead**: Reduce parallelism degree
+
+**Profile**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf \
+  --tasks mmlu \
+  --limit 100 \
+  --log_samples  # Check timing
+```
+
+### GPUs Imbalanced
+
+**Symptom**: GPU 0 at 100%, others at 50%
+
+**Solution**: Use `device_map_option=balanced`:
+```bash
+--model_args parallelize=True,device_map_option=balanced
+```
+
+## Example Configurations
+
+### Small Model (7B) - Fast Evaluation
+
+```bash
+# 8 A100s, data parallel
+accelerate launch --multi_gpu --num_processes 8 \
+  -m lm_eval --model hf \
+  --model_args \
+    pretrained=meta-llama/Llama-2-7b-hf,\
+    dtype=bfloat16 \
+  --tasks mmlu,gsm8k,hellaswag,arc_challenge \
+  --num_fewshot 5 \
+  --batch_size 32
+
+# Time: ~30 minutes
+```
+
+### Large Model (70B) - vLLM
+
+```bash
+# 8 H100s, tensor parallel
+lm_eval --model vllm \
+  --model_args \
+    pretrained=meta-llama/Llama-2-70b-hf,\
+    tensor_parallel_size=8,\
+    dtype=auto,\
+    gpu_memory_utilization=0.9 \
+  --tasks mmlu,gsm8k,humaneval \
+  --num_fewshot 5 \
+  --batch_size auto
+
+# Time: ~1 hour
+```
+
+### Very Large Model (175B+)
+
+**Requires specialized setup - contact framework maintainers**
+
+## References
+
+- HuggingFace Accelerate: https://huggingface.co/docs/accelerate/
+- vLLM docs: https://docs.vllm.ai/
+- NeMo docs: https://docs.nvidia.com/nemo-framework/
+- lm-eval distributed guide: `docs/model_guide.md`
--- a/skills/mlops/evaluation/weights-and-biases/SKILL.md
+++ b/skills/mlops/evaluation/weights-and-biases/SKILL.md
@@ -0,0 +1,593 @@
+---
+name: weights-and-biases
+description: Track ML experiments with automatic logging, visualize training in real-time, optimize hyperparameters with sweeps, and manage model registry with W&B - collaborative MLOps platform
+version: 1.0.0
+author: Orchestra Research
+license: MIT
+dependencies: [wandb]
+metadata:
+  hermes:
+    tags: [MLOps, Weights And Biases, WandB, Experiment Tracking, Hyperparameter Tuning, Model Registry, Collaboration, Real-Time Visualization, PyTorch, TensorFlow, HuggingFace]
+
+---
+
+# Weights & Biases: ML Experiment Tracking & MLOps
+
+## When to Use This Skill
+
+Use Weights & Biases (W&B) when you need to:
+- **Track ML experiments** with automatic metric logging
+- **Visualize training** in real-time dashboards
+- **Compare runs** across hyperparameters and configurations
+- **Optimize hyperparameters** with automated sweeps
+- **Manage model registry** with versioning and lineage
+- **Collaborate on ML projects** with team workspaces
+- **Track artifacts** (datasets, models, code) with lineage
+
+**Users**: 200,000+ ML practitioners | **GitHub Stars**: 10.5k+ | **Integrations**: 100+
+
+## Installation
+
+```bash
+# Install W&B
+pip install wandb
+
+# Login (creates API key)
+wandb login
+
+# Or set API key programmatically
+export WANDB_API_KEY=your_api_key_here
+```
+
+## Quick Start
+
+### Basic Experiment Tracking
+
+```python
+import wandb
+
+# Initialize a run
+run = wandb.init(
+    project="my-project",
+    config={
+        "learning_rate": 0.001,
+        "epochs": 10,
+        "batch_size": 32,
+        "architecture": "ResNet50"
+    }
+)
+
+# Training loop
+for epoch in range(run.config.epochs):
+    # Your training code
+    train_loss = train_epoch()
+    val_loss = validate()
+
+    # Log metrics
+    wandb.log({
+        "epoch": epoch,
+        "train/loss": train_loss,
+        "val/loss": val_loss,
+        "train/accuracy": train_acc,
+        "val/accuracy": val_acc
+    })
+
+# Finish the run
+wandb.finish()
+```
+
+### With PyTorch
+
+```python
+import torch
+import wandb
+
+# Initialize
+wandb.init(project="pytorch-demo", config={
+    "lr": 0.001,
+    "epochs": 10
+})
+
+# Access config
+config = wandb.config
+
+# Training loop
+for epoch in range(config.epochs):
+    for batch_idx, (data, target) in enumerate(train_loader):
+        # Forward pass
+        output = model(data)
+        loss = criterion(output, target)
+
+        # Backward pass
+        optimizer.zero_grad()
+        loss.backward()
+        optimizer.step()
+
+        # Log every 100 batches
+        if batch_idx % 100 == 0:
+            wandb.log({
+                "loss": loss.item(),
+                "epoch": epoch,
+                "batch": batch_idx
+            })
+
+# Save model
+torch.save(model.state_dict(), "model.pth")
+wandb.save("model.pth")  # Upload to W&B
+
+wandb.finish()
+```
+
+## Core Concepts
+
+### 1. Projects and Runs
+
+**Project**: Collection of related experiments
+**Run**: Single execution of your training script
+
+```python
+# Create/use project
+run = wandb.init(
+    project="image-classification",
+    name="resnet50-experiment-1",  # Optional run name
+    tags=["baseline", "resnet"],    # Organize with tags
+    notes="First baseline run"      # Add notes
+)
+
+# Each run has unique ID
+print(f"Run ID: {run.id}")
+print(f"Run URL: {run.url}")
+```
+
+### 2. Configuration Tracking
+
+Track hyperparameters automatically:
+
+```python
+config = {
+    # Model architecture
+    "model": "ResNet50",
+    "pretrained": True,
+
+    # Training params
+    "learning_rate": 0.001,
+    "batch_size": 32,
+    "epochs": 50,
+    "optimizer": "Adam",
+
+    # Data params
+    "dataset": "ImageNet",
+    "augmentation": "standard"
+}
+
+wandb.init(project="my-project", config=config)
+
+# Access config during training
+lr = wandb.config.learning_rate
+batch_size = wandb.config.batch_size
+```
+
+### 3. Metric Logging
+
+```python
+# Log scalars
+wandb.log({"loss": 0.5, "accuracy": 0.92})
+
+# Log multiple metrics
+wandb.log({
+    "train/loss": train_loss,
+    "train/accuracy": train_acc,
+    "val/loss": val_loss,
+    "val/accuracy": val_acc,
+    "learning_rate": current_lr,
+    "epoch": epoch
+})
+
+# Log with custom x-axis
+wandb.log({"loss": loss}, step=global_step)
+
+# Log media (images, audio, video)
+wandb.log({"examples": [wandb.Image(img) for img in images]})
+
+# Log histograms
+wandb.log({"gradients": wandb.Histogram(gradients)})
+
+# Log tables
+table = wandb.Table(columns=["id", "prediction", "ground_truth"])
+wandb.log({"predictions": table})
+```
+
+### 4. Model Checkpointing
+
+```python
+import torch
+import wandb
+
+# Save model checkpoint
+checkpoint = {
+    'epoch': epoch,
+    'model_state_dict': model.state_dict(),
+    'optimizer_state_dict': optimizer.state_dict(),
+    'loss': loss,
+}
+
+torch.save(checkpoint, 'checkpoint.pth')
+
+# Upload to W&B
+wandb.save('checkpoint.pth')
+
+# Or use Artifacts (recommended)
+artifact = wandb.Artifact('model', type='model')
+artifact.add_file('checkpoint.pth')
+wandb.log_artifact(artifact)
+```
+
+## Hyperparameter Sweeps
+
+Automatically search for optimal hyperparameters.
+
+### Define Sweep Configuration
+
+```python
+sweep_config = {
+    'method': 'bayes',  # or 'grid', 'random'
+    'metric': {
+        'name': 'val/accuracy',
+        'goal': 'maximize'
+    },
+    'parameters': {
+        'learning_rate': {
+            'distribution': 'log_uniform',
+            'min': 1e-5,
+            'max': 1e-1
+        },
+        'batch_size': {
+            'values': [16, 32, 64, 128]
+        },
+        'optimizer': {
+            'values': ['adam', 'sgd', 'rmsprop']
+        },
+        'dropout': {
+            'distribution': 'uniform',
+            'min': 0.1,
+            'max': 0.5
+        }
+    }
+}
+
+# Initialize sweep
+sweep_id = wandb.sweep(sweep_config, project="my-project")
+```
+
+### Define Training Function
+
+```python
+def train():
+    # Initialize run
+    run = wandb.init()
+
+    # Access sweep parameters
+    lr = wandb.config.learning_rate
+    batch_size = wandb.config.batch_size
+    optimizer_name = wandb.config.optimizer
+
+    # Build model with sweep config
+    model = build_model(wandb.config)
+    optimizer = get_optimizer(optimizer_name, lr)
+
+    # Training loop
+    for epoch in range(NUM_EPOCHS):
+        train_loss = train_epoch(model, optimizer, batch_size)
+        val_acc = validate(model)
+
+        # Log metrics
+        wandb.log({
+            "train/loss": train_loss,
+            "val/accuracy": val_acc
+        })
+
+# Run sweep
+wandb.agent(sweep_id, function=train, count=50)  # Run 50 trials
+```
+
+### Sweep Strategies
+
+```python
+# Grid search - exhaustive
+sweep_config = {
+    'method': 'grid',
+    'parameters': {
+        'lr': {'values': [0.001, 0.01, 0.1]},
+        'batch_size': {'values': [16, 32, 64]}
+    }
+}
+
+# Random search
+sweep_config = {
+    'method': 'random',
+    'parameters': {
+        'lr': {'distribution': 'uniform', 'min': 0.0001, 'max': 0.1},
+        'dropout': {'distribution': 'uniform', 'min': 0.1, 'max': 0.5}
+    }
+}
+
+# Bayesian optimization (recommended)
+sweep_config = {
+    'method': 'bayes',
+    'metric': {'name': 'val/loss', 'goal': 'minimize'},
+    'parameters': {
+        'lr': {'distribution': 'log_uniform', 'min': 1e-5, 'max': 1e-1}
+    }
+}
+```
+
+## Artifacts
+
+Track datasets, models, and other files with lineage.
+
+### Log Artifacts
+
+```python
+# Create artifact
+artifact = wandb.Artifact(
+    name='training-dataset',
+    type='dataset',
+    description='ImageNet training split',
+    metadata={'size': '1.2M images', 'split': 'train'}
+)
+
+# Add files
+artifact.add_file('data/train.csv')
+artifact.add_dir('data/images/')
+
+# Log artifact
+wandb.log_artifact(artifact)
+```
+
+### Use Artifacts
+
+```python
+# Download and use artifact
+run = wandb.init(project="my-project")
+
+# Download artifact
+artifact = run.use_artifact('training-dataset:latest')
+artifact_dir = artifact.download()
+
+# Use the data
+data = load_data(f"{artifact_dir}/train.csv")
+```
+
+### Model Registry
+
+```python
+# Log model as artifact
+model_artifact = wandb.Artifact(
+    name='resnet50-model',
+    type='model',
+    metadata={'architecture': 'ResNet50', 'accuracy': 0.95}
+)
+
+model_artifact.add_file('model.pth')
+wandb.log_artifact(model_artifact, aliases=['best', 'production'])
+
+# Link to model registry
+run.link_artifact(model_artifact, 'model-registry/production-models')
+```
+
+## Integration Examples
+
+### HuggingFace Transformers
+
+```python
+from transformers import Trainer, TrainingArguments
+import wandb
+
+# Initialize W&B
+wandb.init(project="hf-transformers")
+
+# Training arguments with W&B
+training_args = TrainingArguments(
+    output_dir="./results",
+    report_to="wandb",  # Enable W&B logging
+    run_name="bert-finetuning",
+    logging_steps=100,
+    save_steps=500
+)
+
+# Trainer automatically logs to W&B
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset
+)
+
+trainer.train()
+```
+
+### PyTorch Lightning
+
+```python
+from pytorch_lightning import Trainer
+from pytorch_lightning.loggers import WandbLogger
+import wandb
+
+# Create W&B logger
+wandb_logger = WandbLogger(
+    project="lightning-demo",
+    log_model=True  # Log model checkpoints
+)
+
+# Use with Trainer
+trainer = Trainer(
+    logger=wandb_logger,
+    max_epochs=10
+)
+
+trainer.fit(model, datamodule=dm)
+```
+
+### Keras/TensorFlow
+
+```python
+import wandb
+from wandb.keras import WandbCallback
+
+# Initialize
+wandb.init(project="keras-demo")
+
+# Add callback
+model.fit(
+    x_train, y_train,
+    validation_data=(x_val, y_val),
+    epochs=10,
+    callbacks=[WandbCallback()]  # Auto-logs metrics
+)
+```
+
+## Visualization & Analysis
+
+### Custom Charts
+
+```python
+# Log custom visualizations
+import matplotlib.pyplot as plt
+
+fig, ax = plt.subplots()
+ax.plot(x, y)
+wandb.log({"custom_plot": wandb.Image(fig)})
+
+# Log confusion matrix
+wandb.log({"conf_mat": wandb.plot.confusion_matrix(
+    probs=None,
+    y_true=ground_truth,
+    preds=predictions,
+    class_names=class_names
+)})
+```
+
+### Reports
+
+Create shareable reports in W&B UI:
+- Combine runs, charts, and text
+- Markdown support
+- Embeddable visualizations
+- Team collaboration
+
+## Best Practices
+
+### 1. Organize with Tags and Groups
+
+```python
+wandb.init(
+    project="my-project",
+    tags=["baseline", "resnet50", "imagenet"],
+    group="resnet-experiments",  # Group related runs
+    job_type="train"             # Type of job
+)
+```
+
+### 2. Log Everything Relevant
+
+```python
+# Log system metrics
+wandb.log({
+    "gpu/util": gpu_utilization,
+    "gpu/memory": gpu_memory_used,
+    "cpu/util": cpu_utilization
+})
+
+# Log code version
+wandb.log({"git_commit": git_commit_hash})
+
+# Log data splits
+wandb.log({
+    "data/train_size": len(train_dataset),
+    "data/val_size": len(val_dataset)
+})
+```
+
+### 3. Use Descriptive Names
+
+```python
+# ✅ Good: Descriptive run names
+wandb.init(
+    project="nlp-classification",
+    name="bert-base-lr0.001-bs32-epoch10"
+)
+
+# ❌ Bad: Generic names
+wandb.init(project="nlp", name="run1")
+```
+
+### 4. Save Important Artifacts
+
+```python
+# Save final model
+artifact = wandb.Artifact('final-model', type='model')
+artifact.add_file('model.pth')
+wandb.log_artifact(artifact)
+
+# Save predictions for analysis
+predictions_table = wandb.Table(
+    columns=["id", "input", "prediction", "ground_truth"],
+    data=predictions_data
+)
+wandb.log({"predictions": predictions_table})
+```
+
+### 5. Use Offline Mode for Unstable Connections
+
+```python
+import os
+
+# Enable offline mode
+os.environ["WANDB_MODE"] = "offline"
+
+wandb.init(project="my-project")
+# ... your code ...
+
+# Sync later
+# wandb sync <run_directory>
+```
+
+## Team Collaboration
+
+### Share Runs
+
+```python
+# Runs are automatically shareable via URL
+run = wandb.init(project="team-project")
+print(f"Share this URL: {run.url}")
+```
+
+### Team Projects
+
+- Create team account at wandb.ai
+- Add team members
+- Set project visibility (private/public)
+- Use team-level artifacts and model registry
+
+## Pricing
+
+- **Free**: Unlimited public projects, 100GB storage
+- **Academic**: Free for students/researchers
+- **Teams**: $50/seat/month, private projects, unlimited storage
+- **Enterprise**: Custom pricing, on-prem options
+
+## Resources
+
+- **Documentation**: https://docs.wandb.ai
+- **GitHub**: https://github.com/wandb/wandb (10.5k+ stars)
+- **Examples**: https://github.com/wandb/examples
+- **Community**: https://wandb.ai/community
+- **Discord**: https://wandb.me/discord
+
+## See Also
+
+- `references/sweeps.md` - Comprehensive hyperparameter optimization guide
+- `references/artifacts.md` - Data and model versioning patterns
+- `references/integrations.md` - Framework-specific examples
+
+
--- a/skills/mlops/evaluation/weights-and-biases/references/artifacts.md
+++ b/skills/mlops/evaluation/weights-and-biases/references/artifacts.md
@@ -0,0 +1,584 @@
+# Artifacts & Model Registry Guide
+
+Complete guide to data versioning and model management with W&B Artifacts.
+
+## Table of Contents
+- What are Artifacts
+- Creating Artifacts
+- Using Artifacts
+- Model Registry
+- Versioning & Lineage
+- Best Practices
+
+## What are Artifacts
+
+Artifacts are versioned datasets, models, or files tracked with lineage.
+
+**Key Features:**
+- Automatic versioning (v0, v1, v2...)
+- Lineage tracking (which runs produced/used artifacts)
+- Efficient storage (deduplication)
+- Collaboration (team-wide access)
+- Aliases (latest, best, production)
+
+**Common Use Cases:**
+- Dataset versioning
+- Model checkpoints
+- Preprocessed data
+- Evaluation results
+- Configuration files
+
+## Creating Artifacts
+
+### Basic Dataset Artifact
+
+```python
+import wandb
+
+run = wandb.init(project="my-project")
+
+# Create artifact
+dataset = wandb.Artifact(
+    name='training-data',
+    type='dataset',
+    description='ImageNet training split with augmentations',
+    metadata={
+        'size': '1.2M images',
+        'format': 'JPEG',
+        'resolution': '224x224'
+    }
+)
+
+# Add files
+dataset.add_file('data/train.csv')        # Single file
+dataset.add_dir('data/images')            # Entire directory
+dataset.add_reference('s3://bucket/data') # Cloud reference
+
+# Log artifact
+run.log_artifact(dataset)
+wandb.finish()
+```
+
+### Model Artifact
+
+```python
+import torch
+import wandb
+
+run = wandb.init(project="my-project")
+
+# Train model
+model = train_model()
+
+# Save model
+torch.save(model.state_dict(), 'model.pth')
+
+# Create model artifact
+model_artifact = wandb.Artifact(
+    name='resnet50-classifier',
+    type='model',
+    description='ResNet50 trained on ImageNet',
+    metadata={
+        'architecture': 'ResNet50',
+        'accuracy': 0.95,
+        'loss': 0.15,
+        'epochs': 50,
+        'framework': 'PyTorch'
+    }
+)
+
+# Add model file
+model_artifact.add_file('model.pth')
+
+# Add config
+model_artifact.add_file('config.yaml')
+
+# Log with aliases
+run.log_artifact(model_artifact, aliases=['latest', 'best'])
+
+wandb.finish()
+```
+
+### Preprocessed Data Artifact
+
+```python
+import pandas as pd
+import wandb
+
+run = wandb.init(project="nlp-project")
+
+# Preprocess data
+df = pd.read_csv('raw_data.csv')
+df_processed = preprocess(df)
+df_processed.to_csv('processed_data.csv', index=False)
+
+# Create artifact
+processed_data = wandb.Artifact(
+    name='processed-text-data',
+    type='dataset',
+    metadata={
+        'rows': len(df_processed),
+        'columns': list(df_processed.columns),
+        'preprocessing_steps': ['lowercase', 'remove_stopwords', 'tokenize']
+    }
+)
+
+processed_data.add_file('processed_data.csv')
+
+# Log artifact
+run.log_artifact(processed_data)
+```
+
+## Using Artifacts
+
+### Download and Use
+
+```python
+import wandb
+
+run = wandb.init(project="my-project")
+
+# Download artifact
+artifact = run.use_artifact('training-data:latest')
+artifact_dir = artifact.download()
+
+# Use files
+import pandas as pd
+df = pd.read_csv(f'{artifact_dir}/train.csv')
+
+# Train with artifact data
+model = train_model(df)
+```
+
+### Use Specific Version
+
+```python
+# Use specific version
+artifact_v2 = run.use_artifact('training-data:v2')
+
+# Use alias
+artifact_best = run.use_artifact('model:best')
+artifact_prod = run.use_artifact('model:production')
+
+# Use from another project
+artifact = run.use_artifact('team/other-project/model:latest')
+```
+
+### Check Artifact Metadata
+
+```python
+artifact = run.use_artifact('training-data:latest')
+
+# Access metadata
+print(artifact.metadata)
+print(f"Size: {artifact.metadata['size']}")
+
+# Access version info
+print(f"Version: {artifact.version}")
+print(f"Created at: {artifact.created_at}")
+print(f"Digest: {artifact.digest}")
+```
+
+## Model Registry
+
+Link models to a central registry for governance and deployment.
+
+### Create Model Registry
+
+```python
+# In W&B UI:
+# 1. Go to "Registry" tab
+# 2. Create new registry: "production-models"
+# 3. Define stages: development, staging, production
+```
+
+### Link Model to Registry
+
+```python
+import wandb
+
+run = wandb.init(project="training")
+
+# Create model artifact
+model_artifact = wandb.Artifact(
+    name='sentiment-classifier',
+    type='model',
+    metadata={'accuracy': 0.94, 'f1': 0.92}
+)
+
+model_artifact.add_file('model.pth')
+
+# Log artifact
+run.log_artifact(model_artifact)
+
+# Link to registry
+run.link_artifact(
+    model_artifact,
+    'model-registry/production-models',
+    aliases=['staging']  # Deploy to staging
+)
+
+wandb.finish()
+```
+
+### Promote Model in Registry
+
+```python
+# Retrieve model from registry
+api = wandb.Api()
+artifact = api.artifact('model-registry/production-models/sentiment-classifier:staging')
+
+# Promote to production
+artifact.link('model-registry/production-models', aliases=['production'])
+
+# Demote from production
+artifact.aliases = ['archived']
+artifact.save()
+```
+
+### Use Model from Registry
+
+```python
+import wandb
+
+run = wandb.init()
+
+# Download production model
+model_artifact = run.use_artifact(
+    'model-registry/production-models/sentiment-classifier:production'
+)
+
+model_dir = model_artifact.download()
+
+# Load and use
+import torch
+model = torch.load(f'{model_dir}/model.pth')
+model.eval()
+```
+
+## Versioning & Lineage
+
+### Automatic Versioning
+
+```python
+# First log: creates v0
+run1 = wandb.init(project="my-project")
+dataset_v0 = wandb.Artifact('my-dataset', type='dataset')
+dataset_v0.add_file('data_v1.csv')
+run1.log_artifact(dataset_v0)
+
+# Second log with same name: creates v1
+run2 = wandb.init(project="my-project")
+dataset_v1 = wandb.Artifact('my-dataset', type='dataset')
+dataset_v1.add_file('data_v2.csv')  # Different content
+run2.log_artifact(dataset_v1)
+
+# Third log with SAME content as v1: references v1 (no new version)
+run3 = wandb.init(project="my-project")
+dataset_v1_again = wandb.Artifact('my-dataset', type='dataset')
+dataset_v1_again.add_file('data_v2.csv')  # Same content as v1
+run3.log_artifact(dataset_v1_again)  # Still v1, no v2 created
+```
+
+### Track Lineage
+
+```python
+# Training run
+run = wandb.init(project="my-project")
+
+# Use dataset (input)
+dataset = run.use_artifact('training-data:v3')
+data = load_data(dataset.download())
+
+# Train model
+model = train(data)
+
+# Save model (output)
+model_artifact = wandb.Artifact('trained-model', type='model')
+torch.save(model.state_dict(), 'model.pth')
+model_artifact.add_file('model.pth')
+run.log_artifact(model_artifact)
+
+# Lineage automatically tracked:
+# training-data:v3 --> [run] --> trained-model:v0
+```
+
+### View Lineage Graph
+
+```python
+# In W&B UI:
+# Artifacts → Select artifact → Lineage tab
+# Shows:
+# - Which runs produced this artifact
+# - Which runs used this artifact
+# - Parent/child artifacts
+```
+
+## Artifact Types
+
+### Dataset Artifacts
+
+```python
+# Raw data
+raw_data = wandb.Artifact('raw-data', type='dataset')
+raw_data.add_dir('raw/')
+
+# Processed data
+processed_data = wandb.Artifact('processed-data', type='dataset')
+processed_data.add_dir('processed/')
+
+# Train/val/test splits
+train_split = wandb.Artifact('train-split', type='dataset')
+train_split.add_file('train.csv')
+
+val_split = wandb.Artifact('val-split', type='dataset')
+val_split.add_file('val.csv')
+```
+
+### Model Artifacts
+
+```python
+# Checkpoint during training
+checkpoint = wandb.Artifact('checkpoint-epoch-10', type='model')
+checkpoint.add_file('checkpoint_epoch_10.pth')
+
+# Final model
+final_model = wandb.Artifact('final-model', type='model')
+final_model.add_file('model.pth')
+final_model.add_file('tokenizer.json')
+
+# Quantized model
+quantized = wandb.Artifact('quantized-model', type='model')
+quantized.add_file('model_int8.onnx')
+```
+
+### Result Artifacts
+
+```python
+# Predictions
+predictions = wandb.Artifact('test-predictions', type='predictions')
+predictions.add_file('predictions.csv')
+
+# Evaluation metrics
+eval_results = wandb.Artifact('evaluation', type='evaluation')
+eval_results.add_file('metrics.json')
+eval_results.add_file('confusion_matrix.png')
+```
+
+## Advanced Patterns
+
+### Incremental Artifacts
+
+Add files incrementally without re-uploading.
+
+```python
+run = wandb.init(project="my-project")
+
+# Create artifact
+dataset = wandb.Artifact('incremental-dataset', type='dataset')
+
+# Add files incrementally
+for i in range(100):
+    filename = f'batch_{i}.csv'
+    process_batch(i, filename)
+    dataset.add_file(filename)
+
+    # Log progress
+    if (i + 1) % 10 == 0:
+        print(f"Added {i + 1}/100 batches")
+
+# Log complete artifact
+run.log_artifact(dataset)
+```
+
+### Artifact Tables
+
+Track structured data with W&B Tables.
+
+```python
+import wandb
+
+run = wandb.init(project="my-project")
+
+# Create table
+table = wandb.Table(columns=["id", "image", "label", "prediction"])
+
+for idx, (img, label, pred) in enumerate(zip(images, labels, predictions)):
+    table.add_data(
+        idx,
+        wandb.Image(img),
+        label,
+        pred
+    )
+
+# Log as artifact
+artifact = wandb.Artifact('predictions-table', type='predictions')
+artifact.add(table, "predictions")
+run.log_artifact(artifact)
+```
+
+### Artifact References
+
+Reference external data without copying.
+
+```python
+# S3 reference
+dataset = wandb.Artifact('s3-dataset', type='dataset')
+dataset.add_reference('s3://my-bucket/data/', name='train')
+dataset.add_reference('s3://my-bucket/labels/', name='labels')
+
+# GCS reference
+dataset.add_reference('gs://my-bucket/data/')
+
+# HTTP reference
+dataset.add_reference('https://example.com/data.zip')
+
+# Local filesystem reference (for shared storage)
+dataset.add_reference('file:///mnt/shared/data')
+```
+
+## Collaboration Patterns
+
+### Team Dataset Sharing
+
+```python
+# Data engineer creates dataset
+run = wandb.init(project="data-eng", entity="my-team")
+dataset = wandb.Artifact('shared-dataset', type='dataset')
+dataset.add_dir('data/')
+run.log_artifact(dataset, aliases=['latest', 'production'])
+
+# ML engineer uses dataset
+run = wandb.init(project="ml-training", entity="my-team")
+dataset = run.use_artifact('my-team/data-eng/shared-dataset:production')
+data = load_data(dataset.download())
+```
+
+### Model Handoff
+
+```python
+# Training team
+train_run = wandb.init(project="model-training", entity="ml-team")
+model = train_model()
+model_artifact = wandb.Artifact('nlp-model', type='model')
+model_artifact.add_file('model.pth')
+train_run.log_artifact(model_artifact)
+train_run.link_artifact(model_artifact, 'model-registry/nlp-models', aliases=['candidate'])
+
+# Evaluation team
+eval_run = wandb.init(project="model-eval", entity="ml-team")
+model_artifact = eval_run.use_artifact('model-registry/nlp-models/nlp-model:candidate')
+metrics = evaluate_model(model_artifact)
+
+if metrics['f1'] > 0.9:
+    # Promote to production
+    model_artifact.link('model-registry/nlp-models', aliases=['production'])
+```
+
+## Best Practices
+
+### 1. Use Descriptive Names
+
+```python
+# ✅ Good: Descriptive names
+wandb.Artifact('imagenet-train-augmented-v2', type='dataset')
+wandb.Artifact('bert-base-sentiment-finetuned', type='model')
+
+# ❌ Bad: Generic names
+wandb.Artifact('dataset1', type='dataset')
+wandb.Artifact('model', type='model')
+```
+
+### 2. Add Comprehensive Metadata
+
+```python
+model_artifact = wandb.Artifact(
+    'production-model',
+    type='model',
+    description='ResNet50 classifier for product categorization',
+    metadata={
+        # Model info
+        'architecture': 'ResNet50',
+        'framework': 'PyTorch 2.0',
+        'pretrained': True,
+
+        # Performance
+        'accuracy': 0.95,
+        'f1_score': 0.93,
+        'inference_time_ms': 15,
+
+        # Training
+        'epochs': 50,
+        'dataset': 'imagenet',
+        'num_samples': 1200000,
+
+        # Business context
+        'use_case': 'e-commerce product classification',
+        'owner': 'ml-team@company.com',
+        'approved_by': 'data-science-lead'
+    }
+)
+```
+
+### 3. Use Aliases for Deployment Stages
+
+```python
+# Development
+run.log_artifact(model, aliases=['dev', 'latest'])
+
+# Staging
+run.log_artifact(model, aliases=['staging'])
+
+# Production
+run.log_artifact(model, aliases=['production', 'v1.2.0'])
+
+# Archive old versions
+old_artifact = api.artifact('model:production')
+old_artifact.aliases = ['archived-v1.1.0']
+old_artifact.save()
+```
+
+### 4. Track Data Lineage
+
+```python
+def create_training_pipeline():
+    run = wandb.init(project="pipeline")
+
+    # 1. Load raw data
+    raw_data = run.use_artifact('raw-data:latest')
+
+    # 2. Preprocess
+    processed = preprocess(raw_data)
+    processed_artifact = wandb.Artifact('processed-data', type='dataset')
+    processed_artifact.add_file('processed.csv')
+    run.log_artifact(processed_artifact)
+
+    # 3. Train model
+    model = train(processed)
+    model_artifact = wandb.Artifact('trained-model', type='model')
+    model_artifact.add_file('model.pth')
+    run.log_artifact(model_artifact)
+
+    # Lineage: raw-data → processed-data → trained-model
+```
+
+### 5. Efficient Storage
+
+```python
+# ✅ Good: Reference large files
+large_dataset = wandb.Artifact('large-dataset', type='dataset')
+large_dataset.add_reference('s3://bucket/huge-file.tar.gz')
+
+# ❌ Bad: Upload giant files
+# large_dataset.add_file('huge-file.tar.gz')  # Don't do this
+
+# ✅ Good: Upload only metadata
+metadata_artifact = wandb.Artifact('dataset-metadata', type='dataset')
+metadata_artifact.add_file('metadata.json')  # Small file
+```
+
+## Resources
+
+- **Artifacts Documentation**: https://docs.wandb.ai/guides/artifacts
+- **Model Registry**: https://docs.wandb.ai/guides/model-registry
+- **Best Practices**: https://wandb.ai/site/articles/versioning-data-and-models-in-ml
--- a/skills/mlops/evaluation/weights-and-biases/references/integrations.md
+++ b/skills/mlops/evaluation/weights-and-biases/references/integrations.md
@@ -0,0 +1,700 @@
+# Framework Integrations Guide
+
+Complete guide to integrating W&B with popular ML frameworks.
+
+## Table of Contents
+- HuggingFace Transformers
+- PyTorch Lightning
+- Keras/TensorFlow
+- Fast.ai
+- XGBoost/LightGBM
+- PyTorch Native
+- Custom Integrations
+
+## HuggingFace Transformers
+
+### Automatic Integration
+
+```python
+from transformers import Trainer, TrainingArguments
+import wandb
+
+# Initialize W&B
+wandb.init(project="hf-transformers", name="bert-finetuning")
+
+# Training arguments with W&B
+training_args = TrainingArguments(
+    output_dir="./results",
+    report_to="wandb",  # Enable W&B logging
+    run_name="bert-base-finetuning",
+
+    # Training params
+    num_train_epochs=3,
+    per_device_train_batch_size=16,
+    per_device_eval_batch_size=64,
+    learning_rate=2e-5,
+
+    # Logging
+    logging_dir="./logs",
+    logging_steps=100,
+    logging_first_step=True,
+
+    # Evaluation
+    evaluation_strategy="steps",
+    eval_steps=500,
+    save_steps=500,
+
+    # Other
+    load_best_model_at_end=True,
+    metric_for_best_model="eval_accuracy"
+)
+
+# Trainer automatically logs to W&B
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset,
+    compute_metrics=compute_metrics
+)
+
+# Train (metrics logged automatically)
+trainer.train()
+
+# Finish W&B run
+wandb.finish()
+```
+
+### Custom Logging
+
+```python
+from transformers import Trainer, TrainingArguments
+from transformers.integrations import WandbCallback
+import wandb
+
+class CustomWandbCallback(WandbCallback):
+    def on_evaluate(self, args, state, control, metrics=None, **kwargs):
+        super().on_evaluate(args, state, control, metrics, **kwargs)
+
+        # Log custom metrics
+        wandb.log({
+            "custom/eval_score": metrics["eval_accuracy"] * 100,
+            "custom/epoch": state.epoch
+        })
+
+# Use custom callback
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset,
+    callbacks=[CustomWandbCallback()]
+)
+```
+
+### Log Model to Registry
+
+```python
+from transformers import Trainer, TrainingArguments
+
+training_args = TrainingArguments(
+    output_dir="./results",
+    report_to="wandb",
+    load_best_model_at_end=True
+)
+
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset
+)
+
+trainer.train()
+
+# Save final model as artifact
+model_artifact = wandb.Artifact(
+    'hf-bert-model',
+    type='model',
+    description='BERT finetuned on sentiment analysis'
+)
+
+# Save model files
+trainer.save_model("./final_model")
+model_artifact.add_dir("./final_model")
+
+# Log artifact
+wandb.log_artifact(model_artifact, aliases=['best', 'production'])
+wandb.finish()
+```
+
+## PyTorch Lightning
+
+### Basic Integration
+
+```python
+import pytorch_lightning as pl
+from pytorch_lightning.loggers import WandbLogger
+import wandb
+
+# Create W&B logger
+wandb_logger = WandbLogger(
+    project="lightning-demo",
+    name="resnet50-training",
+    log_model=True,  # Log model checkpoints as artifacts
+    save_code=True   # Save code as artifact
+)
+
+# Lightning module
+class LitModel(pl.LightningModule):
+    def __init__(self, learning_rate=0.001):
+        super().__init__()
+        self.save_hyperparameters()
+        self.model = create_model()
+
+    def training_step(self, batch, batch_idx):
+        x, y = batch
+        y_hat = self.model(x)
+        loss = F.cross_entropy(y_hat, y)
+
+        # Log metrics (automatically sent to W&B)
+        self.log('train/loss', loss, on_step=True, on_epoch=True)
+        self.log('train/accuracy', accuracy(y_hat, y), on_epoch=True)
+
+        return loss
+
+    def validation_step(self, batch, batch_idx):
+        x, y = batch
+        y_hat = self.model(x)
+        loss = F.cross_entropy(y_hat, y)
+
+        self.log('val/loss', loss, on_step=False, on_epoch=True)
+        self.log('val/accuracy', accuracy(y_hat, y), on_epoch=True)
+
+        return loss
+
+    def configure_optimizers(self):
+        return torch.optim.Adam(self.parameters(), lr=self.hparams.learning_rate)
+
+# Trainer with W&B logger
+trainer = pl.Trainer(
+    logger=wandb_logger,
+    max_epochs=10,
+    accelerator="gpu",
+    devices=1
+)
+
+# Train (metrics logged automatically)
+trainer.fit(model, datamodule=dm)
+
+# Finish W&B run
+wandb.finish()
+```
+
+### Log Media
+
+```python
+class LitModel(pl.LightningModule):
+    def validation_step(self, batch, batch_idx):
+        x, y = batch
+        y_hat = self.model(x)
+
+        # Log images (first batch only)
+        if batch_idx == 0:
+            self.logger.experiment.log({
+                "examples": [wandb.Image(img) for img in x[:8]]
+            })
+
+        return loss
+
+    def on_validation_epoch_end(self):
+        # Log confusion matrix
+        cm = compute_confusion_matrix(self.all_preds, self.all_targets)
+
+        self.logger.experiment.log({
+            "confusion_matrix": wandb.plot.confusion_matrix(
+                probs=None,
+                y_true=self.all_targets,
+                preds=self.all_preds,
+                class_names=self.class_names
+            )
+        })
+```
+
+### Hyperparameter Sweeps
+
+```python
+import pytorch_lightning as pl
+from pytorch_lightning.loggers import WandbLogger
+import wandb
+
+# Define sweep
+sweep_config = {
+    'method': 'bayes',
+    'metric': {'name': 'val/accuracy', 'goal': 'maximize'},
+    'parameters': {
+        'learning_rate': {'min': 1e-5, 'max': 1e-2, 'distribution': 'log_uniform'},
+        'batch_size': {'values': [16, 32, 64]},
+        'hidden_size': {'values': [128, 256, 512]}
+    }
+}
+
+sweep_id = wandb.sweep(sweep_config, project="lightning-sweeps")
+
+def train():
+    # Initialize W&B
+    run = wandb.init()
+
+    # Get hyperparameters
+    config = wandb.config
+
+    # Create logger
+    wandb_logger = WandbLogger()
+
+    # Create model with sweep params
+    model = LitModel(
+        learning_rate=config.learning_rate,
+        hidden_size=config.hidden_size
+    )
+
+    # Create datamodule with sweep batch size
+    dm = DataModule(batch_size=config.batch_size)
+
+    # Train
+    trainer = pl.Trainer(logger=wandb_logger, max_epochs=10)
+    trainer.fit(model, dm)
+
+# Run sweep
+wandb.agent(sweep_id, function=train, count=30)
+```
+
+## Keras/TensorFlow
+
+### With Callback
+
+```python
+import tensorflow as tf
+from wandb.keras import WandbCallback
+import wandb
+
+# Initialize W&B
+wandb.init(
+    project="keras-demo",
+    config={
+        "learning_rate": 0.001,
+        "epochs": 10,
+        "batch_size": 32
+    }
+)
+
+config = wandb.config
+
+# Build model
+model = tf.keras.Sequential([
+    tf.keras.layers.Dense(128, activation='relu'),
+    tf.keras.layers.Dropout(0.2),
+    tf.keras.layers.Dense(10, activation='softmax')
+])
+
+model.compile(
+    optimizer=tf.keras.optimizers.Adam(config.learning_rate),
+    loss='sparse_categorical_crossentropy',
+    metrics=['accuracy']
+)
+
+# Train with W&B callback
+history = model.fit(
+    x_train, y_train,
+    validation_data=(x_val, y_val),
+    epochs=config.epochs,
+    batch_size=config.batch_size,
+    callbacks=[
+        WandbCallback(
+            log_weights=True,      # Log model weights
+            log_gradients=True,    # Log gradients
+            training_data=(x_train, y_train),
+            validation_data=(x_val, y_val),
+            labels=class_names
+        )
+    ]
+)
+
+# Save model as artifact
+model.save('model.h5')
+artifact = wandb.Artifact('keras-model', type='model')
+artifact.add_file('model.h5')
+wandb.log_artifact(artifact)
+
+wandb.finish()
+```
+
+### Custom Training Loop
+
+```python
+import tensorflow as tf
+import wandb
+
+wandb.init(project="tf-custom-loop")
+
+# Model, optimizer, loss
+model = create_model()
+optimizer = tf.keras.optimizers.Adam(1e-3)
+loss_fn = tf.keras.losses.SparseCategoricalCrossentropy()
+
+# Metrics
+train_loss = tf.keras.metrics.Mean(name='train_loss')
+train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='train_accuracy')
+
+@tf.function
+def train_step(x, y):
+    with tf.GradientTape() as tape:
+        predictions = model(x, training=True)
+        loss = loss_fn(y, predictions)
+
+    gradients = tape.gradient(loss, model.trainable_variables)
+    optimizer.apply_gradients(zip(gradients, model.trainable_variables))
+
+    train_loss(loss)
+    train_accuracy(y, predictions)
+
+# Training loop
+for epoch in range(EPOCHS):
+    train_loss.reset_states()
+    train_accuracy.reset_states()
+
+    for step, (x, y) in enumerate(train_dataset):
+        train_step(x, y)
+
+        # Log every 100 steps
+        if step % 100 == 0:
+            wandb.log({
+                'train/loss': train_loss.result().numpy(),
+                'train/accuracy': train_accuracy.result().numpy(),
+                'epoch': epoch,
+                'step': step
+            })
+
+    # Log epoch metrics
+    wandb.log({
+        'epoch/train_loss': train_loss.result().numpy(),
+        'epoch/train_accuracy': train_accuracy.result().numpy(),
+        'epoch': epoch
+    })
+
+wandb.finish()
+```
+
+## Fast.ai
+
+### With Callback
+
+```python
+from fastai.vision.all import *
+from fastai.callback.wandb import *
+import wandb
+
+# Initialize W&B
+wandb.init(project="fastai-demo")
+
+# Create data loaders
+dls = ImageDataLoaders.from_folder(
+    path,
+    train='train',
+    valid='valid',
+    bs=64
+)
+
+# Create learner with W&B callback
+learn = vision_learner(
+    dls,
+    resnet34,
+    metrics=accuracy,
+    cbs=WandbCallback(
+        log_preds=True,     # Log predictions
+        log_model=True,     # Log model as artifact
+        log_dataset=True    # Log dataset as artifact
+    )
+)
+
+# Train (metrics logged automatically)
+learn.fine_tune(5)
+
+wandb.finish()
+```
+
+## XGBoost/LightGBM
+
+### XGBoost
+
+```python
+import xgboost as xgb
+import wandb
+
+# Initialize W&B
+run = wandb.init(project="xgboost-demo", config={
+    "max_depth": 6,
+    "learning_rate": 0.1,
+    "n_estimators": 100
+})
+
+config = wandb.config
+
+# Create DMatrix
+dtrain = xgb.DMatrix(X_train, label=y_train)
+dval = xgb.DMatrix(X_val, label=y_val)
+
+# XGBoost params
+params = {
+    'max_depth': config.max_depth,
+    'learning_rate': config.learning_rate,
+    'objective': 'binary:logistic',
+    'eval_metric': ['logloss', 'auc']
+}
+
+# Custom callback for W&B
+def wandb_callback(env):
+    """Log XGBoost metrics to W&B."""
+    for metric_name, metric_value in env.evaluation_result_list:
+        wandb.log({
+            f"{metric_name}": metric_value,
+            "iteration": env.iteration
+        })
+
+# Train with callback
+model = xgb.train(
+    params,
+    dtrain,
+    num_boost_round=config.n_estimators,
+    evals=[(dtrain, 'train'), (dval, 'val')],
+    callbacks=[wandb_callback],
+    verbose_eval=10
+)
+
+# Save model
+model.save_model('xgboost_model.json')
+artifact = wandb.Artifact('xgboost-model', type='model')
+artifact.add_file('xgboost_model.json')
+wandb.log_artifact(artifact)
+
+wandb.finish()
+```
+
+### LightGBM
+
+```python
+import lightgbm as lgb
+import wandb
+
+run = wandb.init(project="lgbm-demo")
+
+# Create datasets
+train_data = lgb.Dataset(X_train, label=y_train)
+val_data = lgb.Dataset(X_val, label=y_val, reference=train_data)
+
+# Parameters
+params = {
+    'objective': 'binary',
+    'metric': ['binary_logloss', 'auc'],
+    'learning_rate': 0.1,
+    'num_leaves': 31
+}
+
+# Custom callback
+def log_to_wandb(env):
+    """Log LightGBM metrics to W&B."""
+    for entry in env.evaluation_result_list:
+        dataset_name, metric_name, metric_value, _ = entry
+        wandb.log({
+            f"{dataset_name}/{metric_name}": metric_value,
+            "iteration": env.iteration
+        })
+
+# Train
+model = lgb.train(
+    params,
+    train_data,
+    num_boost_round=100,
+    valid_sets=[train_data, val_data],
+    valid_names=['train', 'val'],
+    callbacks=[log_to_wandb]
+)
+
+# Save model
+model.save_model('lgbm_model.txt')
+artifact = wandb.Artifact('lgbm-model', type='model')
+artifact.add_file('lgbm_model.txt')
+wandb.log_artifact(artifact)
+
+wandb.finish()
+```
+
+## PyTorch Native
+
+### Training Loop Integration
+
+```python
+import torch
+import torch.nn as nn
+import torch.optim as optim
+import wandb
+
+# Initialize W&B
+wandb.init(project="pytorch-native", config={
+    "learning_rate": 0.001,
+    "epochs": 10,
+    "batch_size": 32
+})
+
+config = wandb.config
+
+# Model, loss, optimizer
+model = create_model()
+criterion = nn.CrossEntropyLoss()
+optimizer = optim.Adam(model.parameters(), lr=config.learning_rate)
+
+# Watch model (logs gradients and parameters)
+wandb.watch(model, criterion, log="all", log_freq=100)
+
+# Training loop
+for epoch in range(config.epochs):
+    model.train()
+    train_loss = 0.0
+    correct = 0
+    total = 0
+
+    for batch_idx, (data, target) in enumerate(train_loader):
+        data, target = data.to(device), target.to(device)
+
+        # Forward pass
+        optimizer.zero_grad()
+        output = model(data)
+        loss = criterion(output, target)
+
+        # Backward pass
+        loss.backward()
+        optimizer.step()
+
+        # Track metrics
+        train_loss += loss.item()
+        _, predicted = output.max(1)
+        total += target.size(0)
+        correct += predicted.eq(target).sum().item()
+
+        # Log every 100 batches
+        if batch_idx % 100 == 0:
+            wandb.log({
+                'train/loss': loss.item(),
+                'train/batch_accuracy': 100. * correct / total,
+                'epoch': epoch,
+                'batch': batch_idx
+            })
+
+    # Validation
+    model.eval()
+    val_loss = 0.0
+    val_correct = 0
+    val_total = 0
+
+    with torch.no_grad():
+        for data, target in val_loader:
+            data, target = data.to(device), target.to(device)
+            output = model(data)
+            loss = criterion(output, target)
+
+            val_loss += loss.item()
+            _, predicted = output.max(1)
+            val_total += target.size(0)
+            val_correct += predicted.eq(target).sum().item()
+
+    # Log epoch metrics
+    wandb.log({
+        'epoch/train_loss': train_loss / len(train_loader),
+        'epoch/train_accuracy': 100. * correct / total,
+        'epoch/val_loss': val_loss / len(val_loader),
+        'epoch/val_accuracy': 100. * val_correct / val_total,
+        'epoch': epoch
+    })
+
+# Save final model
+torch.save(model.state_dict(), 'model.pth')
+artifact = wandb.Artifact('final-model', type='model')
+artifact.add_file('model.pth')
+wandb.log_artifact(artifact)
+
+wandb.finish()
+```
+
+## Custom Integrations
+
+### Generic Framework Integration
+
+```python
+import wandb
+
+class WandbIntegration:
+    """Generic W&B integration wrapper."""
+
+    def __init__(self, project, config):
+        self.run = wandb.init(project=project, config=config)
+        self.config = wandb.config
+        self.step = 0
+
+    def log_metrics(self, metrics, step=None):
+        """Log training metrics."""
+        if step is None:
+            step = self.step
+            self.step += 1
+
+        wandb.log(metrics, step=step)
+
+    def log_images(self, images, caption=""):
+        """Log images."""
+        wandb.log({
+            caption: [wandb.Image(img) for img in images]
+        })
+
+    def log_table(self, data, columns):
+        """Log tabular data."""
+        table = wandb.Table(columns=columns, data=data)
+        wandb.log({"table": table})
+
+    def save_model(self, model_path, metadata=None):
+        """Save model as artifact."""
+        artifact = wandb.Artifact(
+            'model',
+            type='model',
+            metadata=metadata or {}
+        )
+        artifact.add_file(model_path)
+        self.run.log_artifact(artifact)
+
+    def finish(self):
+        """Finish W&B run."""
+        wandb.finish()
+
+# Usage
+wb = WandbIntegration(project="my-project", config={"lr": 0.001})
+
+# Training loop
+for epoch in range(10):
+    # Your training code
+    loss, accuracy = train_epoch()
+
+    # Log metrics
+    wb.log_metrics({
+        'train/loss': loss,
+        'train/accuracy': accuracy
+    })
+
+# Save model
+wb.save_model('model.pth', metadata={'accuracy': 0.95})
+wb.finish()
+```
+
+## Resources
+
+- **Integrations Guide**: https://docs.wandb.ai/guides/integrations
+- **HuggingFace**: https://docs.wandb.ai/guides/integrations/huggingface
+- **PyTorch Lightning**: https://docs.wandb.ai/guides/integrations/lightning
+- **Keras**: https://docs.wandb.ai/guides/integrations/keras
+- **Examples**: https://github.com/wandb/examples
--- a/skills/mlops/evaluation/weights-and-biases/references/sweeps.md
+++ b/skills/mlops/evaluation/weights-and-biases/references/sweeps.md
@@ -0,0 +1,847 @@
+# Comprehensive Hyperparameter Sweeps Guide
+
+Complete guide to hyperparameter optimization with W&B Sweeps.
+
+## Table of Contents
+- Sweep Configuration
+- Search Strategies
+- Parameter Distributions
+- Early Termination
+- Parallel Execution
+- Advanced Patterns
+- Real-World Examples
+
+## Sweep Configuration
+
+### Basic Sweep Config
+
+```python
+sweep_config = {
+    'method': 'bayes',  # Search strategy
+    'metric': {
+        'name': 'val/accuracy',
+        'goal': 'maximize'  # or 'minimize'
+    },
+    'parameters': {
+        'learning_rate': {
+            'distribution': 'log_uniform',
+            'min': 1e-5,
+            'max': 1e-1
+        },
+        'batch_size': {
+            'values': [16, 32, 64, 128]
+        }
+    }
+}
+
+# Initialize sweep
+sweep_id = wandb.sweep(sweep_config, project="my-project")
+```
+
+### Complete Config Example
+
+```python
+sweep_config = {
+    # Required: Search method
+    'method': 'bayes',
+
+    # Required: Optimization metric
+    'metric': {
+        'name': 'val/f1_score',
+        'goal': 'maximize'
+    },
+
+    # Required: Parameters to search
+    'parameters': {
+        # Continuous parameter
+        'learning_rate': {
+            'distribution': 'log_uniform',
+            'min': 1e-5,
+            'max': 1e-1
+        },
+
+        # Discrete values
+        'batch_size': {
+            'values': [16, 32, 64, 128]
+        },
+
+        # Categorical
+        'optimizer': {
+            'values': ['adam', 'sgd', 'rmsprop', 'adamw']
+        },
+
+        # Uniform distribution
+        'dropout': {
+            'distribution': 'uniform',
+            'min': 0.1,
+            'max': 0.5
+        },
+
+        # Integer range
+        'num_layers': {
+            'distribution': 'int_uniform',
+            'min': 2,
+            'max': 10
+        },
+
+        # Fixed value (constant across runs)
+        'epochs': {
+            'value': 50
+        }
+    },
+
+    # Optional: Early termination
+    'early_terminate': {
+        'type': 'hyperband',
+        'min_iter': 5,
+        's': 2,
+        'eta': 3,
+        'max_iter': 27
+    }
+}
+```
+
+## Search Strategies
+
+### 1. Grid Search
+
+Exhaustively search all combinations.
+
+```python
+sweep_config = {
+    'method': 'grid',
+    'parameters': {
+        'learning_rate': {
+            'values': [0.001, 0.01, 0.1]
+        },
+        'batch_size': {
+            'values': [16, 32, 64]
+        },
+        'optimizer': {
+            'values': ['adam', 'sgd']
+        }
+    }
+}
+
+# Total runs: 3 × 3 × 2 = 18 runs
+```
+
+**Pros:**
+- Comprehensive search
+- Reproducible results
+- No randomness
+
+**Cons:**
+- Exponential growth with parameters
+- Inefficient for continuous parameters
+- Not scalable beyond 3-4 parameters
+
+**When to use:**
+- Few parameters (< 4)
+- All discrete values
+- Need complete coverage
+
+### 2. Random Search
+
+Randomly sample parameter combinations.
+
+```python
+sweep_config = {
+    'method': 'random',
+    'parameters': {
+        'learning_rate': {
+            'distribution': 'log_uniform',
+            'min': 1e-5,
+            'max': 1e-1
+        },
+        'batch_size': {
+            'values': [16, 32, 64, 128, 256]
+        },
+        'dropout': {
+            'distribution': 'uniform',
+            'min': 0.0,
+            'max': 0.5
+        },
+        'num_layers': {
+            'distribution': 'int_uniform',
+            'min': 2,
+            'max': 8
+        }
+    }
+}
+
+# Run 100 random trials
+wandb.agent(sweep_id, function=train, count=100)
+```
+
+**Pros:**
+- Scales to many parameters
+- Can run indefinitely
+- Often finds good solutions quickly
+
+**Cons:**
+- No learning from previous runs
+- May miss optimal region
+- Results vary with random seed
+
+**When to use:**
+- Many parameters (> 4)
+- Quick exploration
+- Limited budget
+
+### 3. Bayesian Optimization (Recommended)
+
+Learn from previous trials to sample promising regions.
+
+```python
+sweep_config = {
+    'method': 'bayes',
+    'metric': {
+        'name': 'val/loss',
+        'goal': 'minimize'
+    },
+    'parameters': {
+        'learning_rate': {
+            'distribution': 'log_uniform',
+            'min': 1e-5,
+            'max': 1e-1
+        },
+        'weight_decay': {
+            'distribution': 'log_uniform',
+            'min': 1e-6,
+            'max': 1e-2
+        },
+        'dropout': {
+            'distribution': 'uniform',
+            'min': 0.1,
+            'max': 0.5
+        },
+        'num_layers': {
+            'values': [2, 3, 4, 5, 6]
+        }
+    }
+}
+```
+
+**Pros:**
+- Most sample-efficient
+- Learns from past trials
+- Focuses on promising regions
+
+**Cons:**
+- Initial random exploration phase
+- May get stuck in local optima
+- Slower per iteration
+
+**When to use:**
+- Expensive training runs
+- Need best performance
+- Limited compute budget
+
+## Parameter Distributions
+
+### Continuous Distributions
+
+```python
+# Log-uniform: Good for learning rates, regularization
+'learning_rate': {
+    'distribution': 'log_uniform',
+    'min': 1e-6,
+    'max': 1e-1
+}
+
+# Uniform: Good for dropout, momentum
+'dropout': {
+    'distribution': 'uniform',
+    'min': 0.0,
+    'max': 0.5
+}
+
+# Normal distribution
+'parameter': {
+    'distribution': 'normal',
+    'mu': 0.5,
+    'sigma': 0.1
+}
+
+# Log-normal distribution
+'parameter': {
+    'distribution': 'log_normal',
+    'mu': 0.0,
+    'sigma': 1.0
+}
+```
+
+### Discrete Distributions
+
+```python
+# Fixed values
+'batch_size': {
+    'values': [16, 32, 64, 128, 256]
+}
+
+# Integer uniform
+'num_layers': {
+    'distribution': 'int_uniform',
+    'min': 2,
+    'max': 10
+}
+
+# Quantized uniform (step size)
+'layer_size': {
+    'distribution': 'q_uniform',
+    'min': 32,
+    'max': 512,
+    'q': 32  # Step by 32: 32, 64, 96, 128...
+}
+
+# Quantized log-uniform
+'hidden_size': {
+    'distribution': 'q_log_uniform',
+    'min': 32,
+    'max': 1024,
+    'q': 32
+}
+```
+
+### Categorical Parameters
+
+```python
+# Optimizers
+'optimizer': {
+    'values': ['adam', 'sgd', 'rmsprop', 'adamw']
+}
+
+# Model architectures
+'model': {
+    'values': ['resnet18', 'resnet34', 'resnet50', 'efficientnet_b0']
+}
+
+# Activation functions
+'activation': {
+    'values': ['relu', 'gelu', 'silu', 'leaky_relu']
+}
+```
+
+## Early Termination
+
+Stop underperforming runs early to save compute.
+
+### Hyperband
+
+```python
+sweep_config = {
+    'method': 'bayes',
+    'metric': {'name': 'val/accuracy', 'goal': 'maximize'},
+    'parameters': {...},
+
+    # Hyperband early termination
+    'early_terminate': {
+        'type': 'hyperband',
+        'min_iter': 3,      # Minimum iterations before termination
+        's': 2,             # Bracket count
+        'eta': 3,           # Downsampling rate
+        'max_iter': 27      # Maximum iterations
+    }
+}
+```
+
+**How it works:**
+- Runs trials in brackets
+- Keeps top 1/eta performers each round
+- Eliminates bottom performers early
+
+### Custom Termination
+
+```python
+def train():
+    run = wandb.init()
+
+    for epoch in range(MAX_EPOCHS):
+        loss = train_epoch()
+        val_acc = validate()
+
+        wandb.log({'val/accuracy': val_acc, 'epoch': epoch})
+
+        # Custom early stopping
+        if epoch > 5 and val_acc < 0.5:
+            print("Early stop: Poor performance")
+            break
+
+        if epoch > 10 and val_acc > best_acc - 0.01:
+            print("Early stop: No improvement")
+            break
+```
+
+## Training Function
+
+### Basic Template
+
+```python
+def train():
+    # Initialize W&B run
+    run = wandb.init()
+
+    # Get hyperparameters
+    config = wandb.config
+
+    # Build model with config
+    model = build_model(
+        hidden_size=config.hidden_size,
+        num_layers=config.num_layers,
+        dropout=config.dropout
+    )
+
+    # Create optimizer
+    optimizer = create_optimizer(
+        model.parameters(),
+        name=config.optimizer,
+        lr=config.learning_rate,
+        weight_decay=config.weight_decay
+    )
+
+    # Training loop
+    for epoch in range(config.epochs):
+        # Train
+        train_loss, train_acc = train_epoch(
+            model, optimizer, train_loader, config.batch_size
+        )
+
+        # Validate
+        val_loss, val_acc = validate(model, val_loader)
+
+        # Log metrics
+        wandb.log({
+            'train/loss': train_loss,
+            'train/accuracy': train_acc,
+            'val/loss': val_loss,
+            'val/accuracy': val_acc,
+            'epoch': epoch
+        })
+
+    # Log final model
+    torch.save(model.state_dict(), 'model.pth')
+    wandb.save('model.pth')
+
+    # Finish run
+    wandb.finish()
+```
+
+### With PyTorch
+
+```python
+import torch
+import torch.nn as nn
+from torch.utils.data import DataLoader
+import wandb
+
+def train():
+    run = wandb.init()
+    config = wandb.config
+
+    # Data
+    train_loader = DataLoader(
+        train_dataset,
+        batch_size=config.batch_size,
+        shuffle=True
+    )
+
+    # Model
+    model = ResNet(
+        num_classes=config.num_classes,
+        dropout=config.dropout
+    ).to(device)
+
+    # Optimizer
+    if config.optimizer == 'adam':
+        optimizer = torch.optim.Adam(
+            model.parameters(),
+            lr=config.learning_rate,
+            weight_decay=config.weight_decay
+        )
+    elif config.optimizer == 'sgd':
+        optimizer = torch.optim.SGD(
+            model.parameters(),
+            lr=config.learning_rate,
+            momentum=config.momentum,
+            weight_decay=config.weight_decay
+        )
+
+    # Scheduler
+    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
+        optimizer, T_max=config.epochs
+    )
+
+    # Training
+    for epoch in range(config.epochs):
+        model.train()
+        train_loss = 0.0
+
+        for data, target in train_loader:
+            data, target = data.to(device), target.to(device)
+
+            optimizer.zero_grad()
+            output = model(data)
+            loss = nn.CrossEntropyLoss()(output, target)
+            loss.backward()
+            optimizer.step()
+
+            train_loss += loss.item()
+
+        # Validation
+        model.eval()
+        val_loss, val_acc = validate(model, val_loader)
+
+        # Step scheduler
+        scheduler.step()
+
+        # Log
+        wandb.log({
+            'train/loss': train_loss / len(train_loader),
+            'val/loss': val_loss,
+            'val/accuracy': val_acc,
+            'learning_rate': scheduler.get_last_lr()[0],
+            'epoch': epoch
+        })
+```
+
+## Parallel Execution
+
+### Multiple Agents
+
+Run sweep agents in parallel to speed up search.
+
+```python
+# Initialize sweep once
+sweep_id = wandb.sweep(sweep_config, project="my-project")
+
+# Run multiple agents in parallel
+# Agent 1 (Terminal 1)
+wandb.agent(sweep_id, function=train, count=20)
+
+# Agent 2 (Terminal 2)
+wandb.agent(sweep_id, function=train, count=20)
+
+# Agent 3 (Terminal 3)
+wandb.agent(sweep_id, function=train, count=20)
+
+# Total: 60 runs across 3 agents
+```
+
+### Multi-GPU Execution
+
+```python
+import os
+
+def train():
+    # Get available GPU
+    gpu_id = os.environ.get('CUDA_VISIBLE_DEVICES', '0')
+
+    run = wandb.init()
+    config = wandb.config
+
+    # Train on specific GPU
+    device = torch.device(f'cuda:{gpu_id}')
+    model = model.to(device)
+
+    # ... rest of training ...
+
+# Run agents on different GPUs
+# Terminal 1
+# CUDA_VISIBLE_DEVICES=0 wandb agent sweep_id
+
+# Terminal 2
+# CUDA_VISIBLE_DEVICES=1 wandb agent sweep_id
+
+# Terminal 3
+# CUDA_VISIBLE_DEVICES=2 wandb agent sweep_id
+```
+
+## Advanced Patterns
+
+### Nested Parameters
+
+```python
+sweep_config = {
+    'method': 'bayes',
+    'metric': {'name': 'val/accuracy', 'goal': 'maximize'},
+    'parameters': {
+        'model': {
+            'parameters': {
+                'type': {
+                    'values': ['resnet', 'efficientnet']
+                },
+                'size': {
+                    'values': ['small', 'medium', 'large']
+                }
+            }
+        },
+        'optimizer': {
+            'parameters': {
+                'type': {
+                    'values': ['adam', 'sgd']
+                },
+                'lr': {
+                    'distribution': 'log_uniform',
+                    'min': 1e-5,
+                    'max': 1e-1
+                }
+            }
+        }
+    }
+}
+
+# Access nested config
+def train():
+    run = wandb.init()
+    model_type = wandb.config.model.type
+    model_size = wandb.config.model.size
+    opt_type = wandb.config.optimizer.type
+    lr = wandb.config.optimizer.lr
+```
+
+### Conditional Parameters
+
+```python
+sweep_config = {
+    'method': 'bayes',
+    'parameters': {
+        'optimizer': {
+            'values': ['adam', 'sgd']
+        },
+        'learning_rate': {
+            'distribution': 'log_uniform',
+            'min': 1e-5,
+            'max': 1e-1
+        },
+        # Only used if optimizer == 'sgd'
+        'momentum': {
+            'distribution': 'uniform',
+            'min': 0.5,
+            'max': 0.99
+        }
+    }
+}
+
+def train():
+    run = wandb.init()
+    config = wandb.config
+
+    if config.optimizer == 'adam':
+        optimizer = torch.optim.Adam(
+            model.parameters(),
+            lr=config.learning_rate
+        )
+    elif config.optimizer == 'sgd':
+        optimizer = torch.optim.SGD(
+            model.parameters(),
+            lr=config.learning_rate,
+            momentum=config.momentum  # Conditional parameter
+        )
+```
+
+## Real-World Examples
+
+### Image Classification
+
+```python
+sweep_config = {
+    'method': 'bayes',
+    'metric': {
+        'name': 'val/top1_accuracy',
+        'goal': 'maximize'
+    },
+    'parameters': {
+        # Model
+        'architecture': {
+            'values': ['resnet50', 'resnet101', 'efficientnet_b0', 'efficientnet_b3']
+        },
+        'pretrained': {
+            'values': [True, False]
+        },
+
+        # Training
+        'learning_rate': {
+            'distribution': 'log_uniform',
+            'min': 1e-5,
+            'max': 1e-2
+        },
+        'batch_size': {
+            'values': [16, 32, 64, 128]
+        },
+        'optimizer': {
+            'values': ['adam', 'sgd', 'adamw']
+        },
+        'weight_decay': {
+            'distribution': 'log_uniform',
+            'min': 1e-6,
+            'max': 1e-2
+        },
+
+        # Regularization
+        'dropout': {
+            'distribution': 'uniform',
+            'min': 0.0,
+            'max': 0.5
+        },
+        'label_smoothing': {
+            'distribution': 'uniform',
+            'min': 0.0,
+            'max': 0.2
+        },
+
+        # Data augmentation
+        'mixup_alpha': {
+            'distribution': 'uniform',
+            'min': 0.0,
+            'max': 1.0
+        },
+        'cutmix_alpha': {
+            'distribution': 'uniform',
+            'min': 0.0,
+            'max': 1.0
+        }
+    },
+    'early_terminate': {
+        'type': 'hyperband',
+        'min_iter': 5
+    }
+}
+```
+
+### NLP Fine-Tuning
+
+```python
+sweep_config = {
+    'method': 'bayes',
+    'metric': {'name': 'eval/f1', 'goal': 'maximize'},
+    'parameters': {
+        # Model
+        'model_name': {
+            'values': ['bert-base-uncased', 'roberta-base', 'distilbert-base-uncased']
+        },
+
+        # Training
+        'learning_rate': {
+            'distribution': 'log_uniform',
+            'min': 1e-6,
+            'max': 1e-4
+        },
+        'per_device_train_batch_size': {
+            'values': [8, 16, 32]
+        },
+        'num_train_epochs': {
+            'values': [3, 4, 5]
+        },
+        'warmup_ratio': {
+            'distribution': 'uniform',
+            'min': 0.0,
+            'max': 0.1
+        },
+        'weight_decay': {
+            'distribution': 'log_uniform',
+            'min': 1e-4,
+            'max': 1e-1
+        },
+
+        # Optimizer
+        'adam_beta1': {
+            'distribution': 'uniform',
+            'min': 0.8,
+            'max': 0.95
+        },
+        'adam_beta2': {
+            'distribution': 'uniform',
+            'min': 0.95,
+            'max': 0.999
+        }
+    }
+}
+```
+
+## Best Practices
+
+### 1. Start Small
+
+```python
+# Initial exploration: Random search, 20 runs
+sweep_config_v1 = {
+    'method': 'random',
+    'parameters': {...}
+}
+wandb.agent(sweep_id_v1, train, count=20)
+
+# Refined search: Bayes, narrow ranges
+sweep_config_v2 = {
+    'method': 'bayes',
+    'parameters': {
+        'learning_rate': {
+            'min': 5e-5,  # Narrowed from 1e-6 to 1e-4
+            'max': 1e-4
+        }
+    }
+}
+```
+
+### 2. Use Log Scales
+
+```python
+# ✅ Good: Log scale for learning rate
+'learning_rate': {
+    'distribution': 'log_uniform',
+    'min': 1e-6,
+    'max': 1e-2
+}
+
+# ❌ Bad: Linear scale
+'learning_rate': {
+    'distribution': 'uniform',
+    'min': 0.000001,
+    'max': 0.01
+}
+```
+
+### 3. Set Reasonable Ranges
+
+```python
+# Base ranges on prior knowledge
+'learning_rate': {'min': 1e-5, 'max': 1e-3},  # Typical for Adam
+'batch_size': {'values': [16, 32, 64]},       # GPU memory limits
+'dropout': {'min': 0.1, 'max': 0.5}           # Too high hurts training
+```
+
+### 4. Monitor Resource Usage
+
+```python
+def train():
+    run = wandb.init()
+
+    # Log system metrics
+    wandb.log({
+        'system/gpu_memory_allocated': torch.cuda.memory_allocated(),
+        'system/gpu_memory_reserved': torch.cuda.memory_reserved()
+    })
+```
+
+### 5. Save Best Models
+
+```python
+def train():
+    run = wandb.init()
+    best_acc = 0.0
+
+    for epoch in range(config.epochs):
+        val_acc = validate(model)
+
+        if val_acc > best_acc:
+            best_acc = val_acc
+            # Save best checkpoint
+            torch.save(model.state_dict(), 'best_model.pth')
+            wandb.save('best_model.pth')
+```
+
+## Resources
+
+- **Sweeps Documentation**: https://docs.wandb.ai/guides/sweeps
+- **Configuration Reference**: https://docs.wandb.ai/guides/sweeps/configuration
+- **Examples**: https://github.com/wandb/examples/tree/master/examples/wandb-sweeps