skills/research/research-paper-writing/references/experiment-patterns.md

# Experiment Design Patterns

Patterns and best practices distilled from running research experiments at scale with the Hermes agent. These cover experiment infrastructure, evaluation protocols, monitoring, and failure recovery.

---

## Experiment Infrastructure

### Directory Structure

Organize experiments with a consistent structure:

```
workspace/
  experiments/
    run_main.py                # Core experiment runner
    run_baselines.py           # Baseline comparison
    run_ablation.py            # Ablation studies
    strategies.py              # Method implementations
    config.yaml                # Shared configuration
  results/
    <experiment_name>/
      <task_or_problem>/
        <strategy>/
          result.json          # Final metrics
          final_output.md      # Final output artifact
          history.json         # Full trajectory/log
          pass_01/             # Per-iteration artifacts (if iterative)
            intermediate.md
  analysis/
    analyze_results.py         # Statistical analysis
    compute_stats.py           # Significance tests
    make_charts.py             # Visualization
  paper/
    paper.tex                  # LaTeX source
    fig_*.pdf                  # Generated figures
```

### Script Design Principles

**1. Incremental Saving (Crash Recovery)**

Every experiment script should save results after each unit of work, and skip already-completed work on restart:

```python
import json, os
from pathlib import Path

def run_experiment(problems, strategies, output_dir):
    for problem in problems:
        for strategy in strategies:
            result_path = Path(output_dir) / problem["id"] / strategy / "result.json"
            if result_path.exists():
                print(f"Skipping {problem['id']}/{strategy} (already done)")
                continue
            
            # Run the experiment
            result = execute_strategy(problem, strategy)
            
            # Save immediately
            result_path.parent.mkdir(parents=True, exist_ok=True)
            with open(result_path, 'w') as f:
                json.dump(result, f, indent=2)
```

This pattern makes re-runs safe and efficient. If a process crashes at problem 47/150, restarting skips the first 46.

**2. Artifact Preservation**

Save all intermediate outputs, not just final results. This enables post-hoc analysis without re-running:

```python
def save_pass_artifacts(output_dir, pass_num, artifacts):
    """Save all artifacts from a single pass of an iterative method."""
    pass_dir = Path(output_dir) / f"pass_{pass_num:02d}"
    pass_dir.mkdir(parents=True, exist_ok=True)
    
    for name, content in artifacts.items():
        with open(pass_dir / f"{name}.md", 'w') as f:
            f.write(content)
```

**3. Configuration Management**

Use YAML configs for reproducibility:

```yaml
# config.yaml
model: anthropic/claude-sonnet-4-20250514
author_temperature: 0.8
judge_temperature: 0.3
max_tokens: 4096
num_judges: 3
max_passes: 15
convergence_k: 2
```

```python
import yaml

with open("config.yaml") as f:
    config = yaml.safe_load(f)
```

**4. Separation of Concerns**

Keep generation, evaluation, and visualization in separate scripts:

| Script | Purpose |
|--------|---------|
| `run_experiment.py` | Core method execution |
| `run_baselines.py` | Baseline comparisons at same compute |
| `run_eval.py` | Blind evaluation / judge panels |
| `analyze_results.py` | Statistical analysis |
| `make_charts.py` | Figure generation |

This lets you re-run evaluation without re-running expensive generation, and regenerate figures without re-running analysis.

---

## Evaluation Protocols

### Blind Judge Panels (for Subjective Tasks)

When evaluating subjective outputs (writing, analysis, recommendations), use a blind judge panel:

```python
import random

def run_blind_evaluation(outputs: dict, task_prompt: str, num_judges: int = 7):
    """
    Run blind evaluation of multiple method outputs.
    
    Args:
        outputs: {"method_name": "output_text", ...}
        task_prompt: The original task description
        num_judges: Number of independent judge evaluations
    """
    rankings = []
    
    for judge_i in range(num_judges):
        # Randomize labels and presentation order per judge
        methods = list(outputs.keys())
        random.shuffle(methods)
        labels = {m: chr(65 + i) for i, m in enumerate(methods)}  # A, B, C...
        
        # Present to judge with randomized labels
        prompt = f"Task: {task_prompt}\n\n"
        for method in methods:
            prompt += f"--- Proposal {labels[method]} ---\n{outputs[method]}\n\n"
        prompt += "Rank all proposals from best to worst. Format: RANKING: [best], [second], [worst]"
        
        ranking = call_judge(prompt)
        rankings.append({"labels": labels, "ranking": ranking})
    
    # Aggregate via Borda count
    return compute_borda(rankings)

def compute_borda(rankings, n_methods=3):
    """Borda count: 3/2/1 points for 1st/2nd/3rd."""
    scores = {}
    points = {0: n_methods, 1: n_methods - 1, 2: n_methods - 2}  # Adjust for n_methods
    
    for r in rankings:
        for position, method in enumerate(r["ranking"]):
            scores[method] = scores.get(method, 0) + points.get(position, 0)
    
    return scores
```

Key design decisions:
- **Randomize both labels AND order** per judge to prevent position bias
- **Use odd number of judges** (3, 5, 7) to break ties
- **Conservative tiebreak**: Incumbent/baseline wins ties (prevents false positives)
- **CoT judges** match non-CoT quality at ~40% cost (1 CoT judge ≈ 3 standard judges)

### Code/Objective Evaluation

For tasks with ground-truth evaluation (code, math, factual):

```python
import subprocess

def evaluate_code(solution: str, test_cases: list, timeout: int = 30):
    """Run code solution against test cases with sandboxed execution."""
    results = {"public": [], "private": []}
    
    for test in test_cases:
        try:
            proc = subprocess.run(
                ["python3", "-c", solution],
                input=test["input"],
                capture_output=True,
                timeout=timeout,
                text=True
            )
            actual = proc.stdout.strip()
            expected = test["expected"].strip()
            passed = actual == expected
        except subprocess.TimeoutExpired:
            passed = False
        
        category = "public" if test.get("public") else "private"
        results[category].append(passed)
    
    return {
        "public_pass_rate": sum(results["public"]) / max(len(results["public"]), 1),
        "private_pass_rate": sum(results["private"]) / max(len(results["private"]), 1),
    }
```

### Compute-Matched Comparison

Always compare methods at equal compute budget. If your method uses N API calls, baselines get N calls too:

| Method | Call Budget | Allocation |
|--------|-----------|------------|
| Single pass | 6 calls | 6 independent generations |
| Critique & revise | 6 calls | 1 generate + 5 revise rounds |
| Autoreason | 6 calls | 1 generate + 1 analysis + 4 revisions |
| Best-of-N | 6 calls | 6 independent, pick best on public test |

### Human Evaluation Design

Many ML/NLP papers require human evaluation, especially for subjective tasks (text generation, summarization, dialogue, creative writing). Poorly designed human evals are a common rejection reason.

#### When Human Evaluation Is Required

| Task Type | Required? | Notes |
|-----------|-----------|-------|
| Text generation (open-ended) | Yes | LLM-as-judge alone is insufficient for acceptance at ACL/EMNLP |
| Summarization | Usually | At minimum for a subset of outputs |
| Dialogue systems | Yes | User studies or annotation |
| Code generation | No | Test suites are objective ground truth |
| Classification | No | Standard metrics suffice |
| Any task with subjective quality | Strongly recommended | Strengthens the paper significantly |

#### Annotation Protocol Design

```
Human Evaluation Protocol:
1. Define the evaluation dimensions (fluency, relevance, factual accuracy, etc.)
2. Create annotation guidelines with examples of each score level
3. Run a pilot with 2-3 annotators on 20-30 examples
4. Compute pilot inter-annotator agreement — if low, revise guidelines
5. Run full evaluation
6. Report: annotator count, agreement metrics, compensation, time per item
```

**Evaluation dimensions** (pick relevant subset):

| Dimension | Definition | Scale |
|-----------|-----------|-------|
| Fluency | Grammaticality and naturalness | 1-5 Likert |
| Relevance | Does it address the task? | 1-5 Likert |
| Factual accuracy | Are stated facts correct? | Binary or 1-5 |
| Coherence | Logical flow and consistency | 1-5 Likert |
| Informativeness | Does it provide useful information? | 1-5 Likert |
| Overall preference | Which output is better? | A/B/Tie (pairwise) |

**Pairwise comparison** (preferred over absolute scoring — more reliable):
- Present two outputs side-by-side (randomize left/right position)
- Ask: "Which is better? A / B / Tie"
- More discriminative and less susceptible to annotator calibration drift

#### Inter-Annotator Agreement

Always report agreement metrics. Without them, reviewers assume your annotations are unreliable.

```python
# Krippendorff's alpha (preferred — handles missing data, any scale)
# pip install krippendorffs-alpha
import krippendorff

# Ratings: rows = annotators, columns = items, values = scores
ratings = [
    [3, 4, 1, 2, 5, None, 3],  # Annotator 1
    [3, 5, 1, 3, 5, 2, 3],     # Annotator 2
    [4, 4, 2, 2, 4, 2, None],  # Annotator 3
]
alpha = krippendorff.alpha(reliability_data=ratings, level_of_measurement="ordinal")
print(f"Krippendorff's alpha: {alpha:.3f}")
# Interpretation: >0.80 good, 0.67-0.80 acceptable, <0.67 questionable
```

```python
# Cohen's kappa (for exactly 2 annotators, categorical data)
from sklearn.metrics import cohen_kappa_score

annotator_1 = [1, 2, 3, 1, 2, 3, 2]
annotator_2 = [1, 2, 2, 1, 3, 3, 2]
kappa = cohen_kappa_score(annotator_1, annotator_2)
print(f"Cohen's kappa: {kappa:.3f}")
# Interpretation: >0.80 excellent, 0.60-0.80 substantial, 0.40-0.60 moderate
```

| Metric | When to Use | Annotators | Scale |
|--------|------------|-----------|-------|
| Krippendorff's alpha | Default choice | Any number | Any (ordinal, nominal, ratio) |
| Cohen's kappa | 2 annotators, categorical | Exactly 2 | Nominal/ordinal |
| Fleiss' kappa | 3+ annotators, categorical | 3+ | Nominal |
| Pearson/Spearman | Continuous scores | 2 | Interval/ratio |

#### Crowdsourcing Platforms

| Platform | Best For | Cost | Quality |
|----------|----------|------|---------|
| **Prolific** | Academic research, higher quality | $8-15/hr | High — academic participant pool |
| **MTurk** | Large-scale, fast turnaround | $2-10/hr | Variable — use qualifications |
| **Surge AI** | NLP-specific annotations | Premium | High — trained annotators |
| **Expert annotators** | Domain-specific (medical, legal) | Highest | Highest — but slow |

**Ethics requirements**:
- Report compensation rate (must be at minimum local minimum wage)
- Describe annotator demographics if relevant
- Obtain IRB/ethics approval if required by your institution
- ACL venues explicitly require compensation documentation

#### What to Report in the Paper

```
Human Evaluation Section Checklist:
- [ ] Number of annotators
- [ ] Annotator qualifications / recruitment method
- [ ] Number of items evaluated
- [ ] Evaluation dimensions with definitions
- [ ] Scale used (Likert, pairwise, binary)
- [ ] Inter-annotator agreement (Krippendorff's alpha or Cohen's kappa)
- [ ] Compensation rate
- [ ] Time per annotation item
- [ ] Whether annotators saw model identities (should be blind)
- [ ] Randomization of presentation order
```

---

## Statistical Analysis

### Required Tests

| Test | When to Use | Python |
|------|------------|--------|
| McNemar's test | Comparing two methods on same problems | `scipy.stats.binomtest` for small n |
| Two-proportion z-test | Comparing success rates | Custom or `statsmodels` |
| Fisher's exact test | Small sample pairwise comparison | `scipy.stats.fisher_exact` |
| Bootstrapped CI | Confidence intervals for any metric | Custom bootstrap |
| Cohen's h | Effect size for proportions | Manual calculation |

### Standard Analysis Script

```python
import numpy as np
from scipy import stats
from pathlib import Path
import json

def load_all_results(results_dir):
    """Load all results into a structured format."""
    results = {}
    for result_file in Path(results_dir).rglob("result.json"):
        parts = result_file.relative_to(results_dir).parts
        if len(parts) >= 3:
            experiment, task, strategy = parts[0], parts[1], parts[2]
            data = json.loads(result_file.read_text())
            results.setdefault(experiment, {}).setdefault(strategy, {})[task] = data
    return results

def pairwise_mcnemar(method_a_results, method_b_results):
    """McNemar's test for paired binary outcomes."""
    a_win_b_lose = sum(1 for a, b in zip(method_a_results, method_b_results) if a and not b)
    b_win_a_lose = sum(1 for a, b in zip(method_a_results, method_b_results) if b and not a)
    
    n = a_win_b_lose + b_win_a_lose
    if n < 25:
        # Use exact binomial for small samples
        result = stats.binomtest(a_win_b_lose, n, 0.5)
        p_value = result.pvalue
    else:
        # Chi-squared approximation
        chi2 = (abs(a_win_b_lose - b_win_a_lose) - 1)**2 / (a_win_b_lose + b_win_a_lose)
        p_value = 1 - stats.chi2.cdf(chi2, df=1)
    
    return {
        "a_wins": a_win_b_lose,
        "b_wins": b_win_a_lose,
        "n_discordant": n,
        "p_value": p_value,
        "significant": p_value < 0.05
    }

def bootstrap_ci(data, n_bootstrap=10000, ci=0.95):
    """Bootstrap confidence interval for mean."""
    means = []
    for _ in range(n_bootstrap):
        sample = np.random.choice(data, size=len(data), replace=True)
        means.append(np.mean(sample))
    lower = np.percentile(means, (1 - ci) / 2 * 100)
    upper = np.percentile(means, (1 + ci) / 2 * 100)
    return {"mean": np.mean(data), "ci_lower": lower, "ci_upper": upper}

def cohens_h(p1, p2):
    """Cohen's h effect size for two proportions."""
    return 2 * np.arcsin(np.sqrt(p1)) - 2 * np.arcsin(np.sqrt(p2))
```

### Reporting Standards

Always include in the paper:
- **Sample sizes**: n=X problems/tasks
- **Number of runs**: K independent runs if applicable
- **Error bars**: Specify standard deviation or standard error
- **Confidence intervals**: 95% CI for key results
- **Significance tests**: p-values for key comparisons
- **Effect sizes**: Cohen's d or h for practical significance

---

## Monitoring (Cron Pattern)

### Cron Prompt Template

For each experiment batch, create a monitoring prompt:

```
Check the status of the [EXPERIMENT_NAME] experiment:

1. Process check: ps aux | grep [PROCESS_PATTERN]
2. Log check: tail -30 [LOG_FILE]
3. Results check: ls [RESULT_DIR]/eval/ (or appropriate result location)
4. If results are available:
   - Read the result JSON files
   - Report metrics in a table (Borda scores, accuracy, etc.)
   - Compute key comparisons between methods
5. If all experiments in this batch are complete:
   - git add -A && git commit -m "[COMMIT_MESSAGE]" && git push
   - Report final summary
6. Key question: [SPECIFIC ANALYTICAL QUESTION]

If nothing has changed since the last check, respond with [SILENT].
```

### Monitoring Best Practices

1. **Check processes first** — don't read results if the experiment is still running and results are incomplete
2. **Read the log tail** — look for errors, progress indicators, completion messages
3. **Count completed vs expected** — "45/150 problems done" is more useful than "some results exist"
4. **Report in structured tables** — always include key metrics in a table
5. **Answer the key question** — each experiment should have a specific analytical question to answer when done
6. **[SILENT] for no-news** — suppress notifications when nothing has changed
7. **Commit on completion** — every completed batch gets committed with a descriptive message

### Example Monitoring Report

```
## Code Experiments (Haiku 3.5) - COMPLETE

| Strategy | Pass Rate (150 problems) | vs Single |
|----------|------------------------|-----------|
| single_pass | 38.0% | — |
| critique_revise | 35.2% | -2.8pp |
| **autoreason** | **40.0%** | **+2.0pp** |
| best_of_6 | 31.0% | -7.0pp |

Key finding: Autoreason shows +2pp improvement over single pass, while 
best-of-6 collapses due to single-public-test selection issue.

Committed: `git commit -m "Add Haiku code results (150 problems, 4 strategies)"`
Next: Run significance tests on these results.
```

---

## Failure Recovery

### Common Failures and Recovery

| Failure | Detection | Recovery |
|---------|-----------|----------|
| **API credit exhaustion** | 402 errors in logs, incomplete results | Top up credits, re-run (skips completed work automatically) |
| **Rate limiting** | 429 errors, slow progress | Add retry logic with exponential backoff |
| **Process crash** | PID gone, log stops mid-problem | Re-run script (resumes from last checkpoint) |
| **Wrong model ID** | Model not found errors | Fix ID (e.g., `claude-opus-4-6` not `claude-opus-4.6`) |
| **Parallel slowdown** | Each experiment taking 2x longer | Reduce parallel experiments to 2-3 max |
| **Security scan blocks** | Commands blocked by security | Use `execute_code` instead of piped `terminal` commands |
| **Delegation failures** | `delegate_task` returns errors | Fall back to doing work directly |
| **Timeout on hard problems** | Process stuck, no log progress | Kill, skip problem, note in results |
| **Dataset path mismatch** | File not found errors | Verify paths before launching |

### Retry Naming Convention

When re-running failed experiments, use a suffix to track rounds:

```
logs/experiment_haiku_0_50.log       # Round 1
logs/experiment_haiku_0_50_r2.log    # Round 2 (after credit exhaustion)
logs/experiment_haiku_0_50_r3.log    # Round 3 (after bug fix)
```

### Pre-Flight Checklist

Before launching any experiment batch:

```
Pre-Flight:
- [ ] API credits sufficient for estimated calls
- [ ] Model IDs correct (test with 1 problem first)
- [ ] Output directory exists and is writable
- [ ] Resume logic works (re-run won't overwrite existing results)
- [ ] Log file path is unique (won't overwrite previous logs)
- [ ] Dataset/task files are accessible
- [ ] Config matches intended experiment
```

---

## Task/Benchmark Design

### Open-Ended Tasks (Subjective Evaluation)

Design tasks that have clear objectives but subjective quality:

```markdown
# Task: [Title]

## Context
[Specific scenario with concrete details: company size, constraints, timeline]

## Deliverable
[Exact format and structure required]

## Requirements
- [Specific, measurable requirements]
- [Not vague — "be comprehensive" is bad, "include exactly 6 sections" is good]
```

### Constrained Tasks (for Testing Scope Effects)

Constrained tasks test whether methods respect scope boundaries. Design with:

- **Fixed facts**: "Use only these N data points, add nothing else"
- **Fixed deliverable**: Specific format (pitch, postmortem, memo — not "improve this")
- **Fixed structure**: "These sections in this order, do not add/remove"
- **Fixed change items**: "Address exactly these N points, nothing else"

**Do NOT use word count as a scope constraint.** Word limits cause false convergence — outputs get rejected for length, not quality. Constrain scope (what to include) not length.

### Example: Good vs Bad Constraints

| Bad Constraint | Why | Good Constraint |
|---------------|-----|-----------------|
| "Max 500 words" | Judges reject for length | "Exactly 4 sections, each with 3 numbered items" |
| "Be concise" | Too vague | "Each prohibition must reference a specific base fact" |
| "Improve this" | Unbounded scope | "Write a 600-word incident postmortem with this exact structure" |
| "Make it better" | No clear criterion | "Address exactly these 3 reviewer concerns" |

---

## Visualization Best Practices

### Setup: SciencePlots + matplotlib

Install SciencePlots for publication-ready defaults:

```bash
pip install SciencePlots matplotlib numpy
```

**Option A: SciencePlots styles** (recommended — handles most defaults automatically):

```python
import matplotlib.pyplot as plt
import scienceplots  # registers the styles

# Pick a style:
# 'science'        — clean, serif fonts, suitable for most venues
# 'science+ieee'   — IEEE-style (good for two-column papers)
# 'science+nature' — Nature-style
# Add 'no-latex' if LaTeX is not installed on the machine generating plots

with plt.style.context(['science', 'no-latex']):
    fig, ax = plt.subplots(figsize=(3.5, 2.5))  # single-column width
    # ... plot ...
    fig.savefig('paper/fig_results.pdf', bbox_inches='tight')
```

**Option B: Manual rcParams** (when you need full control):

```python
import matplotlib.pyplot as plt

plt.rcParams.update({
    'font.size': 10,
    'font.family': 'serif',
    'axes.labelsize': 11,
    'axes.titlesize': 11,
    'xtick.labelsize': 9,
    'ytick.labelsize': 9,
    'legend.fontsize': 9,
    'figure.figsize': (3.5, 2.5),    # single-column default
    'figure.dpi': 300,
    'savefig.dpi': 300,
    'savefig.bbox': 'tight',
    'savefig.pad_inches': 0.05,
    'axes.linewidth': 0.8,
    'lines.linewidth': 1.5,
    'lines.markersize': 5,
    'axes.grid': True,
    'grid.alpha': 0.3,
    'grid.linewidth': 0.5,
})
```

### Standard Figure Sizes (Two-Column Format)

| Use Case | figsize | Notes |
|----------|---------|-------|
| Single column | `(3.5, 2.5)` | Fits in one column of two-column layout |
| Double column | `(7.0, 3.0)` | Spans full page width |
| Square (heatmap, confusion matrix) | `(3.5, 3.5)` | Single column |
| Tall single (many rows) | `(3.5, 5.0)` | Use sparingly |

### Colorblind-Safe Palette (Okabe-Ito)

Use this palette for all paper figures. It is distinguishable by people with all common forms of color vision deficiency:

```python
COLORS = {
    'blue':    '#0072B2',
    'orange':  '#E69F00',
    'green':   '#009E73',
    'red':     '#D55E00',
    'purple':  '#CC79A7',
    'cyan':    '#56B4E9',
    'yellow':  '#F0E442',
    'black':   '#000000',
}

# As a list for cycling:
COLOR_CYCLE = ['#0072B2', '#D55E00', '#009E73', '#E69F00', '#CC79A7', '#56B4E9']
```

Also differentiate lines by **marker and linestyle**, not just color:
```python
STYLES = [
    {'color': '#0072B2', 'marker': 'o', 'linestyle': '-'},
    {'color': '#D55E00', 'marker': 's', 'linestyle': '--'},
    {'color': '#009E73', 'marker': '^', 'linestyle': '-.'},
    {'color': '#E69F00', 'marker': 'D', 'linestyle': ':'},
]
```

### Complete Example: Method Comparison Bar Chart

```python
import matplotlib.pyplot as plt
import numpy as np

try:
    import scienceplots
    style = ['science', 'no-latex']
except ImportError:
    style = 'default'

with plt.style.context(style):
    methods = ['Single Pass', 'Critique+Revise', 'Best-of-N', 'Ours']
    scores = [73.2, 74.1, 68.5, 77.0]
    errors = [2.1, 1.8, 3.2, 1.5]
    colors = ['#56B4E9', '#E69F00', '#CC79A7', '#0072B2']
    
    fig, ax = plt.subplots(figsize=(3.5, 2.5))
    bars = ax.bar(methods, scores, yerr=errors, capsize=3,
                  color=colors, edgecolor='black', linewidth=0.5)
    
    # Highlight "Ours"
    bars[-1].set_edgecolor('#0072B2')
    bars[-1].set_linewidth(1.5)
    
    ax.set_ylabel('Pass Rate (%)')
    ax.set_ylim(60, 85)
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    
    fig.savefig('paper/fig_comparison.pdf', bbox_inches='tight')
```

### Complete Example: Convergence/Trajectory Line Chart

```python
with plt.style.context(style):
    fig, ax = plt.subplots(figsize=(3.5, 2.5))
    
    passes = np.arange(1, 16)
    ours = [65, 72, 78, 82, 85, 87, 88, 89, 89.5, 90, 90, 90, 90, 90, 90]
    baseline = [65, 68, 70, 71, 69, 67, 66, 65, 64, 63, 62, 61, 60, 59, 58]
    
    ax.plot(passes, ours, **STYLES[0], label='Ours', markersize=4)
    ax.plot(passes, baseline, **STYLES[1], label='Critique+Revise', markersize=4)
    
    # Mark convergence point
    ax.axvline(x=10, color='gray', linestyle=':', alpha=0.5, linewidth=0.8)
    ax.annotate('Converged', xy=(10, 90), fontsize=8, ha='center',
                xytext=(10, 93), arrowprops=dict(arrowstyle='->', color='gray'))
    
    ax.set_xlabel('Iteration')
    ax.set_ylabel('Quality Score')
    ax.legend(loc='lower right')
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    
    fig.savefig('paper/fig_trajectory.pdf', bbox_inches='tight')
```

### Output Rules

- **Always save as PDF**: `fig.savefig('fig.pdf')` — vector graphics, sharp at any zoom
- **Never save as PNG** for paper figures — raster PNGs look blurry when printed/zoomed
- **Exception**: Screenshots, photographs, or pixel-art visualizations → PNG at 600 DPI
- **Verify grayscale**: Print to grayscale PDF and check all information is still visible

### Chart Types for Common Comparisons

| Comparison Type | Chart | Notes |
|----------------|-------|-------|
| Method vs method | Grouped bar chart | Include error bars |
| Across model sizes | Line chart with CI bands | Log scale for model size axis |
| Ablation study | Stacked/grouped bar | Highlight removed component |
| Trajectory/convergence | Line chart over iterations | Show winner per iteration |
| Per-task breakdown | Heatmap or grouped bar | Show variance across tasks |