729 lines
25 KiB
Markdown
729 lines
25 KiB
Markdown
|
|
# Experiment Design Patterns
|
||
|
|
|
||
|
|
Patterns and best practices distilled from running research experiments at scale with the Hermes agent. These cover experiment infrastructure, evaluation protocols, monitoring, and failure recovery.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Experiment Infrastructure
|
||
|
|
|
||
|
|
### Directory Structure
|
||
|
|
|
||
|
|
Organize experiments with a consistent structure:
|
||
|
|
|
||
|
|
```
|
||
|
|
workspace/
|
||
|
|
experiments/
|
||
|
|
run_main.py # Core experiment runner
|
||
|
|
run_baselines.py # Baseline comparison
|
||
|
|
run_ablation.py # Ablation studies
|
||
|
|
strategies.py # Method implementations
|
||
|
|
config.yaml # Shared configuration
|
||
|
|
results/
|
||
|
|
<experiment_name>/
|
||
|
|
<task_or_problem>/
|
||
|
|
<strategy>/
|
||
|
|
result.json # Final metrics
|
||
|
|
final_output.md # Final output artifact
|
||
|
|
history.json # Full trajectory/log
|
||
|
|
pass_01/ # Per-iteration artifacts (if iterative)
|
||
|
|
intermediate.md
|
||
|
|
analysis/
|
||
|
|
analyze_results.py # Statistical analysis
|
||
|
|
compute_stats.py # Significance tests
|
||
|
|
make_charts.py # Visualization
|
||
|
|
paper/
|
||
|
|
paper.tex # LaTeX source
|
||
|
|
fig_*.pdf # Generated figures
|
||
|
|
```
|
||
|
|
|
||
|
|
### Script Design Principles
|
||
|
|
|
||
|
|
**1. Incremental Saving (Crash Recovery)**
|
||
|
|
|
||
|
|
Every experiment script should save results after each unit of work, and skip already-completed work on restart:
|
||
|
|
|
||
|
|
```python
|
||
|
|
import json, os
|
||
|
|
from pathlib import Path
|
||
|
|
|
||
|
|
def run_experiment(problems, strategies, output_dir):
|
||
|
|
for problem in problems:
|
||
|
|
for strategy in strategies:
|
||
|
|
result_path = Path(output_dir) / problem["id"] / strategy / "result.json"
|
||
|
|
if result_path.exists():
|
||
|
|
print(f"Skipping {problem['id']}/{strategy} (already done)")
|
||
|
|
continue
|
||
|
|
|
||
|
|
# Run the experiment
|
||
|
|
result = execute_strategy(problem, strategy)
|
||
|
|
|
||
|
|
# Save immediately
|
||
|
|
result_path.parent.mkdir(parents=True, exist_ok=True)
|
||
|
|
with open(result_path, 'w') as f:
|
||
|
|
json.dump(result, f, indent=2)
|
||
|
|
```
|
||
|
|
|
||
|
|
This pattern makes re-runs safe and efficient. If a process crashes at problem 47/150, restarting skips the first 46.
|
||
|
|
|
||
|
|
**2. Artifact Preservation**
|
||
|
|
|
||
|
|
Save all intermediate outputs, not just final results. This enables post-hoc analysis without re-running:
|
||
|
|
|
||
|
|
```python
|
||
|
|
def save_pass_artifacts(output_dir, pass_num, artifacts):
|
||
|
|
"""Save all artifacts from a single pass of an iterative method."""
|
||
|
|
pass_dir = Path(output_dir) / f"pass_{pass_num:02d}"
|
||
|
|
pass_dir.mkdir(parents=True, exist_ok=True)
|
||
|
|
|
||
|
|
for name, content in artifacts.items():
|
||
|
|
with open(pass_dir / f"{name}.md", 'w') as f:
|
||
|
|
f.write(content)
|
||
|
|
```
|
||
|
|
|
||
|
|
**3. Configuration Management**
|
||
|
|
|
||
|
|
Use YAML configs for reproducibility:
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
# config.yaml
|
||
|
|
model: anthropic/claude-sonnet-4-20250514
|
||
|
|
author_temperature: 0.8
|
||
|
|
judge_temperature: 0.3
|
||
|
|
max_tokens: 4096
|
||
|
|
num_judges: 3
|
||
|
|
max_passes: 15
|
||
|
|
convergence_k: 2
|
||
|
|
```
|
||
|
|
|
||
|
|
```python
|
||
|
|
import yaml
|
||
|
|
|
||
|
|
with open("config.yaml") as f:
|
||
|
|
config = yaml.safe_load(f)
|
||
|
|
```
|
||
|
|
|
||
|
|
**4. Separation of Concerns**
|
||
|
|
|
||
|
|
Keep generation, evaluation, and visualization in separate scripts:
|
||
|
|
|
||
|
|
| Script | Purpose |
|
||
|
|
|--------|---------|
|
||
|
|
| `run_experiment.py` | Core method execution |
|
||
|
|
| `run_baselines.py` | Baseline comparisons at same compute |
|
||
|
|
| `run_eval.py` | Blind evaluation / judge panels |
|
||
|
|
| `analyze_results.py` | Statistical analysis |
|
||
|
|
| `make_charts.py` | Figure generation |
|
||
|
|
|
||
|
|
This lets you re-run evaluation without re-running expensive generation, and regenerate figures without re-running analysis.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Evaluation Protocols
|
||
|
|
|
||
|
|
### Blind Judge Panels (for Subjective Tasks)
|
||
|
|
|
||
|
|
When evaluating subjective outputs (writing, analysis, recommendations), use a blind judge panel:
|
||
|
|
|
||
|
|
```python
|
||
|
|
import random
|
||
|
|
|
||
|
|
def run_blind_evaluation(outputs: dict, task_prompt: str, num_judges: int = 7):
|
||
|
|
"""
|
||
|
|
Run blind evaluation of multiple method outputs.
|
||
|
|
|
||
|
|
Args:
|
||
|
|
outputs: {"method_name": "output_text", ...}
|
||
|
|
task_prompt: The original task description
|
||
|
|
num_judges: Number of independent judge evaluations
|
||
|
|
"""
|
||
|
|
rankings = []
|
||
|
|
|
||
|
|
for judge_i in range(num_judges):
|
||
|
|
# Randomize labels and presentation order per judge
|
||
|
|
methods = list(outputs.keys())
|
||
|
|
random.shuffle(methods)
|
||
|
|
labels = {m: chr(65 + i) for i, m in enumerate(methods)} # A, B, C...
|
||
|
|
|
||
|
|
# Present to judge with randomized labels
|
||
|
|
prompt = f"Task: {task_prompt}\n\n"
|
||
|
|
for method in methods:
|
||
|
|
prompt += f"--- Proposal {labels[method]} ---\n{outputs[method]}\n\n"
|
||
|
|
prompt += "Rank all proposals from best to worst. Format: RANKING: [best], [second], [worst]"
|
||
|
|
|
||
|
|
ranking = call_judge(prompt)
|
||
|
|
rankings.append({"labels": labels, "ranking": ranking})
|
||
|
|
|
||
|
|
# Aggregate via Borda count
|
||
|
|
return compute_borda(rankings)
|
||
|
|
|
||
|
|
def compute_borda(rankings, n_methods=3):
|
||
|
|
"""Borda count: 3/2/1 points for 1st/2nd/3rd."""
|
||
|
|
scores = {}
|
||
|
|
points = {0: n_methods, 1: n_methods - 1, 2: n_methods - 2} # Adjust for n_methods
|
||
|
|
|
||
|
|
for r in rankings:
|
||
|
|
for position, method in enumerate(r["ranking"]):
|
||
|
|
scores[method] = scores.get(method, 0) + points.get(position, 0)
|
||
|
|
|
||
|
|
return scores
|
||
|
|
```
|
||
|
|
|
||
|
|
Key design decisions:
|
||
|
|
- **Randomize both labels AND order** per judge to prevent position bias
|
||
|
|
- **Use odd number of judges** (3, 5, 7) to break ties
|
||
|
|
- **Conservative tiebreak**: Incumbent/baseline wins ties (prevents false positives)
|
||
|
|
- **CoT judges** match non-CoT quality at ~40% cost (1 CoT judge ≈ 3 standard judges)
|
||
|
|
|
||
|
|
### Code/Objective Evaluation
|
||
|
|
|
||
|
|
For tasks with ground-truth evaluation (code, math, factual):
|
||
|
|
|
||
|
|
```python
|
||
|
|
import subprocess
|
||
|
|
|
||
|
|
def evaluate_code(solution: str, test_cases: list, timeout: int = 30):
|
||
|
|
"""Run code solution against test cases with sandboxed execution."""
|
||
|
|
results = {"public": [], "private": []}
|
||
|
|
|
||
|
|
for test in test_cases:
|
||
|
|
try:
|
||
|
|
proc = subprocess.run(
|
||
|
|
["python3", "-c", solution],
|
||
|
|
input=test["input"],
|
||
|
|
capture_output=True,
|
||
|
|
timeout=timeout,
|
||
|
|
text=True
|
||
|
|
)
|
||
|
|
actual = proc.stdout.strip()
|
||
|
|
expected = test["expected"].strip()
|
||
|
|
passed = actual == expected
|
||
|
|
except subprocess.TimeoutExpired:
|
||
|
|
passed = False
|
||
|
|
|
||
|
|
category = "public" if test.get("public") else "private"
|
||
|
|
results[category].append(passed)
|
||
|
|
|
||
|
|
return {
|
||
|
|
"public_pass_rate": sum(results["public"]) / max(len(results["public"]), 1),
|
||
|
|
"private_pass_rate": sum(results["private"]) / max(len(results["private"]), 1),
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Compute-Matched Comparison
|
||
|
|
|
||
|
|
Always compare methods at equal compute budget. If your method uses N API calls, baselines get N calls too:
|
||
|
|
|
||
|
|
| Method | Call Budget | Allocation |
|
||
|
|
|--------|-----------|------------|
|
||
|
|
| Single pass | 6 calls | 6 independent generations |
|
||
|
|
| Critique & revise | 6 calls | 1 generate + 5 revise rounds |
|
||
|
|
| Autoreason | 6 calls | 1 generate + 1 analysis + 4 revisions |
|
||
|
|
| Best-of-N | 6 calls | 6 independent, pick best on public test |
|
||
|
|
|
||
|
|
### Human Evaluation Design
|
||
|
|
|
||
|
|
Many ML/NLP papers require human evaluation, especially for subjective tasks (text generation, summarization, dialogue, creative writing). Poorly designed human evals are a common rejection reason.
|
||
|
|
|
||
|
|
#### When Human Evaluation Is Required
|
||
|
|
|
||
|
|
| Task Type | Required? | Notes |
|
||
|
|
|-----------|-----------|-------|
|
||
|
|
| Text generation (open-ended) | Yes | LLM-as-judge alone is insufficient for acceptance at ACL/EMNLP |
|
||
|
|
| Summarization | Usually | At minimum for a subset of outputs |
|
||
|
|
| Dialogue systems | Yes | User studies or annotation |
|
||
|
|
| Code generation | No | Test suites are objective ground truth |
|
||
|
|
| Classification | No | Standard metrics suffice |
|
||
|
|
| Any task with subjective quality | Strongly recommended | Strengthens the paper significantly |
|
||
|
|
|
||
|
|
#### Annotation Protocol Design
|
||
|
|
|
||
|
|
```
|
||
|
|
Human Evaluation Protocol:
|
||
|
|
1. Define the evaluation dimensions (fluency, relevance, factual accuracy, etc.)
|
||
|
|
2. Create annotation guidelines with examples of each score level
|
||
|
|
3. Run a pilot with 2-3 annotators on 20-30 examples
|
||
|
|
4. Compute pilot inter-annotator agreement — if low, revise guidelines
|
||
|
|
5. Run full evaluation
|
||
|
|
6. Report: annotator count, agreement metrics, compensation, time per item
|
||
|
|
```
|
||
|
|
|
||
|
|
**Evaluation dimensions** (pick relevant subset):
|
||
|
|
|
||
|
|
| Dimension | Definition | Scale |
|
||
|
|
|-----------|-----------|-------|
|
||
|
|
| Fluency | Grammaticality and naturalness | 1-5 Likert |
|
||
|
|
| Relevance | Does it address the task? | 1-5 Likert |
|
||
|
|
| Factual accuracy | Are stated facts correct? | Binary or 1-5 |
|
||
|
|
| Coherence | Logical flow and consistency | 1-5 Likert |
|
||
|
|
| Informativeness | Does it provide useful information? | 1-5 Likert |
|
||
|
|
| Overall preference | Which output is better? | A/B/Tie (pairwise) |
|
||
|
|
|
||
|
|
**Pairwise comparison** (preferred over absolute scoring — more reliable):
|
||
|
|
- Present two outputs side-by-side (randomize left/right position)
|
||
|
|
- Ask: "Which is better? A / B / Tie"
|
||
|
|
- More discriminative and less susceptible to annotator calibration drift
|
||
|
|
|
||
|
|
#### Inter-Annotator Agreement
|
||
|
|
|
||
|
|
Always report agreement metrics. Without them, reviewers assume your annotations are unreliable.
|
||
|
|
|
||
|
|
```python
|
||
|
|
# Krippendorff's alpha (preferred — handles missing data, any scale)
|
||
|
|
# pip install krippendorffs-alpha
|
||
|
|
import krippendorff
|
||
|
|
|
||
|
|
# Ratings: rows = annotators, columns = items, values = scores
|
||
|
|
ratings = [
|
||
|
|
[3, 4, 1, 2, 5, None, 3], # Annotator 1
|
||
|
|
[3, 5, 1, 3, 5, 2, 3], # Annotator 2
|
||
|
|
[4, 4, 2, 2, 4, 2, None], # Annotator 3
|
||
|
|
]
|
||
|
|
alpha = krippendorff.alpha(reliability_data=ratings, level_of_measurement="ordinal")
|
||
|
|
print(f"Krippendorff's alpha: {alpha:.3f}")
|
||
|
|
# Interpretation: >0.80 good, 0.67-0.80 acceptable, <0.67 questionable
|
||
|
|
```
|
||
|
|
|
||
|
|
```python
|
||
|
|
# Cohen's kappa (for exactly 2 annotators, categorical data)
|
||
|
|
from sklearn.metrics import cohen_kappa_score
|
||
|
|
|
||
|
|
annotator_1 = [1, 2, 3, 1, 2, 3, 2]
|
||
|
|
annotator_2 = [1, 2, 2, 1, 3, 3, 2]
|
||
|
|
kappa = cohen_kappa_score(annotator_1, annotator_2)
|
||
|
|
print(f"Cohen's kappa: {kappa:.3f}")
|
||
|
|
# Interpretation: >0.80 excellent, 0.60-0.80 substantial, 0.40-0.60 moderate
|
||
|
|
```
|
||
|
|
|
||
|
|
| Metric | When to Use | Annotators | Scale |
|
||
|
|
|--------|------------|-----------|-------|
|
||
|
|
| Krippendorff's alpha | Default choice | Any number | Any (ordinal, nominal, ratio) |
|
||
|
|
| Cohen's kappa | 2 annotators, categorical | Exactly 2 | Nominal/ordinal |
|
||
|
|
| Fleiss' kappa | 3+ annotators, categorical | 3+ | Nominal |
|
||
|
|
| Pearson/Spearman | Continuous scores | 2 | Interval/ratio |
|
||
|
|
|
||
|
|
#### Crowdsourcing Platforms
|
||
|
|
|
||
|
|
| Platform | Best For | Cost | Quality |
|
||
|
|
|----------|----------|------|---------|
|
||
|
|
| **Prolific** | Academic research, higher quality | $8-15/hr | High — academic participant pool |
|
||
|
|
| **MTurk** | Large-scale, fast turnaround | $2-10/hr | Variable — use qualifications |
|
||
|
|
| **Surge AI** | NLP-specific annotations | Premium | High — trained annotators |
|
||
|
|
| **Expert annotators** | Domain-specific (medical, legal) | Highest | Highest — but slow |
|
||
|
|
|
||
|
|
**Ethics requirements**:
|
||
|
|
- Report compensation rate (must be at minimum local minimum wage)
|
||
|
|
- Describe annotator demographics if relevant
|
||
|
|
- Obtain IRB/ethics approval if required by your institution
|
||
|
|
- ACL venues explicitly require compensation documentation
|
||
|
|
|
||
|
|
#### What to Report in the Paper
|
||
|
|
|
||
|
|
```
|
||
|
|
Human Evaluation Section Checklist:
|
||
|
|
- [ ] Number of annotators
|
||
|
|
- [ ] Annotator qualifications / recruitment method
|
||
|
|
- [ ] Number of items evaluated
|
||
|
|
- [ ] Evaluation dimensions with definitions
|
||
|
|
- [ ] Scale used (Likert, pairwise, binary)
|
||
|
|
- [ ] Inter-annotator agreement (Krippendorff's alpha or Cohen's kappa)
|
||
|
|
- [ ] Compensation rate
|
||
|
|
- [ ] Time per annotation item
|
||
|
|
- [ ] Whether annotators saw model identities (should be blind)
|
||
|
|
- [ ] Randomization of presentation order
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Statistical Analysis
|
||
|
|
|
||
|
|
### Required Tests
|
||
|
|
|
||
|
|
| Test | When to Use | Python |
|
||
|
|
|------|------------|--------|
|
||
|
|
| McNemar's test | Comparing two methods on same problems | `scipy.stats.binomtest` for small n |
|
||
|
|
| Two-proportion z-test | Comparing success rates | Custom or `statsmodels` |
|
||
|
|
| Fisher's exact test | Small sample pairwise comparison | `scipy.stats.fisher_exact` |
|
||
|
|
| Bootstrapped CI | Confidence intervals for any metric | Custom bootstrap |
|
||
|
|
| Cohen's h | Effect size for proportions | Manual calculation |
|
||
|
|
|
||
|
|
### Standard Analysis Script
|
||
|
|
|
||
|
|
```python
|
||
|
|
import numpy as np
|
||
|
|
from scipy import stats
|
||
|
|
from pathlib import Path
|
||
|
|
import json
|
||
|
|
|
||
|
|
def load_all_results(results_dir):
|
||
|
|
"""Load all results into a structured format."""
|
||
|
|
results = {}
|
||
|
|
for result_file in Path(results_dir).rglob("result.json"):
|
||
|
|
parts = result_file.relative_to(results_dir).parts
|
||
|
|
if len(parts) >= 3:
|
||
|
|
experiment, task, strategy = parts[0], parts[1], parts[2]
|
||
|
|
data = json.loads(result_file.read_text())
|
||
|
|
results.setdefault(experiment, {}).setdefault(strategy, {})[task] = data
|
||
|
|
return results
|
||
|
|
|
||
|
|
def pairwise_mcnemar(method_a_results, method_b_results):
|
||
|
|
"""McNemar's test for paired binary outcomes."""
|
||
|
|
a_win_b_lose = sum(1 for a, b in zip(method_a_results, method_b_results) if a and not b)
|
||
|
|
b_win_a_lose = sum(1 for a, b in zip(method_a_results, method_b_results) if b and not a)
|
||
|
|
|
||
|
|
n = a_win_b_lose + b_win_a_lose
|
||
|
|
if n < 25:
|
||
|
|
# Use exact binomial for small samples
|
||
|
|
result = stats.binomtest(a_win_b_lose, n, 0.5)
|
||
|
|
p_value = result.pvalue
|
||
|
|
else:
|
||
|
|
# Chi-squared approximation
|
||
|
|
chi2 = (abs(a_win_b_lose - b_win_a_lose) - 1)**2 / (a_win_b_lose + b_win_a_lose)
|
||
|
|
p_value = 1 - stats.chi2.cdf(chi2, df=1)
|
||
|
|
|
||
|
|
return {
|
||
|
|
"a_wins": a_win_b_lose,
|
||
|
|
"b_wins": b_win_a_lose,
|
||
|
|
"n_discordant": n,
|
||
|
|
"p_value": p_value,
|
||
|
|
"significant": p_value < 0.05
|
||
|
|
}
|
||
|
|
|
||
|
|
def bootstrap_ci(data, n_bootstrap=10000, ci=0.95):
|
||
|
|
"""Bootstrap confidence interval for mean."""
|
||
|
|
means = []
|
||
|
|
for _ in range(n_bootstrap):
|
||
|
|
sample = np.random.choice(data, size=len(data), replace=True)
|
||
|
|
means.append(np.mean(sample))
|
||
|
|
lower = np.percentile(means, (1 - ci) / 2 * 100)
|
||
|
|
upper = np.percentile(means, (1 + ci) / 2 * 100)
|
||
|
|
return {"mean": np.mean(data), "ci_lower": lower, "ci_upper": upper}
|
||
|
|
|
||
|
|
def cohens_h(p1, p2):
|
||
|
|
"""Cohen's h effect size for two proportions."""
|
||
|
|
return 2 * np.arcsin(np.sqrt(p1)) - 2 * np.arcsin(np.sqrt(p2))
|
||
|
|
```
|
||
|
|
|
||
|
|
### Reporting Standards
|
||
|
|
|
||
|
|
Always include in the paper:
|
||
|
|
- **Sample sizes**: n=X problems/tasks
|
||
|
|
- **Number of runs**: K independent runs if applicable
|
||
|
|
- **Error bars**: Specify standard deviation or standard error
|
||
|
|
- **Confidence intervals**: 95% CI for key results
|
||
|
|
- **Significance tests**: p-values for key comparisons
|
||
|
|
- **Effect sizes**: Cohen's d or h for practical significance
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Monitoring (Cron Pattern)
|
||
|
|
|
||
|
|
### Cron Prompt Template
|
||
|
|
|
||
|
|
For each experiment batch, create a monitoring prompt:
|
||
|
|
|
||
|
|
```
|
||
|
|
Check the status of the [EXPERIMENT_NAME] experiment:
|
||
|
|
|
||
|
|
1. Process check: ps aux | grep [PROCESS_PATTERN]
|
||
|
|
2. Log check: tail -30 [LOG_FILE]
|
||
|
|
3. Results check: ls [RESULT_DIR]/eval/ (or appropriate result location)
|
||
|
|
4. If results are available:
|
||
|
|
- Read the result JSON files
|
||
|
|
- Report metrics in a table (Borda scores, accuracy, etc.)
|
||
|
|
- Compute key comparisons between methods
|
||
|
|
5. If all experiments in this batch are complete:
|
||
|
|
- git add -A && git commit -m "[COMMIT_MESSAGE]" && git push
|
||
|
|
- Report final summary
|
||
|
|
6. Key question: [SPECIFIC ANALYTICAL QUESTION]
|
||
|
|
|
||
|
|
If nothing has changed since the last check, respond with [SILENT].
|
||
|
|
```
|
||
|
|
|
||
|
|
### Monitoring Best Practices
|
||
|
|
|
||
|
|
1. **Check processes first** — don't read results if the experiment is still running and results are incomplete
|
||
|
|
2. **Read the log tail** — look for errors, progress indicators, completion messages
|
||
|
|
3. **Count completed vs expected** — "45/150 problems done" is more useful than "some results exist"
|
||
|
|
4. **Report in structured tables** — always include key metrics in a table
|
||
|
|
5. **Answer the key question** — each experiment should have a specific analytical question to answer when done
|
||
|
|
6. **[SILENT] for no-news** — suppress notifications when nothing has changed
|
||
|
|
7. **Commit on completion** — every completed batch gets committed with a descriptive message
|
||
|
|
|
||
|
|
### Example Monitoring Report
|
||
|
|
|
||
|
|
```
|
||
|
|
## Code Experiments (Haiku 3.5) - COMPLETE
|
||
|
|
|
||
|
|
| Strategy | Pass Rate (150 problems) | vs Single |
|
||
|
|
|----------|------------------------|-----------|
|
||
|
|
| single_pass | 38.0% | — |
|
||
|
|
| critique_revise | 35.2% | -2.8pp |
|
||
|
|
| **autoreason** | **40.0%** | **+2.0pp** |
|
||
|
|
| best_of_6 | 31.0% | -7.0pp |
|
||
|
|
|
||
|
|
Key finding: Autoreason shows +2pp improvement over single pass, while
|
||
|
|
best-of-6 collapses due to single-public-test selection issue.
|
||
|
|
|
||
|
|
Committed: `git commit -m "Add Haiku code results (150 problems, 4 strategies)"`
|
||
|
|
Next: Run significance tests on these results.
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Failure Recovery
|
||
|
|
|
||
|
|
### Common Failures and Recovery
|
||
|
|
|
||
|
|
| Failure | Detection | Recovery |
|
||
|
|
|---------|-----------|----------|
|
||
|
|
| **API credit exhaustion** | 402 errors in logs, incomplete results | Top up credits, re-run (skips completed work automatically) |
|
||
|
|
| **Rate limiting** | 429 errors, slow progress | Add retry logic with exponential backoff |
|
||
|
|
| **Process crash** | PID gone, log stops mid-problem | Re-run script (resumes from last checkpoint) |
|
||
|
|
| **Wrong model ID** | Model not found errors | Fix ID (e.g., `claude-opus-4-6` not `claude-opus-4.6`) |
|
||
|
|
| **Parallel slowdown** | Each experiment taking 2x longer | Reduce parallel experiments to 2-3 max |
|
||
|
|
| **Security scan blocks** | Commands blocked by security | Use `execute_code` instead of piped `terminal` commands |
|
||
|
|
| **Delegation failures** | `delegate_task` returns errors | Fall back to doing work directly |
|
||
|
|
| **Timeout on hard problems** | Process stuck, no log progress | Kill, skip problem, note in results |
|
||
|
|
| **Dataset path mismatch** | File not found errors | Verify paths before launching |
|
||
|
|
|
||
|
|
### Retry Naming Convention
|
||
|
|
|
||
|
|
When re-running failed experiments, use a suffix to track rounds:
|
||
|
|
|
||
|
|
```
|
||
|
|
logs/experiment_haiku_0_50.log # Round 1
|
||
|
|
logs/experiment_haiku_0_50_r2.log # Round 2 (after credit exhaustion)
|
||
|
|
logs/experiment_haiku_0_50_r3.log # Round 3 (after bug fix)
|
||
|
|
```
|
||
|
|
|
||
|
|
### Pre-Flight Checklist
|
||
|
|
|
||
|
|
Before launching any experiment batch:
|
||
|
|
|
||
|
|
```
|
||
|
|
Pre-Flight:
|
||
|
|
- [ ] API credits sufficient for estimated calls
|
||
|
|
- [ ] Model IDs correct (test with 1 problem first)
|
||
|
|
- [ ] Output directory exists and is writable
|
||
|
|
- [ ] Resume logic works (re-run won't overwrite existing results)
|
||
|
|
- [ ] Log file path is unique (won't overwrite previous logs)
|
||
|
|
- [ ] Dataset/task files are accessible
|
||
|
|
- [ ] Config matches intended experiment
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Task/Benchmark Design
|
||
|
|
|
||
|
|
### Open-Ended Tasks (Subjective Evaluation)
|
||
|
|
|
||
|
|
Design tasks that have clear objectives but subjective quality:
|
||
|
|
|
||
|
|
```markdown
|
||
|
|
# Task: [Title]
|
||
|
|
|
||
|
|
## Context
|
||
|
|
[Specific scenario with concrete details: company size, constraints, timeline]
|
||
|
|
|
||
|
|
## Deliverable
|
||
|
|
[Exact format and structure required]
|
||
|
|
|
||
|
|
## Requirements
|
||
|
|
- [Specific, measurable requirements]
|
||
|
|
- [Not vague — "be comprehensive" is bad, "include exactly 6 sections" is good]
|
||
|
|
```
|
||
|
|
|
||
|
|
### Constrained Tasks (for Testing Scope Effects)
|
||
|
|
|
||
|
|
Constrained tasks test whether methods respect scope boundaries. Design with:
|
||
|
|
|
||
|
|
- **Fixed facts**: "Use only these N data points, add nothing else"
|
||
|
|
- **Fixed deliverable**: Specific format (pitch, postmortem, memo — not "improve this")
|
||
|
|
- **Fixed structure**: "These sections in this order, do not add/remove"
|
||
|
|
- **Fixed change items**: "Address exactly these N points, nothing else"
|
||
|
|
|
||
|
|
**Do NOT use word count as a scope constraint.** Word limits cause false convergence — outputs get rejected for length, not quality. Constrain scope (what to include) not length.
|
||
|
|
|
||
|
|
### Example: Good vs Bad Constraints
|
||
|
|
|
||
|
|
| Bad Constraint | Why | Good Constraint |
|
||
|
|
|---------------|-----|-----------------|
|
||
|
|
| "Max 500 words" | Judges reject for length | "Exactly 4 sections, each with 3 numbered items" |
|
||
|
|
| "Be concise" | Too vague | "Each prohibition must reference a specific base fact" |
|
||
|
|
| "Improve this" | Unbounded scope | "Write a 600-word incident postmortem with this exact structure" |
|
||
|
|
| "Make it better" | No clear criterion | "Address exactly these 3 reviewer concerns" |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Visualization Best Practices
|
||
|
|
|
||
|
|
### Setup: SciencePlots + matplotlib
|
||
|
|
|
||
|
|
Install SciencePlots for publication-ready defaults:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
pip install SciencePlots matplotlib numpy
|
||
|
|
```
|
||
|
|
|
||
|
|
**Option A: SciencePlots styles** (recommended — handles most defaults automatically):
|
||
|
|
|
||
|
|
```python
|
||
|
|
import matplotlib.pyplot as plt
|
||
|
|
import scienceplots # registers the styles
|
||
|
|
|
||
|
|
# Pick a style:
|
||
|
|
# 'science' — clean, serif fonts, suitable for most venues
|
||
|
|
# 'science+ieee' — IEEE-style (good for two-column papers)
|
||
|
|
# 'science+nature' — Nature-style
|
||
|
|
# Add 'no-latex' if LaTeX is not installed on the machine generating plots
|
||
|
|
|
||
|
|
with plt.style.context(['science', 'no-latex']):
|
||
|
|
fig, ax = plt.subplots(figsize=(3.5, 2.5)) # single-column width
|
||
|
|
# ... plot ...
|
||
|
|
fig.savefig('paper/fig_results.pdf', bbox_inches='tight')
|
||
|
|
```
|
||
|
|
|
||
|
|
**Option B: Manual rcParams** (when you need full control):
|
||
|
|
|
||
|
|
```python
|
||
|
|
import matplotlib.pyplot as plt
|
||
|
|
|
||
|
|
plt.rcParams.update({
|
||
|
|
'font.size': 10,
|
||
|
|
'font.family': 'serif',
|
||
|
|
'axes.labelsize': 11,
|
||
|
|
'axes.titlesize': 11,
|
||
|
|
'xtick.labelsize': 9,
|
||
|
|
'ytick.labelsize': 9,
|
||
|
|
'legend.fontsize': 9,
|
||
|
|
'figure.figsize': (3.5, 2.5), # single-column default
|
||
|
|
'figure.dpi': 300,
|
||
|
|
'savefig.dpi': 300,
|
||
|
|
'savefig.bbox': 'tight',
|
||
|
|
'savefig.pad_inches': 0.05,
|
||
|
|
'axes.linewidth': 0.8,
|
||
|
|
'lines.linewidth': 1.5,
|
||
|
|
'lines.markersize': 5,
|
||
|
|
'axes.grid': True,
|
||
|
|
'grid.alpha': 0.3,
|
||
|
|
'grid.linewidth': 0.5,
|
||
|
|
})
|
||
|
|
```
|
||
|
|
|
||
|
|
### Standard Figure Sizes (Two-Column Format)
|
||
|
|
|
||
|
|
| Use Case | figsize | Notes |
|
||
|
|
|----------|---------|-------|
|
||
|
|
| Single column | `(3.5, 2.5)` | Fits in one column of two-column layout |
|
||
|
|
| Double column | `(7.0, 3.0)` | Spans full page width |
|
||
|
|
| Square (heatmap, confusion matrix) | `(3.5, 3.5)` | Single column |
|
||
|
|
| Tall single (many rows) | `(3.5, 5.0)` | Use sparingly |
|
||
|
|
|
||
|
|
### Colorblind-Safe Palette (Okabe-Ito)
|
||
|
|
|
||
|
|
Use this palette for all paper figures. It is distinguishable by people with all common forms of color vision deficiency:
|
||
|
|
|
||
|
|
```python
|
||
|
|
COLORS = {
|
||
|
|
'blue': '#0072B2',
|
||
|
|
'orange': '#E69F00',
|
||
|
|
'green': '#009E73',
|
||
|
|
'red': '#D55E00',
|
||
|
|
'purple': '#CC79A7',
|
||
|
|
'cyan': '#56B4E9',
|
||
|
|
'yellow': '#F0E442',
|
||
|
|
'black': '#000000',
|
||
|
|
}
|
||
|
|
|
||
|
|
# As a list for cycling:
|
||
|
|
COLOR_CYCLE = ['#0072B2', '#D55E00', '#009E73', '#E69F00', '#CC79A7', '#56B4E9']
|
||
|
|
```
|
||
|
|
|
||
|
|
Also differentiate lines by **marker and linestyle**, not just color:
|
||
|
|
```python
|
||
|
|
STYLES = [
|
||
|
|
{'color': '#0072B2', 'marker': 'o', 'linestyle': '-'},
|
||
|
|
{'color': '#D55E00', 'marker': 's', 'linestyle': '--'},
|
||
|
|
{'color': '#009E73', 'marker': '^', 'linestyle': '-.'},
|
||
|
|
{'color': '#E69F00', 'marker': 'D', 'linestyle': ':'},
|
||
|
|
]
|
||
|
|
```
|
||
|
|
|
||
|
|
### Complete Example: Method Comparison Bar Chart
|
||
|
|
|
||
|
|
```python
|
||
|
|
import matplotlib.pyplot as plt
|
||
|
|
import numpy as np
|
||
|
|
|
||
|
|
try:
|
||
|
|
import scienceplots
|
||
|
|
style = ['science', 'no-latex']
|
||
|
|
except ImportError:
|
||
|
|
style = 'default'
|
||
|
|
|
||
|
|
with plt.style.context(style):
|
||
|
|
methods = ['Single Pass', 'Critique+Revise', 'Best-of-N', 'Ours']
|
||
|
|
scores = [73.2, 74.1, 68.5, 77.0]
|
||
|
|
errors = [2.1, 1.8, 3.2, 1.5]
|
||
|
|
colors = ['#56B4E9', '#E69F00', '#CC79A7', '#0072B2']
|
||
|
|
|
||
|
|
fig, ax = plt.subplots(figsize=(3.5, 2.5))
|
||
|
|
bars = ax.bar(methods, scores, yerr=errors, capsize=3,
|
||
|
|
color=colors, edgecolor='black', linewidth=0.5)
|
||
|
|
|
||
|
|
# Highlight "Ours"
|
||
|
|
bars[-1].set_edgecolor('#0072B2')
|
||
|
|
bars[-1].set_linewidth(1.5)
|
||
|
|
|
||
|
|
ax.set_ylabel('Pass Rate (%)')
|
||
|
|
ax.set_ylim(60, 85)
|
||
|
|
ax.spines['top'].set_visible(False)
|
||
|
|
ax.spines['right'].set_visible(False)
|
||
|
|
|
||
|
|
fig.savefig('paper/fig_comparison.pdf', bbox_inches='tight')
|
||
|
|
```
|
||
|
|
|
||
|
|
### Complete Example: Convergence/Trajectory Line Chart
|
||
|
|
|
||
|
|
```python
|
||
|
|
with plt.style.context(style):
|
||
|
|
fig, ax = plt.subplots(figsize=(3.5, 2.5))
|
||
|
|
|
||
|
|
passes = np.arange(1, 16)
|
||
|
|
ours = [65, 72, 78, 82, 85, 87, 88, 89, 89.5, 90, 90, 90, 90, 90, 90]
|
||
|
|
baseline = [65, 68, 70, 71, 69, 67, 66, 65, 64, 63, 62, 61, 60, 59, 58]
|
||
|
|
|
||
|
|
ax.plot(passes, ours, **STYLES[0], label='Ours', markersize=4)
|
||
|
|
ax.plot(passes, baseline, **STYLES[1], label='Critique+Revise', markersize=4)
|
||
|
|
|
||
|
|
# Mark convergence point
|
||
|
|
ax.axvline(x=10, color='gray', linestyle=':', alpha=0.5, linewidth=0.8)
|
||
|
|
ax.annotate('Converged', xy=(10, 90), fontsize=8, ha='center',
|
||
|
|
xytext=(10, 93), arrowprops=dict(arrowstyle='->', color='gray'))
|
||
|
|
|
||
|
|
ax.set_xlabel('Iteration')
|
||
|
|
ax.set_ylabel('Quality Score')
|
||
|
|
ax.legend(loc='lower right')
|
||
|
|
ax.spines['top'].set_visible(False)
|
||
|
|
ax.spines['right'].set_visible(False)
|
||
|
|
|
||
|
|
fig.savefig('paper/fig_trajectory.pdf', bbox_inches='tight')
|
||
|
|
```
|
||
|
|
|
||
|
|
### Output Rules
|
||
|
|
|
||
|
|
- **Always save as PDF**: `fig.savefig('fig.pdf')` — vector graphics, sharp at any zoom
|
||
|
|
- **Never save as PNG** for paper figures — raster PNGs look blurry when printed/zoomed
|
||
|
|
- **Exception**: Screenshots, photographs, or pixel-art visualizations → PNG at 600 DPI
|
||
|
|
- **Verify grayscale**: Print to grayscale PDF and check all information is still visible
|
||
|
|
|
||
|
|
### Chart Types for Common Comparisons
|
||
|
|
|
||
|
|
| Comparison Type | Chart | Notes |
|
||
|
|
|----------------|-------|-------|
|
||
|
|
| Method vs method | Grouped bar chart | Include error bars |
|
||
|
|
| Across model sizes | Line chart with CI bands | Log scale for model size axis |
|
||
|
|
| Ablation study | Stacked/grouped bar | Highlight removed component |
|
||
|
|
| Trajectory/convergence | Line chart over iterations | Show winner per iteration |
|
||
|
|
| Per-task breakdown | Heatmap or grouped bar | Show variance across tasks |
|