Sync all skills and memories 2026-04-14 07:27
This commit is contained in:
@@ -0,0 +1,394 @@
|
||||
# Autoreason: Iterative Refinement Methodology
|
||||
|
||||
Complete reference for the autoreason iterative refinement method, derived from experimental results across subjective writing tasks, competitive programming, and four model tiers. Use this when any output (paper draft, experiment script, analysis, task definition) needs iterative improvement.
|
||||
|
||||
**Source**: [NousResearch/autoreason](https://github.com/NousResearch/autoreason) — "Autoreason: When Iterative LLM Refinement Works and Why It Fails"
|
||||
|
||||
---
|
||||
|
||||
## Strategy Selection Guide
|
||||
|
||||
### Decision Tree
|
||||
|
||||
```
|
||||
Is the task objectively verifiable (code, math, factual)?
|
||||
├── YES → Does the model solve it on the first attempt?
|
||||
│ ├── YES → Use single pass (no refinement needed)
|
||||
│ └── NO → Use autoreason (structured analysis → reason-informed revision)
|
||||
│
|
||||
└── NO (subjective) → What model tier are you using?
|
||||
├── Weak (Llama 8B, small models)
|
||||
│ → Single pass. Model too weak for refinement to help.
|
||||
│ Invest in generation quality, not iteration.
|
||||
│
|
||||
├── Mid-tier (Haiku 3.5, Gemini Flash)
|
||||
│ → Autoreason with stronger judges. This is the sweet spot.
|
||||
│ Self-refinement DESTROYS weak model outputs — autoreason prevents this.
|
||||
│
|
||||
├── Strong (Sonnet 4)
|
||||
│ → Autoreason for open-ended tasks. Wins 3/5.
|
||||
│ Critique-and-revise for concrete technical tasks (2/5).
|
||||
│
|
||||
└── Frontier (Sonnet 4.6, Opus)
|
||||
├── Constrained scope? → Autoreason. Wins 2/3 constrained tasks.
|
||||
└── Unconstrained? → Critique-and-revise or single pass.
|
||||
Autoreason FAILS on unconstrained frontier tasks (comes last).
|
||||
```
|
||||
|
||||
### Strategy Comparison Table
|
||||
|
||||
| Strategy | Best For | Avoid When | Compute (per iteration) |
|
||||
|----------|----------|------------|------------------------|
|
||||
| **Single pass** | Frontier models, template tasks, tight budgets | Mid-tier models where quality ceiling is low | 1 call |
|
||||
| **Critique-and-revise** | Concrete technical requirements (system design, specifications) | Weak models (degrades output), unconstrained subjective tasks | 2 calls |
|
||||
| **Autoreason** | Mid-tier models, constrained scope, tasks with genuine tradeoffs | Weak models (Llama 8B), frontier + unconstrained | ~6 calls |
|
||||
| **Best-of-N** | Almost never recommended | Weak models especially — worse than single pass | N calls |
|
||||
|
||||
### Why Each Strategy Fails
|
||||
|
||||
| Strategy | Failure Mode | Mechanism |
|
||||
|----------|-------------|-----------|
|
||||
| **Single pass** | Quality ceiling | No mechanism to improve beyond first attempt |
|
||||
| **Critique-and-revise** | Progressive degradation | Model hallucinates problems (sycophancy), scope creeps each pass, never declines to change |
|
||||
| **Best-of-N** | Random selection | Without good ranking signal, more samples = more mediocre options |
|
||||
| **Autoreason (unconstrained)** | Synthesis drift | Stronger models produce syntheses so consistently preferred that incumbent never stabilizes |
|
||||
|
||||
---
|
||||
|
||||
## The Autoreason Loop
|
||||
|
||||
### Architecture
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────┐
|
||||
│ ITERATION LOOP │
|
||||
│ │
|
||||
│ Incumbent A ──► Critic ──► Author B ──► Synthesizer │
|
||||
│ │ │ │
|
||||
│ │ ┌───────────────────────┘ │
|
||||
│ ▼ ▼ │
|
||||
│ [A] [AB] [B] │
|
||||
│ │ │ │ │
|
||||
│ └──────────────┼────────────┘ │
|
||||
│ ▼ │
|
||||
│ Judge Panel (blind) │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ Winner │
|
||||
│ │ │
|
||||
│ ┌───────┴───────┐ │
|
||||
│ ▼ ▼ │
|
||||
│ A wins k=2 B or AB wins │
|
||||
│ consecutive? → new incumbent │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ CONVERGED │
|
||||
└──────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Roles
|
||||
|
||||
Every role is a **fresh, isolated agent** with no shared context:
|
||||
|
||||
| Role | Input | Output | Key Rule |
|
||||
|------|-------|--------|----------|
|
||||
| **Critic** | Task + Incumbent A | List of problems | Find problems ONLY. No fixes. No suggestions. |
|
||||
| **Author B** | Task + A + Critique | Revised version B | Address each criticism. State which problem each change fixes. |
|
||||
| **Synthesizer** | Task + X + Y (randomized labels) | Synthesis AB | Take strongest elements of each. Not a compromise. |
|
||||
| **Judge Panel** | Task + A, AB, B (randomized labels + order) | Ranking | Rank best to worst. No authorship stake. |
|
||||
|
||||
### Configuration
|
||||
|
||||
| Parameter | Value | Rationale |
|
||||
|-----------|-------|-----------|
|
||||
| **Convergence k** | 2 | k=1 premature (94% displaced later). k=2 converges 100%, quality plateaus. k=3 fails 24%, 2x cost, no quality gain. |
|
||||
| **Author temperature** | 0.7-0.8 | Encourages diverse revisions |
|
||||
| **Judge temperature** | 0.3 | Encourages consistent evaluation |
|
||||
| **In-loop judges** | 3 | Balance per-pass cost vs evaluation stability |
|
||||
| **Final evaluation judges** | 7 | Higher statistical power for final comparison |
|
||||
| **Max tokens** | 4096 | Standard; 8192 for long-form (papers) |
|
||||
| **Judge type** | Chain-of-thought | 3x faster convergence on some tasks. Always use. |
|
||||
| **Tiebreak** | Conservative (incumbent wins) | Prevents false positives — A must be genuinely beaten |
|
||||
| **Max passes** | 25 (constrained), 50 (remedy) | Safety cap; most converge by pass 10-15 |
|
||||
|
||||
### Prompts
|
||||
|
||||
#### Critic
|
||||
```
|
||||
System: You are a critical reviewer. Your only job is to find real problems.
|
||||
Be specific and concrete. Do not suggest fixes.
|
||||
|
||||
User: Find real problems with this proposal. Focus on:
|
||||
- Things that won't work as described
|
||||
- Complexity that doesn't pay for itself
|
||||
- Assumptions that are wrong
|
||||
- Missing pieces
|
||||
Do NOT propose fixes. Just the problems.
|
||||
```
|
||||
|
||||
#### Author B
|
||||
```
|
||||
System: You are a senior consultant revising a proposal based on specific
|
||||
criticisms. Address each valid criticism directly. Do not make changes not
|
||||
motivated by an identified problem.
|
||||
|
||||
User: [TASK] + [VERSION A] + [CRITIC OUTPUT]
|
||||
Revise to address these problems. For each change, state which problem it fixes.
|
||||
```
|
||||
|
||||
#### Synthesizer
|
||||
```
|
||||
System: You are given two versions as equal inputs. Take the strongest elements
|
||||
from each and produce a coherent synthesis. This is not a compromise.
|
||||
|
||||
User: [TASK] + [VERSION X] + [VERSION Y]
|
||||
(labels randomized — synthesizer doesn't know which is incumbent)
|
||||
```
|
||||
|
||||
#### Judge (Chain-of-Thought) — ALWAYS USE THIS VERSION
|
||||
```
|
||||
System: You are an independent evaluator. Think carefully before deciding.
|
||||
|
||||
User: [TASK] + Three proposals. For each, think step by step:
|
||||
1. What does it get right?
|
||||
2. What does it get wrong or miss?
|
||||
3. Are numbers and claims defensible?
|
||||
4. Is detail appropriate or bloated?
|
||||
After reasoning, rank all three.
|
||||
RANKING: [best], [second], [worst]
|
||||
```
|
||||
|
||||
#### Baseline Prompts (for comparison experiments)
|
||||
|
||||
| Baseline | Prompt |
|
||||
|----------|--------|
|
||||
| **Conservative** | "Make minimal improvements while preserving what works. Do not add new sections or significantly expand scope." |
|
||||
| **Improve this** | "Improve this document." (no further guidance) |
|
||||
| **Harsh critic** | "Critically evaluate and rewrite, fixing all weaknesses you identify." |
|
||||
| **Critique & revise** | Step 1: "Produce a structured critique. List specific weaknesses." Step 2: "Revise to address each criticism." |
|
||||
|
||||
---
|
||||
|
||||
## Scoring: Borda Count
|
||||
|
||||
Judges rank candidates. Points awarded by rank position:
|
||||
|
||||
| Rank | Points (3 candidates) |
|
||||
|------|----------------------|
|
||||
| 1st | 3 |
|
||||
| 2nd | 2 |
|
||||
| 3rd | 1 |
|
||||
|
||||
**Aggregation**: Sum across all judges. Winner = highest total.
|
||||
**Tiebreak**: Incumbent (A) wins any tie.
|
||||
|
||||
**Example** (3 judges):
|
||||
- Judge 1: AB > A > B → AB gets 3, A gets 2, B gets 1
|
||||
- Judge 2: A > AB > B → A gets 3, AB gets 2, B gets 1
|
||||
- Judge 3: AB > B > A → AB gets 3, B gets 2, A gets 1
|
||||
- Totals: AB=8, A=6, B=4 → AB wins, becomes new incumbent
|
||||
|
||||
**Randomization per judge**:
|
||||
- Candidate labels randomized (A might be called "Proposal X" for one judge, "Proposal Z" for another)
|
||||
- Presentation order randomized (AB might appear first or last)
|
||||
- This prevents position bias and label bias
|
||||
|
||||
---
|
||||
|
||||
## Model Selection Guide
|
||||
|
||||
### Empirical Results by Model Tier
|
||||
|
||||
| Model | Autoreason Wins | Autoreason Avg Borda | Best Baseline | Margin | Recommendation |
|
||||
|-------|----------------|---------------------|---------------|--------|----------------|
|
||||
| **Llama 3.1 8B** | 1/3 | 23.7 | 25.0 (single) | -1.3 | Skip autoreason. Model too weak for diverse candidates. |
|
||||
| **Gemini 2.0 Flash** | 2/3 | 25.0 | 20.0 (single) | +5.0 | Good candidate. Moderate gains. |
|
||||
| **Haiku 3.5** | 3/3 | **42.0** | 33.7 (single) | **+8.3** | **Best candidate.** Perfect scores. Baselines actively destroy quality. |
|
||||
| **Sonnet 4** | 3/5 | 27.8 | 22.4 (C&R) | +5.4 | Good candidate for open tasks. C&R better for technical tasks. |
|
||||
| **Sonnet 4.6 (unconstrained)** | 0/1 | 7.0 | 31.0 (C&R) | -24.0 | Do NOT use autoreason without constraints. |
|
||||
| **Sonnet 4.6 (constrained)** | 2/3 | 29.0 | 27.0 (improve) | +2.0 | Use only with scope constraints. |
|
||||
|
||||
### The Generation-Evaluation Gap
|
||||
|
||||
The core insight: **autoreason's value depends on the gap between a model's generation capability and its self-evaluation capability.**
|
||||
|
||||
```
|
||||
Weak models (Llama 8B):
|
||||
Generation: Poor | Self-evaluation: Poor
|
||||
Gap: Small (both bad) → Autoreason can't help, no diverse candidates
|
||||
|
||||
Mid-tier models (Haiku, Flash):
|
||||
Generation: Decent | Self-evaluation: Poor
|
||||
Gap: LARGE → Autoreason's sweet spot. External eval bridges the gap.
|
||||
|
||||
Strong models (Sonnet 4):
|
||||
Generation: Good | Self-evaluation: Decent
|
||||
Gap: Moderate → Autoreason helps on 3/5 tasks
|
||||
|
||||
Frontier models (Sonnet 4.6):
|
||||
Generation: Excellent | Self-evaluation: Good
|
||||
Gap: Small → Simple methods suffice. Autoreason hurts on unconstrained tasks.
|
||||
```
|
||||
|
||||
**Practical rule**: As model costs drop and capabilities improve, today's frontier becomes tomorrow's mid-tier. The generation-evaluation gap is structural, not temporary. Match refinement architecture to the model's position on the capability curve.
|
||||
|
||||
### Judge Selection
|
||||
|
||||
| Author Model | Recommended Judge | Rationale |
|
||||
|-------------|------------------|-----------|
|
||||
| Llama 8B | Don't use autoreason | Model too weak |
|
||||
| Gemini Flash | Sonnet 4 | Cross-model evaluation works |
|
||||
| Haiku 3.5 | Sonnet 4 | Strong external eval is the mechanism |
|
||||
| Haiku 3.5 | Haiku 3.5 (same) | Still works — tournament structure provides value even without strong judges (20.7 vs 18.3 avg Borda) |
|
||||
| Sonnet 4 | Sonnet 4 (same) | Same-model judges work at this tier |
|
||||
| Sonnet 4.6 | Sonnet 4.6 (same) | Only with scope constraints |
|
||||
|
||||
---
|
||||
|
||||
## Scope Constraint Design
|
||||
|
||||
### What Makes Autoreason Work on Constrained Tasks
|
||||
|
||||
The same model (Sonnet 4.6) goes from **last place** (unconstrained) to **first place** (constrained) with scope constraints. The constraints bound the improvement space so synthesis drift can't accumulate.
|
||||
|
||||
### Effective Constraints
|
||||
|
||||
| Constraint Type | Example | Why It Works |
|
||||
|----------------|---------|-------------|
|
||||
| **Fixed facts** | "Use only these 8 data points, add nothing else" | Bounds information space |
|
||||
| **Fixed deliverable** | "500-word startup pitch" (not "improve this") | Defines done condition |
|
||||
| **Fixed structure** | "Exactly 4 sections, each with 3 numbered items" | Prevents structural drift |
|
||||
| **Fixed change items** | "Address exactly these 3 reviewer concerns" | Bounds modification scope |
|
||||
|
||||
### Ineffective Constraints
|
||||
|
||||
| Constraint | Why It Fails | What Happens |
|
||||
|-----------|-------------|-------------|
|
||||
| Word count alone | Not a scope constraint | False convergence — rejected for length, not quality |
|
||||
| "Be concise" | Too vague | Ignored after 2-3 passes |
|
||||
| "Be comprehensive" | Anti-constraint | Invites scope creep |
|
||||
| No constraints at all | Unbounded improvement space | Synthesis dominates, no convergence |
|
||||
|
||||
### Task Categories
|
||||
|
||||
| Task Type | Autoreason Works? | Why |
|
||||
|-----------|-------------------|-----|
|
||||
| Tasks with genuine tradeoffs (strategy, policy) | Yes | Multiple valid approaches for tournament to select between |
|
||||
| Constrained writing (pitch, memo, postmortem) | Mostly (2/3) | Bounded scope, clear evaluation criteria |
|
||||
| Template-filling (incident postmortem) | No | One correct structure, minimal decision space |
|
||||
| Competitive programming | Yes | Naturally scoped, test suite provides external verification |
|
||||
| Open-ended unconstrained + frontier model | No | Synthesis drift, no convergence |
|
||||
|
||||
---
|
||||
|
||||
## Failure Taxonomy
|
||||
|
||||
| Failure Mode | Condition | Detection | Evidence |
|
||||
|-------------|-----------|-----------|----------|
|
||||
| **Self-correction unreliable** | No external evaluation signal | Baselines degrade below single pass | Haiku baselines: 16.3 avg vs 33.7 single pass |
|
||||
| **Drift / synthesis dominance** | Unconstrained scope | A wins <15%, AB dominates | Sonnet 4.6 unconstrained: A wins 12%, AB wins 60%+ |
|
||||
| **Overfitting to visible feedback** | Shallow revision loop (C&R) | High public/private divergence | C&R overfits 32% on hard code problems |
|
||||
| **No convergence** | Broken judge pipeline | Parsing failures, <3 valid judges | Mixed panel parser failure: 11+ passes |
|
||||
| **Model too weak** | Insufficient generation diversity | All candidates look similar | Llama 8B wins only 1/3 tasks |
|
||||
|
||||
### Recovery Patterns
|
||||
|
||||
| Failure | Recovery |
|
||||
|---------|----------|
|
||||
| No convergence (drift) | Add scope constraints to the task |
|
||||
| No convergence (broken judges) | Fix parser, ensure 3 valid judges before continuing |
|
||||
| Quality degrades with iteration | Switch to single pass or add constraints |
|
||||
| Model too weak | Use a stronger model for generation, keep weak model for cheap roles |
|
||||
| Overfitting (code) | Use structured analysis step, not just test feedback |
|
||||
|
||||
---
|
||||
|
||||
## Code Domain Adaptation
|
||||
|
||||
The autoreason method adapts differently for code vs writing:
|
||||
|
||||
### Writing Domain
|
||||
```
|
||||
Call 1: Critic (find problems in incumbent)
|
||||
Call 2: Author B (revise based on critique)
|
||||
Call 3: Synthesizer (merge A and B)
|
||||
Calls 4-6: Judge Panel (3 blind judges rank A, B, AB)
|
||||
```
|
||||
|
||||
### Code Domain (6-call budget)
|
||||
```
|
||||
Call 1: Initial generation
|
||||
Call 2: Structured analysis (5 points — NO CODE):
|
||||
- Problem analysis: what does the problem actually require?
|
||||
- Approach analysis: what approach did we use, is it correct?
|
||||
- Failure analysis: why did tests fail?
|
||||
- Alternative approaches: what else could work?
|
||||
- Edge cases: what inputs might break the solution?
|
||||
Calls 3-6: Reason-informed revisions
|
||||
- Each revision must explain WHY it fixes the issue
|
||||
- Sees test results from public (visible) test cases
|
||||
```
|
||||
|
||||
**Key difference**: The code strategy replaces the judge panel with test-suite evaluation (objective ground truth). The structured analysis step (Call 2) is what drives recovery — it forces reasoning about *why* the approach failed before attempting fixes.
|
||||
|
||||
**Results**: Recovery is the mechanism. Among problems where both autoreason and single-pass failed initially, autoreason recovered 62% vs single-pass's 43% (McNemar p=0.041, Cohen's h=0.32).
|
||||
|
||||
---
|
||||
|
||||
## Applying Autoreason to Paper Writing
|
||||
|
||||
The paper itself was refined using autoreason (Section 8 of the paper):
|
||||
|
||||
### Setup
|
||||
- Model: claude-opus-4
|
||||
- Judges: 3 Opus judges
|
||||
- Enhancement: Ground-truth critic (access to actual experimental data)
|
||||
- Result: Converged in 9 passes
|
||||
|
||||
### Key Findings for Paper Refinement
|
||||
|
||||
1. **Ground-truth critic is essential**: Without ground-truth access, Opus hallucinated a fabricated ablation study, fake confidence intervals, wrong model names, and incorrect role descriptions. With ground-truth access, the critic caught all four on pass 1.
|
||||
|
||||
2. **Judge panel integrity matters**: A broken parser in one judge (Gemini output format mismatch) reduced the panel from 3 to 2 judges. This prevented convergence for 11+ passes. Fixing to 3 working judges, the same incumbent converged in 2 passes. A broken judge doesn't add noise — it prevents equilibrium.
|
||||
|
||||
### Recommended Setup for Paper Refinement
|
||||
|
||||
```
|
||||
Critic prompt: "You are reviewing a research paper draft. You have access to the
|
||||
actual experimental results [GROUND TRUTH DATA]. Find factual errors, unsupported
|
||||
claims, hallucinated results, and structural problems. Do not suggest fixes."
|
||||
|
||||
Author B prompt: "Revise this paper draft to fix the identified problems. For each
|
||||
change, cite the specific problem it addresses. Do not add claims not supported by
|
||||
the provided experimental data."
|
||||
|
||||
Judge prompt (CoT): "Compare three versions of this paper. For each, evaluate:
|
||||
1. Factual accuracy against the provided results
|
||||
2. Clarity of the narrative and contribution
|
||||
3. Whether claims are properly hedged and supported
|
||||
4. Writing quality (concision, precision, no filler)
|
||||
After reasoning, rank all three. RANKING: [best], [second], [worst]"
|
||||
```
|
||||
|
||||
### What to Provide as Ground Truth
|
||||
- All experimental result JSON files
|
||||
- Statistical test outputs
|
||||
- Raw numbers for every table and figure
|
||||
- Configuration files showing exact hyperparameters
|
||||
- Code that generated the results (for method description accuracy)
|
||||
|
||||
---
|
||||
|
||||
## Compute Budget Reference
|
||||
|
||||
| Method | Calls per Pass | Typical Passes | Total Calls | Relative Cost |
|
||||
|--------|---------------|----------------|-------------|---------------|
|
||||
| Single pass | 1 | 1 | 1 | 1x |
|
||||
| Best-of-N | N | 1 | N | Nx |
|
||||
| Critique & revise | 2 | 15 | 30 | 30x |
|
||||
| Autoreason (in-loop) | ~6 | 10-15 | 60-90 | 60-90x |
|
||||
| Autoreason (with final eval) | ~6 + 7 | 10-15 + 1 | 67-97 | ~80x |
|
||||
|
||||
**Cost-quality tradeoff**: Autoreason uses ~6x more compute per pass and typically runs more passes. This is a real tradeoff. The method trades compute for evaluation quality. On constrained tasks with mid-tier models, this tradeoff is strongly positive. On unconstrained tasks with frontier models, it's negative.
|
||||
|
||||
**CoT judges reduce cost**: 1 CoT judge provides evaluation quality comparable to 3 standard judges, at ~40% cost savings. Always use CoT judges.
|
||||
434
skills/research/research-paper-writing/references/checklists.md
Normal file
434
skills/research/research-paper-writing/references/checklists.md
Normal file
@@ -0,0 +1,434 @@
|
||||
# Conference Paper Checklists
|
||||
|
||||
This reference documents the mandatory checklist requirements for major ML/AI conferences. All major venues now require paper checklists—missing them results in desk rejection.
|
||||
|
||||
---
|
||||
|
||||
## Contents
|
||||
|
||||
- [NeurIPS Paper Checklist](#neurips-paper-checklist)
|
||||
- [ICML Paper Checklist](#icml-paper-checklist)
|
||||
- [ICLR Requirements](#iclr-requirements)
|
||||
- [ACL Requirements](#acl-requirements)
|
||||
- [AAAI Requirements](#aaai-requirements)
|
||||
- [COLM Requirements](#colm-requirements)
|
||||
- [Universal Pre-Submission Checklist](#universal-pre-submission-checklist)
|
||||
|
||||
---
|
||||
|
||||
## NeurIPS Paper Checklist
|
||||
|
||||
### Mandatory Components
|
||||
|
||||
All NeurIPS submissions must include a completed paper checklist. Papers lacking this element face **automatic desk rejection**. The checklist appears after references and supplemental material, outside the page limit.
|
||||
|
||||
### 16 Required Checklist Items
|
||||
|
||||
#### 1. Claims Alignment
|
||||
Authors must verify that abstract and introduction claims match theoretical and experimental results, with clearly stated contributions, assumptions, and limitations.
|
||||
|
||||
**What to check:**
|
||||
- [ ] Abstract claims match actual results
|
||||
- [ ] Introduction doesn't overclaim
|
||||
- [ ] Contributions are specific and falsifiable
|
||||
|
||||
#### 2. Limitations Discussion
|
||||
Papers should include a dedicated "Limitations" section addressing strong assumptions, robustness to violations, scope constraints, and performance-influencing factors.
|
||||
|
||||
**What to include:**
|
||||
- [ ] Dedicated Limitations section
|
||||
- [ ] Honest assessment of scope
|
||||
- [ ] Conditions where method may fail
|
||||
|
||||
#### 3. Theory & Proofs
|
||||
Theoretical contributions require full assumption statements and complete proofs (main paper or appendix with proof sketches for intuition).
|
||||
|
||||
**What to check:**
|
||||
- [ ] All assumptions stated formally
|
||||
- [ ] Complete proofs provided (main text or appendix)
|
||||
- [ ] Proof sketches for intuition in main text
|
||||
|
||||
#### 4. Reproducibility
|
||||
Authors must describe steps ensuring results verification through code release, detailed instructions, model access, or checkpoints appropriate to their contribution type.
|
||||
|
||||
**What to provide:**
|
||||
- [ ] Clear reproducibility statement
|
||||
- [ ] Code availability information
|
||||
- [ ] Model checkpoints if applicable
|
||||
|
||||
#### 5. Data & Code Access
|
||||
Instructions for reproducing main experimental results should be provided (supplemental material or URLs), including exact commands and environment specifications.
|
||||
|
||||
**What to include:**
|
||||
- [ ] Exact commands to run experiments
|
||||
- [ ] Environment specifications (requirements.txt, conda env)
|
||||
- [ ] Data access instructions
|
||||
|
||||
#### 6. Experimental Details
|
||||
Papers must specify training details: data splits, hyperparameters, and selection methods in the main paper or supplementary materials.
|
||||
|
||||
**What to document:**
|
||||
- [ ] Train/val/test split details
|
||||
- [ ] All hyperparameters used
|
||||
- [ ] Hyperparameter selection method
|
||||
|
||||
#### 7. Statistical Significance
|
||||
Results require error bars, confidence intervals, or statistical tests with clearly stated calculation methods and underlying assumptions.
|
||||
|
||||
**What to include:**
|
||||
- [ ] Error bars or confidence intervals
|
||||
- [ ] Number of runs/seeds
|
||||
- [ ] Calculation method (std dev vs std error)
|
||||
|
||||
#### 8. Compute Resources
|
||||
Specifications needed: compute worker types (CPU/GPU), memory, storage, execution time per run, and total project compute requirements.
|
||||
|
||||
**What to document:**
|
||||
- [ ] GPU type and count
|
||||
- [ ] Training time per run
|
||||
- [ ] Total compute used
|
||||
|
||||
#### 9. Ethics Code Compliance
|
||||
Authors confirm adherence to the NeurIPS Code of Ethics, noting any necessary deviations.
|
||||
|
||||
**What to verify:**
|
||||
- [ ] Read NeurIPS Code of Ethics
|
||||
- [ ] Confirm compliance
|
||||
- [ ] Note any deviations with justification
|
||||
|
||||
#### 10. Broader Impacts
|
||||
Discussion of potential negative societal applications, fairness concerns, privacy risks, and possible mitigation strategies when applicable.
|
||||
|
||||
**What to address:**
|
||||
- [ ] Potential negative applications
|
||||
- [ ] Fairness considerations
|
||||
- [ ] Privacy implications
|
||||
- [ ] Mitigation strategies
|
||||
|
||||
#### 11. Safeguards
|
||||
High-risk models (language models, internet-scraped datasets) require controlled release mechanisms and usage guidelines.
|
||||
|
||||
**What to consider:**
|
||||
- [ ] Release strategy for sensitive models
|
||||
- [ ] Usage guidelines if needed
|
||||
- [ ] Access controls if appropriate
|
||||
|
||||
#### 12. License Respect
|
||||
All existing assets require creator citations, license names, URLs, version numbers, and terms-of-service acknowledgment.
|
||||
|
||||
**What to document:**
|
||||
- [ ] Dataset licenses cited
|
||||
- [ ] Code licenses respected
|
||||
- [ ] Version numbers included
|
||||
|
||||
#### 13. Asset Documentation
|
||||
New releases need structured templates documenting training details, limitations, consent procedures, and licensing information.
|
||||
|
||||
**For new datasets/models:**
|
||||
- [ ] Datasheet or model card
|
||||
- [ ] Training data documentation
|
||||
- [ ] Known limitations
|
||||
|
||||
#### 14. Human Subjects
|
||||
Crowdsourcing studies must include participant instructions, screenshots, compensation details, and comply with minimum wage requirements.
|
||||
|
||||
**What to include:**
|
||||
- [ ] Task instructions
|
||||
- [ ] Compensation details
|
||||
- [ ] Time estimates
|
||||
|
||||
#### 15. IRB Approvals
|
||||
Human subjects research requires documented institutional review board approval or equivalent, with risk descriptions disclosed (maintaining anonymity at submission).
|
||||
|
||||
**What to verify:**
|
||||
- [ ] IRB approval obtained
|
||||
- [ ] Risk assessment completed
|
||||
- [ ] Anonymized at submission
|
||||
|
||||
#### 16. LLM Declaration
|
||||
Usage of large language models as core methodology components requires disclosure; writing/editing use doesn't require declaration.
|
||||
|
||||
**What to disclose:**
|
||||
- [ ] LLM used as core methodology component
|
||||
- [ ] How LLM was used
|
||||
- [ ] (Writing assistance doesn't require disclosure)
|
||||
|
||||
### Response Format
|
||||
|
||||
Authors select "yes," "no," or "N/A" per question, with optional 1-2 sentence justifications.
|
||||
|
||||
**Important:** Reviewers are explicitly instructed not to penalize honest limitation acknowledgment.
|
||||
|
||||
---
|
||||
|
||||
## ICML Paper Checklist
|
||||
|
||||
### Broader Impact Statement
|
||||
|
||||
ICML requires a Broader Impact Statement at the end of the paper, before references. This does NOT count toward the page limit.
|
||||
|
||||
**Required elements:**
|
||||
- Potential positive impacts
|
||||
- Potential negative impacts
|
||||
- Mitigation strategies
|
||||
- Who may be affected
|
||||
|
||||
### ICML Specific Requirements
|
||||
|
||||
#### Reproducibility Checklist
|
||||
|
||||
- [ ] Data splits clearly specified
|
||||
- [ ] Hyperparameters listed
|
||||
- [ ] Search ranges documented
|
||||
- [ ] Selection method explained
|
||||
- [ ] Compute resources specified
|
||||
- [ ] Code availability stated
|
||||
|
||||
#### Statistical Reporting
|
||||
|
||||
- [ ] Error bars on all figures
|
||||
- [ ] Standard deviation vs standard error specified
|
||||
- [ ] Number of runs stated
|
||||
- [ ] Significance tests if comparing methods
|
||||
|
||||
#### Anonymization
|
||||
|
||||
- [ ] No author names in paper
|
||||
- [ ] No acknowledgments
|
||||
- [ ] No grant numbers
|
||||
- [ ] Prior work cited in third person
|
||||
- [ ] No identifiable repository URLs
|
||||
|
||||
---
|
||||
|
||||
## ICLR Requirements
|
||||
|
||||
### LLM Disclosure Policy (New for 2026)
|
||||
|
||||
ICLR has a specific LLM disclosure requirement:
|
||||
|
||||
> "If LLMs played a significant role in research ideation and/or writing to the extent that they could be regarded as a contributor, authors must describe their precise role in a separate appendix section."
|
||||
|
||||
**When disclosure is required:**
|
||||
- LLM used for significant research ideation
|
||||
- LLM used for substantial writing
|
||||
- LLM could be considered a contributor
|
||||
|
||||
**When disclosure is NOT required:**
|
||||
- Grammar checking
|
||||
- Minor editing assistance
|
||||
- Code completion tools
|
||||
|
||||
**Consequences of non-disclosure:**
|
||||
- Desk rejection
|
||||
- Potential post-publication issues
|
||||
|
||||
### ICLR Specific Requirements
|
||||
|
||||
#### Reproducibility Statement (Optional but Recommended)
|
||||
|
||||
Add a statement referencing:
|
||||
- Supporting materials
|
||||
- Code availability
|
||||
- Data availability
|
||||
- Model checkpoints
|
||||
|
||||
#### Ethics Statement (Optional)
|
||||
|
||||
Address potential concerns in ≤1 page. Does not count toward page limit.
|
||||
|
||||
#### Reciprocal Reviewing
|
||||
|
||||
- Authors on 3+ papers must serve as reviewers for ≥6 papers
|
||||
- Each submission needs ≥1 author registered to review ≥3 papers
|
||||
|
||||
---
|
||||
|
||||
## ACL Requirements
|
||||
|
||||
### Limitations Section (Mandatory)
|
||||
|
||||
ACL specifically requires a Limitations section:
|
||||
|
||||
**What to include:**
|
||||
- Strong assumptions made
|
||||
- Scope limitations
|
||||
- When method may fail
|
||||
- Generalization concerns
|
||||
|
||||
**Important:** The Limitations section does NOT count toward the page limit.
|
||||
|
||||
### ACL Specific Checklist
|
||||
|
||||
#### Responsible NLP
|
||||
|
||||
- [ ] Bias considerations addressed
|
||||
- [ ] Fairness evaluated if applicable
|
||||
- [ ] Dual-use concerns discussed
|
||||
|
||||
#### Multilingual Considerations
|
||||
|
||||
If applicable:
|
||||
- [ ] Language diversity addressed
|
||||
- [ ] Non-English languages included
|
||||
- [ ] Translation quality verified
|
||||
|
||||
#### Human Evaluation
|
||||
|
||||
If applicable:
|
||||
- [ ] Annotator details provided
|
||||
- [ ] Agreement metrics reported
|
||||
- [ ] Compensation documented
|
||||
|
||||
---
|
||||
|
||||
## AAAI Requirements
|
||||
|
||||
### Formatting (Strictest of All Venues)
|
||||
|
||||
AAAI enforces formatting rules more strictly than any other major venue. Papers that deviate from the template are desk-rejected.
|
||||
|
||||
- [ ] Use the **exact** AAAI style file without modification — no `\setlength`, no `\vspace` hacks, no font overrides
|
||||
- [ ] 7 pages main content (8 for camera-ready with author info)
|
||||
- [ ] Two-column format, Times font (set by template)
|
||||
- [ ] References and appendices do not count toward page limit
|
||||
- [ ] Abstract must be a single paragraph
|
||||
- [ ] Do not modify margins, column widths, or font sizes
|
||||
|
||||
### Required Sections
|
||||
|
||||
- [ ] Abstract (single paragraph, no math or citations)
|
||||
- [ ] Introduction with clear contribution statement
|
||||
- [ ] References in AAAI format (uses `aaai2026.bst`)
|
||||
- [ ] Appendix (optional, unlimited)
|
||||
|
||||
### Ethics and Reproducibility
|
||||
|
||||
- [ ] Broader impact statement (encouraged but not always mandatory — check current year's CFP)
|
||||
- [ ] Reproducibility details (datasets, code availability)
|
||||
- [ ] Acknowledge use of AI writing tools if applicable
|
||||
|
||||
### Key Differences from Other Venues
|
||||
|
||||
- **No separate limitations section required** (unlike ACL), but discussing limitations is recommended
|
||||
- **Strictest formatting enforcement** — the style checker will reject non-compliant PDFs
|
||||
- **No paper checklist** like NeurIPS has, but the universal checklist below still applies
|
||||
- **Unified template** covers main paper and supplementary in the same file
|
||||
|
||||
---
|
||||
|
||||
## COLM Requirements
|
||||
|
||||
### Overview
|
||||
|
||||
COLM (Conference on Language Modeling) focuses specifically on language model research. Framing must target this community.
|
||||
|
||||
### Formatting
|
||||
|
||||
- [ ] 9 pages main content (10 for camera-ready)
|
||||
- [ ] Use COLM template (based on ICLR template with modifications)
|
||||
- [ ] Double-blind review
|
||||
- [ ] References and appendices unlimited
|
||||
|
||||
### Required Sections
|
||||
|
||||
- [ ] Abstract
|
||||
- [ ] Introduction framed for language modeling community
|
||||
- [ ] Conclusion
|
||||
- [ ] References
|
||||
|
||||
### Content Expectations
|
||||
|
||||
- [ ] Contribution must be relevant to language models (broadly interpreted: training, evaluation, applications, theory, alignment, safety)
|
||||
- [ ] If the method is general, frame with language model examples
|
||||
- [ ] Baselines should include recent LM-specific methods where applicable
|
||||
|
||||
### Key Differences from Other Venues
|
||||
|
||||
- **Narrower scope** than NeurIPS/ICML — must frame for LM community
|
||||
- **Template derived from ICLR** — similar formatting rules
|
||||
- **Newer venue** — reviewer norms are still establishing; err on the side of thorough evaluation
|
||||
- **No mandatory checklist** like NeurIPS, but broader impact discussion is expected
|
||||
- **LLM disclosure**: If LLMs were used in research (code generation, data annotation, writing assistance), disclose this
|
||||
|
||||
---
|
||||
|
||||
## Universal Pre-Submission Checklist
|
||||
|
||||
### Before Every Submission
|
||||
|
||||
#### Paper Content
|
||||
|
||||
- [ ] Abstract ≤ word limit (usually 250-300 words)
|
||||
- [ ] Main content within page limit
|
||||
- [ ] References complete and verified
|
||||
- [ ] Limitations section included
|
||||
- [ ] All figures/tables have captions
|
||||
- [ ] Captions are self-contained
|
||||
|
||||
#### Formatting
|
||||
|
||||
- [ ] Correct template used (venue + year specific)
|
||||
- [ ] Margins not modified
|
||||
- [ ] Font sizes not modified
|
||||
- [ ] Double-blind requirements met
|
||||
- [ ] Page numbers (for review) or none (camera-ready)
|
||||
|
||||
#### Technical
|
||||
|
||||
- [ ] All claims supported by evidence
|
||||
- [ ] Error bars included
|
||||
- [ ] Baselines appropriate
|
||||
- [ ] Hyperparameters documented
|
||||
- [ ] Compute resources stated
|
||||
|
||||
#### Reproducibility
|
||||
|
||||
- [ ] Code will be available (or justification)
|
||||
- [ ] Data will be available (or justification)
|
||||
- [ ] Environment documented
|
||||
- [ ] Commands to reproduce provided
|
||||
|
||||
#### Ethics
|
||||
|
||||
- [ ] Broader impacts considered
|
||||
- [ ] Limitations honestly stated
|
||||
- [ ] Licenses respected
|
||||
- [ ] IRB obtained if needed
|
||||
|
||||
#### Final Checks
|
||||
|
||||
- [ ] PDF compiles without errors
|
||||
- [ ] All figures render correctly
|
||||
- [ ] All citations resolve
|
||||
- [ ] Supplementary material organized
|
||||
- [ ] Conference checklist completed
|
||||
|
||||
---
|
||||
|
||||
## Quick Reference: Page Limits
|
||||
|
||||
| Conference | Main Content | References | Appendix |
|
||||
|------------|-------------|------------|----------|
|
||||
| NeurIPS 2025 | 9 pages | Unlimited | Unlimited (checklist separate) |
|
||||
| ICML 2026 | 8 pages (+1 camera) | Unlimited | Unlimited |
|
||||
| ICLR 2026 | 9 pages (+1 camera) | Unlimited | Unlimited |
|
||||
| ACL 2025 | 8 pages (long) | Unlimited | Unlimited |
|
||||
| AAAI 2026 | 7 pages (+1 camera) | Unlimited | Unlimited |
|
||||
| COLM 2025 | 9 pages (+1 camera) | Unlimited | Unlimited |
|
||||
|
||||
---
|
||||
|
||||
## Template Locations
|
||||
|
||||
All conference templates are in the `templates/` directory:
|
||||
|
||||
```
|
||||
templates/
|
||||
├── icml2026/ # ICML 2026 official
|
||||
├── iclr2026/ # ICLR 2026 official
|
||||
├── neurips2025/ # NeurIPS 2025
|
||||
├── acl/ # ACL style files
|
||||
├── aaai2026/ # AAAI 2026
|
||||
└── colm2025/ # COLM 2025
|
||||
```
|
||||
@@ -0,0 +1,564 @@
|
||||
# Citation Management & Hallucination Prevention
|
||||
|
||||
This reference provides a complete workflow for managing citations programmatically, preventing AI-generated citation hallucinations, and maintaining clean bibliographies.
|
||||
|
||||
---
|
||||
|
||||
## Contents
|
||||
|
||||
- [Why Citation Verification Matters](#why-citation-verification-matters)
|
||||
- [Citation APIs Overview](#citation-apis-overview)
|
||||
- [Verified Citation Workflow](#verified-citation-workflow)
|
||||
- [Python Implementation](#python-implementation)
|
||||
- [BibTeX Management](#bibtex-management)
|
||||
- [Common Citation Formats](#common-citation-formats)
|
||||
- [Troubleshooting](#troubleshooting)
|
||||
|
||||
---
|
||||
|
||||
## Why Citation Verification Matters
|
||||
|
||||
### The Hallucination Problem
|
||||
|
||||
Research has documented significant issues with AI-generated citations:
|
||||
- **~40% error rate** in AI-generated citations (Enago Academy research)
|
||||
- NeurIPS 2025 found **100+ hallucinated citations** slipped through review
|
||||
- Common errors include:
|
||||
- Fabricated paper titles with real author names
|
||||
- Wrong publication venues or years
|
||||
- Non-existent papers with plausible metadata
|
||||
- Incorrect DOIs or arXiv IDs
|
||||
|
||||
### Consequences
|
||||
|
||||
- Desk rejection at some venues
|
||||
- Loss of credibility with reviewers
|
||||
- Potential retraction if published
|
||||
- Wasted time chasing non-existent sources
|
||||
|
||||
### Solution
|
||||
|
||||
**Never generate citations from memory—always verify programmatically.**
|
||||
|
||||
---
|
||||
|
||||
## Citation APIs Overview
|
||||
|
||||
### Primary APIs
|
||||
|
||||
| API | Coverage | Rate Limits | Best For |
|
||||
|-----|----------|-------------|----------|
|
||||
| **Semantic Scholar** | 214M papers | 1 RPS (free key) | ML/AI papers, citation graphs |
|
||||
| **CrossRef** | 140M+ DOIs | Polite pool with mailto | DOI lookup, BibTeX retrieval |
|
||||
| **arXiv** | Preprints | 3-second delays | ML preprints, PDF access |
|
||||
| **OpenAlex** | 240M+ works | 100K/day, 10 RPS | Open alternative to MAG |
|
||||
|
||||
### API Selection Guide
|
||||
|
||||
```
|
||||
Need ML paper search? → Semantic Scholar
|
||||
Have DOI, need BibTeX? → CrossRef content negotiation
|
||||
Looking for preprint? → arXiv API
|
||||
Need open data, bulk access? → OpenAlex
|
||||
```
|
||||
|
||||
### No Official Google Scholar API
|
||||
|
||||
Google Scholar has no official API. Scraping violates ToS. Use SerpApi ($75-275/month) only if Semantic Scholar coverage is insufficient.
|
||||
|
||||
---
|
||||
|
||||
## Verified Citation Workflow
|
||||
|
||||
### 5-Step Process
|
||||
|
||||
```
|
||||
1. SEARCH → Query Semantic Scholar with specific keywords
|
||||
↓
|
||||
2. VERIFY → Confirm paper exists in 2+ sources
|
||||
↓
|
||||
3. RETRIEVE → Get BibTeX via DOI content negotiation
|
||||
↓
|
||||
4. VALIDATE → Confirm the claim appears in source
|
||||
↓
|
||||
5. ADD → Add verified entry to .bib file
|
||||
```
|
||||
|
||||
### Step 1: Search
|
||||
|
||||
Use Semantic Scholar for ML/AI papers:
|
||||
|
||||
```python
|
||||
from semanticscholar import SemanticScholar
|
||||
|
||||
sch = SemanticScholar()
|
||||
results = sch.search_paper("transformer attention mechanism", limit=10)
|
||||
|
||||
for paper in results:
|
||||
print(f"Title: {paper.title}")
|
||||
print(f"Year: {paper.year}")
|
||||
print(f"DOI: {paper.externalIds.get('DOI', 'N/A')}")
|
||||
print(f"arXiv: {paper.externalIds.get('ArXiv', 'N/A')}")
|
||||
print(f"Citation count: {paper.citationCount}")
|
||||
print("---")
|
||||
```
|
||||
|
||||
### Step 2: Verify Existence
|
||||
|
||||
Confirm paper exists in at least two sources:
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
def verify_paper(doi=None, arxiv_id=None, title=None):
|
||||
"""Verify paper exists in multiple sources."""
|
||||
sources_found = []
|
||||
|
||||
# Check Semantic Scholar
|
||||
sch = SemanticScholar()
|
||||
if doi:
|
||||
paper = sch.get_paper(f"DOI:{doi}")
|
||||
if paper:
|
||||
sources_found.append("Semantic Scholar")
|
||||
|
||||
# Check CrossRef (via DOI)
|
||||
if doi:
|
||||
resp = requests.get(f"https://api.crossref.org/works/{doi}")
|
||||
if resp.status_code == 200:
|
||||
sources_found.append("CrossRef")
|
||||
|
||||
# Check arXiv
|
||||
if arxiv_id:
|
||||
resp = requests.get(
|
||||
f"http://export.arxiv.org/api/query?id_list={arxiv_id}"
|
||||
)
|
||||
if "<entry>" in resp.text:
|
||||
sources_found.append("arXiv")
|
||||
|
||||
return len(sources_found) >= 2, sources_found
|
||||
```
|
||||
|
||||
### Step 3: Retrieve BibTeX
|
||||
|
||||
Use DOI content negotiation for guaranteed accuracy:
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
def doi_to_bibtex(doi: str) -> str:
|
||||
"""Get verified BibTeX from DOI via CrossRef content negotiation."""
|
||||
response = requests.get(
|
||||
f"https://doi.org/{doi}",
|
||||
headers={"Accept": "application/x-bibtex"},
|
||||
allow_redirects=True
|
||||
)
|
||||
response.raise_for_status()
|
||||
return response.text
|
||||
|
||||
# Example: "Attention Is All You Need"
|
||||
bibtex = doi_to_bibtex("10.48550/arXiv.1706.03762")
|
||||
print(bibtex)
|
||||
```
|
||||
|
||||
### Step 4: Validate Claims
|
||||
|
||||
Before citing a paper for a specific claim, verify the claim exists:
|
||||
|
||||
```python
|
||||
def get_paper_abstract(doi):
|
||||
"""Get abstract to verify claims."""
|
||||
sch = SemanticScholar()
|
||||
paper = sch.get_paper(f"DOI:{doi}")
|
||||
return paper.abstract if paper else None
|
||||
|
||||
# Verify claim appears in abstract
|
||||
abstract = get_paper_abstract("10.48550/arXiv.1706.03762")
|
||||
claim = "attention mechanism"
|
||||
if claim.lower() in abstract.lower():
|
||||
print("Claim appears in paper")
|
||||
```
|
||||
|
||||
### Step 5: Add to Bibliography
|
||||
|
||||
Add verified entry to your .bib file with consistent key format:
|
||||
|
||||
```python
|
||||
def generate_citation_key(bibtex: str) -> str:
|
||||
"""Generate consistent citation key: author_year_firstword."""
|
||||
import re
|
||||
|
||||
# Extract author
|
||||
author_match = re.search(r'author\s*=\s*\{([^}]+)\}', bibtex, re.I)
|
||||
if author_match:
|
||||
first_author = author_match.group(1).split(',')[0].split()[-1]
|
||||
else:
|
||||
first_author = "unknown"
|
||||
|
||||
# Extract year
|
||||
year_match = re.search(r'year\s*=\s*\{?(\d{4})\}?', bibtex, re.I)
|
||||
year = year_match.group(1) if year_match else "0000"
|
||||
|
||||
# Extract title first word
|
||||
title_match = re.search(r'title\s*=\s*\{([^}]+)\}', bibtex, re.I)
|
||||
if title_match:
|
||||
first_word = title_match.group(1).split()[0].lower()
|
||||
first_word = re.sub(r'[^a-z]', '', first_word)
|
||||
else:
|
||||
first_word = "paper"
|
||||
|
||||
return f"{first_author.lower()}_{year}_{first_word}"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Python Implementation
|
||||
|
||||
### Complete Citation Manager Class
|
||||
|
||||
{% raw %}
|
||||
```python
|
||||
"""
|
||||
Citation Manager - Verified citation workflow for ML papers.
|
||||
"""
|
||||
|
||||
import requests
|
||||
import time
|
||||
from typing import Optional, List, Dict, Tuple
|
||||
from dataclasses import dataclass
|
||||
|
||||
try:
|
||||
from semanticscholar import SemanticScholar
|
||||
except ImportError:
|
||||
print("Install: pip install semanticscholar")
|
||||
SemanticScholar = None
|
||||
|
||||
@dataclass
|
||||
class Paper:
|
||||
title: str
|
||||
authors: List[str]
|
||||
year: int
|
||||
doi: Optional[str]
|
||||
arxiv_id: Optional[str]
|
||||
venue: Optional[str]
|
||||
citation_count: int
|
||||
abstract: Optional[str]
|
||||
|
||||
class CitationManager:
|
||||
"""Manage citations with verification."""
|
||||
|
||||
def __init__(self, api_key: Optional[str] = None):
|
||||
self.sch = SemanticScholar(api_key=api_key) if SemanticScholar else None
|
||||
self.verified_papers: Dict[str, Paper] = {}
|
||||
|
||||
def search(self, query: str, limit: int = 10) -> List[Paper]:
|
||||
"""Search for papers using Semantic Scholar."""
|
||||
if not self.sch:
|
||||
raise RuntimeError("Semantic Scholar not available")
|
||||
|
||||
results = self.sch.search_paper(query, limit=limit)
|
||||
papers = []
|
||||
|
||||
for r in results:
|
||||
paper = Paper(
|
||||
title=r.title,
|
||||
authors=[a.name for a in (r.authors or [])],
|
||||
year=r.year or 0,
|
||||
doi=r.externalIds.get('DOI') if r.externalIds else None,
|
||||
arxiv_id=r.externalIds.get('ArXiv') if r.externalIds else None,
|
||||
venue=r.venue,
|
||||
citation_count=r.citationCount or 0,
|
||||
abstract=r.abstract
|
||||
)
|
||||
papers.append(paper)
|
||||
|
||||
return papers
|
||||
|
||||
def verify(self, paper: Paper) -> Tuple[bool, List[str]]:
|
||||
"""Verify paper exists in multiple sources."""
|
||||
sources = []
|
||||
|
||||
# Already found in Semantic Scholar via search
|
||||
sources.append("Semantic Scholar")
|
||||
|
||||
# Check CrossRef if DOI available
|
||||
if paper.doi:
|
||||
try:
|
||||
resp = requests.get(
|
||||
f"https://api.crossref.org/works/{paper.doi}",
|
||||
timeout=10
|
||||
)
|
||||
if resp.status_code == 200:
|
||||
sources.append("CrossRef")
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Check arXiv if ID available
|
||||
if paper.arxiv_id:
|
||||
try:
|
||||
resp = requests.get(
|
||||
f"http://export.arxiv.org/api/query?id_list={paper.arxiv_id}",
|
||||
timeout=10
|
||||
)
|
||||
if "<entry>" in resp.text and "<title>" in resp.text:
|
||||
sources.append("arXiv")
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
return len(sources) >= 2, sources
|
||||
|
||||
def get_bibtex(self, paper: Paper) -> Optional[str]:
|
||||
"""Get BibTeX for verified paper."""
|
||||
if paper.doi:
|
||||
try:
|
||||
resp = requests.get(
|
||||
f"https://doi.org/{paper.doi}",
|
||||
headers={"Accept": "application/x-bibtex"},
|
||||
timeout=10,
|
||||
allow_redirects=True
|
||||
)
|
||||
if resp.status_code == 200:
|
||||
return resp.text
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Fallback: generate from paper data
|
||||
return self._generate_bibtex(paper)
|
||||
|
||||
def _generate_bibtex(self, paper: Paper) -> str:
|
||||
"""Generate BibTeX from paper metadata."""
|
||||
# Generate citation key
|
||||
first_author = paper.authors[0].split()[-1] if paper.authors else "unknown"
|
||||
first_word = paper.title.split()[0].lower().replace(',', '').replace(':', '')
|
||||
key = f"{first_author.lower()}_{paper.year}_{first_word}"
|
||||
|
||||
# Format authors
|
||||
authors = " and ".join(paper.authors) if paper.authors else "Unknown"
|
||||
|
||||
bibtex = f"""@article{{{key},
|
||||
title = {{{paper.title}}},
|
||||
author = {{{authors}}},
|
||||
year = {{{paper.year}}},
|
||||
{'doi = {' + paper.doi + '},' if paper.doi else ''}
|
||||
{'eprint = {' + paper.arxiv_id + '},' if paper.arxiv_id else ''}
|
||||
{'journal = {' + paper.venue + '},' if paper.venue else ''}
|
||||
}}"""
|
||||
return bibtex
|
||||
|
||||
def cite(self, query: str) -> Optional[str]:
|
||||
"""Full workflow: search, verify, return BibTeX."""
|
||||
# Search
|
||||
papers = self.search(query, limit=5)
|
||||
if not papers:
|
||||
return None
|
||||
|
||||
# Take top result
|
||||
paper = papers[0]
|
||||
|
||||
# Verify
|
||||
verified, sources = self.verify(paper)
|
||||
if not verified:
|
||||
print(f"Warning: Could only verify in {sources}")
|
||||
|
||||
# Get BibTeX
|
||||
bibtex = self.get_bibtex(paper)
|
||||
|
||||
# Cache
|
||||
if bibtex:
|
||||
self.verified_papers[paper.title] = paper
|
||||
|
||||
return bibtex
|
||||
|
||||
|
||||
# Usage example
|
||||
if __name__ == "__main__":
|
||||
cm = CitationManager()
|
||||
|
||||
# Search and cite
|
||||
bibtex = cm.cite("attention is all you need transformer")
|
||||
if bibtex:
|
||||
print(bibtex)
|
||||
```
|
||||
{% endraw %}
|
||||
|
||||
### Quick Functions
|
||||
|
||||
```python
|
||||
def quick_cite(query: str) -> str:
|
||||
"""One-liner citation."""
|
||||
cm = CitationManager()
|
||||
return cm.cite(query)
|
||||
|
||||
def batch_cite(queries: List[str], output_file: str = "references.bib"):
|
||||
"""Cite multiple papers and save to file."""
|
||||
cm = CitationManager()
|
||||
bibtex_entries = []
|
||||
|
||||
for query in queries:
|
||||
print(f"Processing: {query}")
|
||||
bibtex = cm.cite(query)
|
||||
if bibtex:
|
||||
bibtex_entries.append(bibtex)
|
||||
time.sleep(1) # Rate limiting
|
||||
|
||||
with open(output_file, 'w') as f:
|
||||
f.write("\n\n".join(bibtex_entries))
|
||||
|
||||
print(f"Saved {len(bibtex_entries)} citations to {output_file}")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## BibTeX Management
|
||||
|
||||
### BibTeX vs BibLaTeX
|
||||
|
||||
| Feature | BibTeX | BibLaTeX |
|
||||
|---------|--------|----------|
|
||||
| Unicode support | Limited | Full |
|
||||
| Entry types | Standard | Extended (@online, @dataset) |
|
||||
| Customization | Limited | Highly flexible |
|
||||
| Backend | bibtex | Biber (recommended) |
|
||||
|
||||
**Recommendation**: Use natbib with BibTeX for conference submissions — all major venue templates (NeurIPS, ICML, ICLR, ACL, AAAI, COLM) ship with natbib and `.bst` files. BibLaTeX with Biber is an option for journals or personal projects where you control the template.
|
||||
|
||||
### LaTeX Setup
|
||||
|
||||
```latex
|
||||
% In preamble
|
||||
\usepackage[
|
||||
backend=biber,
|
||||
style=numeric,
|
||||
sorting=none
|
||||
]{biblatex}
|
||||
\addbibresource{references.bib}
|
||||
|
||||
% In document
|
||||
\cite{vaswani_2017_attention}
|
||||
|
||||
% At end
|
||||
\printbibliography
|
||||
```
|
||||
|
||||
### Citation Commands
|
||||
|
||||
```latex
|
||||
\cite{key} % Numeric: [1]
|
||||
\citep{key} % Parenthetical: (Author, 2020)
|
||||
\citet{key} % Textual: Author (2020)
|
||||
\citeauthor{key} % Just author name
|
||||
\citeyear{key} % Just year
|
||||
```
|
||||
|
||||
### Consistent Citation Keys
|
||||
|
||||
Use format: `author_year_firstword`
|
||||
|
||||
```
|
||||
vaswani_2017_attention
|
||||
devlin_2019_bert
|
||||
brown_2020_language
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Citation Formats
|
||||
|
||||
### Conference Paper
|
||||
|
||||
```bibtex
|
||||
@inproceedings{vaswani_2017_attention,
|
||||
title = {Attention Is All You Need},
|
||||
author = {Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and
|
||||
Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and
|
||||
Kaiser, Lukasz and Polosukhin, Illia},
|
||||
booktitle = {Advances in Neural Information Processing Systems},
|
||||
volume = {30},
|
||||
year = {2017},
|
||||
publisher = {Curran Associates, Inc.}
|
||||
}
|
||||
```
|
||||
|
||||
### Journal Article
|
||||
|
||||
```bibtex
|
||||
@article{hochreiter_1997_long,
|
||||
title = {Long Short-Term Memory},
|
||||
author = {Hochreiter, Sepp and Schmidhuber, J{\"u}rgen},
|
||||
journal = {Neural Computation},
|
||||
volume = {9},
|
||||
number = {8},
|
||||
pages = {1735--1780},
|
||||
year = {1997},
|
||||
publisher = {MIT Press}
|
||||
}
|
||||
```
|
||||
|
||||
### arXiv Preprint
|
||||
|
||||
```bibtex
|
||||
@misc{brown_2020_language,
|
||||
title = {Language Models are Few-Shot Learners},
|
||||
author = {Brown, Tom and Mann, Benjamin and Ryder, Nick and others},
|
||||
year = {2020},
|
||||
eprint = {2005.14165},
|
||||
archiveprefix = {arXiv},
|
||||
primaryclass = {cs.CL}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
**Issue: Semantic Scholar returns no results**
|
||||
- Try more specific keywords
|
||||
- Check spelling of author names
|
||||
- Use quotation marks for exact phrases
|
||||
|
||||
**Issue: DOI doesn't resolve to BibTeX**
|
||||
- DOI may be registered but not linked to CrossRef
|
||||
- Try arXiv ID instead if available
|
||||
- Generate BibTeX from metadata manually
|
||||
|
||||
**Issue: Rate limiting errors**
|
||||
- Add delays between requests (1-3 seconds)
|
||||
- Use API key if available
|
||||
- Cache results to avoid repeat queries
|
||||
|
||||
**Issue: Encoding problems in BibTeX**
|
||||
- Use proper LaTeX escaping: `{\"u}` for ü
|
||||
- Ensure file is UTF-8 encoded
|
||||
- Use BibLaTeX with Biber for better Unicode
|
||||
|
||||
### Verification Checklist
|
||||
|
||||
Before adding a citation:
|
||||
|
||||
- [ ] Paper found in at least 2 sources
|
||||
- [ ] DOI or arXiv ID verified
|
||||
- [ ] BibTeX retrieved (not generated from memory)
|
||||
- [ ] Entry type correct (@inproceedings vs @article)
|
||||
- [ ] Author names complete and correctly formatted
|
||||
- [ ] Year and venue verified
|
||||
- [ ] Citation key follows consistent format
|
||||
|
||||
---
|
||||
|
||||
## Additional Resources
|
||||
|
||||
**APIs:**
|
||||
- Semantic Scholar: https://api.semanticscholar.org/api-docs/
|
||||
- CrossRef: https://www.crossref.org/documentation/retrieve-metadata/rest-api/
|
||||
- arXiv: https://info.arxiv.org/help/api/basics.html
|
||||
- OpenAlex: https://docs.openalex.org/
|
||||
|
||||
**Python Libraries:**
|
||||
- `semanticscholar`: https://pypi.org/project/semanticscholar/
|
||||
- `arxiv`: https://pypi.org/project/arxiv/
|
||||
- `habanero` (CrossRef): https://github.com/sckott/habanero
|
||||
|
||||
**Verification Tools:**
|
||||
- Citely: https://citely.ai/citation-checker
|
||||
- ReciteWorks: https://reciteworks.com/
|
||||
@@ -0,0 +1,728 @@
|
||||
# Experiment Design Patterns
|
||||
|
||||
Patterns and best practices distilled from running research experiments at scale with the Hermes agent. These cover experiment infrastructure, evaluation protocols, monitoring, and failure recovery.
|
||||
|
||||
---
|
||||
|
||||
## Experiment Infrastructure
|
||||
|
||||
### Directory Structure
|
||||
|
||||
Organize experiments with a consistent structure:
|
||||
|
||||
```
|
||||
workspace/
|
||||
experiments/
|
||||
run_main.py # Core experiment runner
|
||||
run_baselines.py # Baseline comparison
|
||||
run_ablation.py # Ablation studies
|
||||
strategies.py # Method implementations
|
||||
config.yaml # Shared configuration
|
||||
results/
|
||||
<experiment_name>/
|
||||
<task_or_problem>/
|
||||
<strategy>/
|
||||
result.json # Final metrics
|
||||
final_output.md # Final output artifact
|
||||
history.json # Full trajectory/log
|
||||
pass_01/ # Per-iteration artifacts (if iterative)
|
||||
intermediate.md
|
||||
analysis/
|
||||
analyze_results.py # Statistical analysis
|
||||
compute_stats.py # Significance tests
|
||||
make_charts.py # Visualization
|
||||
paper/
|
||||
paper.tex # LaTeX source
|
||||
fig_*.pdf # Generated figures
|
||||
```
|
||||
|
||||
### Script Design Principles
|
||||
|
||||
**1. Incremental Saving (Crash Recovery)**
|
||||
|
||||
Every experiment script should save results after each unit of work, and skip already-completed work on restart:
|
||||
|
||||
```python
|
||||
import json, os
|
||||
from pathlib import Path
|
||||
|
||||
def run_experiment(problems, strategies, output_dir):
|
||||
for problem in problems:
|
||||
for strategy in strategies:
|
||||
result_path = Path(output_dir) / problem["id"] / strategy / "result.json"
|
||||
if result_path.exists():
|
||||
print(f"Skipping {problem['id']}/{strategy} (already done)")
|
||||
continue
|
||||
|
||||
# Run the experiment
|
||||
result = execute_strategy(problem, strategy)
|
||||
|
||||
# Save immediately
|
||||
result_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(result_path, 'w') as f:
|
||||
json.dump(result, f, indent=2)
|
||||
```
|
||||
|
||||
This pattern makes re-runs safe and efficient. If a process crashes at problem 47/150, restarting skips the first 46.
|
||||
|
||||
**2. Artifact Preservation**
|
||||
|
||||
Save all intermediate outputs, not just final results. This enables post-hoc analysis without re-running:
|
||||
|
||||
```python
|
||||
def save_pass_artifacts(output_dir, pass_num, artifacts):
|
||||
"""Save all artifacts from a single pass of an iterative method."""
|
||||
pass_dir = Path(output_dir) / f"pass_{pass_num:02d}"
|
||||
pass_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
for name, content in artifacts.items():
|
||||
with open(pass_dir / f"{name}.md", 'w') as f:
|
||||
f.write(content)
|
||||
```
|
||||
|
||||
**3. Configuration Management**
|
||||
|
||||
Use YAML configs for reproducibility:
|
||||
|
||||
```yaml
|
||||
# config.yaml
|
||||
model: anthropic/claude-sonnet-4-20250514
|
||||
author_temperature: 0.8
|
||||
judge_temperature: 0.3
|
||||
max_tokens: 4096
|
||||
num_judges: 3
|
||||
max_passes: 15
|
||||
convergence_k: 2
|
||||
```
|
||||
|
||||
```python
|
||||
import yaml
|
||||
|
||||
with open("config.yaml") as f:
|
||||
config = yaml.safe_load(f)
|
||||
```
|
||||
|
||||
**4. Separation of Concerns**
|
||||
|
||||
Keep generation, evaluation, and visualization in separate scripts:
|
||||
|
||||
| Script | Purpose |
|
||||
|--------|---------|
|
||||
| `run_experiment.py` | Core method execution |
|
||||
| `run_baselines.py` | Baseline comparisons at same compute |
|
||||
| `run_eval.py` | Blind evaluation / judge panels |
|
||||
| `analyze_results.py` | Statistical analysis |
|
||||
| `make_charts.py` | Figure generation |
|
||||
|
||||
This lets you re-run evaluation without re-running expensive generation, and regenerate figures without re-running analysis.
|
||||
|
||||
---
|
||||
|
||||
## Evaluation Protocols
|
||||
|
||||
### Blind Judge Panels (for Subjective Tasks)
|
||||
|
||||
When evaluating subjective outputs (writing, analysis, recommendations), use a blind judge panel:
|
||||
|
||||
```python
|
||||
import random
|
||||
|
||||
def run_blind_evaluation(outputs: dict, task_prompt: str, num_judges: int = 7):
|
||||
"""
|
||||
Run blind evaluation of multiple method outputs.
|
||||
|
||||
Args:
|
||||
outputs: {"method_name": "output_text", ...}
|
||||
task_prompt: The original task description
|
||||
num_judges: Number of independent judge evaluations
|
||||
"""
|
||||
rankings = []
|
||||
|
||||
for judge_i in range(num_judges):
|
||||
# Randomize labels and presentation order per judge
|
||||
methods = list(outputs.keys())
|
||||
random.shuffle(methods)
|
||||
labels = {m: chr(65 + i) for i, m in enumerate(methods)} # A, B, C...
|
||||
|
||||
# Present to judge with randomized labels
|
||||
prompt = f"Task: {task_prompt}\n\n"
|
||||
for method in methods:
|
||||
prompt += f"--- Proposal {labels[method]} ---\n{outputs[method]}\n\n"
|
||||
prompt += "Rank all proposals from best to worst. Format: RANKING: [best], [second], [worst]"
|
||||
|
||||
ranking = call_judge(prompt)
|
||||
rankings.append({"labels": labels, "ranking": ranking})
|
||||
|
||||
# Aggregate via Borda count
|
||||
return compute_borda(rankings)
|
||||
|
||||
def compute_borda(rankings, n_methods=3):
|
||||
"""Borda count: 3/2/1 points for 1st/2nd/3rd."""
|
||||
scores = {}
|
||||
points = {0: n_methods, 1: n_methods - 1, 2: n_methods - 2} # Adjust for n_methods
|
||||
|
||||
for r in rankings:
|
||||
for position, method in enumerate(r["ranking"]):
|
||||
scores[method] = scores.get(method, 0) + points.get(position, 0)
|
||||
|
||||
return scores
|
||||
```
|
||||
|
||||
Key design decisions:
|
||||
- **Randomize both labels AND order** per judge to prevent position bias
|
||||
- **Use odd number of judges** (3, 5, 7) to break ties
|
||||
- **Conservative tiebreak**: Incumbent/baseline wins ties (prevents false positives)
|
||||
- **CoT judges** match non-CoT quality at ~40% cost (1 CoT judge ≈ 3 standard judges)
|
||||
|
||||
### Code/Objective Evaluation
|
||||
|
||||
For tasks with ground-truth evaluation (code, math, factual):
|
||||
|
||||
```python
|
||||
import subprocess
|
||||
|
||||
def evaluate_code(solution: str, test_cases: list, timeout: int = 30):
|
||||
"""Run code solution against test cases with sandboxed execution."""
|
||||
results = {"public": [], "private": []}
|
||||
|
||||
for test in test_cases:
|
||||
try:
|
||||
proc = subprocess.run(
|
||||
["python3", "-c", solution],
|
||||
input=test["input"],
|
||||
capture_output=True,
|
||||
timeout=timeout,
|
||||
text=True
|
||||
)
|
||||
actual = proc.stdout.strip()
|
||||
expected = test["expected"].strip()
|
||||
passed = actual == expected
|
||||
except subprocess.TimeoutExpired:
|
||||
passed = False
|
||||
|
||||
category = "public" if test.get("public") else "private"
|
||||
results[category].append(passed)
|
||||
|
||||
return {
|
||||
"public_pass_rate": sum(results["public"]) / max(len(results["public"]), 1),
|
||||
"private_pass_rate": sum(results["private"]) / max(len(results["private"]), 1),
|
||||
}
|
||||
```
|
||||
|
||||
### Compute-Matched Comparison
|
||||
|
||||
Always compare methods at equal compute budget. If your method uses N API calls, baselines get N calls too:
|
||||
|
||||
| Method | Call Budget | Allocation |
|
||||
|--------|-----------|------------|
|
||||
| Single pass | 6 calls | 6 independent generations |
|
||||
| Critique & revise | 6 calls | 1 generate + 5 revise rounds |
|
||||
| Autoreason | 6 calls | 1 generate + 1 analysis + 4 revisions |
|
||||
| Best-of-N | 6 calls | 6 independent, pick best on public test |
|
||||
|
||||
### Human Evaluation Design
|
||||
|
||||
Many ML/NLP papers require human evaluation, especially for subjective tasks (text generation, summarization, dialogue, creative writing). Poorly designed human evals are a common rejection reason.
|
||||
|
||||
#### When Human Evaluation Is Required
|
||||
|
||||
| Task Type | Required? | Notes |
|
||||
|-----------|-----------|-------|
|
||||
| Text generation (open-ended) | Yes | LLM-as-judge alone is insufficient for acceptance at ACL/EMNLP |
|
||||
| Summarization | Usually | At minimum for a subset of outputs |
|
||||
| Dialogue systems | Yes | User studies or annotation |
|
||||
| Code generation | No | Test suites are objective ground truth |
|
||||
| Classification | No | Standard metrics suffice |
|
||||
| Any task with subjective quality | Strongly recommended | Strengthens the paper significantly |
|
||||
|
||||
#### Annotation Protocol Design
|
||||
|
||||
```
|
||||
Human Evaluation Protocol:
|
||||
1. Define the evaluation dimensions (fluency, relevance, factual accuracy, etc.)
|
||||
2. Create annotation guidelines with examples of each score level
|
||||
3. Run a pilot with 2-3 annotators on 20-30 examples
|
||||
4. Compute pilot inter-annotator agreement — if low, revise guidelines
|
||||
5. Run full evaluation
|
||||
6. Report: annotator count, agreement metrics, compensation, time per item
|
||||
```
|
||||
|
||||
**Evaluation dimensions** (pick relevant subset):
|
||||
|
||||
| Dimension | Definition | Scale |
|
||||
|-----------|-----------|-------|
|
||||
| Fluency | Grammaticality and naturalness | 1-5 Likert |
|
||||
| Relevance | Does it address the task? | 1-5 Likert |
|
||||
| Factual accuracy | Are stated facts correct? | Binary or 1-5 |
|
||||
| Coherence | Logical flow and consistency | 1-5 Likert |
|
||||
| Informativeness | Does it provide useful information? | 1-5 Likert |
|
||||
| Overall preference | Which output is better? | A/B/Tie (pairwise) |
|
||||
|
||||
**Pairwise comparison** (preferred over absolute scoring — more reliable):
|
||||
- Present two outputs side-by-side (randomize left/right position)
|
||||
- Ask: "Which is better? A / B / Tie"
|
||||
- More discriminative and less susceptible to annotator calibration drift
|
||||
|
||||
#### Inter-Annotator Agreement
|
||||
|
||||
Always report agreement metrics. Without them, reviewers assume your annotations are unreliable.
|
||||
|
||||
```python
|
||||
# Krippendorff's alpha (preferred — handles missing data, any scale)
|
||||
# pip install krippendorffs-alpha
|
||||
import krippendorff
|
||||
|
||||
# Ratings: rows = annotators, columns = items, values = scores
|
||||
ratings = [
|
||||
[3, 4, 1, 2, 5, None, 3], # Annotator 1
|
||||
[3, 5, 1, 3, 5, 2, 3], # Annotator 2
|
||||
[4, 4, 2, 2, 4, 2, None], # Annotator 3
|
||||
]
|
||||
alpha = krippendorff.alpha(reliability_data=ratings, level_of_measurement="ordinal")
|
||||
print(f"Krippendorff's alpha: {alpha:.3f}")
|
||||
# Interpretation: >0.80 good, 0.67-0.80 acceptable, <0.67 questionable
|
||||
```
|
||||
|
||||
```python
|
||||
# Cohen's kappa (for exactly 2 annotators, categorical data)
|
||||
from sklearn.metrics import cohen_kappa_score
|
||||
|
||||
annotator_1 = [1, 2, 3, 1, 2, 3, 2]
|
||||
annotator_2 = [1, 2, 2, 1, 3, 3, 2]
|
||||
kappa = cohen_kappa_score(annotator_1, annotator_2)
|
||||
print(f"Cohen's kappa: {kappa:.3f}")
|
||||
# Interpretation: >0.80 excellent, 0.60-0.80 substantial, 0.40-0.60 moderate
|
||||
```
|
||||
|
||||
| Metric | When to Use | Annotators | Scale |
|
||||
|--------|------------|-----------|-------|
|
||||
| Krippendorff's alpha | Default choice | Any number | Any (ordinal, nominal, ratio) |
|
||||
| Cohen's kappa | 2 annotators, categorical | Exactly 2 | Nominal/ordinal |
|
||||
| Fleiss' kappa | 3+ annotators, categorical | 3+ | Nominal |
|
||||
| Pearson/Spearman | Continuous scores | 2 | Interval/ratio |
|
||||
|
||||
#### Crowdsourcing Platforms
|
||||
|
||||
| Platform | Best For | Cost | Quality |
|
||||
|----------|----------|------|---------|
|
||||
| **Prolific** | Academic research, higher quality | $8-15/hr | High — academic participant pool |
|
||||
| **MTurk** | Large-scale, fast turnaround | $2-10/hr | Variable — use qualifications |
|
||||
| **Surge AI** | NLP-specific annotations | Premium | High — trained annotators |
|
||||
| **Expert annotators** | Domain-specific (medical, legal) | Highest | Highest — but slow |
|
||||
|
||||
**Ethics requirements**:
|
||||
- Report compensation rate (must be at minimum local minimum wage)
|
||||
- Describe annotator demographics if relevant
|
||||
- Obtain IRB/ethics approval if required by your institution
|
||||
- ACL venues explicitly require compensation documentation
|
||||
|
||||
#### What to Report in the Paper
|
||||
|
||||
```
|
||||
Human Evaluation Section Checklist:
|
||||
- [ ] Number of annotators
|
||||
- [ ] Annotator qualifications / recruitment method
|
||||
- [ ] Number of items evaluated
|
||||
- [ ] Evaluation dimensions with definitions
|
||||
- [ ] Scale used (Likert, pairwise, binary)
|
||||
- [ ] Inter-annotator agreement (Krippendorff's alpha or Cohen's kappa)
|
||||
- [ ] Compensation rate
|
||||
- [ ] Time per annotation item
|
||||
- [ ] Whether annotators saw model identities (should be blind)
|
||||
- [ ] Randomization of presentation order
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Statistical Analysis
|
||||
|
||||
### Required Tests
|
||||
|
||||
| Test | When to Use | Python |
|
||||
|------|------------|--------|
|
||||
| McNemar's test | Comparing two methods on same problems | `scipy.stats.binomtest` for small n |
|
||||
| Two-proportion z-test | Comparing success rates | Custom or `statsmodels` |
|
||||
| Fisher's exact test | Small sample pairwise comparison | `scipy.stats.fisher_exact` |
|
||||
| Bootstrapped CI | Confidence intervals for any metric | Custom bootstrap |
|
||||
| Cohen's h | Effect size for proportions | Manual calculation |
|
||||
|
||||
### Standard Analysis Script
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
from scipy import stats
|
||||
from pathlib import Path
|
||||
import json
|
||||
|
||||
def load_all_results(results_dir):
|
||||
"""Load all results into a structured format."""
|
||||
results = {}
|
||||
for result_file in Path(results_dir).rglob("result.json"):
|
||||
parts = result_file.relative_to(results_dir).parts
|
||||
if len(parts) >= 3:
|
||||
experiment, task, strategy = parts[0], parts[1], parts[2]
|
||||
data = json.loads(result_file.read_text())
|
||||
results.setdefault(experiment, {}).setdefault(strategy, {})[task] = data
|
||||
return results
|
||||
|
||||
def pairwise_mcnemar(method_a_results, method_b_results):
|
||||
"""McNemar's test for paired binary outcomes."""
|
||||
a_win_b_lose = sum(1 for a, b in zip(method_a_results, method_b_results) if a and not b)
|
||||
b_win_a_lose = sum(1 for a, b in zip(method_a_results, method_b_results) if b and not a)
|
||||
|
||||
n = a_win_b_lose + b_win_a_lose
|
||||
if n < 25:
|
||||
# Use exact binomial for small samples
|
||||
result = stats.binomtest(a_win_b_lose, n, 0.5)
|
||||
p_value = result.pvalue
|
||||
else:
|
||||
# Chi-squared approximation
|
||||
chi2 = (abs(a_win_b_lose - b_win_a_lose) - 1)**2 / (a_win_b_lose + b_win_a_lose)
|
||||
p_value = 1 - stats.chi2.cdf(chi2, df=1)
|
||||
|
||||
return {
|
||||
"a_wins": a_win_b_lose,
|
||||
"b_wins": b_win_a_lose,
|
||||
"n_discordant": n,
|
||||
"p_value": p_value,
|
||||
"significant": p_value < 0.05
|
||||
}
|
||||
|
||||
def bootstrap_ci(data, n_bootstrap=10000, ci=0.95):
|
||||
"""Bootstrap confidence interval for mean."""
|
||||
means = []
|
||||
for _ in range(n_bootstrap):
|
||||
sample = np.random.choice(data, size=len(data), replace=True)
|
||||
means.append(np.mean(sample))
|
||||
lower = np.percentile(means, (1 - ci) / 2 * 100)
|
||||
upper = np.percentile(means, (1 + ci) / 2 * 100)
|
||||
return {"mean": np.mean(data), "ci_lower": lower, "ci_upper": upper}
|
||||
|
||||
def cohens_h(p1, p2):
|
||||
"""Cohen's h effect size for two proportions."""
|
||||
return 2 * np.arcsin(np.sqrt(p1)) - 2 * np.arcsin(np.sqrt(p2))
|
||||
```
|
||||
|
||||
### Reporting Standards
|
||||
|
||||
Always include in the paper:
|
||||
- **Sample sizes**: n=X problems/tasks
|
||||
- **Number of runs**: K independent runs if applicable
|
||||
- **Error bars**: Specify standard deviation or standard error
|
||||
- **Confidence intervals**: 95% CI for key results
|
||||
- **Significance tests**: p-values for key comparisons
|
||||
- **Effect sizes**: Cohen's d or h for practical significance
|
||||
|
||||
---
|
||||
|
||||
## Monitoring (Cron Pattern)
|
||||
|
||||
### Cron Prompt Template
|
||||
|
||||
For each experiment batch, create a monitoring prompt:
|
||||
|
||||
```
|
||||
Check the status of the [EXPERIMENT_NAME] experiment:
|
||||
|
||||
1. Process check: ps aux | grep [PROCESS_PATTERN]
|
||||
2. Log check: tail -30 [LOG_FILE]
|
||||
3. Results check: ls [RESULT_DIR]/eval/ (or appropriate result location)
|
||||
4. If results are available:
|
||||
- Read the result JSON files
|
||||
- Report metrics in a table (Borda scores, accuracy, etc.)
|
||||
- Compute key comparisons between methods
|
||||
5. If all experiments in this batch are complete:
|
||||
- git add -A && git commit -m "[COMMIT_MESSAGE]" && git push
|
||||
- Report final summary
|
||||
6. Key question: [SPECIFIC ANALYTICAL QUESTION]
|
||||
|
||||
If nothing has changed since the last check, respond with [SILENT].
|
||||
```
|
||||
|
||||
### Monitoring Best Practices
|
||||
|
||||
1. **Check processes first** — don't read results if the experiment is still running and results are incomplete
|
||||
2. **Read the log tail** — look for errors, progress indicators, completion messages
|
||||
3. **Count completed vs expected** — "45/150 problems done" is more useful than "some results exist"
|
||||
4. **Report in structured tables** — always include key metrics in a table
|
||||
5. **Answer the key question** — each experiment should have a specific analytical question to answer when done
|
||||
6. **[SILENT] for no-news** — suppress notifications when nothing has changed
|
||||
7. **Commit on completion** — every completed batch gets committed with a descriptive message
|
||||
|
||||
### Example Monitoring Report
|
||||
|
||||
```
|
||||
## Code Experiments (Haiku 3.5) - COMPLETE
|
||||
|
||||
| Strategy | Pass Rate (150 problems) | vs Single |
|
||||
|----------|------------------------|-----------|
|
||||
| single_pass | 38.0% | — |
|
||||
| critique_revise | 35.2% | -2.8pp |
|
||||
| **autoreason** | **40.0%** | **+2.0pp** |
|
||||
| best_of_6 | 31.0% | -7.0pp |
|
||||
|
||||
Key finding: Autoreason shows +2pp improvement over single pass, while
|
||||
best-of-6 collapses due to single-public-test selection issue.
|
||||
|
||||
Committed: `git commit -m "Add Haiku code results (150 problems, 4 strategies)"`
|
||||
Next: Run significance tests on these results.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Failure Recovery
|
||||
|
||||
### Common Failures and Recovery
|
||||
|
||||
| Failure | Detection | Recovery |
|
||||
|---------|-----------|----------|
|
||||
| **API credit exhaustion** | 402 errors in logs, incomplete results | Top up credits, re-run (skips completed work automatically) |
|
||||
| **Rate limiting** | 429 errors, slow progress | Add retry logic with exponential backoff |
|
||||
| **Process crash** | PID gone, log stops mid-problem | Re-run script (resumes from last checkpoint) |
|
||||
| **Wrong model ID** | Model not found errors | Fix ID (e.g., `claude-opus-4-6` not `claude-opus-4.6`) |
|
||||
| **Parallel slowdown** | Each experiment taking 2x longer | Reduce parallel experiments to 2-3 max |
|
||||
| **Security scan blocks** | Commands blocked by security | Use `execute_code` instead of piped `terminal` commands |
|
||||
| **Delegation failures** | `delegate_task` returns errors | Fall back to doing work directly |
|
||||
| **Timeout on hard problems** | Process stuck, no log progress | Kill, skip problem, note in results |
|
||||
| **Dataset path mismatch** | File not found errors | Verify paths before launching |
|
||||
|
||||
### Retry Naming Convention
|
||||
|
||||
When re-running failed experiments, use a suffix to track rounds:
|
||||
|
||||
```
|
||||
logs/experiment_haiku_0_50.log # Round 1
|
||||
logs/experiment_haiku_0_50_r2.log # Round 2 (after credit exhaustion)
|
||||
logs/experiment_haiku_0_50_r3.log # Round 3 (after bug fix)
|
||||
```
|
||||
|
||||
### Pre-Flight Checklist
|
||||
|
||||
Before launching any experiment batch:
|
||||
|
||||
```
|
||||
Pre-Flight:
|
||||
- [ ] API credits sufficient for estimated calls
|
||||
- [ ] Model IDs correct (test with 1 problem first)
|
||||
- [ ] Output directory exists and is writable
|
||||
- [ ] Resume logic works (re-run won't overwrite existing results)
|
||||
- [ ] Log file path is unique (won't overwrite previous logs)
|
||||
- [ ] Dataset/task files are accessible
|
||||
- [ ] Config matches intended experiment
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task/Benchmark Design
|
||||
|
||||
### Open-Ended Tasks (Subjective Evaluation)
|
||||
|
||||
Design tasks that have clear objectives but subjective quality:
|
||||
|
||||
```markdown
|
||||
# Task: [Title]
|
||||
|
||||
## Context
|
||||
[Specific scenario with concrete details: company size, constraints, timeline]
|
||||
|
||||
## Deliverable
|
||||
[Exact format and structure required]
|
||||
|
||||
## Requirements
|
||||
- [Specific, measurable requirements]
|
||||
- [Not vague — "be comprehensive" is bad, "include exactly 6 sections" is good]
|
||||
```
|
||||
|
||||
### Constrained Tasks (for Testing Scope Effects)
|
||||
|
||||
Constrained tasks test whether methods respect scope boundaries. Design with:
|
||||
|
||||
- **Fixed facts**: "Use only these N data points, add nothing else"
|
||||
- **Fixed deliverable**: Specific format (pitch, postmortem, memo — not "improve this")
|
||||
- **Fixed structure**: "These sections in this order, do not add/remove"
|
||||
- **Fixed change items**: "Address exactly these N points, nothing else"
|
||||
|
||||
**Do NOT use word count as a scope constraint.** Word limits cause false convergence — outputs get rejected for length, not quality. Constrain scope (what to include) not length.
|
||||
|
||||
### Example: Good vs Bad Constraints
|
||||
|
||||
| Bad Constraint | Why | Good Constraint |
|
||||
|---------------|-----|-----------------|
|
||||
| "Max 500 words" | Judges reject for length | "Exactly 4 sections, each with 3 numbered items" |
|
||||
| "Be concise" | Too vague | "Each prohibition must reference a specific base fact" |
|
||||
| "Improve this" | Unbounded scope | "Write a 600-word incident postmortem with this exact structure" |
|
||||
| "Make it better" | No clear criterion | "Address exactly these 3 reviewer concerns" |
|
||||
|
||||
---
|
||||
|
||||
## Visualization Best Practices
|
||||
|
||||
### Setup: SciencePlots + matplotlib
|
||||
|
||||
Install SciencePlots for publication-ready defaults:
|
||||
|
||||
```bash
|
||||
pip install SciencePlots matplotlib numpy
|
||||
```
|
||||
|
||||
**Option A: SciencePlots styles** (recommended — handles most defaults automatically):
|
||||
|
||||
```python
|
||||
import matplotlib.pyplot as plt
|
||||
import scienceplots # registers the styles
|
||||
|
||||
# Pick a style:
|
||||
# 'science' — clean, serif fonts, suitable for most venues
|
||||
# 'science+ieee' — IEEE-style (good for two-column papers)
|
||||
# 'science+nature' — Nature-style
|
||||
# Add 'no-latex' if LaTeX is not installed on the machine generating plots
|
||||
|
||||
with plt.style.context(['science', 'no-latex']):
|
||||
fig, ax = plt.subplots(figsize=(3.5, 2.5)) # single-column width
|
||||
# ... plot ...
|
||||
fig.savefig('paper/fig_results.pdf', bbox_inches='tight')
|
||||
```
|
||||
|
||||
**Option B: Manual rcParams** (when you need full control):
|
||||
|
||||
```python
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
plt.rcParams.update({
|
||||
'font.size': 10,
|
||||
'font.family': 'serif',
|
||||
'axes.labelsize': 11,
|
||||
'axes.titlesize': 11,
|
||||
'xtick.labelsize': 9,
|
||||
'ytick.labelsize': 9,
|
||||
'legend.fontsize': 9,
|
||||
'figure.figsize': (3.5, 2.5), # single-column default
|
||||
'figure.dpi': 300,
|
||||
'savefig.dpi': 300,
|
||||
'savefig.bbox': 'tight',
|
||||
'savefig.pad_inches': 0.05,
|
||||
'axes.linewidth': 0.8,
|
||||
'lines.linewidth': 1.5,
|
||||
'lines.markersize': 5,
|
||||
'axes.grid': True,
|
||||
'grid.alpha': 0.3,
|
||||
'grid.linewidth': 0.5,
|
||||
})
|
||||
```
|
||||
|
||||
### Standard Figure Sizes (Two-Column Format)
|
||||
|
||||
| Use Case | figsize | Notes |
|
||||
|----------|---------|-------|
|
||||
| Single column | `(3.5, 2.5)` | Fits in one column of two-column layout |
|
||||
| Double column | `(7.0, 3.0)` | Spans full page width |
|
||||
| Square (heatmap, confusion matrix) | `(3.5, 3.5)` | Single column |
|
||||
| Tall single (many rows) | `(3.5, 5.0)` | Use sparingly |
|
||||
|
||||
### Colorblind-Safe Palette (Okabe-Ito)
|
||||
|
||||
Use this palette for all paper figures. It is distinguishable by people with all common forms of color vision deficiency:
|
||||
|
||||
```python
|
||||
COLORS = {
|
||||
'blue': '#0072B2',
|
||||
'orange': '#E69F00',
|
||||
'green': '#009E73',
|
||||
'red': '#D55E00',
|
||||
'purple': '#CC79A7',
|
||||
'cyan': '#56B4E9',
|
||||
'yellow': '#F0E442',
|
||||
'black': '#000000',
|
||||
}
|
||||
|
||||
# As a list for cycling:
|
||||
COLOR_CYCLE = ['#0072B2', '#D55E00', '#009E73', '#E69F00', '#CC79A7', '#56B4E9']
|
||||
```
|
||||
|
||||
Also differentiate lines by **marker and linestyle**, not just color:
|
||||
```python
|
||||
STYLES = [
|
||||
{'color': '#0072B2', 'marker': 'o', 'linestyle': '-'},
|
||||
{'color': '#D55E00', 'marker': 's', 'linestyle': '--'},
|
||||
{'color': '#009E73', 'marker': '^', 'linestyle': '-.'},
|
||||
{'color': '#E69F00', 'marker': 'D', 'linestyle': ':'},
|
||||
]
|
||||
```
|
||||
|
||||
### Complete Example: Method Comparison Bar Chart
|
||||
|
||||
```python
|
||||
import matplotlib.pyplot as plt
|
||||
import numpy as np
|
||||
|
||||
try:
|
||||
import scienceplots
|
||||
style = ['science', 'no-latex']
|
||||
except ImportError:
|
||||
style = 'default'
|
||||
|
||||
with plt.style.context(style):
|
||||
methods = ['Single Pass', 'Critique+Revise', 'Best-of-N', 'Ours']
|
||||
scores = [73.2, 74.1, 68.5, 77.0]
|
||||
errors = [2.1, 1.8, 3.2, 1.5]
|
||||
colors = ['#56B4E9', '#E69F00', '#CC79A7', '#0072B2']
|
||||
|
||||
fig, ax = plt.subplots(figsize=(3.5, 2.5))
|
||||
bars = ax.bar(methods, scores, yerr=errors, capsize=3,
|
||||
color=colors, edgecolor='black', linewidth=0.5)
|
||||
|
||||
# Highlight "Ours"
|
||||
bars[-1].set_edgecolor('#0072B2')
|
||||
bars[-1].set_linewidth(1.5)
|
||||
|
||||
ax.set_ylabel('Pass Rate (%)')
|
||||
ax.set_ylim(60, 85)
|
||||
ax.spines['top'].set_visible(False)
|
||||
ax.spines['right'].set_visible(False)
|
||||
|
||||
fig.savefig('paper/fig_comparison.pdf', bbox_inches='tight')
|
||||
```
|
||||
|
||||
### Complete Example: Convergence/Trajectory Line Chart
|
||||
|
||||
```python
|
||||
with plt.style.context(style):
|
||||
fig, ax = plt.subplots(figsize=(3.5, 2.5))
|
||||
|
||||
passes = np.arange(1, 16)
|
||||
ours = [65, 72, 78, 82, 85, 87, 88, 89, 89.5, 90, 90, 90, 90, 90, 90]
|
||||
baseline = [65, 68, 70, 71, 69, 67, 66, 65, 64, 63, 62, 61, 60, 59, 58]
|
||||
|
||||
ax.plot(passes, ours, **STYLES[0], label='Ours', markersize=4)
|
||||
ax.plot(passes, baseline, **STYLES[1], label='Critique+Revise', markersize=4)
|
||||
|
||||
# Mark convergence point
|
||||
ax.axvline(x=10, color='gray', linestyle=':', alpha=0.5, linewidth=0.8)
|
||||
ax.annotate('Converged', xy=(10, 90), fontsize=8, ha='center',
|
||||
xytext=(10, 93), arrowprops=dict(arrowstyle='->', color='gray'))
|
||||
|
||||
ax.set_xlabel('Iteration')
|
||||
ax.set_ylabel('Quality Score')
|
||||
ax.legend(loc='lower right')
|
||||
ax.spines['top'].set_visible(False)
|
||||
ax.spines['right'].set_visible(False)
|
||||
|
||||
fig.savefig('paper/fig_trajectory.pdf', bbox_inches='tight')
|
||||
```
|
||||
|
||||
### Output Rules
|
||||
|
||||
- **Always save as PDF**: `fig.savefig('fig.pdf')` — vector graphics, sharp at any zoom
|
||||
- **Never save as PNG** for paper figures — raster PNGs look blurry when printed/zoomed
|
||||
- **Exception**: Screenshots, photographs, or pixel-art visualizations → PNG at 600 DPI
|
||||
- **Verify grayscale**: Print to grayscale PDF and check all information is still visible
|
||||
|
||||
### Chart Types for Common Comparisons
|
||||
|
||||
| Comparison Type | Chart | Notes |
|
||||
|----------------|-------|-------|
|
||||
| Method vs method | Grouped bar chart | Include error bars |
|
||||
| Across model sizes | Line chart with CI bands | Log scale for model size axis |
|
||||
| Ablation study | Stacked/grouped bar | Highlight removed component |
|
||||
| Trajectory/convergence | Line chart over iterations | Show winner per iteration |
|
||||
| Per-task breakdown | Heatmap or grouped bar | Show variance across tasks |
|
||||
@@ -0,0 +1,476 @@
|
||||
# Human Evaluation Guide for ML/AI Research
|
||||
|
||||
Comprehensive guide for designing, running, and reporting human evaluations in ML/AI papers. Human evaluation is the primary evidence for many NLP, HCI, and alignment papers, and is increasingly expected as complementary evidence at all ML venues.
|
||||
|
||||
---
|
||||
|
||||
## Contents
|
||||
|
||||
- [When Human Evaluation Is Needed](#when-human-evaluation-is-needed)
|
||||
- [Study Design](#study-design)
|
||||
- [Annotation Guidelines](#annotation-guidelines)
|
||||
- [Platforms and Recruitment](#platforms-and-recruitment)
|
||||
- [Quality Control](#quality-control)
|
||||
- [Agreement Metrics](#agreement-metrics)
|
||||
- [Statistical Analysis for Human Eval](#statistical-analysis-for-human-eval)
|
||||
- [Reporting Requirements](#reporting-requirements)
|
||||
- [IRB and Ethics](#irb-and-ethics)
|
||||
- [Common Pitfalls](#common-pitfalls)
|
||||
|
||||
---
|
||||
|
||||
## When Human Evaluation Is Needed
|
||||
|
||||
| Scenario | Human Eval Required? | Notes |
|
||||
|----------|---------------------|-------|
|
||||
| Text generation quality (fluency, coherence) | **Yes** | Automated metrics (BLEU, ROUGE) correlate poorly with human judgment |
|
||||
| Factual accuracy of generated text | **Strongly recommended** | Automated fact-checking is unreliable |
|
||||
| Safety/toxicity evaluation | **Yes for nuanced cases** | Classifiers miss context-dependent harm |
|
||||
| Preference between two systems | **Yes** | Most reliable method for comparing LLM outputs |
|
||||
| Summarization quality | **Yes** | ROUGE doesn't capture faithfulness or relevance well |
|
||||
| Task completion (UI, agents) | **Yes** | User studies are the gold standard |
|
||||
| Classification accuracy | **Usually no** | Ground truth labels suffice; human eval adds cost without insight |
|
||||
| Perplexity or loss comparisons | **No** | Automated metrics are the correct evaluation |
|
||||
|
||||
---
|
||||
|
||||
## Study Design
|
||||
|
||||
### Evaluation Types
|
||||
|
||||
| Type | When to Use | Pros | Cons |
|
||||
|------|-------------|------|------|
|
||||
| **Pairwise comparison** | Comparing two systems | Most reliable, minimizes scale bias | Only compares pairs, quadratic in systems |
|
||||
| **Likert scale** (1-5 or 1-7) | Rating individual outputs | Easy to aggregate | Subjective anchoring, scale compression |
|
||||
| **Ranking** | Ordering 3+ systems | Captures full preference order | Cognitive load increases with items |
|
||||
| **Best-worst scaling** | Comparing many systems efficiently | More reliable than Likert, linear in items | Requires careful item selection |
|
||||
| **Binary judgment** | Yes/no decisions (grammatical? factual?) | Simple, high agreement | Loses nuance |
|
||||
| **Error annotation** | Identifying specific error types | Rich diagnostic information | Expensive, requires trained annotators |
|
||||
|
||||
**Recommendation for most ML papers**: Pairwise comparison is the most defensible. Reviewers rarely question its validity. For Likert scales, always report both mean and distribution.
|
||||
|
||||
### Sample Size Planning
|
||||
|
||||
**Minimum viable sample sizes:**
|
||||
|
||||
| Study Type | Minimum Items | Minimum Annotators | Notes |
|
||||
|------------|--------------|-------------------|-------|
|
||||
| Pairwise comparison | 100 pairs | 3 per pair | Detects ~10% win rate difference at p<0.05 |
|
||||
| Likert rating | 100 items | 3 per item | Enough for meaningful averages |
|
||||
| Ranking | 50 sets | 3 per set | Each set contains all systems being compared |
|
||||
| Error annotation | 200 items | 2 per item | Higher agreement expected for structured schemes |
|
||||
|
||||
**Power analysis** (for planning more precisely):
|
||||
|
||||
```python
|
||||
from scipy import stats
|
||||
import numpy as np
|
||||
|
||||
def sample_size_pairwise(effect_size=0.10, alpha=0.05, power=0.80):
|
||||
"""
|
||||
Estimate sample size for pairwise comparison (sign test).
|
||||
effect_size: expected win rate difference from 0.50
|
||||
"""
|
||||
p_expected = 0.50 + effect_size
|
||||
# Normal approximation to binomial
|
||||
z_alpha = stats.norm.ppf(1 - alpha / 2)
|
||||
z_beta = stats.norm.ppf(power)
|
||||
n = ((z_alpha * np.sqrt(0.25) + z_beta * np.sqrt(p_expected * (1 - p_expected))) ** 2) / (effect_size ** 2)
|
||||
return int(np.ceil(n))
|
||||
|
||||
print(f"Sample size for 10% effect: {sample_size_pairwise(0.10)}") # ~200
|
||||
print(f"Sample size for 15% effect: {sample_size_pairwise(0.15)}") # ~90
|
||||
print(f"Sample size for 20% effect: {sample_size_pairwise(0.20)}") # ~50
|
||||
```
|
||||
|
||||
### Controlling for Bias
|
||||
|
||||
| Bias | Mitigation |
|
||||
|------|-----------|
|
||||
| **Order bias** (first item preferred) | Randomize presentation order for each annotator |
|
||||
| **Length bias** (longer = better) | Control for length or analyze separately |
|
||||
| **Anchoring** (first annotation sets scale) | Include warm-up items (not counted) |
|
||||
| **Fatigue** (quality drops over time) | Limit session length (30-45 min max), randomize item order |
|
||||
| **Annotator expertise** | Report annotator background; use qualification tasks |
|
||||
|
||||
---
|
||||
|
||||
## Annotation Guidelines
|
||||
|
||||
Well-written annotation guidelines are the single biggest factor in evaluation quality. Invest significant time here.
|
||||
|
||||
### Structure of Good Guidelines
|
||||
|
||||
```markdown
|
||||
# [Task Name] Annotation Guidelines
|
||||
|
||||
## Overview
|
||||
[1-2 sentences describing the task]
|
||||
|
||||
## Definitions
|
||||
[Define every term annotators will use in their judgments]
|
||||
- Quality: [specific definition for this study]
|
||||
- Fluency: [specific definition]
|
||||
- Factuality: [specific definition]
|
||||
|
||||
## Rating Scale
|
||||
[For each scale point, provide:]
|
||||
- Numeric value
|
||||
- Label (e.g., "Excellent", "Good", "Acceptable", "Poor", "Unacceptable")
|
||||
- Definition of what qualifies for this rating
|
||||
- 1-2 concrete examples at this level
|
||||
|
||||
## Examples
|
||||
|
||||
### Example 1: [Rating = 5]
|
||||
Input: [exact input]
|
||||
Output: [exact output]
|
||||
Rating: 5
|
||||
Explanation: [why this is a 5]
|
||||
|
||||
### Example 2: [Rating = 2]
|
||||
Input: [exact input]
|
||||
Output: [exact output]
|
||||
Rating: 2
|
||||
Explanation: [why this is a 2]
|
||||
|
||||
[Include at least 2 examples per rating level, covering edge cases]
|
||||
|
||||
## Edge Cases
|
||||
- If the output is [ambiguous case]: [instruction]
|
||||
- If the input is [unusual case]: [instruction]
|
||||
|
||||
## Common Mistakes
|
||||
- Don't [common annotator error]
|
||||
- Don't let [bias] influence your rating
|
||||
```
|
||||
|
||||
### Pilot Testing
|
||||
|
||||
**Always run a pilot** before the full study:
|
||||
1. 3-5 annotators, 20-30 items
|
||||
2. Compute agreement metrics
|
||||
3. Discuss disagreements in group session
|
||||
4. Revise guidelines based on confusion points
|
||||
5. Run second pilot if agreement was poor (<0.40 kappa)
|
||||
|
||||
---
|
||||
|
||||
## Platforms and Recruitment
|
||||
|
||||
| Platform | Best For | Cost | Quality |
|
||||
|----------|----------|------|---------|
|
||||
| **Prolific** | General annotation, surveys | $8-15/hr | High (academic-focused pool) |
|
||||
| **Amazon MTurk** | Large-scale simple tasks | $5-12/hr | Variable (needs strong QC) |
|
||||
| **Surge AI** | NLP-specific annotation | $15-25/hr | Very high (trained annotators) |
|
||||
| **Scale AI** | Production-quality labeling | Varies | High (managed workforce) |
|
||||
| **Internal team** | Domain expertise required | Varies | Highest for specialized tasks |
|
||||
| **Upwork/contractors** | Long-term annotation projects | $10-30/hr | Depends on hiring |
|
||||
|
||||
**Fair compensation**: Always pay at least the equivalent of local minimum wage for the annotator's location. Many conferences (ACL in particular) now ask about annotator compensation. Paying below minimum wage is an ethics risk.
|
||||
|
||||
**Prolific setup (recommended for most ML papers):**
|
||||
1. Create study on prolific.co
|
||||
2. Set prescreening filters (language, country, approval rate >95%)
|
||||
3. Estimate time per task from pilot → set fair payment
|
||||
4. Use Prolific's built-in attention checks or add your own
|
||||
5. Collect Prolific IDs for quality tracking (but don't share in paper)
|
||||
|
||||
---
|
||||
|
||||
## Quality Control
|
||||
|
||||
### Attention Checks
|
||||
|
||||
Include items where the correct answer is unambiguous:
|
||||
|
||||
```python
|
||||
# Types of attention checks
|
||||
attention_checks = {
|
||||
"instructed_response": "For this item, please select 'Strongly Agree' regardless of content.",
|
||||
"obvious_quality": "Rate this clearly ungrammatical text: 'The cat dog house green yesterday.'", # Should get lowest score
|
||||
"gold_standard": "Items where expert consensus exists (pre-annotated by authors)",
|
||||
"trap_question": "What color is the sky on a clear day? (embedded in annotation interface)"
|
||||
}
|
||||
|
||||
# Recommended: 10-15% of total items should be checks
|
||||
# Exclusion criterion: fail 2+ attention checks → exclude annotator
|
||||
```
|
||||
|
||||
### Annotator Qualification
|
||||
|
||||
For tasks requiring expertise:
|
||||
|
||||
```
|
||||
Qualification Task Design:
|
||||
1. Create a set of 20-30 items with known-correct labels
|
||||
2. Require annotators to complete this before the main task
|
||||
3. Set threshold: ≥80% agreement with gold labels to qualify
|
||||
4. Record qualification scores for reporting
|
||||
```
|
||||
|
||||
### Monitoring During Collection
|
||||
|
||||
```python
|
||||
# Real-time quality monitoring
|
||||
def monitor_quality(annotations):
|
||||
"""Check for annotation quality issues during collection."""
|
||||
issues = []
|
||||
|
||||
# 1. Check for straight-lining (same answer for everything)
|
||||
for annotator_id, items in annotations.groupby('annotator'):
|
||||
if items['rating'].nunique() <= 1:
|
||||
issues.append(f"Annotator {annotator_id}: straight-lining detected")
|
||||
|
||||
# 2. Check time per item (too fast = not reading)
|
||||
median_time = annotations['time_seconds'].median()
|
||||
fast_annotators = annotations.groupby('annotator')['time_seconds'].median()
|
||||
for ann_id, time in fast_annotators.items():
|
||||
if time < median_time * 0.3:
|
||||
issues.append(f"Annotator {ann_id}: suspiciously fast ({time:.0f}s vs median {median_time:.0f}s)")
|
||||
|
||||
# 3. Check attention check performance
|
||||
checks = annotations[annotations['is_attention_check']]
|
||||
for ann_id, items in checks.groupby('annotator'):
|
||||
accuracy = (items['rating'] == items['gold_rating']).mean()
|
||||
if accuracy < 0.80:
|
||||
issues.append(f"Annotator {ann_id}: failing attention checks ({accuracy:.0%})")
|
||||
|
||||
return issues
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Agreement Metrics
|
||||
|
||||
### Which Metric to Use
|
||||
|
||||
| Metric | When to Use | Interpretation |
|
||||
|--------|-------------|---------------|
|
||||
| **Cohen's kappa (κ)** | Exactly 2 annotators, categorical | Chance-corrected agreement |
|
||||
| **Fleiss' kappa** | 3+ annotators, all rate same items, categorical | Multi-annotator extension of Cohen's |
|
||||
| **Krippendorff's alpha (α)** | Any number of annotators, handles missing data | Most general; recommended default |
|
||||
| **ICC (Intraclass Correlation)** | Continuous ratings (Likert) | Consistency among raters |
|
||||
| **Percent agreement** | Reporting alongside kappa/alpha | Raw agreement (not chance-corrected) |
|
||||
| **Kendall's W** | Rankings | Concordance among rankers |
|
||||
|
||||
**Always report at least two**: one chance-corrected metric (kappa or alpha) AND raw percent agreement.
|
||||
|
||||
### Interpretation Guide
|
||||
|
||||
| Value | Krippendorff's α / Cohen's κ | Quality |
|
||||
|-------|-------------------------------|---------|
|
||||
| > 0.80 | Excellent agreement | Reliable for most purposes |
|
||||
| 0.67 - 0.80 | Good agreement | Acceptable for most ML papers |
|
||||
| 0.40 - 0.67 | Moderate agreement | Borderline; discuss in paper |
|
||||
| < 0.40 | Poor agreement | Revise guidelines and redo annotation |
|
||||
|
||||
**Note**: Krippendorff recommends α > 0.667 as minimum for tentative conclusions. NLP tasks with subjective judgments (fluency, helpfulness) typically achieve 0.40-0.70.
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
from sklearn.metrics import cohen_kappa_score
|
||||
import krippendorff # pip install krippendorff
|
||||
|
||||
def compute_agreement(annotations_matrix):
|
||||
"""
|
||||
annotations_matrix: shape (n_items, n_annotators)
|
||||
Values: ratings (int or float). Use np.nan for missing.
|
||||
"""
|
||||
results = {}
|
||||
|
||||
# Krippendorff's alpha (handles missing data, any number of annotators)
|
||||
results['krippendorff_alpha'] = krippendorff.alpha(
|
||||
annotations_matrix.T, # krippendorff expects (annotators, items)
|
||||
level_of_measurement='ordinal' # or 'nominal', 'interval', 'ratio'
|
||||
)
|
||||
|
||||
# Pairwise Cohen's kappa (for 2 annotators at a time)
|
||||
n_annotators = annotations_matrix.shape[1]
|
||||
kappas = []
|
||||
for i in range(n_annotators):
|
||||
for j in range(i + 1, n_annotators):
|
||||
mask = ~np.isnan(annotations_matrix[:, i]) & ~np.isnan(annotations_matrix[:, j])
|
||||
if mask.sum() > 0:
|
||||
k = cohen_kappa_score(
|
||||
annotations_matrix[mask, i].astype(int),
|
||||
annotations_matrix[mask, j].astype(int)
|
||||
)
|
||||
kappas.append(k)
|
||||
results['mean_pairwise_kappa'] = np.mean(kappas) if kappas else None
|
||||
|
||||
# Raw percent agreement
|
||||
agree_count = 0
|
||||
total_count = 0
|
||||
for item in range(annotations_matrix.shape[0]):
|
||||
ratings = annotations_matrix[item, ~np.isnan(annotations_matrix[item, :])]
|
||||
if len(ratings) >= 2:
|
||||
# All annotators agree
|
||||
if len(set(ratings.astype(int))) == 1:
|
||||
agree_count += 1
|
||||
total_count += 1
|
||||
results['percent_agreement'] = agree_count / total_count if total_count > 0 else None
|
||||
|
||||
return results
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Statistical Analysis for Human Eval
|
||||
|
||||
### Pairwise Comparisons
|
||||
|
||||
```python
|
||||
from scipy import stats
|
||||
|
||||
def analyze_pairwise(wins_a, wins_b, ties=0):
|
||||
"""
|
||||
Analyze pairwise comparison results.
|
||||
wins_a: number of times system A won
|
||||
wins_b: number of times system B won
|
||||
ties: number of ties (excluded from sign test)
|
||||
"""
|
||||
n = wins_a + wins_b # exclude ties
|
||||
|
||||
# Sign test (exact binomial)
|
||||
p_value = stats.binom_test(wins_a, n, 0.5, alternative='two-sided')
|
||||
|
||||
# Win rate with 95% CI (Wilson score interval)
|
||||
win_rate = wins_a / n if n > 0 else 0.5
|
||||
z = 1.96
|
||||
denominator = 1 + z**2 / n
|
||||
center = (win_rate + z**2 / (2 * n)) / denominator
|
||||
margin = z * np.sqrt((win_rate * (1 - win_rate) + z**2 / (4 * n)) / n) / denominator
|
||||
ci_lower = center - margin
|
||||
ci_upper = center + margin
|
||||
|
||||
return {
|
||||
'win_rate_a': win_rate,
|
||||
'win_rate_b': 1 - win_rate,
|
||||
'p_value': p_value,
|
||||
'ci_95': (ci_lower, ci_upper),
|
||||
'significant': p_value < 0.05,
|
||||
'n_comparisons': n,
|
||||
'ties': ties,
|
||||
}
|
||||
```
|
||||
|
||||
### Likert Scale Analysis
|
||||
|
||||
```python
|
||||
def analyze_likert(ratings_a, ratings_b):
|
||||
"""Compare Likert ratings between two systems (paired)."""
|
||||
# Wilcoxon signed-rank test (non-parametric, paired)
|
||||
stat, p_value = stats.wilcoxon(ratings_a, ratings_b, alternative='two-sided')
|
||||
|
||||
# Effect size (rank-biserial correlation)
|
||||
n = len(ratings_a)
|
||||
r = 1 - (2 * stat) / (n * (n + 1))
|
||||
|
||||
return {
|
||||
'mean_a': np.mean(ratings_a),
|
||||
'mean_b': np.mean(ratings_b),
|
||||
'std_a': np.std(ratings_a),
|
||||
'std_b': np.std(ratings_b),
|
||||
'wilcoxon_stat': stat,
|
||||
'p_value': p_value,
|
||||
'effect_size_r': r,
|
||||
'significant': p_value < 0.05,
|
||||
}
|
||||
```
|
||||
|
||||
### Multiple Comparisons Correction
|
||||
|
||||
When comparing more than two systems:
|
||||
|
||||
```python
|
||||
from statsmodels.stats.multitest import multipletests
|
||||
|
||||
# After computing p-values for all pairs
|
||||
p_values = [0.03, 0.001, 0.08, 0.04, 0.15, 0.002]
|
||||
rejected, corrected_p, _, _ = multipletests(p_values, method='holm')
|
||||
# Use corrected p-values in your paper
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Reporting Requirements
|
||||
|
||||
Reviewers at NLP venues (ACL, EMNLP, NAACL) check for all of these. ML venues (NeurIPS, ICML) increasingly expect them too.
|
||||
|
||||
### Mandatory Reporting
|
||||
|
||||
```latex
|
||||
% In your paper's human evaluation section:
|
||||
\paragraph{Annotators.} We recruited [N] annotators via [platform].
|
||||
[Describe qualifications or screening.] Annotators were paid
|
||||
\$[X]/hour, above the [country] minimum wage.
|
||||
|
||||
\paragraph{Agreement.} Inter-annotator agreement was [metric] = [value]
|
||||
(Krippendorff's $\alpha$ = [value]; raw agreement = [value]\%).
|
||||
[If low: explain why the task is subjective and how you handle disagreements.]
|
||||
|
||||
\paragraph{Evaluation Protocol.} Each [item type] was rated by [N]
|
||||
annotators on a [scale description]. We collected [total] annotations
|
||||
across [N items]. [Describe randomization and blinding.]
|
||||
```
|
||||
|
||||
### What Goes in the Appendix
|
||||
|
||||
```
|
||||
Appendix: Human Evaluation Details
|
||||
- Full annotation guidelines (verbatim)
|
||||
- Screenshot of annotation interface
|
||||
- Qualification task details and threshold
|
||||
- Attention check items and failure rates
|
||||
- Per-annotator agreement breakdown
|
||||
- Full results table (not just averages)
|
||||
- Compensation calculation
|
||||
- IRB approval number (if applicable)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## IRB and Ethics
|
||||
|
||||
### When IRB Approval Is Needed
|
||||
|
||||
| Situation | IRB Required? |
|
||||
|-----------|---------------|
|
||||
| Crowdworkers rating text quality | **Usually no** (not "human subjects research" at most institutions) |
|
||||
| User study with real users | **Yes** at most US/EU institutions |
|
||||
| Collecting personal information | **Yes** |
|
||||
| Studying annotator behavior/cognition | **Yes** (they become the subject) |
|
||||
| Using existing annotated data | **Usually no** (secondary data analysis) |
|
||||
|
||||
**Check your institution's policy.** The definition of "human subjects research" varies. When in doubt, submit an IRB protocol — the review is often fast for minimal-risk studies.
|
||||
|
||||
### Ethics Checklist for Human Evaluation
|
||||
|
||||
```
|
||||
- [ ] Annotators informed about task purpose (not deceptive)
|
||||
- [ ] Annotators can withdraw at any time without penalty
|
||||
- [ ] No personally identifiable information collected beyond platform ID
|
||||
- [ ] Content being evaluated does not expose annotators to harm
|
||||
(if it does: content warnings + opt-out + higher compensation)
|
||||
- [ ] Fair compensation (>= equivalent local minimum wage)
|
||||
- [ ] Data stored securely, access limited to research team
|
||||
- [ ] IRB approval obtained if required by institution
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
| Pitfall | Problem | Fix |
|
||||
|---------|---------|-----|
|
||||
| Too few annotators (1-2) | No agreement metric possible | Minimum 3 annotators per item |
|
||||
| No attention checks | Can't detect low-quality annotations | Include 10-15% attention checks |
|
||||
| Not reporting compensation | Reviewers flag as ethics concern | Always report hourly rate |
|
||||
| Using only automated metrics for generation | Reviewers will ask for human eval | Add at least pairwise comparison |
|
||||
| Not piloting guidelines | Low agreement, wasted budget | Always pilot with 3-5 people first |
|
||||
| Reporting only averages | Hides annotator disagreement | Report distribution and agreement |
|
||||
| Not controlling for order/position | Position bias inflates results | Randomize presentation order |
|
||||
| Conflating annotator agreement with ground truth | High agreement doesn't mean correct | Validate against expert judgments |
|
||||
481
skills/research/research-paper-writing/references/paper-types.md
Normal file
481
skills/research/research-paper-writing/references/paper-types.md
Normal file
@@ -0,0 +1,481 @@
|
||||
# Paper Types Beyond Empirical ML
|
||||
|
||||
Guide for writing non-standard paper types: theory papers, survey/tutorial papers, benchmark/dataset papers, and position papers. Each type has distinct structure, evidence standards, and venue expectations.
|
||||
|
||||
---
|
||||
|
||||
## Contents
|
||||
|
||||
- [Theory Papers](#theory-papers)
|
||||
- [Survey and Tutorial Papers](#survey-and-tutorial-papers)
|
||||
- [Benchmark and Dataset Papers](#benchmark-and-dataset-papers)
|
||||
- [Position Papers](#position-papers)
|
||||
- [Reproducibility and Replication Papers](#reproducibility-and-replication-papers)
|
||||
|
||||
---
|
||||
|
||||
## Theory Papers
|
||||
|
||||
### When to Write a Theory Paper
|
||||
|
||||
Your paper should be a theory paper if:
|
||||
- The main contribution is a theorem, bound, impossibility result, or formal characterization
|
||||
- Experiments are supplementary validation, not the core evidence
|
||||
- The contribution advances understanding rather than achieving state-of-the-art numbers
|
||||
|
||||
### Structure
|
||||
|
||||
```
|
||||
1. Introduction (1-1.5 pages)
|
||||
- Problem statement and motivation
|
||||
- Informal statement of main results
|
||||
- Comparison to prior theoretical work
|
||||
- Contribution bullets (state theorems informally)
|
||||
|
||||
2. Preliminaries (0.5-1 page)
|
||||
- Notation table
|
||||
- Formal definitions
|
||||
- Assumptions (numbered, referenced later)
|
||||
- Known results you build on
|
||||
|
||||
3. Main Results (2-3 pages)
|
||||
- Theorem statements (formal)
|
||||
- Proof sketches (intuition + key steps)
|
||||
- Corollaries and special cases
|
||||
- Discussion of tightness / optimality
|
||||
|
||||
4. Experimental Validation (1-2 pages, optional but recommended)
|
||||
- Do theoretical predictions match empirical behavior?
|
||||
- Synthetic experiments that isolate the phenomenon
|
||||
- Comparison to bounds from prior work
|
||||
|
||||
5. Related Work (1 page)
|
||||
- Theoretical predecessors
|
||||
- Empirical work your theory explains
|
||||
|
||||
6. Discussion & Open Problems (0.5 page)
|
||||
- Limitations of your results
|
||||
- Conjectures suggested by your analysis
|
||||
- Concrete open problems
|
||||
|
||||
Appendix:
|
||||
- Full proofs
|
||||
- Technical lemmas
|
||||
- Extended experimental details
|
||||
```
|
||||
|
||||
### Writing Theorems
|
||||
|
||||
**Template for a well-stated theorem:**
|
||||
|
||||
```latex
|
||||
\begin{assumption}[Bounded Gradients]\label{assum:bounded-grad}
|
||||
There exists $G > 0$ such that $\|\nabla f(x)\| \leq G$ for all $x \in \mathcal{X}$.
|
||||
\end{assumption}
|
||||
|
||||
\begin{theorem}[Convergence Rate]\label{thm:convergence}
|
||||
Under Assumptions~\ref{assum:bounded-grad} and~\ref{assum:smoothness},
|
||||
Algorithm~\ref{alg:method} with step size $\eta = \frac{1}{\sqrt{T}}$ satisfies
|
||||
\[
|
||||
\frac{1}{T}\sum_{t=1}^{T} \mathbb{E}\left[\|\nabla f(x_t)\|^2\right]
|
||||
\leq \frac{2(f(x_1) - f^*)}{\sqrt{T}} + \frac{G^2}{\sqrt{T}}.
|
||||
\]
|
||||
In particular, after $T = O(1/\epsilon^2)$ iterations, we obtain an
|
||||
$\epsilon$-stationary point.
|
||||
\end{theorem}
|
||||
```
|
||||
|
||||
**Rules for theorem statements:**
|
||||
- State all assumptions explicitly (numbered, with names)
|
||||
- Include the formal bound, not just "converges at rate O(·)"
|
||||
- Add a plain-language corollary: "In particular, this means..."
|
||||
- Compare to known bounds: "This improves over [prior work]'s bound of O(·) by a factor of..."
|
||||
|
||||
### Proof Sketches
|
||||
|
||||
The proof sketch is the most important part of the main text for a theory paper. Reviewers evaluate whether you have genuine insight or just mechanical derivation.
|
||||
|
||||
**Good proof sketch pattern:**
|
||||
|
||||
```latex
|
||||
\begin{proof}[Proof Sketch of Theorem~\ref{thm:convergence}]
|
||||
The key insight is that [one sentence describing the main idea].
|
||||
|
||||
The proof proceeds in three steps:
|
||||
\begin{enumerate}
|
||||
\item \textbf{Decomposition.} We decompose the error into [term A]
|
||||
and [term B] using [technique]. This reduces the problem to
|
||||
bounding each term separately.
|
||||
|
||||
\item \textbf{Bounding [term A].} By [assumption/lemma], [term A]
|
||||
is bounded by $O(\cdot)$. The critical observation is that
|
||||
[specific insight that makes this non-trivial].
|
||||
|
||||
\item \textbf{Combining.} Choosing $\eta = 1/\sqrt{T}$ balances
|
||||
the two terms, yielding the stated bound.
|
||||
\end{enumerate}
|
||||
|
||||
The full proof, including the technical lemma for Step 2,
|
||||
appears in Appendix~\ref{app:proofs}.
|
||||
\end{proof}
|
||||
```
|
||||
|
||||
**Bad proof sketch**: Restating the theorem with slightly different notation, or just saying "the proof follows standard techniques."
|
||||
|
||||
### Full Proofs in Appendix
|
||||
|
||||
```latex
|
||||
\appendix
|
||||
\section{Proofs}\label{app:proofs}
|
||||
|
||||
\subsection{Proof of Theorem~\ref{thm:convergence}}
|
||||
|
||||
We first establish two technical lemmas.
|
||||
|
||||
\begin{lemma}[Descent Lemma]\label{lem:descent}
|
||||
Under Assumption~\ref{assum:smoothness}, for any step size $\eta \leq 1/L$:
|
||||
\[
|
||||
f(x_{t+1}) \leq f(x_t) - \frac{\eta}{2}\|\nabla f(x_t)\|^2 + \frac{\eta^2 L}{2}\|\nabla f(x_t)\|^2.
|
||||
\]
|
||||
\end{lemma}
|
||||
|
||||
\begin{proof}
|
||||
[Complete proof with all steps]
|
||||
\end{proof}
|
||||
|
||||
% Continue with remaining lemmas and main theorem proof
|
||||
```
|
||||
|
||||
### Common Theory Paper Pitfalls
|
||||
|
||||
| Pitfall | Problem | Fix |
|
||||
|---------|---------|-----|
|
||||
| Assumptions too strong | Trivializes the result | Discuss which assumptions are necessary; prove lower bounds |
|
||||
| No comparison to existing bounds | Reviewers can't assess contribution | Add a comparison table of bounds |
|
||||
| Proof sketch is just the full proof shortened | Doesn't convey insight | Focus on the 1-2 key ideas; defer mechanics to appendix |
|
||||
| No experimental validation | Reviewers question practical relevance | Add synthetic experiments testing predictions |
|
||||
| Notation inconsistency | Confuses reviewers | Create a notation table in Preliminaries |
|
||||
| Overly complex proofs where simple ones exist | Reviewers suspect error | Prefer clarity over generality |
|
||||
|
||||
### Venues for Theory Papers
|
||||
|
||||
| Venue | Theory Acceptance Rate | Notes |
|
||||
|-------|----------------------|-------|
|
||||
| **NeurIPS** | Moderate | Values theory with practical implications |
|
||||
| **ICML** | High | Strong theory track |
|
||||
| **ICLR** | Moderate | Prefers theory with empirical validation |
|
||||
| **COLT** | High | Theory-focused venue |
|
||||
| **ALT** | High | Algorithmic learning theory |
|
||||
| **STOC/FOCS** | For TCS-flavored results | If contribution is primarily combinatorial/algorithmic |
|
||||
| **JMLR** | High | No page limit; good for long proofs |
|
||||
|
||||
---
|
||||
|
||||
## Survey and Tutorial Papers
|
||||
|
||||
### When to Write a Survey
|
||||
|
||||
- A subfield has matured enough that synthesis is valuable
|
||||
- You've identified connections between works that individual papers don't make
|
||||
- Newcomers to the area have no good entry point
|
||||
- The landscape has changed significantly since the last survey
|
||||
|
||||
**Warning**: Surveys require genuine expertise. A survey by someone outside the field, however comprehensive, will miss nuances and mischaracterize work.
|
||||
|
||||
### Structure
|
||||
|
||||
```
|
||||
1. Introduction (1-2 pages)
|
||||
- Scope definition (what's included and excluded, and why)
|
||||
- Motivation for the survey now
|
||||
- Overview of organization (often with a figure)
|
||||
|
||||
2. Background / Problem Formulation (1-2 pages)
|
||||
- Formal problem definition
|
||||
- Notation (used consistently throughout)
|
||||
- Historical context
|
||||
|
||||
3. Taxonomy (the core contribution)
|
||||
- Organize methods along meaningful axes
|
||||
- Present taxonomy as a figure or table
|
||||
- Each category gets a subsection
|
||||
|
||||
4. Detailed Coverage (bulk of paper)
|
||||
- For each category: representative methods, key ideas, strengths/weaknesses
|
||||
- Comparison tables within and across categories
|
||||
- Don't just describe — analyze and compare
|
||||
|
||||
5. Experimental Comparison (if applicable)
|
||||
- Standardized benchmark comparison
|
||||
- Fair hyperparameter tuning for all methods
|
||||
- Not always feasible but significantly strengthens the survey
|
||||
|
||||
6. Open Problems & Future Directions (1-2 pages)
|
||||
- Unsolved problems the field should tackle
|
||||
- Promising but underexplored directions
|
||||
- This section is what makes a survey a genuine contribution
|
||||
|
||||
7. Conclusion
|
||||
```
|
||||
|
||||
### Taxonomy Design
|
||||
|
||||
The taxonomy is the core intellectual contribution of a survey. It should:
|
||||
|
||||
- **Be meaningful**: Categories should correspond to real methodological differences, not arbitrary groupings
|
||||
- **Be exhaustive**: Every relevant paper should fit somewhere
|
||||
- **Be mutually exclusive** (ideally): Each paper belongs to one primary category
|
||||
- **Have informative names**: "Attention-based methods" > "Category 3"
|
||||
- **Be visualized**: A figure showing the taxonomy is almost always helpful
|
||||
|
||||
**Example taxonomy axes for "LLM Reasoning" survey:**
|
||||
- By technique: chain-of-thought, tree-of-thought, self-consistency, tool use
|
||||
- By training requirement: prompting-only, fine-tuned, RLHF
|
||||
- By reasoning type: mathematical, commonsense, logical, causal
|
||||
|
||||
### Writing Standards
|
||||
|
||||
- **Cite every relevant paper** — authors will check if their work is included
|
||||
- **Be fair** — don't dismiss methods you don't prefer
|
||||
- **Synthesize, don't just list** — identify patterns, trade-offs, open questions
|
||||
- **Include a comparison table** — even if qualitative (features/properties checklist)
|
||||
- **Update before submission** — check arXiv for papers published since you started writing
|
||||
|
||||
### Venues for Surveys
|
||||
|
||||
| Venue | Notes |
|
||||
|-------|-------|
|
||||
| **TMLR** (Survey track) | Dedicated survey submissions; no page limit |
|
||||
| **JMLR** | Long format, well-respected |
|
||||
| **Foundations and Trends in ML** | Invited, but can be proposed |
|
||||
| **ACM Computing Surveys** | Broad CS audience |
|
||||
| **arXiv** (standalone) | No peer review but high visibility if well-done |
|
||||
| **Conference tutorials** | Present as tutorial at NeurIPS/ICML/ACL; write up as paper |
|
||||
|
||||
---
|
||||
|
||||
## Benchmark and Dataset Papers
|
||||
|
||||
### When to Write a Benchmark Paper
|
||||
|
||||
- Existing benchmarks don't measure what you think matters
|
||||
- A new capability has emerged with no standard evaluation
|
||||
- Existing benchmarks are saturated (all methods score >95%)
|
||||
- You want to standardize evaluation in a fragmented subfield
|
||||
|
||||
### Structure
|
||||
|
||||
```
|
||||
1. Introduction
|
||||
- What evaluation gap does this benchmark fill?
|
||||
- Why existing benchmarks are insufficient
|
||||
|
||||
2. Task Definition
|
||||
- Formal task specification
|
||||
- Input/output format
|
||||
- Evaluation criteria (what makes a good answer?)
|
||||
|
||||
3. Dataset Construction
|
||||
- Data source and collection methodology
|
||||
- Annotation process (if human-annotated)
|
||||
- Quality control measures
|
||||
- Dataset statistics (size, distribution, splits)
|
||||
|
||||
4. Baseline Evaluation
|
||||
- Run strong baselines (don't just report random/majority)
|
||||
- Show the benchmark is challenging but not impossible
|
||||
- Human performance baseline (if feasible)
|
||||
|
||||
5. Analysis
|
||||
- Error analysis on baselines
|
||||
- What makes items hard/easy?
|
||||
- Construct validity: does the benchmark measure what you claim?
|
||||
|
||||
6. Intended Use & Limitations
|
||||
- What should this benchmark be used for?
|
||||
- What should it NOT be used for?
|
||||
- Known biases or limitations
|
||||
|
||||
7. Datasheet (Appendix)
|
||||
- Full datasheet for datasets (Gebru et al.)
|
||||
```
|
||||
|
||||
### Evidence Standards
|
||||
|
||||
Reviewers evaluate benchmarks on different criteria than methods papers:
|
||||
|
||||
| Criterion | What Reviewers Check |
|
||||
|-----------|---------------------|
|
||||
| **Novelty of evaluation** | Does this measure something existing benchmarks don't? |
|
||||
| **Construct validity** | Does the benchmark actually measure the stated capability? |
|
||||
| **Difficulty calibration** | Not too easy (saturated) or too hard (random performance) |
|
||||
| **Annotation quality** | Agreement metrics, annotator qualifications, guidelines |
|
||||
| **Documentation** | Datasheet, license, maintenance plan |
|
||||
| **Reproducibility** | Can others use this benchmark easily? |
|
||||
| **Ethical considerations** | Bias analysis, consent, sensitive content handling |
|
||||
|
||||
### Dataset Documentation (Required)
|
||||
|
||||
Follow the Datasheets for Datasets framework (Gebru et al., 2021):
|
||||
|
||||
```
|
||||
Datasheet Questions:
|
||||
1. Motivation
|
||||
- Why was this dataset created?
|
||||
- Who created it and on behalf of whom?
|
||||
- Who funded the creation?
|
||||
|
||||
2. Composition
|
||||
- What do the instances represent?
|
||||
- How many instances are there?
|
||||
- Does it contain all possible instances or a sample?
|
||||
- Is there a label? If so, how was it determined?
|
||||
- Are there recommended data splits?
|
||||
|
||||
3. Collection Process
|
||||
- How was the data collected?
|
||||
- Who was involved in collection?
|
||||
- Over what timeframe?
|
||||
- Was ethical review conducted?
|
||||
|
||||
4. Preprocessing
|
||||
- What preprocessing was done?
|
||||
- Was the "raw" data saved?
|
||||
|
||||
5. Uses
|
||||
- What tasks has this been used for?
|
||||
- What should it NOT be used for?
|
||||
- Are there other tasks it could be used for?
|
||||
|
||||
6. Distribution
|
||||
- How is it distributed?
|
||||
- Under what license?
|
||||
- Are there any restrictions?
|
||||
|
||||
7. Maintenance
|
||||
- Who maintains it?
|
||||
- How can users contact the maintainer?
|
||||
- Will it be updated? How?
|
||||
- Is there an erratum?
|
||||
```
|
||||
|
||||
### Venues for Benchmark Papers
|
||||
|
||||
| Venue | Notes |
|
||||
|-------|-------|
|
||||
| **NeurIPS Datasets & Benchmarks** | Dedicated track; best venue for this |
|
||||
| **ACL** (Resource papers) | NLP-focused datasets |
|
||||
| **LREC-COLING** | Language resources |
|
||||
| **TMLR** | Good for benchmarks with analysis |
|
||||
|
||||
---
|
||||
|
||||
## Position Papers
|
||||
|
||||
### When to Write a Position Paper
|
||||
|
||||
- You have an argument about how the field should develop
|
||||
- You want to challenge a widely-held assumption
|
||||
- You want to propose a research agenda based on analysis
|
||||
- You've identified a systematic problem in current methodology
|
||||
|
||||
### Structure
|
||||
|
||||
```
|
||||
1. Introduction
|
||||
- State your thesis clearly in the first paragraph
|
||||
- Why this matters now
|
||||
|
||||
2. Background
|
||||
- Current state of the field
|
||||
- Prevailing assumptions you're challenging
|
||||
|
||||
3. Argument
|
||||
- Present your thesis with supporting evidence
|
||||
- Evidence can be: empirical data, theoretical analysis, logical argument,
|
||||
case studies, historical precedent
|
||||
- Be rigorous — this isn't an opinion piece
|
||||
|
||||
4. Counterarguments
|
||||
- Engage seriously with the strongest objections
|
||||
- Explain why they don't undermine your thesis
|
||||
- Concede where appropriate — it strengthens credibility
|
||||
|
||||
5. Implications
|
||||
- What should the field do differently?
|
||||
- Concrete research directions your thesis suggests
|
||||
- How should evaluation/methodology change?
|
||||
|
||||
6. Conclusion
|
||||
- Restate thesis
|
||||
- Call to action
|
||||
```
|
||||
|
||||
### Writing Standards
|
||||
|
||||
- **Lead with the strongest version of your argument** — don't hedge in the first paragraph
|
||||
- **Engage with counterarguments honestly** — the best position papers address the strongest objections, not the weakest
|
||||
- **Provide evidence** — a position paper without evidence is an editorial
|
||||
- **Be concrete** — "the field should do X" is better than "more work is needed"
|
||||
- **Don't straw-man existing work** — characterize opposing positions fairly
|
||||
|
||||
### Venues for Position Papers
|
||||
|
||||
| Venue | Notes |
|
||||
|-------|-------|
|
||||
| **ICML** (Position track) | Dedicated track for position papers |
|
||||
| **NeurIPS** (Workshop papers) | Workshops often welcome position pieces |
|
||||
| **ACL** (Theme papers) | When your position aligns with the conference theme |
|
||||
| **TMLR** | Accepts well-argued position papers |
|
||||
| **CACM** | For broader CS audience |
|
||||
|
||||
---
|
||||
|
||||
## Reproducibility and Replication Papers
|
||||
|
||||
### When to Write a Reproducibility Paper
|
||||
|
||||
- You attempted to reproduce a published result and succeeded/failed
|
||||
- You want to verify claims under different conditions
|
||||
- You've identified that a popular method's performance depends on unreported details
|
||||
|
||||
### Structure
|
||||
|
||||
```
|
||||
1. Introduction
|
||||
- What paper/result are you reproducing?
|
||||
- Why is this reproduction valuable?
|
||||
|
||||
2. Original Claims
|
||||
- State the exact claims from the original paper
|
||||
- What evidence was provided?
|
||||
|
||||
3. Methodology
|
||||
- Your reproduction approach
|
||||
- Differences from original (if any) and why
|
||||
- What information was missing from the original paper?
|
||||
|
||||
4. Results
|
||||
- Side-by-side comparison with original results
|
||||
- Statistical comparison (confidence intervals overlap?)
|
||||
- What reproduced and what didn't?
|
||||
|
||||
5. Analysis
|
||||
- If results differ: why? What's sensitive?
|
||||
- Hidden hyperparameters or implementation details?
|
||||
- Robustness to seed, hardware, library versions?
|
||||
|
||||
6. Recommendations
|
||||
- For original authors: what should be clarified?
|
||||
- For practitioners: what to watch out for?
|
||||
- For the field: what reproducibility lessons emerge?
|
||||
```
|
||||
|
||||
### Venues
|
||||
|
||||
| Venue | Notes |
|
||||
|-------|-------|
|
||||
| **ML Reproducibility Challenge** | Annual challenge at NeurIPS |
|
||||
| **ReScience** | Journal dedicated to replications |
|
||||
| **TMLR** | Accepts reproductions with analysis |
|
||||
| **Workshops** | Reproducibility workshops at major conferences |
|
||||
@@ -0,0 +1,433 @@
|
||||
# Reviewer Guidelines & Evaluation Criteria
|
||||
|
||||
This reference documents how reviewers evaluate papers at major ML/AI conferences, helping authors anticipate and address reviewer concerns.
|
||||
|
||||
---
|
||||
|
||||
## Contents
|
||||
|
||||
- [Universal Evaluation Dimensions](#universal-evaluation-dimensions)
|
||||
- [NeurIPS Reviewer Guidelines](#neurips-reviewer-guidelines)
|
||||
- [ICML Reviewer Guidelines](#icml-reviewer-guidelines)
|
||||
- [ICLR Reviewer Guidelines](#iclr-reviewer-guidelines)
|
||||
- [ACL Reviewer Guidelines](#acl-reviewer-guidelines)
|
||||
- [What Makes Reviews Strong](#what-makes-reviews-strong)
|
||||
- [Common Reviewer Concerns](#common-reviewer-concerns)
|
||||
- [How to Address Reviewer Feedback](#how-to-address-reviewer-feedback)
|
||||
|
||||
---
|
||||
|
||||
## Universal Evaluation Dimensions
|
||||
|
||||
All major ML conferences assess papers across four core dimensions:
|
||||
|
||||
### 1. Quality (Technical Soundness)
|
||||
|
||||
**What reviewers ask:**
|
||||
- Are claims well-supported by theoretical analysis or experimental results?
|
||||
- Are the proofs correct? Are the experiments properly controlled?
|
||||
- Are baselines appropriate and fairly compared?
|
||||
- Is the methodology sound?
|
||||
|
||||
**How to ensure high quality:**
|
||||
- Include complete proofs (main paper or appendix with sketches)
|
||||
- Use appropriate baselines (not strawmen)
|
||||
- Report variance/error bars with methodology
|
||||
- Document hyperparameter selection process
|
||||
|
||||
### 2. Clarity (Writing & Organization)
|
||||
|
||||
**What reviewers ask:**
|
||||
- Is the paper clearly written and well organized?
|
||||
- Can an expert in the field reproduce the results?
|
||||
- Is notation consistent? Are terms defined?
|
||||
- Is the paper self-contained?
|
||||
|
||||
**How to ensure clarity:**
|
||||
- Use consistent terminology throughout
|
||||
- Define all notation at first use
|
||||
- Include reproducibility details (appendix acceptable)
|
||||
- Have non-authors read before submission
|
||||
|
||||
### 3. Significance (Impact & Importance)
|
||||
|
||||
**What reviewers ask:**
|
||||
- Are the results impactful for the community?
|
||||
- Will others build upon this work?
|
||||
- Does it address an important problem?
|
||||
- What is the potential for real-world impact?
|
||||
|
||||
**How to demonstrate significance:**
|
||||
- Clearly articulate the problem's importance
|
||||
- Connect to broader research themes
|
||||
- Discuss potential applications
|
||||
- Compare to existing approaches meaningfully
|
||||
|
||||
### 4. Originality (Novelty & Contribution)
|
||||
|
||||
**What reviewers ask:**
|
||||
- Does this provide new insights?
|
||||
- How does it differ from prior work?
|
||||
- Is the contribution non-trivial?
|
||||
|
||||
**Key insight from NeurIPS guidelines:**
|
||||
> "Originality does not necessarily require introducing an entirely new method. Papers that provide novel insights from evaluating existing approaches or shed light on why methods succeed can also be highly original."
|
||||
|
||||
---
|
||||
|
||||
## NeurIPS Reviewer Guidelines
|
||||
|
||||
### Scoring System (1-6 Scale)
|
||||
|
||||
| Score | Label | Description |
|
||||
|-------|-------|-------------|
|
||||
| **6** | Strong Accept | Groundbreaking, flawless work; top 2-3% of submissions |
|
||||
| **5** | Accept | Technically solid, high impact; would benefit the community |
|
||||
| **4** | Borderline Accept | Solid work with limited evaluation; leans accept |
|
||||
| **3** | Borderline Reject | Solid but weaknesses outweigh strengths; leans reject |
|
||||
| **2** | Reject | Technical flaws or weak evaluation |
|
||||
| **1** | Strong Reject | Well-known results or unaddressed ethics concerns |
|
||||
|
||||
### Reviewer Instructions
|
||||
|
||||
Reviewers are explicitly instructed to:
|
||||
|
||||
1. **Evaluate the paper as written** - not what it could be with revisions
|
||||
2. **Provide constructive feedback** - 3-5 actionable points
|
||||
3. **Not penalize honest limitations** - acknowledging weaknesses is encouraged
|
||||
4. **Assess reproducibility** - can the work be verified?
|
||||
5. **Consider ethical implications** - potential misuse or harm
|
||||
|
||||
### What Reviewers Should Avoid
|
||||
|
||||
- Superficial, uninformed reviews
|
||||
- Demanding unreasonable additional experiments
|
||||
- Penalizing authors for honest limitation acknowledgment
|
||||
- Rejecting for missing citations to reviewer's own work
|
||||
|
||||
### Timeline (NeurIPS 2025 — verify dates for current year)
|
||||
|
||||
- Bidding: May 17-21
|
||||
- Reviewing period: May 29 - July 2
|
||||
- Author rebuttals: July 24-30
|
||||
- Discussion period: July 31 - August 13
|
||||
- Final notifications: September 18
|
||||
|
||||
> **Note**: These dates are from the 2025 cycle. Always check the current year's call for papers at the venue website.
|
||||
|
||||
---
|
||||
|
||||
## ICML Reviewer Guidelines
|
||||
|
||||
### Review Structure
|
||||
|
||||
ICML reviewers provide:
|
||||
|
||||
1. **Summary** - Brief description of contributions
|
||||
2. **Strengths** - Positive aspects
|
||||
3. **Weaknesses** - Areas for improvement
|
||||
4. **Questions** - Clarifications for authors
|
||||
5. **Limitations** - Assessment of stated limitations
|
||||
6. **Ethics** - Any concerns
|
||||
7. **Overall Score** - Recommendation
|
||||
|
||||
### Scoring Guidelines
|
||||
|
||||
ICML uses a similar 1-6 scale with calibration:
|
||||
- Top 25% of accepted papers: Score 5-6
|
||||
- Typical accepted paper: Score 4-5
|
||||
- Borderline: Score 3-4
|
||||
- Clear reject: Score 1-2
|
||||
|
||||
### Key Evaluation Points
|
||||
|
||||
1. **Reproducibility** - Are there enough details?
|
||||
2. **Experimental rigor** - Multiple seeds, proper baselines?
|
||||
3. **Writing quality** - Clear, organized, well-structured?
|
||||
4. **Novelty** - Non-trivial contribution?
|
||||
|
||||
---
|
||||
|
||||
## ICLR Reviewer Guidelines
|
||||
|
||||
### OpenReview Process
|
||||
|
||||
ICLR uses OpenReview with:
|
||||
- Public reviews (after acceptance decisions)
|
||||
- Author responses visible to reviewers
|
||||
- Discussion between reviewers and ACs
|
||||
|
||||
### Scoring
|
||||
|
||||
ICLR reviews include:
|
||||
- **Soundness**: 1-4 scale
|
||||
- **Presentation**: 1-4 scale
|
||||
- **Contribution**: 1-4 scale
|
||||
- **Overall**: 1-10 scale
|
||||
- **Confidence**: 1-5 scale
|
||||
|
||||
### Unique ICLR Considerations
|
||||
|
||||
1. **LLM Disclosure** - Reviewers assess whether LLM use is properly disclosed
|
||||
2. **Reproducibility** - Emphasis on code availability
|
||||
3. **Reciprocal Reviewing** - Authors must also serve as reviewers
|
||||
|
||||
---
|
||||
|
||||
## ACL Reviewer Guidelines
|
||||
|
||||
### ACL-Specific Criteria
|
||||
|
||||
ACL adds NLP-specific evaluation:
|
||||
|
||||
1. **Linguistic soundness** - Are linguistic claims accurate?
|
||||
2. **Resource documentation** - Are datasets/models properly documented?
|
||||
3. **Multilingual consideration** - If applicable, is language diversity addressed?
|
||||
|
||||
### Limitations Section
|
||||
|
||||
ACL specifically requires a Limitations section. Reviewers check:
|
||||
- Are limitations honest and comprehensive?
|
||||
- Do limitations undermine core claims?
|
||||
- Are potential negative impacts addressed?
|
||||
|
||||
### Ethics Review
|
||||
|
||||
ACL has a dedicated ethics review process for:
|
||||
- Dual-use concerns
|
||||
- Data privacy issues
|
||||
- Bias and fairness implications
|
||||
|
||||
---
|
||||
|
||||
## AAAI Reviewer Guidelines
|
||||
|
||||
### Evaluation Criteria
|
||||
|
||||
AAAI reviewers evaluate along similar axes to NeurIPS/ICML but with some differences:
|
||||
|
||||
| Criterion | Weight | Notes |
|
||||
|-----------|--------|-------|
|
||||
| **Technical quality** | High | Soundness of approach, correctness of results |
|
||||
| **Significance** | High | Importance of the problem and contribution |
|
||||
| **Novelty** | Medium-High | New ideas, methods, or insights |
|
||||
| **Clarity** | Medium | Clear writing, well-organized presentation |
|
||||
| **Reproducibility** | Medium | Sufficient detail to reproduce results |
|
||||
|
||||
### AAAI-Specific Considerations
|
||||
|
||||
- **Broader AI scope**: AAAI covers all of AI, not just ML. Papers on planning, reasoning, knowledge representation, NLP, vision, robotics, and multi-agent systems are all in scope. Reviewers may not be deep ML specialists.
|
||||
- **Formatting strictness**: AAAI reviewers are instructed to flag formatting violations. Non-compliant papers may be desk-rejected before review.
|
||||
- **Application papers**: AAAI is more receptive to application-focused work than NeurIPS/ICML. Framing a strong application contribution is viable.
|
||||
- **Senior Program Committee**: AAAI uses SPCs (Senior Program Committee members) who mediate between reviewers and make accept/reject recommendations.
|
||||
|
||||
### Scoring (AAAI Scale)
|
||||
|
||||
- **Strong Accept**: Clearly above threshold, excellent contribution
|
||||
- **Accept**: Above threshold, good contribution with minor issues
|
||||
- **Weak Accept**: Borderline, merits outweigh concerns
|
||||
- **Weak Reject**: Borderline, concerns outweigh merits
|
||||
- **Reject**: Below threshold, significant issues
|
||||
- **Strong Reject**: Well below threshold
|
||||
|
||||
---
|
||||
|
||||
## COLM Reviewer Guidelines
|
||||
|
||||
### Evaluation Criteria
|
||||
|
||||
COLM reviews focus on relevance to language modeling in addition to standard criteria:
|
||||
|
||||
| Criterion | Weight | Notes |
|
||||
|-----------|--------|-------|
|
||||
| **Relevance** | High | Must be relevant to language modeling community |
|
||||
| **Technical quality** | High | Sound methodology, well-supported claims |
|
||||
| **Novelty** | Medium-High | New insights about language models |
|
||||
| **Clarity** | Medium | Clear presentation, reproducible |
|
||||
| **Significance** | Medium-High | Impact on LM research and practice |
|
||||
|
||||
### COLM-Specific Considerations
|
||||
|
||||
- **Language model focus**: Reviewers will assess whether the contribution advances understanding of language models. General ML contributions need explicit LM framing.
|
||||
- **Newer venue norms**: COLM is newer than NeurIPS/ICML, so reviewer calibration varies more. Write more defensively — anticipate a wider range of reviewer expertise.
|
||||
- **ICLR-derived process**: Review process is modeled on ICLR (open reviews, author response period, discussion among reviewers).
|
||||
- **Broad interpretation of "language modeling"**: Includes training, evaluation, alignment, safety, efficiency, applications, theory, multimodality (if language is central), and social impact of LMs.
|
||||
|
||||
### Scoring
|
||||
|
||||
COLM uses an ICLR-style scoring system:
|
||||
- **8-10**: Strong accept (top papers)
|
||||
- **6-7**: Weak accept (solid contribution)
|
||||
- **5**: Borderline
|
||||
- **3-4**: Weak reject (below threshold)
|
||||
- **1-2**: Strong reject
|
||||
|
||||
---
|
||||
|
||||
## What Makes Reviews Strong
|
||||
|
||||
### Following Daniel Dennett's Rules
|
||||
|
||||
Good reviewers follow these principles:
|
||||
|
||||
1. **Re-express the position fairly** - Show you understand the paper
|
||||
2. **List agreements** - Acknowledge what works well
|
||||
3. **List what you learned** - Credit the contribution
|
||||
4. **Only then critique** - After establishing understanding
|
||||
|
||||
### Review Structure Best Practices
|
||||
|
||||
**Strong Review Structure:**
|
||||
```
|
||||
Summary (1 paragraph):
|
||||
- What the paper does
|
||||
- Main contribution claimed
|
||||
|
||||
Strengths (3-5 bullets):
|
||||
- Specific positive aspects
|
||||
- Why these matter
|
||||
|
||||
Weaknesses (3-5 bullets):
|
||||
- Specific concerns
|
||||
- Why these matter
|
||||
- Suggestions for addressing
|
||||
|
||||
Questions (2-4 items):
|
||||
- Clarifications needed
|
||||
- Things that would change assessment
|
||||
|
||||
Minor Issues (optional):
|
||||
- Typos, unclear sentences
|
||||
- Formatting issues
|
||||
|
||||
Overall Assessment:
|
||||
- Clear recommendation with reasoning
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Reviewer Concerns
|
||||
|
||||
### Technical Concerns
|
||||
|
||||
| Concern | How to Pre-empt |
|
||||
|---------|-----------------|
|
||||
| "Baselines too weak" | Use state-of-the-art baselines, cite recent work |
|
||||
| "Missing ablations" | Include systematic ablation study |
|
||||
| "No error bars" | Report std dev/error, multiple runs |
|
||||
| "Hyperparameters not tuned" | Document tuning process, search ranges |
|
||||
| "Claims not supported" | Ensure every claim has evidence |
|
||||
|
||||
### Novelty Concerns
|
||||
|
||||
| Concern | How to Pre-empt |
|
||||
|---------|-----------------|
|
||||
| "Incremental contribution" | Clearly articulate what's new vs prior work |
|
||||
| "Similar to [paper X]" | Explicitly compare to X in Related Work |
|
||||
| "Straightforward extension" | Highlight non-obvious aspects |
|
||||
|
||||
### Clarity Concerns
|
||||
|
||||
| Concern | How to Pre-empt |
|
||||
|---------|-----------------|
|
||||
| "Hard to follow" | Use clear structure, signposting |
|
||||
| "Notation inconsistent" | Review all notation, create notation table |
|
||||
| "Missing details" | Include reproducibility appendix |
|
||||
| "Figures unclear" | Self-contained captions, proper sizing |
|
||||
|
||||
### Significance Concerns
|
||||
|
||||
| Concern | How to Pre-empt |
|
||||
|---------|-----------------|
|
||||
| "Limited impact" | Discuss broader implications |
|
||||
| "Narrow evaluation" | Evaluate on multiple benchmarks |
|
||||
| "Only works in restricted setting" | Acknowledge scope, explain why still valuable |
|
||||
|
||||
---
|
||||
|
||||
## How to Address Reviewer Feedback
|
||||
|
||||
### Rebuttal Best Practices
|
||||
|
||||
**Do:**
|
||||
- Thank reviewers for their time
|
||||
- Address each concern specifically
|
||||
- Provide evidence (new experiments if possible)
|
||||
- Be concise—reviewers are busy
|
||||
- Acknowledge valid criticisms
|
||||
|
||||
**Don't:**
|
||||
- Be defensive or dismissive
|
||||
- Make promises you can't keep
|
||||
- Ignore difficult criticisms
|
||||
- Write excessively long rebuttals
|
||||
- Argue about subjective assessments
|
||||
|
||||
### Rebuttal Template
|
||||
|
||||
```markdown
|
||||
We thank the reviewers for their thoughtful feedback.
|
||||
|
||||
## Reviewer 1
|
||||
|
||||
**R1-Q1: [Quoted concern]**
|
||||
[Direct response with evidence]
|
||||
|
||||
**R1-Q2: [Quoted concern]**
|
||||
[Direct response with evidence]
|
||||
|
||||
## Reviewer 2
|
||||
|
||||
...
|
||||
|
||||
## Summary of Changes
|
||||
If accepted, we will:
|
||||
1. [Specific change]
|
||||
2. [Specific change]
|
||||
3. [Specific change]
|
||||
```
|
||||
|
||||
### When to Accept Criticism
|
||||
|
||||
Some reviewer feedback should simply be accepted:
|
||||
- Valid technical errors
|
||||
- Missing important related work
|
||||
- Unclear explanations
|
||||
- Missing experimental details
|
||||
|
||||
Acknowledge these gracefully: "The reviewer is correct that... We will revise to..."
|
||||
|
||||
### When to Push Back
|
||||
|
||||
You can respectfully disagree when:
|
||||
- Reviewer misunderstood the paper
|
||||
- Requested experiments are out of scope
|
||||
- Criticism is factually incorrect
|
||||
|
||||
Frame disagreements constructively: "We appreciate this perspective. However, [explanation]..."
|
||||
|
||||
---
|
||||
|
||||
## Pre-Submission Reviewer Simulation
|
||||
|
||||
Before submitting, ask yourself:
|
||||
|
||||
**Quality:**
|
||||
- [ ] Would I trust these results if I saw them?
|
||||
- [ ] Are all claims supported by evidence?
|
||||
- [ ] Are baselines fair and recent?
|
||||
|
||||
**Clarity:**
|
||||
- [ ] Can someone reproduce this from the paper?
|
||||
- [ ] Is the writing clear to non-experts in this subfield?
|
||||
- [ ] Are all terms and notation defined?
|
||||
|
||||
**Significance:**
|
||||
- [ ] Why should the community care about this?
|
||||
- [ ] What can people do with this work?
|
||||
- [ ] Is the problem important?
|
||||
|
||||
**Originality:**
|
||||
- [ ] What specifically is new here?
|
||||
- [ ] How does this differ from closest related work?
|
||||
- [ ] Is the contribution non-trivial?
|
||||
191
skills/research/research-paper-writing/references/sources.md
Normal file
191
skills/research/research-paper-writing/references/sources.md
Normal file
@@ -0,0 +1,191 @@
|
||||
# Source Bibliography
|
||||
|
||||
This document lists all authoritative sources used to build this skill, organized by topic.
|
||||
|
||||
---
|
||||
|
||||
## Origin & Attribution
|
||||
|
||||
The writing philosophy, citation verification workflow, and conference reference materials in this skill were originally compiled by **[Orchestra Research](https://github.com/orchestra-research)** as the `ml-paper-writing` skill (January 2026), drawing on Neel Nanda's blog post and other researcher guides listed below. The skill was integrated into hermes-agent by teknium (January 2026), then expanded into the current `research-paper-writing` pipeline by SHL0MS (April 2026, PR #4654), which added experiment design, execution monitoring, iterative refinement, and submission phases while preserving the original writing philosophy and reference files.
|
||||
|
||||
---
|
||||
|
||||
## Writing Philosophy & Guides
|
||||
|
||||
### Primary Sources (Must-Read)
|
||||
|
||||
| Source | Author | URL | Key Contribution |
|
||||
|--------|--------|-----|------------------|
|
||||
| **Highly Opinionated Advice on How to Write ML Papers** | Neel Nanda | [Alignment Forum](https://www.alignmentforum.org/posts/eJGptPbbFPZGLpjsp/highly-opinionated-advice-on-how-to-write-ml-papers) | Narrative framework, "What/Why/So What", time allocation |
|
||||
| **How to Write ML Papers** | Sebastian Farquhar (DeepMind) | [Blog](https://sebastianfarquhar.com/on-research/2024/11/04/how_to_write_ml_papers/) | 5-sentence abstract formula, structure templates |
|
||||
| **A Survival Guide to a PhD** | Andrej Karpathy | [Blog](http://karpathy.github.io/2016/09/07/phd/) | Paper structure recipe, contribution framing |
|
||||
| **Heuristics for Scientific Writing** | Zachary Lipton (CMU) | [Blog](https://www.approximatelycorrect.com/2018/01/29/heuristics-technical-scientific-writing-machine-learning-perspective/) | Word choice, section balance, intensifier warnings |
|
||||
| **Advice for Authors** | Jacob Steinhardt (UC Berkeley) | [Blog](https://jsteinhardt.stat.berkeley.edu/blog/advice-for-authors) | Precision over brevity, consistent terminology |
|
||||
| **Easy Paper Writing Tips** | Ethan Perez (Anthropic) | [Blog](https://ethanperez.net/easy-paper-writing-tips/) | Micro-level tips, apostrophe unfolding, clarity tricks |
|
||||
|
||||
### Foundational Scientific Writing
|
||||
|
||||
| Source | Author | URL | Key Contribution |
|
||||
|--------|--------|-----|------------------|
|
||||
| **The Science of Scientific Writing** | Gopen & Swan | [PDF](https://cseweb.ucsd.edu/~swanson/papers/science-of-writing.pdf) | Topic/stress positions, old-before-new, 7 principles |
|
||||
| **Summary of Science of Scientific Writing** | Lawrence Crowl | [Summary](https://www.crowl.org/Lawrence/writing/GopenSwan90.html) | Condensed version of Gopen & Swan |
|
||||
|
||||
### Additional Resources
|
||||
|
||||
| Source | URL | Key Contribution |
|
||||
|--------|-----|------------------|
|
||||
| How To Write A Research Paper In ML | [Blog](https://grigorisg9gr.github.io/machine%20learning/research%20paper/how-to-write-a-research-paper-in-machine-learning/) | Practical walkthrough, LaTeX tips |
|
||||
| A Recipe for Training Neural Networks | [Karpathy Blog](http://karpathy.github.io/2019/04/25/recipe/) | Debugging methodology that translates to paper structure |
|
||||
| ICML Paper Writing Best Practices | [ICML](https://icml.cc/Conferences/2022/BestPractices) | Official venue guidance |
|
||||
| Bill Freeman's Writing Slides | [MIT](https://billf.mit.edu/sites/default/files/documents/cvprPapers.pdf) | Visual guide to paper structure |
|
||||
|
||||
---
|
||||
|
||||
## Official Conference Guidelines
|
||||
|
||||
### NeurIPS
|
||||
|
||||
| Document | URL | Purpose |
|
||||
|----------|-----|---------|
|
||||
| Paper Checklist Guidelines | [NeurIPS](https://neurips.cc/public/guides/PaperChecklist) | 16-item mandatory checklist |
|
||||
| Reviewer Guidelines 2025 | [NeurIPS](https://neurips.cc/Conferences/2025/ReviewerGuidelines) | Evaluation criteria, scoring |
|
||||
| Style Files | [NeurIPS](https://neurips.cc/Conferences/2025/PaperInformation/StyleFiles) | LaTeX templates |
|
||||
|
||||
### ICML
|
||||
|
||||
| Document | URL | Purpose |
|
||||
|----------|-----|---------|
|
||||
| Paper Guidelines | [ICML](https://icml.cc/Conferences/2024/PaperGuidelines) | Submission requirements |
|
||||
| Reviewer Instructions 2025 | [ICML](https://icml.cc/Conferences/2025/ReviewerInstructions) | Review form, evaluation |
|
||||
| Style & Author Instructions | [ICML](https://icml.cc/Conferences/2022/StyleAuthorInstructions) | Formatting specifications |
|
||||
|
||||
### ICLR
|
||||
|
||||
| Document | URL | Purpose |
|
||||
|----------|-----|---------|
|
||||
| Author Guide 2026 | [ICLR](https://iclr.cc/Conferences/2026/AuthorGuide) | Submission requirements, LLM disclosure |
|
||||
| Reviewer Guide 2025 | [ICLR](https://iclr.cc/Conferences/2025/ReviewerGuide) | Review process, evaluation |
|
||||
|
||||
### ACL/EMNLP
|
||||
|
||||
| Document | URL | Purpose |
|
||||
|----------|-----|---------|
|
||||
| ACL Style Files | [GitHub](https://github.com/acl-org/acl-style-files) | LaTeX templates |
|
||||
| ACL Rolling Review | [ARR](https://aclrollingreview.org/) | Submission process |
|
||||
|
||||
### AAAI
|
||||
|
||||
| Document | URL | Purpose |
|
||||
|----------|-----|---------|
|
||||
| Author Kit 2026 | [AAAI](https://aaai.org/authorkit26/) | Templates and guidelines |
|
||||
|
||||
### COLM
|
||||
|
||||
| Document | URL | Purpose |
|
||||
|----------|-----|---------|
|
||||
| Template | [GitHub](https://github.com/COLM-org/Template) | LaTeX templates |
|
||||
|
||||
---
|
||||
|
||||
## Citation APIs & Tools
|
||||
|
||||
### APIs
|
||||
|
||||
| API | Documentation | Best For |
|
||||
|-----|---------------|----------|
|
||||
| **Semantic Scholar** | [Docs](https://api.semanticscholar.org/api-docs/) | ML/AI papers, citation graphs |
|
||||
| **CrossRef** | [Docs](https://www.crossref.org/documentation/retrieve-metadata/rest-api/) | DOI lookup, BibTeX retrieval |
|
||||
| **arXiv** | [Docs](https://info.arxiv.org/help/api/basics.html) | Preprints, PDF access |
|
||||
| **OpenAlex** | [Docs](https://docs.openalex.org/) | Open alternative, bulk access |
|
||||
|
||||
### Python Libraries
|
||||
|
||||
| Library | Install | Purpose |
|
||||
|---------|---------|---------|
|
||||
| `semanticscholar` | `pip install semanticscholar` | Semantic Scholar wrapper |
|
||||
| `arxiv` | `pip install arxiv` | arXiv search and download |
|
||||
| `habanero` | `pip install habanero` | CrossRef client |
|
||||
|
||||
### Citation Verification
|
||||
|
||||
| Tool | URL | Purpose |
|
||||
|------|-----|---------|
|
||||
| Citely | [citely.ai](https://citely.ai/citation-checker) | Batch verification |
|
||||
| ReciteWorks | [reciteworks.com](https://reciteworks.com/) | In-text citation checking |
|
||||
|
||||
---
|
||||
|
||||
## Visualization & Formatting
|
||||
|
||||
### Figure Creation
|
||||
|
||||
| Tool | URL | Purpose |
|
||||
|------|-----|---------|
|
||||
| PlotNeuralNet | [GitHub](https://github.com/HarisIqbal88/PlotNeuralNet) | TikZ neural network diagrams |
|
||||
| SciencePlots | [GitHub](https://github.com/garrettj403/SciencePlots) | Publication-ready matplotlib |
|
||||
| Okabe-Ito Palette | [Reference](https://jfly.uni-koeln.de/color/) | Colorblind-safe colors |
|
||||
|
||||
### LaTeX Resources
|
||||
|
||||
| Resource | URL | Purpose |
|
||||
|----------|-----|---------|
|
||||
| Overleaf Templates | [Overleaf](https://www.overleaf.com/latex/templates) | Online LaTeX editor |
|
||||
| BibLaTeX Guide | [CTAN](https://ctan.org/pkg/biblatex) | Modern citation management |
|
||||
|
||||
---
|
||||
|
||||
## Research on AI Writing & Hallucination
|
||||
|
||||
| Source | URL | Key Finding |
|
||||
|--------|-----|-------------|
|
||||
| AI Hallucinations in Citations | [Enago](https://www.enago.com/academy/ai-hallucinations-research-citations/) | ~40% error rate |
|
||||
| Hallucination in AI Writing | [PMC](https://pmc.ncbi.nlm.nih.gov/articles/PMC10726751/) | Types of citation errors |
|
||||
| NeurIPS 2025 AI Report | [ByteIota](https://byteiota.com/neurips-2025-100-ai-hallucinations-slip-through-review/) | 100+ hallucinated citations |
|
||||
|
||||
---
|
||||
|
||||
## Quick Reference by Topic
|
||||
|
||||
### For Narrative & Structure
|
||||
→ Start with: Neel Nanda, Sebastian Farquhar, Andrej Karpathy
|
||||
|
||||
### For Sentence-Level Clarity
|
||||
→ Start with: Gopen & Swan, Ethan Perez, Zachary Lipton
|
||||
|
||||
### For Word Choice & Style
|
||||
→ Start with: Zachary Lipton, Jacob Steinhardt
|
||||
|
||||
### For Conference-Specific Requirements
|
||||
→ Start with: Official venue guidelines (NeurIPS, ICML, ICLR, ACL)
|
||||
|
||||
### For Citation Management
|
||||
→ Start with: Semantic Scholar API, CrossRef, citation-workflow.md
|
||||
|
||||
### For Reviewer Expectations
|
||||
→ Start with: Venue reviewer guidelines, reviewer-guidelines.md
|
||||
|
||||
### For Human Evaluation
|
||||
→ Start with: human-evaluation.md, Prolific/MTurk documentation
|
||||
|
||||
### For Non-Empirical Papers (Theory, Survey, Benchmark, Position)
|
||||
→ Start with: paper-types.md
|
||||
|
||||
---
|
||||
|
||||
## Human Evaluation & Annotation
|
||||
|
||||
| Source | URL | Key Contribution |
|
||||
|--------|-----|------------------|
|
||||
| **Datasheets for Datasets** | Gebru et al., 2021 ([arXiv](https://arxiv.org/abs/1803.09010)) | Structured dataset documentation framework |
|
||||
| **Model Cards for Model Reporting** | Mitchell et al., 2019 ([arXiv](https://arxiv.org/abs/1810.03993)) | Structured model documentation framework |
|
||||
| **Crowdsourcing and Human Computation** | [Survey](https://arxiv.org/abs/2202.06516) | Best practices for crowdsourced annotation |
|
||||
| **Krippendorff's Alpha** | [Wikipedia](https://en.wikipedia.org/wiki/Krippendorff%27s_alpha) | Inter-annotator agreement metric reference |
|
||||
| **Prolific** | [prolific.co](https://www.prolific.co/) | Recommended crowdsourcing platform for research |
|
||||
|
||||
## Ethics & Broader Impact
|
||||
|
||||
| Source | URL | Key Contribution |
|
||||
|--------|-----|------------------|
|
||||
| **ML CO2 Impact** | [mlco2.github.io](https://mlco2.github.io/impact/) | Compute carbon footprint calculator |
|
||||
| **NeurIPS Broader Impact Guide** | [NeurIPS](https://neurips.cc/public/guides/PaperChecklist) | Official guidance on impact statements |
|
||||
| **ACL Ethics Policy** | [ACL](https://www.aclweb.org/portal/content/acl-code-ethics) | Ethics requirements for NLP research |
|
||||
@@ -0,0 +1,474 @@
|
||||
# ML Paper Writing Philosophy & Best Practices
|
||||
|
||||
This reference compiles writing advice from prominent ML researchers including Neel Nanda, Andrej Karpathy, Sebastian Farquhar, Zachary Lipton, and Jacob Steinhardt.
|
||||
|
||||
---
|
||||
|
||||
## Contents
|
||||
|
||||
- [The Narrative Principle](#the-narrative-principle)
|
||||
- [Time Allocation](#time-allocation)
|
||||
- [Abstract Writing Formula](#abstract-writing-formula)
|
||||
- [Introduction Structure](#introduction-structure)
|
||||
- [Sentence-Level Clarity](#sentence-level-clarity)
|
||||
- [Word Choice and Precision](#word-choice-and-precision)
|
||||
- [Mathematical Writing](#mathematical-writing)
|
||||
- [Figure Design](#figure-design)
|
||||
- [Common Mistakes to Avoid](#common-mistakes-to-avoid)
|
||||
|
||||
---
|
||||
|
||||
## The Narrative Principle
|
||||
|
||||
### From Neel Nanda
|
||||
|
||||
"A paper is a short, rigorous, evidence-based technical story with a takeaway readers care about."
|
||||
|
||||
The narrative rests on three pillars that must be crystal clear by the end of your introduction:
|
||||
|
||||
**The "What"**: One to three specific novel claims fitting within a cohesive theme. Vague contributions like "we study X" fail immediately—reviewers need precise, falsifiable claims.
|
||||
|
||||
**The "Why"**: Rigorous empirical evidence that convincingly supports those claims, including strong baselines honestly tuned and experiments that distinguish between competing hypotheses rather than merely showing "decent results."
|
||||
|
||||
**The "So What"**: Why readers should care, connecting your contribution to problems the community recognizes as important.
|
||||
|
||||
### From Andrej Karpathy
|
||||
|
||||
"A paper is not a random collection of experiments you report on. The paper sells a single thing that was not obvious or present before. The entire paper is organized around this core contribution with surgical precision."
|
||||
|
||||
This applies whether you're presenting a new architecture, a theoretical result, or improved understanding of existing methods—NeurIPS explicitly notes that "originality does not necessarily require an entirely new method."
|
||||
|
||||
**Practical Implication**: If you cannot state your contribution in one sentence, you don't yet have a paper. Everything else—experiments, related work, discussion—exists only to support that core claim.
|
||||
|
||||
---
|
||||
|
||||
## Time Allocation
|
||||
|
||||
### From Neel Nanda
|
||||
|
||||
Spend approximately **the same amount of time** on each of:
|
||||
1. The abstract
|
||||
2. The introduction
|
||||
3. The figures
|
||||
4. Everything else combined
|
||||
|
||||
This isn't hyperbole—most reviewers form preliminary judgments before reaching your methods section. Readers encounter your paper in a predictable pattern: **title → abstract → introduction → figures → maybe the rest.**
|
||||
|
||||
### Reviewer Reading Patterns
|
||||
|
||||
Studies of reviewer behavior show:
|
||||
- Abstract is read 100% of the time
|
||||
- Introduction is skimmed by 90%+ of reviewers
|
||||
- Figures are examined before methods by most reviewers
|
||||
- Full methods are read only if interest is established
|
||||
|
||||
**Implication**: Front-load your paper's value. Don't bury the contribution.
|
||||
|
||||
---
|
||||
|
||||
## Abstract Writing Formula
|
||||
|
||||
### Sebastian Farquhar's 5-Sentence Formula
|
||||
|
||||
1. **What you achieved**: "We introduce...", "We prove...", "We demonstrate..."
|
||||
2. **Why this is hard and important**
|
||||
3. **How you do it** (with specialist keywords for discoverability)
|
||||
4. **What evidence you have**
|
||||
5. **Your most remarkable number/result**
|
||||
|
||||
### Example (Good Abstract)
|
||||
|
||||
```
|
||||
We prove that gradient descent on overparameterized neural networks
|
||||
converges to global minima at a linear rate. [What]
|
||||
This resolves a fundamental question about why deep learning works
|
||||
despite non-convex optimization landscapes. [Why hard/important]
|
||||
Our proof relies on showing that the Neural Tangent Kernel remains
|
||||
approximately constant during training, reducing the problem to
|
||||
kernel regression. [How with keywords]
|
||||
We validate our theory on CIFAR-10 and ImageNet, showing that
|
||||
predicted convergence rates match experiments within 5%. [Evidence]
|
||||
This is the first polynomial-time convergence guarantee for
|
||||
networks with practical depth and width. [Remarkable result]
|
||||
```
|
||||
|
||||
### What to Avoid
|
||||
|
||||
From Zachary Lipton: "If the first sentence can be pre-pended to any ML paper, delete it."
|
||||
|
||||
**Delete these openings**:
|
||||
- "Large language models have achieved remarkable success..."
|
||||
- "Deep learning has revolutionized..."
|
||||
- "In recent years, neural networks have..."
|
||||
|
||||
**Start with your specific contribution instead.**
|
||||
|
||||
---
|
||||
|
||||
## Introduction Structure
|
||||
|
||||
### Requirements
|
||||
|
||||
- **1-1.5 pages maximum** (in two-column format)
|
||||
- **Methods should start by page 2-3**
|
||||
- Must include **2-4 bullet contribution list** (max 1-2 lines each)
|
||||
|
||||
### Structure Template
|
||||
|
||||
```markdown
|
||||
1. Opening Hook (2-3 sentences)
|
||||
- State the problem your paper addresses
|
||||
- Why it matters RIGHT NOW
|
||||
|
||||
2. Background/Challenge (1 paragraph)
|
||||
- What makes this problem hard?
|
||||
- What have others tried? Why is it insufficient?
|
||||
|
||||
3. Your Approach (1 paragraph)
|
||||
- What do you do differently?
|
||||
- Key insight that enables your contribution
|
||||
|
||||
4. Contribution Bullets (2-4 items)
|
||||
- Be specific and falsifiable
|
||||
- Each bullet: 1-2 lines maximum
|
||||
|
||||
5. Results Preview (2-3 sentences)
|
||||
- Most impressive numbers
|
||||
- Scope of evaluation
|
||||
|
||||
6. Paper Organization (optional, 1-2 sentences)
|
||||
- "Section 2 presents... Section 3 describes..."
|
||||
```
|
||||
|
||||
### Contribution Bullets: Good vs Bad
|
||||
|
||||
**Good:**
|
||||
- We prove that X converges in O(n log n) time under assumption Y
|
||||
- We introduce Z, a 3-layer architecture that reduces memory by 40%
|
||||
- We demonstrate that A outperforms B by 15% on benchmark C
|
||||
|
||||
**Bad:**
|
||||
- We study the problem of X (not a contribution)
|
||||
- We provide extensive experiments (too vague)
|
||||
- We make several contributions to the field (says nothing)
|
||||
|
||||
---
|
||||
|
||||
## Sentence-Level Clarity
|
||||
|
||||
### From Gopen & Swan: "The Science of Scientific Writing"
|
||||
|
||||
The seminal 1990 paper by George Gopen and Judith Swan establishes that **readers have structural expectations** about where information appears in prose. Violating these expectations forces readers to spend energy on structure rather than content.
|
||||
|
||||
> "If the reader is to grasp what the writer means, the writer must understand what the reader needs."
|
||||
|
||||
#### The 7 Principles of Reader Expectations
|
||||
|
||||
**Principle 1: Subject-Verb Proximity**
|
||||
|
||||
Keep grammatical subject and verb close together. Anything intervening reads as interruption of lesser importance.
|
||||
|
||||
**Weak**: "The model, which was trained on 100M tokens and fine-tuned on domain-specific data using LoRA with rank 16, achieves state-of-the-art results"
|
||||
|
||||
**Strong**: "The model achieves state-of-the-art results after training on 100M tokens and fine-tuning with LoRA (rank 16)"
|
||||
|
||||
**Principle 2: Stress Position (Save the Best for Last)**
|
||||
|
||||
Readers naturally emphasize the **last words of a sentence**. Place your most important information there.
|
||||
|
||||
**Weak**: "Accuracy improves by 15% when using attention"
|
||||
**Strong**: "When using attention, accuracy improves by **15%**"
|
||||
|
||||
**Principle 3: Topic Position (First Things First)**
|
||||
|
||||
The beginning of a sentence establishes perspective. Put the "whose story" element first—readers expect the sentence to be about whoever shows up first.
|
||||
|
||||
**Weak**: "A novel attention mechanism that computes alignment scores is introduced"
|
||||
**Strong**: "To address the alignment problem, we introduce a novel attention mechanism"
|
||||
|
||||
**Principle 4: Old Information Before New**
|
||||
|
||||
Put familiar information (old) in the topic position for backward linkage; put new information in the stress position for emphasis.
|
||||
|
||||
**Weak**: "Sparse attention was introduced by Child et al. The quadratic complexity of standard attention motivates this work."
|
||||
**Strong**: "Standard attention has quadratic complexity. To address this, Child et al. introduced sparse attention."
|
||||
|
||||
**Principle 5: One Unit, One Function**
|
||||
|
||||
Each unit of discourse (sentence, paragraph, section) should serve a single function. If you have two points, use two units.
|
||||
|
||||
**Principle 6: Articulate Action in the Verb**
|
||||
|
||||
Express the action of each sentence in its verb, not in nominalized nouns.
|
||||
|
||||
**Weak**: "We performed an analysis of the results" (nominalization)
|
||||
**Strong**: "We analyzed the results" (action in verb)
|
||||
|
||||
**Principle 7: Context Before New Information**
|
||||
|
||||
Provide context before asking the reader to consider anything new. This applies at all levels—sentence, paragraph, section.
|
||||
|
||||
**Weak**: "Equation 3 shows that convergence is guaranteed when the learning rate satisfies..."
|
||||
**Strong**: "For convergence to be guaranteed, the learning rate must satisfy the condition in Equation 3..."
|
||||
|
||||
#### Summary Table
|
||||
|
||||
| Principle | Rule | Mnemonic |
|
||||
|-----------|------|----------|
|
||||
| Subject-Verb Proximity | Keep subject and verb close | "Don't interrupt yourself" |
|
||||
| Stress Position | Emphasis at sentence end | "Save the best for last" |
|
||||
| Topic Position | Context at sentence start | "First things first" |
|
||||
| Old Before New | Familiar → unfamiliar | "Build on known ground" |
|
||||
| One Unit, One Function | Each paragraph = one point | "One idea per container" |
|
||||
| Action in Verb | Use verbs, not nominalizations | "Verbs do, nouns sit" |
|
||||
| Context Before New | Explain before presenting | "Set the stage first" |
|
||||
|
||||
---
|
||||
|
||||
## Micro-Level Writing Tips
|
||||
|
||||
### From Ethan Perez (Anthropic)
|
||||
|
||||
These practical micro-level tips improve clarity at the sentence and word level.
|
||||
|
||||
#### Pronoun Management
|
||||
|
||||
**Minimize pronouns** ("this," "it," "these," "that"). When pronouns are necessary, use them as adjectives with a noun:
|
||||
|
||||
**Weak**: "This shows that the model converges."
|
||||
**Strong**: "This result shows that the model converges."
|
||||
|
||||
**Weak**: "It improves performance."
|
||||
**Strong**: "This modification improves performance."
|
||||
|
||||
#### Verb Placement
|
||||
|
||||
**Position verbs early** in sentences for better parsing:
|
||||
|
||||
**Weak**: "The gradient, after being computed and normalized, updates the weights."
|
||||
**Strong**: "The gradient updates the weights after being computed and normalized."
|
||||
|
||||
#### Apostrophe Unfolding
|
||||
|
||||
Transform possessive constructions for clarity:
|
||||
|
||||
**Original**: "X's Y" → **Unfolded**: "The Y of X"
|
||||
|
||||
**Before**: "The model's accuracy on the test set"
|
||||
**After**: "The accuracy of the model on the test set"
|
||||
|
||||
This isn't always better, but when sentences feel awkward, try unfolding.
|
||||
|
||||
#### Words to Eliminate
|
||||
|
||||
Delete these filler words in almost all cases:
|
||||
- "actually"
|
||||
- "a bit"
|
||||
- "fortunately" / "unfortunately"
|
||||
- "very" / "really"
|
||||
- "quite"
|
||||
- "basically"
|
||||
- "essentially"
|
||||
- Excessive connectives ("however," "moreover," "furthermore" when not needed)
|
||||
|
||||
#### Sentence Construction Rules
|
||||
|
||||
1. **One idea per sentence** - If struggling to express an idea in one sentence, it needs two
|
||||
2. **No repeated sounds** - Avoid similar-sounding words in the same sentence
|
||||
3. **Every sentence adds information** - Delete sentences that merely restate
|
||||
4. **Active voice always** - Specify the actor ("We find..." not "It is found...")
|
||||
5. **Expand contractions** - "don't" → "do not" for formality
|
||||
|
||||
#### Paragraph Architecture
|
||||
|
||||
- **First sentence**: State the point clearly
|
||||
- **Middle sentences**: Support with evidence
|
||||
- **Last sentence**: Reinforce or transition
|
||||
|
||||
Don't bury key information in the middle of paragraphs.
|
||||
|
||||
---
|
||||
|
||||
## Word Choice and Precision
|
||||
|
||||
### From Zachary Lipton
|
||||
|
||||
**Eliminate hedging** unless genuine uncertainty exists:
|
||||
- Delete "may" and "can" unless necessary
|
||||
- "provides *very* tight approximation" drips with insecurity
|
||||
- "provides tight approximation" is confident
|
||||
|
||||
**Avoid vacuous intensifiers**:
|
||||
- Delete: very, extremely, highly, significantly (unless statistical)
|
||||
- These words signal insecurity, not strength
|
||||
|
||||
### From Jacob Steinhardt
|
||||
|
||||
**Precision over brevity**: Replace vague terms with specific ones.
|
||||
|
||||
| Vague | Specific |
|
||||
|-------|----------|
|
||||
| performance | accuracy, latency, throughput |
|
||||
| improves | increases accuracy by X%, reduces latency by Y |
|
||||
| large | 1B parameters, 100M tokens |
|
||||
| fast | 3x faster, 50ms latency |
|
||||
| good results | 92% accuracy, 0.85 F1 |
|
||||
|
||||
**Consistent terminology**: Referring to the same concept with different terms creates confusion.
|
||||
|
||||
**Choose one and stick with it**:
|
||||
- "model" vs "network" vs "architecture"
|
||||
- "training" vs "learning" vs "optimization"
|
||||
- "sample" vs "example" vs "instance"
|
||||
|
||||
### Vocabulary Signaling
|
||||
|
||||
**Avoid words signaling incremental work**:
|
||||
- Never: "combine," "modify," "expand," "extend"
|
||||
- Instead: "develop," "propose," "introduce"
|
||||
|
||||
**Why**: "We combine X and Y" sounds like you stapled two existing ideas together. "We develop a method that leverages X for Y" sounds like genuine contribution.
|
||||
|
||||
---
|
||||
|
||||
## Mathematical Writing
|
||||
|
||||
### From Ethan Perez
|
||||
|
||||
**Unfold apostrophes** for clarity:
|
||||
- Weak: "X's Y"
|
||||
- Strong: "The Y of X"
|
||||
|
||||
Example: "the model's accuracy" → "the accuracy of the model"
|
||||
|
||||
### General Principles
|
||||
|
||||
1. **State all assumptions formally** before theorems
|
||||
2. **Provide intuitive explanations** alongside proofs
|
||||
3. **Use consistent notation** throughout the paper
|
||||
4. **Define symbols at first use**
|
||||
|
||||
### Notation Conventions
|
||||
|
||||
```latex
|
||||
% Scalars: lowercase italic
|
||||
$x$, $y$, $\alpha$, $\beta$
|
||||
|
||||
% Vectors: lowercase bold
|
||||
$\mathbf{x}$, $\mathbf{v}$
|
||||
|
||||
% Matrices: uppercase bold
|
||||
$\mathbf{W}$, $\mathbf{X}$
|
||||
|
||||
% Sets: uppercase calligraphic
|
||||
$\mathcal{X}$, $\mathcal{D}$
|
||||
|
||||
% Functions: roman for named functions
|
||||
$\mathrm{softmax}$, $\mathrm{ReLU}$
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Figure Design
|
||||
|
||||
### From Neel Nanda
|
||||
|
||||
Figures should tell a coherent story even if the reader skips the text. Many readers DO skip the text initially.
|
||||
|
||||
### Design Principles
|
||||
|
||||
1. **Figure 1 is crucial**: Often the first thing readers examine after abstract
|
||||
2. **Self-contained captions**: Reader should understand figure without main text
|
||||
3. **No title inside figure**: The caption serves this function (ICML/NeurIPS rule)
|
||||
4. **Vector graphics**: PDF/EPS for plots, PNG (600 DPI) only for photographs
|
||||
|
||||
### Accessibility Requirements
|
||||
|
||||
8% of men have color vision deficiency. Your figures must work for them.
|
||||
|
||||
**Solutions**:
|
||||
- Use colorblind-safe palettes: Okabe-Ito or Paul Tol
|
||||
- Avoid red-green combinations
|
||||
- Verify figures work in grayscale
|
||||
- Use different line styles (solid, dashed, dotted) in addition to colors
|
||||
|
||||
### Tools
|
||||
|
||||
```python
|
||||
# SciencePlots: Publication-ready styles
|
||||
import matplotlib.pyplot as plt
|
||||
plt.style.use(['science', 'ieee'])
|
||||
|
||||
# Or for Nature-style
|
||||
plt.style.use(['science', 'nature'])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Mistakes to Avoid
|
||||
|
||||
### Structure Mistakes
|
||||
|
||||
| Mistake | Solution |
|
||||
|---------|----------|
|
||||
| Introduction too long (>1.5 pages) | Move background to Related Work |
|
||||
| Methods buried (after page 3) | Front-load contribution, cut intro |
|
||||
| Missing contribution bullets | Add 2-4 specific, falsifiable claims |
|
||||
| Experiments without explicit claims | State what each experiment tests |
|
||||
|
||||
### Writing Mistakes
|
||||
|
||||
| Mistake | Solution |
|
||||
|---------|----------|
|
||||
| Generic abstract opening | Start with your specific contribution |
|
||||
| Inconsistent terminology | Choose one term per concept |
|
||||
| Passive voice overuse | Use active voice: "We show" not "It is shown" |
|
||||
| Hedging everywhere | Be confident unless genuinely uncertain |
|
||||
|
||||
### Figure Mistakes
|
||||
|
||||
| Mistake | Solution |
|
||||
|---------|----------|
|
||||
| Raster graphics for plots | Use vector (PDF/EPS) |
|
||||
| Red-green color scheme | Use colorblind-safe palette |
|
||||
| Title inside figure | Put title in caption |
|
||||
| Captions require main text | Make captions self-contained |
|
||||
|
||||
### Citation Mistakes
|
||||
|
||||
| Mistake | Solution |
|
||||
|---------|----------|
|
||||
| Paper-by-paper Related Work | Organize methodologically |
|
||||
| Missing relevant citations | Reviewers authored papers—cite generously |
|
||||
| AI-generated citations | Always verify via APIs |
|
||||
| Inconsistent citation format | Use BibLaTeX with consistent keys |
|
||||
|
||||
---
|
||||
|
||||
## Pre-Submission Checklist
|
||||
|
||||
Before submitting, verify:
|
||||
|
||||
**Narrative**:
|
||||
- [ ] Can state contribution in one sentence
|
||||
- [ ] Three pillars (What/Why/So What) clear in intro
|
||||
- [ ] Every experiment supports a specific claim
|
||||
|
||||
**Structure**:
|
||||
- [ ] Abstract follows 5-sentence formula
|
||||
- [ ] Introduction ≤1.5 pages
|
||||
- [ ] Methods start by page 2-3
|
||||
- [ ] 2-4 contribution bullets included
|
||||
- [ ] Limitations section present
|
||||
|
||||
**Writing**:
|
||||
- [ ] Consistent terminology throughout
|
||||
- [ ] No generic opening sentences
|
||||
- [ ] Hedging removed unless necessary
|
||||
- [ ] All figures have self-contained captions
|
||||
|
||||
**Technical**:
|
||||
- [ ] All citations verified via API
|
||||
- [ ] Error bars included with methodology
|
||||
- [ ] Compute resources documented
|
||||
- [ ] Code/data availability stated
|
||||
Reference in New Issue
Block a user