475 lines
16 KiB
Markdown
475 lines
16 KiB
Markdown
# ML Paper Writing Philosophy & Best Practices
|
|
|
|
This reference compiles writing advice from prominent ML researchers including Neel Nanda, Andrej Karpathy, Sebastian Farquhar, Zachary Lipton, and Jacob Steinhardt.
|
|
|
|
---
|
|
|
|
## Contents
|
|
|
|
- [The Narrative Principle](#the-narrative-principle)
|
|
- [Time Allocation](#time-allocation)
|
|
- [Abstract Writing Formula](#abstract-writing-formula)
|
|
- [Introduction Structure](#introduction-structure)
|
|
- [Sentence-Level Clarity](#sentence-level-clarity)
|
|
- [Word Choice and Precision](#word-choice-and-precision)
|
|
- [Mathematical Writing](#mathematical-writing)
|
|
- [Figure Design](#figure-design)
|
|
- [Common Mistakes to Avoid](#common-mistakes-to-avoid)
|
|
|
|
---
|
|
|
|
## The Narrative Principle
|
|
|
|
### From Neel Nanda
|
|
|
|
"A paper is a short, rigorous, evidence-based technical story with a takeaway readers care about."
|
|
|
|
The narrative rests on three pillars that must be crystal clear by the end of your introduction:
|
|
|
|
**The "What"**: One to three specific novel claims fitting within a cohesive theme. Vague contributions like "we study X" fail immediately—reviewers need precise, falsifiable claims.
|
|
|
|
**The "Why"**: Rigorous empirical evidence that convincingly supports those claims, including strong baselines honestly tuned and experiments that distinguish between competing hypotheses rather than merely showing "decent results."
|
|
|
|
**The "So What"**: Why readers should care, connecting your contribution to problems the community recognizes as important.
|
|
|
|
### From Andrej Karpathy
|
|
|
|
"A paper is not a random collection of experiments you report on. The paper sells a single thing that was not obvious or present before. The entire paper is organized around this core contribution with surgical precision."
|
|
|
|
This applies whether you're presenting a new architecture, a theoretical result, or improved understanding of existing methods—NeurIPS explicitly notes that "originality does not necessarily require an entirely new method."
|
|
|
|
**Practical Implication**: If you cannot state your contribution in one sentence, you don't yet have a paper. Everything else—experiments, related work, discussion—exists only to support that core claim.
|
|
|
|
---
|
|
|
|
## Time Allocation
|
|
|
|
### From Neel Nanda
|
|
|
|
Spend approximately **the same amount of time** on each of:
|
|
1. The abstract
|
|
2. The introduction
|
|
3. The figures
|
|
4. Everything else combined
|
|
|
|
This isn't hyperbole—most reviewers form preliminary judgments before reaching your methods section. Readers encounter your paper in a predictable pattern: **title → abstract → introduction → figures → maybe the rest.**
|
|
|
|
### Reviewer Reading Patterns
|
|
|
|
Studies of reviewer behavior show:
|
|
- Abstract is read 100% of the time
|
|
- Introduction is skimmed by 90%+ of reviewers
|
|
- Figures are examined before methods by most reviewers
|
|
- Full methods are read only if interest is established
|
|
|
|
**Implication**: Front-load your paper's value. Don't bury the contribution.
|
|
|
|
---
|
|
|
|
## Abstract Writing Formula
|
|
|
|
### Sebastian Farquhar's 5-Sentence Formula
|
|
|
|
1. **What you achieved**: "We introduce...", "We prove...", "We demonstrate..."
|
|
2. **Why this is hard and important**
|
|
3. **How you do it** (with specialist keywords for discoverability)
|
|
4. **What evidence you have**
|
|
5. **Your most remarkable number/result**
|
|
|
|
### Example (Good Abstract)
|
|
|
|
```
|
|
We prove that gradient descent on overparameterized neural networks
|
|
converges to global minima at a linear rate. [What]
|
|
This resolves a fundamental question about why deep learning works
|
|
despite non-convex optimization landscapes. [Why hard/important]
|
|
Our proof relies on showing that the Neural Tangent Kernel remains
|
|
approximately constant during training, reducing the problem to
|
|
kernel regression. [How with keywords]
|
|
We validate our theory on CIFAR-10 and ImageNet, showing that
|
|
predicted convergence rates match experiments within 5%. [Evidence]
|
|
This is the first polynomial-time convergence guarantee for
|
|
networks with practical depth and width. [Remarkable result]
|
|
```
|
|
|
|
### What to Avoid
|
|
|
|
From Zachary Lipton: "If the first sentence can be pre-pended to any ML paper, delete it."
|
|
|
|
**Delete these openings**:
|
|
- "Large language models have achieved remarkable success..."
|
|
- "Deep learning has revolutionized..."
|
|
- "In recent years, neural networks have..."
|
|
|
|
**Start with your specific contribution instead.**
|
|
|
|
---
|
|
|
|
## Introduction Structure
|
|
|
|
### Requirements
|
|
|
|
- **1-1.5 pages maximum** (in two-column format)
|
|
- **Methods should start by page 2-3**
|
|
- Must include **2-4 bullet contribution list** (max 1-2 lines each)
|
|
|
|
### Structure Template
|
|
|
|
```markdown
|
|
1. Opening Hook (2-3 sentences)
|
|
- State the problem your paper addresses
|
|
- Why it matters RIGHT NOW
|
|
|
|
2. Background/Challenge (1 paragraph)
|
|
- What makes this problem hard?
|
|
- What have others tried? Why is it insufficient?
|
|
|
|
3. Your Approach (1 paragraph)
|
|
- What do you do differently?
|
|
- Key insight that enables your contribution
|
|
|
|
4. Contribution Bullets (2-4 items)
|
|
- Be specific and falsifiable
|
|
- Each bullet: 1-2 lines maximum
|
|
|
|
5. Results Preview (2-3 sentences)
|
|
- Most impressive numbers
|
|
- Scope of evaluation
|
|
|
|
6. Paper Organization (optional, 1-2 sentences)
|
|
- "Section 2 presents... Section 3 describes..."
|
|
```
|
|
|
|
### Contribution Bullets: Good vs Bad
|
|
|
|
**Good:**
|
|
- We prove that X converges in O(n log n) time under assumption Y
|
|
- We introduce Z, a 3-layer architecture that reduces memory by 40%
|
|
- We demonstrate that A outperforms B by 15% on benchmark C
|
|
|
|
**Bad:**
|
|
- We study the problem of X (not a contribution)
|
|
- We provide extensive experiments (too vague)
|
|
- We make several contributions to the field (says nothing)
|
|
|
|
---
|
|
|
|
## Sentence-Level Clarity
|
|
|
|
### From Gopen & Swan: "The Science of Scientific Writing"
|
|
|
|
The seminal 1990 paper by George Gopen and Judith Swan establishes that **readers have structural expectations** about where information appears in prose. Violating these expectations forces readers to spend energy on structure rather than content.
|
|
|
|
> "If the reader is to grasp what the writer means, the writer must understand what the reader needs."
|
|
|
|
#### The 7 Principles of Reader Expectations
|
|
|
|
**Principle 1: Subject-Verb Proximity**
|
|
|
|
Keep grammatical subject and verb close together. Anything intervening reads as interruption of lesser importance.
|
|
|
|
**Weak**: "The model, which was trained on 100M tokens and fine-tuned on domain-specific data using LoRA with rank 16, achieves state-of-the-art results"
|
|
|
|
**Strong**: "The model achieves state-of-the-art results after training on 100M tokens and fine-tuning with LoRA (rank 16)"
|
|
|
|
**Principle 2: Stress Position (Save the Best for Last)**
|
|
|
|
Readers naturally emphasize the **last words of a sentence**. Place your most important information there.
|
|
|
|
**Weak**: "Accuracy improves by 15% when using attention"
|
|
**Strong**: "When using attention, accuracy improves by **15%**"
|
|
|
|
**Principle 3: Topic Position (First Things First)**
|
|
|
|
The beginning of a sentence establishes perspective. Put the "whose story" element first—readers expect the sentence to be about whoever shows up first.
|
|
|
|
**Weak**: "A novel attention mechanism that computes alignment scores is introduced"
|
|
**Strong**: "To address the alignment problem, we introduce a novel attention mechanism"
|
|
|
|
**Principle 4: Old Information Before New**
|
|
|
|
Put familiar information (old) in the topic position for backward linkage; put new information in the stress position for emphasis.
|
|
|
|
**Weak**: "Sparse attention was introduced by Child et al. The quadratic complexity of standard attention motivates this work."
|
|
**Strong**: "Standard attention has quadratic complexity. To address this, Child et al. introduced sparse attention."
|
|
|
|
**Principle 5: One Unit, One Function**
|
|
|
|
Each unit of discourse (sentence, paragraph, section) should serve a single function. If you have two points, use two units.
|
|
|
|
**Principle 6: Articulate Action in the Verb**
|
|
|
|
Express the action of each sentence in its verb, not in nominalized nouns.
|
|
|
|
**Weak**: "We performed an analysis of the results" (nominalization)
|
|
**Strong**: "We analyzed the results" (action in verb)
|
|
|
|
**Principle 7: Context Before New Information**
|
|
|
|
Provide context before asking the reader to consider anything new. This applies at all levels—sentence, paragraph, section.
|
|
|
|
**Weak**: "Equation 3 shows that convergence is guaranteed when the learning rate satisfies..."
|
|
**Strong**: "For convergence to be guaranteed, the learning rate must satisfy the condition in Equation 3..."
|
|
|
|
#### Summary Table
|
|
|
|
| Principle | Rule | Mnemonic |
|
|
|-----------|------|----------|
|
|
| Subject-Verb Proximity | Keep subject and verb close | "Don't interrupt yourself" |
|
|
| Stress Position | Emphasis at sentence end | "Save the best for last" |
|
|
| Topic Position | Context at sentence start | "First things first" |
|
|
| Old Before New | Familiar → unfamiliar | "Build on known ground" |
|
|
| One Unit, One Function | Each paragraph = one point | "One idea per container" |
|
|
| Action in Verb | Use verbs, not nominalizations | "Verbs do, nouns sit" |
|
|
| Context Before New | Explain before presenting | "Set the stage first" |
|
|
|
|
---
|
|
|
|
## Micro-Level Writing Tips
|
|
|
|
### From Ethan Perez (Anthropic)
|
|
|
|
These practical micro-level tips improve clarity at the sentence and word level.
|
|
|
|
#### Pronoun Management
|
|
|
|
**Minimize pronouns** ("this," "it," "these," "that"). When pronouns are necessary, use them as adjectives with a noun:
|
|
|
|
**Weak**: "This shows that the model converges."
|
|
**Strong**: "This result shows that the model converges."
|
|
|
|
**Weak**: "It improves performance."
|
|
**Strong**: "This modification improves performance."
|
|
|
|
#### Verb Placement
|
|
|
|
**Position verbs early** in sentences for better parsing:
|
|
|
|
**Weak**: "The gradient, after being computed and normalized, updates the weights."
|
|
**Strong**: "The gradient updates the weights after being computed and normalized."
|
|
|
|
#### Apostrophe Unfolding
|
|
|
|
Transform possessive constructions for clarity:
|
|
|
|
**Original**: "X's Y" → **Unfolded**: "The Y of X"
|
|
|
|
**Before**: "The model's accuracy on the test set"
|
|
**After**: "The accuracy of the model on the test set"
|
|
|
|
This isn't always better, but when sentences feel awkward, try unfolding.
|
|
|
|
#### Words to Eliminate
|
|
|
|
Delete these filler words in almost all cases:
|
|
- "actually"
|
|
- "a bit"
|
|
- "fortunately" / "unfortunately"
|
|
- "very" / "really"
|
|
- "quite"
|
|
- "basically"
|
|
- "essentially"
|
|
- Excessive connectives ("however," "moreover," "furthermore" when not needed)
|
|
|
|
#### Sentence Construction Rules
|
|
|
|
1. **One idea per sentence** - If struggling to express an idea in one sentence, it needs two
|
|
2. **No repeated sounds** - Avoid similar-sounding words in the same sentence
|
|
3. **Every sentence adds information** - Delete sentences that merely restate
|
|
4. **Active voice always** - Specify the actor ("We find..." not "It is found...")
|
|
5. **Expand contractions** - "don't" → "do not" for formality
|
|
|
|
#### Paragraph Architecture
|
|
|
|
- **First sentence**: State the point clearly
|
|
- **Middle sentences**: Support with evidence
|
|
- **Last sentence**: Reinforce or transition
|
|
|
|
Don't bury key information in the middle of paragraphs.
|
|
|
|
---
|
|
|
|
## Word Choice and Precision
|
|
|
|
### From Zachary Lipton
|
|
|
|
**Eliminate hedging** unless genuine uncertainty exists:
|
|
- Delete "may" and "can" unless necessary
|
|
- "provides *very* tight approximation" drips with insecurity
|
|
- "provides tight approximation" is confident
|
|
|
|
**Avoid vacuous intensifiers**:
|
|
- Delete: very, extremely, highly, significantly (unless statistical)
|
|
- These words signal insecurity, not strength
|
|
|
|
### From Jacob Steinhardt
|
|
|
|
**Precision over brevity**: Replace vague terms with specific ones.
|
|
|
|
| Vague | Specific |
|
|
|-------|----------|
|
|
| performance | accuracy, latency, throughput |
|
|
| improves | increases accuracy by X%, reduces latency by Y |
|
|
| large | 1B parameters, 100M tokens |
|
|
| fast | 3x faster, 50ms latency |
|
|
| good results | 92% accuracy, 0.85 F1 |
|
|
|
|
**Consistent terminology**: Referring to the same concept with different terms creates confusion.
|
|
|
|
**Choose one and stick with it**:
|
|
- "model" vs "network" vs "architecture"
|
|
- "training" vs "learning" vs "optimization"
|
|
- "sample" vs "example" vs "instance"
|
|
|
|
### Vocabulary Signaling
|
|
|
|
**Avoid words signaling incremental work**:
|
|
- Never: "combine," "modify," "expand," "extend"
|
|
- Instead: "develop," "propose," "introduce"
|
|
|
|
**Why**: "We combine X and Y" sounds like you stapled two existing ideas together. "We develop a method that leverages X for Y" sounds like genuine contribution.
|
|
|
|
---
|
|
|
|
## Mathematical Writing
|
|
|
|
### From Ethan Perez
|
|
|
|
**Unfold apostrophes** for clarity:
|
|
- Weak: "X's Y"
|
|
- Strong: "The Y of X"
|
|
|
|
Example: "the model's accuracy" → "the accuracy of the model"
|
|
|
|
### General Principles
|
|
|
|
1. **State all assumptions formally** before theorems
|
|
2. **Provide intuitive explanations** alongside proofs
|
|
3. **Use consistent notation** throughout the paper
|
|
4. **Define symbols at first use**
|
|
|
|
### Notation Conventions
|
|
|
|
```latex
|
|
% Scalars: lowercase italic
|
|
$x$, $y$, $\alpha$, $\beta$
|
|
|
|
% Vectors: lowercase bold
|
|
$\mathbf{x}$, $\mathbf{v}$
|
|
|
|
% Matrices: uppercase bold
|
|
$\mathbf{W}$, $\mathbf{X}$
|
|
|
|
% Sets: uppercase calligraphic
|
|
$\mathcal{X}$, $\mathcal{D}$
|
|
|
|
% Functions: roman for named functions
|
|
$\mathrm{softmax}$, $\mathrm{ReLU}$
|
|
```
|
|
|
|
---
|
|
|
|
## Figure Design
|
|
|
|
### From Neel Nanda
|
|
|
|
Figures should tell a coherent story even if the reader skips the text. Many readers DO skip the text initially.
|
|
|
|
### Design Principles
|
|
|
|
1. **Figure 1 is crucial**: Often the first thing readers examine after abstract
|
|
2. **Self-contained captions**: Reader should understand figure without main text
|
|
3. **No title inside figure**: The caption serves this function (ICML/NeurIPS rule)
|
|
4. **Vector graphics**: PDF/EPS for plots, PNG (600 DPI) only for photographs
|
|
|
|
### Accessibility Requirements
|
|
|
|
8% of men have color vision deficiency. Your figures must work for them.
|
|
|
|
**Solutions**:
|
|
- Use colorblind-safe palettes: Okabe-Ito or Paul Tol
|
|
- Avoid red-green combinations
|
|
- Verify figures work in grayscale
|
|
- Use different line styles (solid, dashed, dotted) in addition to colors
|
|
|
|
### Tools
|
|
|
|
```python
|
|
# SciencePlots: Publication-ready styles
|
|
import matplotlib.pyplot as plt
|
|
plt.style.use(['science', 'ieee'])
|
|
|
|
# Or for Nature-style
|
|
plt.style.use(['science', 'nature'])
|
|
```
|
|
|
|
---
|
|
|
|
## Common Mistakes to Avoid
|
|
|
|
### Structure Mistakes
|
|
|
|
| Mistake | Solution |
|
|
|---------|----------|
|
|
| Introduction too long (>1.5 pages) | Move background to Related Work |
|
|
| Methods buried (after page 3) | Front-load contribution, cut intro |
|
|
| Missing contribution bullets | Add 2-4 specific, falsifiable claims |
|
|
| Experiments without explicit claims | State what each experiment tests |
|
|
|
|
### Writing Mistakes
|
|
|
|
| Mistake | Solution |
|
|
|---------|----------|
|
|
| Generic abstract opening | Start with your specific contribution |
|
|
| Inconsistent terminology | Choose one term per concept |
|
|
| Passive voice overuse | Use active voice: "We show" not "It is shown" |
|
|
| Hedging everywhere | Be confident unless genuinely uncertain |
|
|
|
|
### Figure Mistakes
|
|
|
|
| Mistake | Solution |
|
|
|---------|----------|
|
|
| Raster graphics for plots | Use vector (PDF/EPS) |
|
|
| Red-green color scheme | Use colorblind-safe palette |
|
|
| Title inside figure | Put title in caption |
|
|
| Captions require main text | Make captions self-contained |
|
|
|
|
### Citation Mistakes
|
|
|
|
| Mistake | Solution |
|
|
|---------|----------|
|
|
| Paper-by-paper Related Work | Organize methodologically |
|
|
| Missing relevant citations | Reviewers authored papers—cite generously |
|
|
| AI-generated citations | Always verify via APIs |
|
|
| Inconsistent citation format | Use BibLaTeX with consistent keys |
|
|
|
|
---
|
|
|
|
## Pre-Submission Checklist
|
|
|
|
Before submitting, verify:
|
|
|
|
**Narrative**:
|
|
- [ ] Can state contribution in one sentence
|
|
- [ ] Three pillars (What/Why/So What) clear in intro
|
|
- [ ] Every experiment supports a specific claim
|
|
|
|
**Structure**:
|
|
- [ ] Abstract follows 5-sentence formula
|
|
- [ ] Introduction ≤1.5 pages
|
|
- [ ] Methods start by page 2-3
|
|
- [ ] 2-4 contribution bullets included
|
|
- [ ] Limitations section present
|
|
|
|
**Writing**:
|
|
- [ ] Consistent terminology throughout
|
|
- [ ] No generic opening sentences
|
|
- [ ] Hedging removed unless necessary
|
|
- [ ] All figures have self-contained captions
|
|
|
|
**Technical**:
|
|
- [ ] All citations verified via API
|
|
- [ ] Error bars included with methodology
|
|
- [ ] Compute resources documented
|
|
- [ ] Code/data availability stated
|