Files

16 KiB

ML Paper Writing Philosophy & Best Practices

This reference compiles writing advice from prominent ML researchers including Neel Nanda, Andrej Karpathy, Sebastian Farquhar, Zachary Lipton, and Jacob Steinhardt.


Contents


The Narrative Principle

From Neel Nanda

"A paper is a short, rigorous, evidence-based technical story with a takeaway readers care about."

The narrative rests on three pillars that must be crystal clear by the end of your introduction:

The "What": One to three specific novel claims fitting within a cohesive theme. Vague contributions like "we study X" fail immediately—reviewers need precise, falsifiable claims.

The "Why": Rigorous empirical evidence that convincingly supports those claims, including strong baselines honestly tuned and experiments that distinguish between competing hypotheses rather than merely showing "decent results."

The "So What": Why readers should care, connecting your contribution to problems the community recognizes as important.

From Andrej Karpathy

"A paper is not a random collection of experiments you report on. The paper sells a single thing that was not obvious or present before. The entire paper is organized around this core contribution with surgical precision."

This applies whether you're presenting a new architecture, a theoretical result, or improved understanding of existing methods—NeurIPS explicitly notes that "originality does not necessarily require an entirely new method."

Practical Implication: If you cannot state your contribution in one sentence, you don't yet have a paper. Everything else—experiments, related work, discussion—exists only to support that core claim.


Time Allocation

From Neel Nanda

Spend approximately the same amount of time on each of:

  1. The abstract
  2. The introduction
  3. The figures
  4. Everything else combined

This isn't hyperbole—most reviewers form preliminary judgments before reaching your methods section. Readers encounter your paper in a predictable pattern: title → abstract → introduction → figures → maybe the rest.

Reviewer Reading Patterns

Studies of reviewer behavior show:

  • Abstract is read 100% of the time
  • Introduction is skimmed by 90%+ of reviewers
  • Figures are examined before methods by most reviewers
  • Full methods are read only if interest is established

Implication: Front-load your paper's value. Don't bury the contribution.


Abstract Writing Formula

Sebastian Farquhar's 5-Sentence Formula

  1. What you achieved: "We introduce...", "We prove...", "We demonstrate..."
  2. Why this is hard and important
  3. How you do it (with specialist keywords for discoverability)
  4. What evidence you have
  5. Your most remarkable number/result

Example (Good Abstract)

We prove that gradient descent on overparameterized neural networks
converges to global minima at a linear rate. [What]
This resolves a fundamental question about why deep learning works
despite non-convex optimization landscapes. [Why hard/important]
Our proof relies on showing that the Neural Tangent Kernel remains
approximately constant during training, reducing the problem to
kernel regression. [How with keywords]
We validate our theory on CIFAR-10 and ImageNet, showing that
predicted convergence rates match experiments within 5%. [Evidence]
This is the first polynomial-time convergence guarantee for
networks with practical depth and width. [Remarkable result]

What to Avoid

From Zachary Lipton: "If the first sentence can be pre-pended to any ML paper, delete it."

Delete these openings:

  • "Large language models have achieved remarkable success..."
  • "Deep learning has revolutionized..."
  • "In recent years, neural networks have..."

Start with your specific contribution instead.


Introduction Structure

Requirements

  • 1-1.5 pages maximum (in two-column format)
  • Methods should start by page 2-3
  • Must include 2-4 bullet contribution list (max 1-2 lines each)

Structure Template

1. Opening Hook (2-3 sentences)
   - State the problem your paper addresses
   - Why it matters RIGHT NOW

2. Background/Challenge (1 paragraph)
   - What makes this problem hard?
   - What have others tried? Why is it insufficient?

3. Your Approach (1 paragraph)
   - What do you do differently?
   - Key insight that enables your contribution

4. Contribution Bullets (2-4 items)
   - Be specific and falsifiable
   - Each bullet: 1-2 lines maximum

5. Results Preview (2-3 sentences)
   - Most impressive numbers
   - Scope of evaluation

6. Paper Organization (optional, 1-2 sentences)
   - "Section 2 presents... Section 3 describes..."

Contribution Bullets: Good vs Bad

Good:

  • We prove that X converges in O(n log n) time under assumption Y
  • We introduce Z, a 3-layer architecture that reduces memory by 40%
  • We demonstrate that A outperforms B by 15% on benchmark C

Bad:

  • We study the problem of X (not a contribution)
  • We provide extensive experiments (too vague)
  • We make several contributions to the field (says nothing)

Sentence-Level Clarity

From Gopen & Swan: "The Science of Scientific Writing"

The seminal 1990 paper by George Gopen and Judith Swan establishes that readers have structural expectations about where information appears in prose. Violating these expectations forces readers to spend energy on structure rather than content.

"If the reader is to grasp what the writer means, the writer must understand what the reader needs."

The 7 Principles of Reader Expectations

Principle 1: Subject-Verb Proximity

Keep grammatical subject and verb close together. Anything intervening reads as interruption of lesser importance.

Weak: "The model, which was trained on 100M tokens and fine-tuned on domain-specific data using LoRA with rank 16, achieves state-of-the-art results"

Strong: "The model achieves state-of-the-art results after training on 100M tokens and fine-tuning with LoRA (rank 16)"

Principle 2: Stress Position (Save the Best for Last)

Readers naturally emphasize the last words of a sentence. Place your most important information there.

Weak: "Accuracy improves by 15% when using attention" Strong: "When using attention, accuracy improves by 15%"

Principle 3: Topic Position (First Things First)

The beginning of a sentence establishes perspective. Put the "whose story" element first—readers expect the sentence to be about whoever shows up first.

Weak: "A novel attention mechanism that computes alignment scores is introduced" Strong: "To address the alignment problem, we introduce a novel attention mechanism"

Principle 4: Old Information Before New

Put familiar information (old) in the topic position for backward linkage; put new information in the stress position for emphasis.

Weak: "Sparse attention was introduced by Child et al. The quadratic complexity of standard attention motivates this work." Strong: "Standard attention has quadratic complexity. To address this, Child et al. introduced sparse attention."

Principle 5: One Unit, One Function

Each unit of discourse (sentence, paragraph, section) should serve a single function. If you have two points, use two units.

Principle 6: Articulate Action in the Verb

Express the action of each sentence in its verb, not in nominalized nouns.

Weak: "We performed an analysis of the results" (nominalization) Strong: "We analyzed the results" (action in verb)

Principle 7: Context Before New Information

Provide context before asking the reader to consider anything new. This applies at all levels—sentence, paragraph, section.

Weak: "Equation 3 shows that convergence is guaranteed when the learning rate satisfies..." Strong: "For convergence to be guaranteed, the learning rate must satisfy the condition in Equation 3..."

Summary Table

Principle Rule Mnemonic
Subject-Verb Proximity Keep subject and verb close "Don't interrupt yourself"
Stress Position Emphasis at sentence end "Save the best for last"
Topic Position Context at sentence start "First things first"
Old Before New Familiar → unfamiliar "Build on known ground"
One Unit, One Function Each paragraph = one point "One idea per container"
Action in Verb Use verbs, not nominalizations "Verbs do, nouns sit"
Context Before New Explain before presenting "Set the stage first"

Micro-Level Writing Tips

From Ethan Perez (Anthropic)

These practical micro-level tips improve clarity at the sentence and word level.

Pronoun Management

Minimize pronouns ("this," "it," "these," "that"). When pronouns are necessary, use them as adjectives with a noun:

Weak: "This shows that the model converges." Strong: "This result shows that the model converges."

Weak: "It improves performance." Strong: "This modification improves performance."

Verb Placement

Position verbs early in sentences for better parsing:

Weak: "The gradient, after being computed and normalized, updates the weights." Strong: "The gradient updates the weights after being computed and normalized."

Apostrophe Unfolding

Transform possessive constructions for clarity:

Original: "X's Y" → Unfolded: "The Y of X"

Before: "The model's accuracy on the test set" After: "The accuracy of the model on the test set"

This isn't always better, but when sentences feel awkward, try unfolding.

Words to Eliminate

Delete these filler words in almost all cases:

  • "actually"
  • "a bit"
  • "fortunately" / "unfortunately"
  • "very" / "really"
  • "quite"
  • "basically"
  • "essentially"
  • Excessive connectives ("however," "moreover," "furthermore" when not needed)

Sentence Construction Rules

  1. One idea per sentence - If struggling to express an idea in one sentence, it needs two
  2. No repeated sounds - Avoid similar-sounding words in the same sentence
  3. Every sentence adds information - Delete sentences that merely restate
  4. Active voice always - Specify the actor ("We find..." not "It is found...")
  5. Expand contractions - "don't" → "do not" for formality

Paragraph Architecture

  • First sentence: State the point clearly
  • Middle sentences: Support with evidence
  • Last sentence: Reinforce or transition

Don't bury key information in the middle of paragraphs.


Word Choice and Precision

From Zachary Lipton

Eliminate hedging unless genuine uncertainty exists:

  • Delete "may" and "can" unless necessary
  • "provides very tight approximation" drips with insecurity
  • "provides tight approximation" is confident

Avoid vacuous intensifiers:

  • Delete: very, extremely, highly, significantly (unless statistical)
  • These words signal insecurity, not strength

From Jacob Steinhardt

Precision over brevity: Replace vague terms with specific ones.

Vague Specific
performance accuracy, latency, throughput
improves increases accuracy by X%, reduces latency by Y
large 1B parameters, 100M tokens
fast 3x faster, 50ms latency
good results 92% accuracy, 0.85 F1

Consistent terminology: Referring to the same concept with different terms creates confusion.

Choose one and stick with it:

  • "model" vs "network" vs "architecture"
  • "training" vs "learning" vs "optimization"
  • "sample" vs "example" vs "instance"

Vocabulary Signaling

Avoid words signaling incremental work:

  • Never: "combine," "modify," "expand," "extend"
  • Instead: "develop," "propose," "introduce"

Why: "We combine X and Y" sounds like you stapled two existing ideas together. "We develop a method that leverages X for Y" sounds like genuine contribution.


Mathematical Writing

From Ethan Perez

Unfold apostrophes for clarity:

  • Weak: "X's Y"
  • Strong: "The Y of X"

Example: "the model's accuracy" → "the accuracy of the model"

General Principles

  1. State all assumptions formally before theorems
  2. Provide intuitive explanations alongside proofs
  3. Use consistent notation throughout the paper
  4. Define symbols at first use

Notation Conventions

% Scalars: lowercase italic
$x$, $y$, $\alpha$, $\beta$

% Vectors: lowercase bold
$\mathbf{x}$, $\mathbf{v}$

% Matrices: uppercase bold
$\mathbf{W}$, $\mathbf{X}$

% Sets: uppercase calligraphic
$\mathcal{X}$, $\mathcal{D}$

% Functions: roman for named functions
$\mathrm{softmax}$, $\mathrm{ReLU}$

Figure Design

From Neel Nanda

Figures should tell a coherent story even if the reader skips the text. Many readers DO skip the text initially.

Design Principles

  1. Figure 1 is crucial: Often the first thing readers examine after abstract
  2. Self-contained captions: Reader should understand figure without main text
  3. No title inside figure: The caption serves this function (ICML/NeurIPS rule)
  4. Vector graphics: PDF/EPS for plots, PNG (600 DPI) only for photographs

Accessibility Requirements

8% of men have color vision deficiency. Your figures must work for them.

Solutions:

  • Use colorblind-safe palettes: Okabe-Ito or Paul Tol
  • Avoid red-green combinations
  • Verify figures work in grayscale
  • Use different line styles (solid, dashed, dotted) in addition to colors

Tools

# SciencePlots: Publication-ready styles
import matplotlib.pyplot as plt
plt.style.use(['science', 'ieee'])

# Or for Nature-style
plt.style.use(['science', 'nature'])

Common Mistakes to Avoid

Structure Mistakes

Mistake Solution
Introduction too long (>1.5 pages) Move background to Related Work
Methods buried (after page 3) Front-load contribution, cut intro
Missing contribution bullets Add 2-4 specific, falsifiable claims
Experiments without explicit claims State what each experiment tests

Writing Mistakes

Mistake Solution
Generic abstract opening Start with your specific contribution
Inconsistent terminology Choose one term per concept
Passive voice overuse Use active voice: "We show" not "It is shown"
Hedging everywhere Be confident unless genuinely uncertain

Figure Mistakes

Mistake Solution
Raster graphics for plots Use vector (PDF/EPS)
Red-green color scheme Use colorblind-safe palette
Title inside figure Put title in caption
Captions require main text Make captions self-contained

Citation Mistakes

Mistake Solution
Paper-by-paper Related Work Organize methodologically
Missing relevant citations Reviewers authored papers—cite generously
AI-generated citations Always verify via APIs
Inconsistent citation format Use BibLaTeX with consistent keys

Pre-Submission Checklist

Before submitting, verify:

Narrative:

  • Can state contribution in one sentence
  • Three pillars (What/Why/So What) clear in intro
  • Every experiment supports a specific claim

Structure:

  • Abstract follows 5-sentence formula
  • Introduction ≤1.5 pages
  • Methods start by page 2-3
  • 2-4 contribution bullets included
  • Limitations section present

Writing:

  • Consistent terminology throughout
  • No generic opening sentences
  • Hedging removed unless necessary
  • All figures have self-contained captions

Technical:

  • All citations verified via API
  • Error bars included with methodology
  • Compute resources documented
  • Code/data availability stated