489 lines
10 KiB
Markdown
489 lines
10 KiB
Markdown
# Benchmark Guide
|
|
|
|
Complete guide to all 60+ evaluation tasks in lm-evaluation-harness, what they measure, and how to interpret results.
|
|
|
|
## Overview
|
|
|
|
The lm-evaluation-harness includes 60+ benchmarks spanning:
|
|
- Language understanding (MMLU, GLUE)
|
|
- Mathematical reasoning (GSM8K, MATH)
|
|
- Code generation (HumanEval, MBPP)
|
|
- Instruction following (IFEval, AlpacaEval)
|
|
- Long-context understanding (LongBench)
|
|
- Multilingual capabilities (AfroBench, NorEval)
|
|
- Reasoning (BBH, ARC)
|
|
- Truthfulness (TruthfulQA)
|
|
|
|
**List all tasks**:
|
|
```bash
|
|
lm_eval --tasks list
|
|
```
|
|
|
|
## Major Benchmarks
|
|
|
|
### MMLU (Massive Multitask Language Understanding)
|
|
|
|
**What it measures**: Broad knowledge across 57 subjects (STEM, humanities, social sciences, law).
|
|
|
|
**Task variants**:
|
|
- `mmlu`: Original 57-subject benchmark
|
|
- `mmlu_pro`: More challenging version with reasoning-focused questions
|
|
- `mmlu_prox`: Multilingual extension
|
|
|
|
**Format**: Multiple choice (4 options)
|
|
|
|
**Example**:
|
|
```
|
|
Question: What is the capital of France?
|
|
A. Berlin
|
|
B. Paris
|
|
C. London
|
|
D. Madrid
|
|
Answer: B
|
|
```
|
|
|
|
**Command**:
|
|
```bash
|
|
lm_eval --model hf \
|
|
--model_args pretrained=meta-llama/Llama-2-7b-hf \
|
|
--tasks mmlu \
|
|
--num_fewshot 5
|
|
```
|
|
|
|
**Interpretation**:
|
|
- Random: 25% (chance)
|
|
- GPT-3 (175B): 43.9%
|
|
- GPT-4: 86.4%
|
|
- Human expert: ~90%
|
|
|
|
**Good for**: Assessing general knowledge and domain expertise.
|
|
|
|
### GSM8K (Grade School Math 8K)
|
|
|
|
**What it measures**: Mathematical reasoning on grade-school level word problems.
|
|
|
|
**Task variants**:
|
|
- `gsm8k`: Base task
|
|
- `gsm8k_cot`: With chain-of-thought prompting
|
|
- `gsm_plus`: Adversarial variant with perturbations
|
|
|
|
**Format**: Free-form generation, extract numerical answer
|
|
|
|
**Example**:
|
|
```
|
|
Question: A baker made 200 cookies. He sold 3/5 of them in the morning and 1/4 of the remaining in the afternoon. How many cookies does he have left?
|
|
Answer: 60
|
|
```
|
|
|
|
**Command**:
|
|
```bash
|
|
lm_eval --model hf \
|
|
--model_args pretrained=meta-llama/Llama-2-7b-hf \
|
|
--tasks gsm8k \
|
|
--num_fewshot 5
|
|
```
|
|
|
|
**Interpretation**:
|
|
- Random: ~0%
|
|
- GPT-3 (175B): 17.0%
|
|
- GPT-4: 92.0%
|
|
- Llama 2 70B: 56.8%
|
|
|
|
**Good for**: Testing multi-step reasoning and arithmetic.
|
|
|
|
### HumanEval
|
|
|
|
**What it measures**: Python code generation from docstrings (functional correctness).
|
|
|
|
**Task variants**:
|
|
- `humaneval`: Standard benchmark
|
|
- `humaneval_instruct`: For instruction-tuned models
|
|
|
|
**Format**: Code generation, execution-based evaluation
|
|
|
|
**Example**:
|
|
```python
|
|
def has_close_elements(numbers: List[float], threshold: float) -> bool:
|
|
""" Check if in given list of numbers, are any two numbers closer to each other than
|
|
given threshold.
|
|
>>> has_close_elements([1.0, 2.0, 3.0], 0.5)
|
|
False
|
|
>>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
|
|
True
|
|
"""
|
|
```
|
|
|
|
**Command**:
|
|
```bash
|
|
lm_eval --model hf \
|
|
--model_args pretrained=codellama/CodeLlama-7b-hf \
|
|
--tasks humaneval \
|
|
--batch_size 1
|
|
```
|
|
|
|
**Interpretation**:
|
|
- Random: 0%
|
|
- GPT-3 (175B): 0%
|
|
- Codex: 28.8%
|
|
- GPT-4: 67.0%
|
|
- Code Llama 34B: 53.7%
|
|
|
|
**Good for**: Evaluating code generation capabilities.
|
|
|
|
### BBH (BIG-Bench Hard)
|
|
|
|
**What it measures**: 23 challenging reasoning tasks where models previously failed to beat humans.
|
|
|
|
**Categories**:
|
|
- Logical reasoning
|
|
- Math word problems
|
|
- Social understanding
|
|
- Algorithmic reasoning
|
|
|
|
**Format**: Multiple choice and free-form
|
|
|
|
**Command**:
|
|
```bash
|
|
lm_eval --model hf \
|
|
--model_args pretrained=meta-llama/Llama-2-7b-hf \
|
|
--tasks bbh \
|
|
--num_fewshot 3
|
|
```
|
|
|
|
**Interpretation**:
|
|
- Random: ~25%
|
|
- GPT-3 (175B): 33.9%
|
|
- PaLM 540B: 58.3%
|
|
- GPT-4: 86.7%
|
|
|
|
**Good for**: Testing advanced reasoning capabilities.
|
|
|
|
### IFEval (Instruction-Following Evaluation)
|
|
|
|
**What it measures**: Ability to follow specific, verifiable instructions.
|
|
|
|
**Instruction types**:
|
|
- Format constraints (e.g., "answer in 3 sentences")
|
|
- Length constraints (e.g., "use at least 100 words")
|
|
- Content constraints (e.g., "include the word 'banana'")
|
|
- Structural constraints (e.g., "use bullet points")
|
|
|
|
**Format**: Free-form generation with rule-based verification
|
|
|
|
**Command**:
|
|
```bash
|
|
lm_eval --model hf \
|
|
--model_args pretrained=meta-llama/Llama-2-7b-chat-hf \
|
|
--tasks ifeval \
|
|
--batch_size auto
|
|
```
|
|
|
|
**Interpretation**:
|
|
- Measures: Instruction adherence (not quality)
|
|
- GPT-4: 86% instruction following
|
|
- Claude 2: 84%
|
|
|
|
**Good for**: Evaluating chat/instruct models.
|
|
|
|
### GLUE (General Language Understanding Evaluation)
|
|
|
|
**What it measures**: Natural language understanding across 9 tasks.
|
|
|
|
**Tasks**:
|
|
- `cola`: Grammatical acceptability
|
|
- `sst2`: Sentiment analysis
|
|
- `mrpc`: Paraphrase detection
|
|
- `qqp`: Question pairs
|
|
- `stsb`: Semantic similarity
|
|
- `mnli`: Natural language inference
|
|
- `qnli`: Question answering NLI
|
|
- `rte`: Recognizing textual entailment
|
|
- `wnli`: Winograd schemas
|
|
|
|
**Command**:
|
|
```bash
|
|
lm_eval --model hf \
|
|
--model_args pretrained=bert-base-uncased \
|
|
--tasks glue \
|
|
--num_fewshot 0
|
|
```
|
|
|
|
**Interpretation**:
|
|
- BERT Base: 78.3 (GLUE score)
|
|
- RoBERTa Large: 88.5
|
|
- Human baseline: 87.1
|
|
|
|
**Good for**: Encoder-only models, fine-tuning baselines.
|
|
|
|
### LongBench
|
|
|
|
**What it measures**: Long-context understanding (4K-32K tokens).
|
|
|
|
**21 tasks covering**:
|
|
- Single-document QA
|
|
- Multi-document QA
|
|
- Summarization
|
|
- Few-shot learning
|
|
- Code completion
|
|
- Synthetic tasks
|
|
|
|
**Command**:
|
|
```bash
|
|
lm_eval --model hf \
|
|
--model_args pretrained=meta-llama/Llama-2-7b-hf \
|
|
--tasks longbench \
|
|
--batch_size 1
|
|
```
|
|
|
|
**Interpretation**:
|
|
- Tests context utilization
|
|
- Many models struggle beyond 4K tokens
|
|
- GPT-4 Turbo: 54.3%
|
|
|
|
**Good for**: Evaluating long-context models.
|
|
|
|
## Additional Benchmarks
|
|
|
|
### TruthfulQA
|
|
|
|
**What it measures**: Model's propensity to be truthful vs. generate plausible-sounding falsehoods.
|
|
|
|
**Format**: Multiple choice with 4-5 options
|
|
|
|
**Command**:
|
|
```bash
|
|
lm_eval --model hf \
|
|
--model_args pretrained=meta-llama/Llama-2-7b-hf \
|
|
--tasks truthfulqa_mc2 \
|
|
--batch_size auto
|
|
```
|
|
|
|
**Interpretation**:
|
|
- Larger models often score worse (more convincing lies)
|
|
- GPT-3: 58.8%
|
|
- GPT-4: 59.0%
|
|
- Human: ~94%
|
|
|
|
### ARC (AI2 Reasoning Challenge)
|
|
|
|
**What it measures**: Grade-school science questions.
|
|
|
|
**Variants**:
|
|
- `arc_easy`: Easier questions
|
|
- `arc_challenge`: Harder questions requiring reasoning
|
|
|
|
**Command**:
|
|
```bash
|
|
lm_eval --model hf \
|
|
--model_args pretrained=meta-llama/Llama-2-7b-hf \
|
|
--tasks arc_challenge \
|
|
--num_fewshot 25
|
|
```
|
|
|
|
**Interpretation**:
|
|
- ARC-Easy: Most models >80%
|
|
- ARC-Challenge random: 25%
|
|
- GPT-4: 96.3%
|
|
|
|
### HellaSwag
|
|
|
|
**What it measures**: Commonsense reasoning about everyday situations.
|
|
|
|
**Format**: Choose most plausible continuation
|
|
|
|
**Command**:
|
|
```bash
|
|
lm_eval --model hf \
|
|
--model_args pretrained=meta-llama/Llama-2-7b-hf \
|
|
--tasks hellaswag \
|
|
--num_fewshot 10
|
|
```
|
|
|
|
**Interpretation**:
|
|
- Random: 25%
|
|
- GPT-3: 78.9%
|
|
- Llama 2 70B: 85.3%
|
|
|
|
### WinoGrande
|
|
|
|
**What it measures**: Commonsense reasoning via pronoun resolution.
|
|
|
|
**Example**:
|
|
```
|
|
The trophy doesn't fit in the brown suitcase because _ is too large.
|
|
A. the trophy
|
|
B. the suitcase
|
|
```
|
|
|
|
**Command**:
|
|
```bash
|
|
lm_eval --model hf \
|
|
--model_args pretrained=meta-llama/Llama-2-7b-hf \
|
|
--tasks winogrande \
|
|
--num_fewshot 5
|
|
```
|
|
|
|
### PIQA
|
|
|
|
**What it measures**: Physical commonsense reasoning.
|
|
|
|
**Example**: "To clean a keyboard, use compressed air or..."
|
|
|
|
**Command**:
|
|
```bash
|
|
lm_eval --model hf \
|
|
--model_args pretrained=meta-llama/Llama-2-7b-hf \
|
|
--tasks piqa
|
|
```
|
|
|
|
## Multilingual Benchmarks
|
|
|
|
### AfroBench
|
|
|
|
**What it measures**: Performance across 64 African languages.
|
|
|
|
**15 tasks**: NLU, text generation, knowledge, QA, math reasoning
|
|
|
|
**Command**:
|
|
```bash
|
|
lm_eval --model hf \
|
|
--model_args pretrained=meta-llama/Llama-2-7b-hf \
|
|
--tasks afrobench
|
|
```
|
|
|
|
### NorEval
|
|
|
|
**What it measures**: Norwegian language understanding (9 task categories).
|
|
|
|
**Command**:
|
|
```bash
|
|
lm_eval --model hf \
|
|
--model_args pretrained=NbAiLab/nb-gpt-j-6B \
|
|
--tasks noreval
|
|
```
|
|
|
|
## Domain-Specific Benchmarks
|
|
|
|
### MATH
|
|
|
|
**What it measures**: High-school competition math problems.
|
|
|
|
**Command**:
|
|
```bash
|
|
lm_eval --model hf \
|
|
--model_args pretrained=meta-llama/Llama-2-7b-hf \
|
|
--tasks math \
|
|
--num_fewshot 4
|
|
```
|
|
|
|
**Interpretation**:
|
|
- Very challenging
|
|
- GPT-4: 42.5%
|
|
- Minerva 540B: 33.6%
|
|
|
|
### MBPP (Mostly Basic Python Problems)
|
|
|
|
**What it measures**: Python programming from natural language descriptions.
|
|
|
|
**Command**:
|
|
```bash
|
|
lm_eval --model hf \
|
|
--model_args pretrained=codellama/CodeLlama-7b-hf \
|
|
--tasks mbpp \
|
|
--batch_size 1
|
|
```
|
|
|
|
### DROP
|
|
|
|
**What it measures**: Reading comprehension requiring discrete reasoning.
|
|
|
|
**Command**:
|
|
```bash
|
|
lm_eval --model hf \
|
|
--model_args pretrained=meta-llama/Llama-2-7b-hf \
|
|
--tasks drop
|
|
```
|
|
|
|
## Benchmark Selection Guide
|
|
|
|
### For General Purpose Models
|
|
|
|
Run this suite:
|
|
```bash
|
|
lm_eval --model hf \
|
|
--model_args pretrained=meta-llama/Llama-2-7b-hf \
|
|
--tasks mmlu,gsm8k,hellaswag,arc_challenge,truthfulqa_mc2 \
|
|
--num_fewshot 5
|
|
```
|
|
|
|
### For Code Models
|
|
|
|
```bash
|
|
lm_eval --model hf \
|
|
--model_args pretrained=codellama/CodeLlama-7b-hf \
|
|
--tasks humaneval,mbpp \
|
|
--batch_size 1
|
|
```
|
|
|
|
### For Chat/Instruct Models
|
|
|
|
```bash
|
|
lm_eval --model hf \
|
|
--model_args pretrained=meta-llama/Llama-2-7b-chat-hf \
|
|
--tasks ifeval,mmlu,gsm8k_cot \
|
|
--batch_size auto
|
|
```
|
|
|
|
### For Long Context Models
|
|
|
|
```bash
|
|
lm_eval --model hf \
|
|
--model_args pretrained=meta-llama/Llama-3.1-8B \
|
|
--tasks longbench \
|
|
--batch_size 1
|
|
```
|
|
|
|
## Interpreting Results
|
|
|
|
### Understanding Metrics
|
|
|
|
**Accuracy**: Percentage of correct answers (most common)
|
|
|
|
**Exact Match (EM)**: Requires exact string match (strict)
|
|
|
|
**F1 Score**: Balances precision and recall
|
|
|
|
**BLEU/ROUGE**: Text generation similarity
|
|
|
|
**Pass@k**: Percentage passing when generating k samples
|
|
|
|
### Typical Score Ranges
|
|
|
|
| Model Size | MMLU | GSM8K | HumanEval | HellaSwag |
|
|
|------------|------|-------|-----------|-----------|
|
|
| 7B | 40-50% | 10-20% | 5-15% | 70-80% |
|
|
| 13B | 45-55% | 20-35% | 15-25% | 75-82% |
|
|
| 70B | 60-70% | 50-65% | 35-50% | 82-87% |
|
|
| GPT-4 | 86% | 92% | 67% | 95% |
|
|
|
|
### Red Flags
|
|
|
|
- **All tasks at random chance**: Model not trained properly
|
|
- **Exact 0% on generation tasks**: Likely format/parsing issue
|
|
- **Huge variance across runs**: Check seed/sampling settings
|
|
- **Better than GPT-4 on everything**: Likely contamination
|
|
|
|
## Best Practices
|
|
|
|
1. **Always report few-shot setting**: 0-shot, 5-shot, etc.
|
|
2. **Run multiple seeds**: Report mean ± std
|
|
3. **Check for data contamination**: Search training data for benchmark examples
|
|
4. **Compare to published baselines**: Validate your setup
|
|
5. **Report all hyperparameters**: Model, batch size, max tokens, temperature
|
|
|
|
## References
|
|
|
|
- Task list: `lm_eval --tasks list`
|
|
- Task README: `lm_eval/tasks/README.md`
|
|
- Papers: See individual benchmark papers
|