491 lines
11 KiB
Markdown
491 lines
11 KiB
Markdown
# API Evaluation
|
|
|
|
Guide to evaluating OpenAI, Anthropic, and other API-based language models.
|
|
|
|
## Overview
|
|
|
|
The lm-evaluation-harness supports evaluating API-based models through a unified `TemplateAPI` interface. This allows benchmarking of:
|
|
- OpenAI models (GPT-4, GPT-3.5, etc.)
|
|
- Anthropic models (Claude 3, Claude 2, etc.)
|
|
- Local OpenAI-compatible APIs
|
|
- Custom API endpoints
|
|
|
|
**Why evaluate API models**:
|
|
- Benchmark closed-source models
|
|
- Compare API models to open models
|
|
- Validate API performance
|
|
- Track model updates over time
|
|
|
|
## Supported API Models
|
|
|
|
| Provider | Model Type | Request Types | Logprobs |
|
|
|----------|------------|---------------|----------|
|
|
| OpenAI (completions) | `openai-completions` | All | ✅ Yes |
|
|
| OpenAI (chat) | `openai-chat-completions` | `generate_until` only | ❌ No |
|
|
| Anthropic (completions) | `anthropic-completions` | All | ❌ No |
|
|
| Anthropic (chat) | `anthropic-chat` | `generate_until` only | ❌ No |
|
|
| Local (OpenAI-compatible) | `local-completions` | Depends on server | Varies |
|
|
|
|
**Note**: Models without logprobs can only be evaluated on generation tasks, not perplexity or loglikelihood tasks.
|
|
|
|
## OpenAI Models
|
|
|
|
### Setup
|
|
|
|
```bash
|
|
export OPENAI_API_KEY=sk-...
|
|
```
|
|
|
|
### Completion Models (Legacy)
|
|
|
|
**Available models**: `davinci-002`, `babbage-002`
|
|
|
|
```bash
|
|
lm_eval --model openai-completions \
|
|
--model_args model=davinci-002 \
|
|
--tasks lambada_openai,hellaswag \
|
|
--batch_size auto
|
|
```
|
|
|
|
**Supports**:
|
|
- `generate_until`: ✅
|
|
- `loglikelihood`: ✅
|
|
- `loglikelihood_rolling`: ✅
|
|
|
|
### Chat Models
|
|
|
|
**Available models**: `gpt-4`, `gpt-4-turbo`, `gpt-3.5-turbo`
|
|
|
|
```bash
|
|
lm_eval --model openai-chat-completions \
|
|
--model_args model=gpt-4-turbo \
|
|
--tasks mmlu,gsm8k,humaneval \
|
|
--num_fewshot 5 \
|
|
--batch_size auto
|
|
```
|
|
|
|
**Supports**:
|
|
- `generate_until`: ✅
|
|
- `loglikelihood`: ❌ (no logprobs)
|
|
- `loglikelihood_rolling`: ❌
|
|
|
|
**Important**: Chat models don't provide logprobs, so they can only be used with generation tasks (MMLU, GSM8K, HumanEval), not perplexity tasks.
|
|
|
|
### Configuration Options
|
|
|
|
```bash
|
|
lm_eval --model openai-chat-completions \
|
|
--model_args \
|
|
model=gpt-4-turbo,\
|
|
base_url=https://api.openai.com/v1,\
|
|
num_concurrent=5,\
|
|
max_retries=3,\
|
|
timeout=60,\
|
|
batch_size=auto
|
|
```
|
|
|
|
**Parameters**:
|
|
- `model`: Model identifier (required)
|
|
- `base_url`: API endpoint (default: OpenAI)
|
|
- `num_concurrent`: Concurrent requests (default: 5)
|
|
- `max_retries`: Retry failed requests (default: 3)
|
|
- `timeout`: Request timeout in seconds (default: 60)
|
|
- `tokenizer`: Tokenizer to use (default: matches model)
|
|
- `tokenizer_backend`: `"tiktoken"` or `"huggingface"`
|
|
|
|
### Cost Management
|
|
|
|
OpenAI charges per token. Estimate costs before running:
|
|
|
|
```python
|
|
# Rough estimate
|
|
num_samples = 1000
|
|
avg_tokens_per_sample = 500 # input + output
|
|
cost_per_1k_tokens = 0.01 # GPT-3.5 Turbo
|
|
|
|
total_cost = (num_samples * avg_tokens_per_sample / 1000) * cost_per_1k_tokens
|
|
print(f"Estimated cost: ${total_cost:.2f}")
|
|
```
|
|
|
|
**Cost-saving tips**:
|
|
- Use `--limit N` for testing
|
|
- Start with `gpt-3.5-turbo` before `gpt-4`
|
|
- Set `max_gen_toks` to minimum needed
|
|
- Use `num_fewshot=0` for zero-shot when possible
|
|
|
|
## Anthropic Models
|
|
|
|
### Setup
|
|
|
|
```bash
|
|
export ANTHROPIC_API_KEY=sk-ant-...
|
|
```
|
|
|
|
### Completion Models (Legacy)
|
|
|
|
```bash
|
|
lm_eval --model anthropic-completions \
|
|
--model_args model=claude-2.1 \
|
|
--tasks lambada_openai,hellaswag \
|
|
--batch_size auto
|
|
```
|
|
|
|
### Chat Models (Recommended)
|
|
|
|
**Available models**: `claude-3-5-sonnet-20241022`, `claude-3-opus-20240229`, `claude-3-sonnet-20240229`, `claude-3-haiku-20240307`
|
|
|
|
```bash
|
|
lm_eval --model anthropic-chat \
|
|
--model_args model=claude-3-5-sonnet-20241022 \
|
|
--tasks mmlu,gsm8k,humaneval \
|
|
--num_fewshot 5 \
|
|
--batch_size auto
|
|
```
|
|
|
|
**Aliases**: `anthropic-chat-completions` (same as `anthropic-chat`)
|
|
|
|
### Configuration Options
|
|
|
|
```bash
|
|
lm_eval --model anthropic-chat \
|
|
--model_args \
|
|
model=claude-3-5-sonnet-20241022,\
|
|
base_url=https://api.anthropic.com,\
|
|
num_concurrent=5,\
|
|
max_retries=3,\
|
|
timeout=60
|
|
```
|
|
|
|
### Cost Management
|
|
|
|
Anthropic pricing (as of 2024):
|
|
- Claude 3.5 Sonnet: $3.00 / 1M input, $15.00 / 1M output
|
|
- Claude 3 Opus: $15.00 / 1M input, $75.00 / 1M output
|
|
- Claude 3 Haiku: $0.25 / 1M input, $1.25 / 1M output
|
|
|
|
**Budget-friendly strategy**:
|
|
```bash
|
|
# Test on small sample first
|
|
lm_eval --model anthropic-chat \
|
|
--model_args model=claude-3-haiku-20240307 \
|
|
--tasks mmlu \
|
|
--limit 100
|
|
|
|
# Then run full eval on best model
|
|
lm_eval --model anthropic-chat \
|
|
--model_args model=claude-3-5-sonnet-20241022 \
|
|
--tasks mmlu \
|
|
--num_fewshot 5
|
|
```
|
|
|
|
## Local OpenAI-Compatible APIs
|
|
|
|
Many local inference servers expose OpenAI-compatible APIs (vLLM, Text Generation Inference, llama.cpp, Ollama).
|
|
|
|
### vLLM Local Server
|
|
|
|
**Start server**:
|
|
```bash
|
|
vllm serve meta-llama/Llama-2-7b-hf \
|
|
--host 0.0.0.0 \
|
|
--port 8000
|
|
```
|
|
|
|
**Evaluate**:
|
|
```bash
|
|
lm_eval --model local-completions \
|
|
--model_args \
|
|
model=meta-llama/Llama-2-7b-hf,\
|
|
base_url=http://localhost:8000/v1,\
|
|
num_concurrent=1 \
|
|
--tasks mmlu,gsm8k \
|
|
--batch_size auto
|
|
```
|
|
|
|
### Text Generation Inference (TGI)
|
|
|
|
**Start server**:
|
|
```bash
|
|
docker run --gpus all --shm-size 1g -p 8080:80 \
|
|
ghcr.io/huggingface/text-generation-inference:latest \
|
|
--model-id meta-llama/Llama-2-7b-hf
|
|
```
|
|
|
|
**Evaluate**:
|
|
```bash
|
|
lm_eval --model local-completions \
|
|
--model_args \
|
|
model=meta-llama/Llama-2-7b-hf,\
|
|
base_url=http://localhost:8080/v1 \
|
|
--tasks hellaswag,arc_challenge
|
|
```
|
|
|
|
### Ollama
|
|
|
|
**Start server**:
|
|
```bash
|
|
ollama serve
|
|
ollama pull llama2:7b
|
|
```
|
|
|
|
**Evaluate**:
|
|
```bash
|
|
lm_eval --model local-completions \
|
|
--model_args \
|
|
model=llama2:7b,\
|
|
base_url=http://localhost:11434/v1 \
|
|
--tasks mmlu
|
|
```
|
|
|
|
### llama.cpp Server
|
|
|
|
**Start server**:
|
|
```bash
|
|
./server -m models/llama-2-7b.gguf --host 0.0.0.0 --port 8080
|
|
```
|
|
|
|
**Evaluate**:
|
|
```bash
|
|
lm_eval --model local-completions \
|
|
--model_args \
|
|
model=llama2,\
|
|
base_url=http://localhost:8080/v1 \
|
|
--tasks gsm8k
|
|
```
|
|
|
|
## Custom API Implementation
|
|
|
|
For custom API endpoints, subclass `TemplateAPI`:
|
|
|
|
### Create `my_api.py`
|
|
|
|
```python
|
|
from lm_eval.models.api_models import TemplateAPI
|
|
import requests
|
|
|
|
class MyCustomAPI(TemplateAPI):
|
|
"""Custom API model."""
|
|
|
|
def __init__(self, base_url, api_key, **kwargs):
|
|
super().__init__(base_url=base_url, **kwargs)
|
|
self.api_key = api_key
|
|
|
|
def _create_payload(self, messages, gen_kwargs):
|
|
"""Create API request payload."""
|
|
return {
|
|
"messages": messages,
|
|
"api_key": self.api_key,
|
|
**gen_kwargs
|
|
}
|
|
|
|
def parse_generations(self, response):
|
|
"""Parse generation response."""
|
|
return response.json()["choices"][0]["text"]
|
|
|
|
def parse_logprobs(self, response):
|
|
"""Parse logprobs (if available)."""
|
|
# Return None if API doesn't provide logprobs
|
|
logprobs = response.json().get("logprobs")
|
|
if logprobs:
|
|
return logprobs["token_logprobs"]
|
|
return None
|
|
```
|
|
|
|
### Register and Use
|
|
|
|
```python
|
|
from lm_eval import evaluator
|
|
from my_api import MyCustomAPI
|
|
|
|
model = MyCustomAPI(
|
|
base_url="https://api.example.com/v1",
|
|
api_key="your-key"
|
|
)
|
|
|
|
results = evaluator.simple_evaluate(
|
|
model=model,
|
|
tasks=["mmlu", "gsm8k"],
|
|
num_fewshot=5,
|
|
batch_size="auto"
|
|
)
|
|
```
|
|
|
|
## Comparing API and Open Models
|
|
|
|
### Side-by-Side Evaluation
|
|
|
|
```bash
|
|
# Evaluate OpenAI GPT-4
|
|
lm_eval --model openai-chat-completions \
|
|
--model_args model=gpt-4-turbo \
|
|
--tasks mmlu,gsm8k,hellaswag \
|
|
--num_fewshot 5 \
|
|
--output_path results/gpt4.json
|
|
|
|
# Evaluate open Llama 2 70B
|
|
lm_eval --model hf \
|
|
--model_args pretrained=meta-llama/Llama-2-70b-hf,dtype=bfloat16 \
|
|
--tasks mmlu,gsm8k,hellaswag \
|
|
--num_fewshot 5 \
|
|
--output_path results/llama2-70b.json
|
|
|
|
# Compare results
|
|
python scripts/compare_results.py \
|
|
results/gpt4.json \
|
|
results/llama2-70b.json
|
|
```
|
|
|
|
### Typical Comparisons
|
|
|
|
| Model | MMLU | GSM8K | HumanEval | Cost |
|
|
|-------|------|-------|-----------|------|
|
|
| GPT-4 Turbo | 86.4% | 92.0% | 67.0% | $$$$ |
|
|
| Claude 3 Opus | 86.8% | 95.0% | 84.9% | $$$$ |
|
|
| GPT-3.5 Turbo | 70.0% | 57.1% | 48.1% | $$ |
|
|
| Llama 2 70B | 68.9% | 56.8% | 29.9% | Free (self-host) |
|
|
| Mixtral 8x7B | 70.6% | 58.4% | 40.2% | Free (self-host) |
|
|
|
|
## Best Practices
|
|
|
|
### Rate Limiting
|
|
|
|
Respect API rate limits:
|
|
```bash
|
|
lm_eval --model openai-chat-completions \
|
|
--model_args \
|
|
model=gpt-4-turbo,\
|
|
num_concurrent=3,\ # Lower concurrency
|
|
timeout=120 \ # Longer timeout
|
|
--tasks mmlu
|
|
```
|
|
|
|
### Reproducibility
|
|
|
|
Set temperature to 0 for deterministic results:
|
|
```bash
|
|
lm_eval --model openai-chat-completions \
|
|
--model_args model=gpt-4-turbo \
|
|
--tasks mmlu \
|
|
--gen_kwargs temperature=0.0
|
|
```
|
|
|
|
Or use `seed` for sampling:
|
|
```bash
|
|
lm_eval --model anthropic-chat \
|
|
--model_args model=claude-3-5-sonnet-20241022 \
|
|
--tasks gsm8k \
|
|
--gen_kwargs temperature=0.7,seed=42
|
|
```
|
|
|
|
### Caching
|
|
|
|
API models automatically cache responses to avoid redundant calls:
|
|
```bash
|
|
# First run: makes API calls
|
|
lm_eval --model openai-chat-completions \
|
|
--model_args model=gpt-4-turbo \
|
|
--tasks mmlu \
|
|
--limit 100
|
|
|
|
# Second run: uses cache (instant, free)
|
|
lm_eval --model openai-chat-completions \
|
|
--model_args model=gpt-4-turbo \
|
|
--tasks mmlu \
|
|
--limit 100
|
|
```
|
|
|
|
Cache location: `~/.cache/lm_eval/`
|
|
|
|
### Error Handling
|
|
|
|
APIs can fail. Use retries:
|
|
```bash
|
|
lm_eval --model openai-chat-completions \
|
|
--model_args \
|
|
model=gpt-4-turbo,\
|
|
max_retries=5,\
|
|
timeout=120 \
|
|
--tasks mmlu
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### "Authentication failed"
|
|
|
|
Check API key:
|
|
```bash
|
|
echo $OPENAI_API_KEY # Should print sk-...
|
|
echo $ANTHROPIC_API_KEY # Should print sk-ant-...
|
|
```
|
|
|
|
### "Rate limit exceeded"
|
|
|
|
Reduce concurrency:
|
|
```bash
|
|
--model_args num_concurrent=1
|
|
```
|
|
|
|
Or add delays between requests.
|
|
|
|
### "Timeout error"
|
|
|
|
Increase timeout:
|
|
```bash
|
|
--model_args timeout=180
|
|
```
|
|
|
|
### "Model not found"
|
|
|
|
For local APIs, verify server is running:
|
|
```bash
|
|
curl http://localhost:8000/v1/models
|
|
```
|
|
|
|
### Cost Runaway
|
|
|
|
Use `--limit` for testing:
|
|
```bash
|
|
lm_eval --model openai-chat-completions \
|
|
--model_args model=gpt-4-turbo \
|
|
--tasks mmlu \
|
|
--limit 50 # Only 50 samples
|
|
```
|
|
|
|
## Advanced Features
|
|
|
|
### Custom Headers
|
|
|
|
```bash
|
|
lm_eval --model local-completions \
|
|
--model_args \
|
|
base_url=http://api.example.com/v1,\
|
|
header="Authorization: Bearer token,X-Custom: value"
|
|
```
|
|
|
|
### Disable SSL Verification (Development Only)
|
|
|
|
```bash
|
|
lm_eval --model local-completions \
|
|
--model_args \
|
|
base_url=https://localhost:8000/v1,\
|
|
verify_certificate=false
|
|
```
|
|
|
|
### Custom Tokenizer
|
|
|
|
```bash
|
|
lm_eval --model openai-chat-completions \
|
|
--model_args \
|
|
model=gpt-4-turbo,\
|
|
tokenizer=gpt2,\
|
|
tokenizer_backend=huggingface
|
|
```
|
|
|
|
## References
|
|
|
|
- OpenAI API: https://platform.openai.com/docs/api-reference
|
|
- Anthropic API: https://docs.anthropic.com/claude/reference
|
|
- TemplateAPI: `lm_eval/models/api_models.py`
|
|
- OpenAI models: `lm_eval/models/openai_completions.py`
|
|
- Anthropic models: `lm_eval/models/anthropic_llms.py`
|