skills/mlops/evaluation/lm-evaluation-harness/references/custom-tasks.md

# Custom Tasks

Complete guide to creating domain-specific evaluation tasks in lm-evaluation-harness.

## Overview

Custom tasks allow you to evaluate models on your own datasets and metrics. Tasks are defined using YAML configuration files with optional Python utilities for complex logic.

**Why create custom tasks**:
- Evaluate on proprietary/domain-specific data
- Test specific capabilities not covered by existing benchmarks
- Create evaluation pipelines for internal models
- Reproduce research experiments

## Quick Start

### Minimal Custom Task

Create `my_tasks/simple_qa.yaml`:

```yaml
task: simple_qa
dataset_path: data/simple_qa.jsonl
output_type: generate_until
doc_to_text: "Question: {{question}}\nAnswer:"
doc_to_target: "{{answer}}"
metric_list:
  - metric: exact_match
    aggregation: mean
    higher_is_better: true
```

**Run it**:
```bash
lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks simple_qa \
  --include_path my_tasks/
```

## Task Configuration Reference

### Essential Fields

```yaml
# Task identification
task: my_custom_task           # Unique task name (required)
task_alias: "My Task"          # Display name
tag:                           # Tags for grouping
  - custom
  - domain_specific

# Dataset configuration
dataset_path: data/my_data.jsonl  # HuggingFace dataset or local path
dataset_name: default             # Subset name (if applicable)
training_split: train
validation_split: validation
test_split: test

# Evaluation configuration
output_type: generate_until    # or loglikelihood, multiple_choice
num_fewshot: 5                 # Number of few-shot examples
batch_size: auto               # Batch size

# Prompt templates (Jinja2)
doc_to_text: "Question: {{question}}"
doc_to_target: "{{answer}}"

# Metrics
metric_list:
  - metric: exact_match
    aggregation: mean
    higher_is_better: true

# Metadata
metadata:
  version: 1.0
```

### Output Types

**`generate_until`**: Free-form generation
```yaml
output_type: generate_until
generation_kwargs:
  max_gen_toks: 256
  until:
    - "\n"
    - "."
  temperature: 0.0
```

**`loglikelihood`**: Compute log probability of targets
```yaml
output_type: loglikelihood
# Used for perplexity, classification
```

**`multiple_choice`**: Choose from options
```yaml
output_type: multiple_choice
doc_to_choice: "{{choices}}"  # List of choices
```

## Data Formats

### Local JSONL File

`data/my_data.jsonl`:
```json
{"question": "What is 2+2?", "answer": "4"}
{"question": "Capital of France?", "answer": "Paris"}
```

**Task config**:
```yaml
dataset_path: data/my_data.jsonl
dataset_kwargs:
  data_files:
    test: data/my_data.jsonl
```

### HuggingFace Dataset

```yaml
dataset_path: squad
dataset_name: plain_text
test_split: validation
```

### CSV File

`data/my_data.csv`:
```csv
question,answer,category
What is 2+2?,4,math
Capital of France?,Paris,geography
```

**Task config**:
```yaml
dataset_path: data/my_data.csv
dataset_kwargs:
  data_files:
    test: data/my_data.csv
```

## Prompt Engineering

### Simple Template

```yaml
doc_to_text: "Question: {{question}}\nAnswer:"
doc_to_target: "{{answer}}"
```

### Conditional Logic

```yaml
doc_to_text: |
  {% if context %}
  Context: {{context}}
  {% endif %}
  Question: {{question}}
  Answer:
```

### Multiple Choice

```yaml
doc_to_text: |
  Question: {{question}}
  A. {{choices[0]}}
  B. {{choices[1]}}
  C. {{choices[2]}}
  D. {{choices[3]}}
  Answer:

doc_to_target: "{{ 'ABCD'[answer_idx] }}"
doc_to_choice: ["A", "B", "C", "D"]
```

### Few-Shot Formatting

```yaml
fewshot_delimiter: "\n\n"        # Between examples
target_delimiter: " "            # Between question and answer
doc_to_text: "Q: {{question}}"
doc_to_target: "A: {{answer}}"
```

## Custom Python Functions

For complex logic, use Python functions in `utils.py`.

### Create `my_tasks/utils.py`

```python
def process_docs(dataset):
    """Preprocess documents."""
    def _process(doc):
        # Custom preprocessing
        doc["question"] = doc["question"].strip().lower()
        return doc

    return dataset.map(_process)

def doc_to_text(doc):
    """Custom prompt formatting."""
    context = doc.get("context", "")
    question = doc["question"]

    if context:
        return f"Context: {context}\nQuestion: {question}\nAnswer:"
    return f"Question: {question}\nAnswer:"

def doc_to_target(doc):
    """Custom target extraction."""
    return doc["answer"].strip().lower()

def aggregate_scores(items):
    """Custom metric aggregation."""
    correct = sum(1 for item in items if item == 1.0)
    total = len(items)
    return correct / total if total > 0 else 0.0
```

### Use in Task Config

```yaml
task: my_custom_task
dataset_path: data/my_data.jsonl

# Use Python functions
process_docs: !function utils.process_docs
doc_to_text: !function utils.doc_to_text
doc_to_target: !function utils.doc_to_target

metric_list:
  - metric: exact_match
    aggregation: !function utils.aggregate_scores
    higher_is_better: true
```

## Real-World Examples

### Example 1: Domain QA Task

**Goal**: Evaluate medical question answering.

`medical_qa/medical_qa.yaml`:
```yaml
task: medical_qa
dataset_path: data/medical_qa.jsonl
output_type: generate_until
num_fewshot: 3

doc_to_text: |
  Medical Question: {{question}}
  Context: {{context}}
  Answer (be concise):

doc_to_target: "{{answer}}"

generation_kwargs:
  max_gen_toks: 100
  until:
    - "\n\n"
  temperature: 0.0

metric_list:
  - metric: exact_match
    aggregation: mean
    higher_is_better: true
  - metric: !function utils.medical_f1
    aggregation: mean
    higher_is_better: true

filter_list:
  - name: lowercase
    filter:
      - function: lowercase
      - function: remove_whitespace

metadata:
  version: 1.0
  domain: medical
```

`medical_qa/utils.py`:
```python
from sklearn.metrics import f1_score
import re

def medical_f1(predictions, references):
    """Custom F1 for medical terms."""
    pred_terms = set(extract_medical_terms(predictions[0]))
    ref_terms = set(extract_medical_terms(references[0]))

    if not pred_terms and not ref_terms:
        return 1.0
    if not pred_terms or not ref_terms:
        return 0.0

    tp = len(pred_terms & ref_terms)
    fp = len(pred_terms - ref_terms)
    fn = len(ref_terms - pred_terms)

    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0

    return 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

def extract_medical_terms(text):
    """Extract medical terminology."""
    # Custom logic
    return re.findall(r'\b[A-Z][a-z]+(?:[A-Z][a-z]+)*\b', text)
```

### Example 2: Code Evaluation

`code_eval/python_challenges.yaml`:
```yaml
task: python_challenges
dataset_path: data/python_problems.jsonl
output_type: generate_until
num_fewshot: 0

doc_to_text: |
  Write a Python function to solve:
  {{problem_statement}}

  Function signature:
  {{function_signature}}

doc_to_target: "{{canonical_solution}}"

generation_kwargs:
  max_gen_toks: 512
  until:
    - "\n\nclass"
    - "\n\ndef"
  temperature: 0.2

metric_list:
  - metric: !function utils.execute_code
    aggregation: mean
    higher_is_better: true

process_results: !function utils.process_code_results

metadata:
  version: 1.0
```

`code_eval/utils.py`:
```python
import subprocess
import json

def execute_code(predictions, references):
    """Execute generated code against test cases."""
    generated_code = predictions[0]
    test_cases = json.loads(references[0])

    try:
        # Execute code with test cases
        for test_input, expected_output in test_cases:
            result = execute_with_timeout(generated_code, test_input, timeout=5)
            if result != expected_output:
                return 0.0
        return 1.0
    except Exception:
        return 0.0

def execute_with_timeout(code, input_data, timeout=5):
    """Safely execute code with timeout."""
    # Implementation with subprocess and timeout
    pass

def process_code_results(doc, results):
    """Process code execution results."""
    return {
        "passed": results[0] == 1.0,
        "generated_code": results[1]
    }
```

### Example 3: Instruction Following

`instruction_eval/instruction_eval.yaml`:
```yaml
task: instruction_following
dataset_path: data/instructions.jsonl
output_type: generate_until
num_fewshot: 0

doc_to_text: |
  Instruction: {{instruction}}
  {% if constraints %}
  Constraints: {{constraints}}
  {% endif %}
  Response:

doc_to_target: "{{expected_response}}"

generation_kwargs:
  max_gen_toks: 256
  temperature: 0.7

metric_list:
  - metric: !function utils.check_constraints
    aggregation: mean
    higher_is_better: true
  - metric: !function utils.semantic_similarity
    aggregation: mean
    higher_is_better: true

process_docs: !function utils.add_constraint_checkers
```

`instruction_eval/utils.py`:
```python
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')

def check_constraints(predictions, references):
    """Check if response satisfies constraints."""
    response = predictions[0]
    constraints = json.loads(references[0])

    satisfied = 0
    total = len(constraints)

    for constraint in constraints:
        if verify_constraint(response, constraint):
            satisfied += 1

    return satisfied / total if total > 0 else 1.0

def verify_constraint(response, constraint):
    """Verify single constraint."""
    if constraint["type"] == "length":
        return len(response.split()) >= constraint["min_words"]
    elif constraint["type"] == "contains":
        return constraint["keyword"] in response.lower()
    # Add more constraint types
    return True

def semantic_similarity(predictions, references):
    """Compute semantic similarity."""
    pred_embedding = model.encode(predictions[0])
    ref_embedding = model.encode(references[0])
    return float(util.cos_sim(pred_embedding, ref_embedding))

def add_constraint_checkers(dataset):
    """Parse constraints into verifiable format."""
    def _parse(doc):
        # Parse constraint string into structured format
        doc["parsed_constraints"] = parse_constraints(doc.get("constraints", ""))
        return doc
    return dataset.map(_parse)
```

## Advanced Features

### Output Filtering

```yaml
filter_list:
  - name: extract_answer
    filter:
      - function: regex
        regex_pattern: "Answer: (.*)"
        group: 1
      - function: lowercase
      - function: strip_whitespace
```

### Multiple Metrics

```yaml
metric_list:
  - metric: exact_match
    aggregation: mean
    higher_is_better: true
  - metric: f1
    aggregation: mean
    higher_is_better: true
  - metric: bleu
    aggregation: mean
    higher_is_better: true
```

### Task Groups

Create `my_tasks/_default.yaml`:
```yaml
group: my_eval_suite
task:
  - simple_qa
  - medical_qa
  - python_challenges
```

**Run entire suite**:
```bash
lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks my_eval_suite \
  --include_path my_tasks/
```

## Testing Your Task

### Validate Configuration

```bash
# Test task loading
lm_eval --tasks my_custom_task --include_path my_tasks/ --limit 0

# Run on 5 samples
lm_eval --model hf \
  --model_args pretrained=gpt2 \
  --tasks my_custom_task \
  --include_path my_tasks/ \
  --limit 5
```

### Debug Mode

```bash
lm_eval --model hf \
  --model_args pretrained=gpt2 \
  --tasks my_custom_task \
  --include_path my_tasks/ \
  --limit 1 \
  --log_samples  # Save input/output samples
```

## Best Practices

1. **Start simple**: Test with minimal config first
2. **Version your tasks**: Use `metadata.version`
3. **Document your metrics**: Explain custom metrics in comments
4. **Test with multiple models**: Ensure robustness
5. **Validate on known examples**: Include sanity checks
6. **Use filters carefully**: Can hide errors
7. **Handle edge cases**: Empty strings, missing fields

## Common Patterns

### Classification Task

```yaml
output_type: loglikelihood
doc_to_text: "Text: {{text}}\nLabel:"
doc_to_target: " {{label}}"  # Space prefix important!
metric_list:
  - metric: acc
    aggregation: mean
```

### Perplexity Evaluation

```yaml
output_type: loglikelihood_rolling
doc_to_text: "{{text}}"
metric_list:
  - metric: perplexity
    aggregation: perplexity
```

### Ranking Task

```yaml
output_type: loglikelihood
doc_to_text: "Query: {{query}}\nPassage: {{passage}}\nRelevant:"
doc_to_target: [" Yes", " No"]
metric_list:
  - metric: acc
    aggregation: mean
```

## Troubleshooting

**"Task not found"**: Check `--include_path` and task name

**Empty results**: Verify `doc_to_text` and `doc_to_target` templates

**Metric errors**: Ensure metric names are correct (exact_match, not exact-match)

**Filter issues**: Test filters with `--log_samples`

**Python function not found**: Check `!function module.function_name` syntax

## References

- Task system: EleutherAI/lm-evaluation-harness docs
- Example tasks: `lm_eval/tasks/` directory
- TaskConfig: `lm_eval/api/task.py`
Sync all skills and memories 2026-04-14 07:27 2026-04-14 07:27:20 +09:00			`# Custom Tasks`

			`Complete guide to creating domain-specific evaluation tasks in lm-evaluation-harness.`

			`## Overview`

			`Custom tasks allow you to evaluate models on your own datasets and metrics. Tasks are defined using YAML configuration files with optional Python utilities for complex logic.`

			`Why create custom tasks:`
			`- Evaluate on proprietary/domain-specific data`
			`- Test specific capabilities not covered by existing benchmarks`
			`- Create evaluation pipelines for internal models`
			`- Reproduce research experiments`

			`## Quick Start`

			`### Minimal Custom Task`

			Create `my_tasks/simple_qa.yaml`:

			```yaml
			`task: simple_qa`
			`dataset_path: data/simple_qa.jsonl`
			`output_type: generate_until`
			`doc_to_text: "Question: {{question}}\nAnswer:"`
			`doc_to_target: "{{answer}}"`
			`metric_list:`
			`- metric: exact_match`
			`aggregation: mean`
			`higher_is_better: true`
			```

			`Run it:`
			```bash
			`lm_eval --model hf \`
			`--model_args pretrained=meta-llama/Llama-2-7b-hf \`
			`--tasks simple_qa \`
			`--include_path my_tasks/`
			```

			`## Task Configuration Reference`

			`### Essential Fields`

			```yaml
			`# Task identification`
			`task: my_custom_task # Unique task name (required)`
			`task_alias: "My Task" # Display name`
			`tag: # Tags for grouping`
			`- custom`
			`- domain_specific`

			`# Dataset configuration`
			`dataset_path: data/my_data.jsonl # HuggingFace dataset or local path`
			`dataset_name: default # Subset name (if applicable)`
			`training_split: train`
			`validation_split: validation`
			`test_split: test`

			`# Evaluation configuration`
			`output_type: generate_until # or loglikelihood, multiple_choice`
			`num_fewshot: 5 # Number of few-shot examples`
			`batch_size: auto # Batch size`

			`# Prompt templates (Jinja2)`
			`doc_to_text: "Question: {{question}}"`
			`doc_to_target: "{{answer}}"`

			`# Metrics`
			`metric_list:`
			`- metric: exact_match`
			`aggregation: mean`
			`higher_is_better: true`

			`# Metadata`
			`metadata:`
			`version: 1.0`
			```

			`### Output Types`

			`generate_until`: Free-form generation
			```yaml
			`output_type: generate_until`
			`generation_kwargs:`
			`max_gen_toks: 256`
			`until:`
			`- "\n"`
			`- "."`
			`temperature: 0.0`
			```

			`loglikelihood`: Compute log probability of targets
			```yaml
			`output_type: loglikelihood`
			`# Used for perplexity, classification`
			```

			`multiple_choice`: Choose from options
			```yaml
			`output_type: multiple_choice`
			`doc_to_choice: "{{choices}}" # List of choices`
			```

			`## Data Formats`

			`### Local JSONL File`

			`data/my_data.jsonl`:
			```json
			`{"question": "What is 2+2?", "answer": "4"}`
			`{"question": "Capital of France?", "answer": "Paris"}`
			```

			`Task config:`
			```yaml
			`dataset_path: data/my_data.jsonl`
			`dataset_kwargs:`
			`data_files:`
			`test: data/my_data.jsonl`
			```

			`### HuggingFace Dataset`

			```yaml
			`dataset_path: squad`
			`dataset_name: plain_text`
			`test_split: validation`
			```

			`### CSV File`

			`data/my_data.csv`:
			```csv
			`question,answer,category`
			`What is 2+2?,4,math`
			`Capital of France?,Paris,geography`
			```

			`Task config:`
			```yaml
			`dataset_path: data/my_data.csv`
			`dataset_kwargs:`
			`data_files:`
			`test: data/my_data.csv`
			```

			`## Prompt Engineering`

			`### Simple Template`

			```yaml
			`doc_to_text: "Question: {{question}}\nAnswer:"`
			`doc_to_target: "{{answer}}"`
			```

			`### Conditional Logic`

			```yaml
			`doc_to_text: \|`
			`{% if context %}`
			`Context: {{context}}`
			`{% endif %}`
			`Question: {{question}}`
			`Answer:`
			```

			`### Multiple Choice`

			```yaml
			`doc_to_text: \|`
			`Question: {{question}}`
			`A. {{choices[0]}}`
			`B. {{choices[1]}}`
			`C. {{choices[2]}}`
			`D. {{choices[3]}}`
			`Answer:`

			`doc_to_target: "{{ 'ABCD'[answer_idx] }}"`
			`doc_to_choice: ["A", "B", "C", "D"]`
			```

			`### Few-Shot Formatting`

			```yaml
			`fewshot_delimiter: "\n\n" # Between examples`
			`target_delimiter: " " # Between question and answer`
			`doc_to_text: "Q: {{question}}"`
			`doc_to_target: "A: {{answer}}"`
			```

			`## Custom Python Functions`

			For complex logic, use Python functions in `utils.py`.

			### Create `my_tasks/utils.py`

			```python
			`def process_docs(dataset):`
			`"""Preprocess documents."""`
			`def _process(doc):`
			`# Custom preprocessing`
			`doc["question"] = doc["question"].strip().lower()`
			`return doc`

			`return dataset.map(_process)`

			`def doc_to_text(doc):`
			`"""Custom prompt formatting."""`
			`context = doc.get("context", "")`
			`question = doc["question"]`

			`if context:`
			`return f"Context: {context}\nQuestion: {question}\nAnswer:"`
			`return f"Question: {question}\nAnswer:"`

			`def doc_to_target(doc):`
			`"""Custom target extraction."""`
			`return doc["answer"].strip().lower()`

			`def aggregate_scores(items):`
			`"""Custom metric aggregation."""`
			`correct = sum(1 for item in items if item == 1.0)`
			`total = len(items)`
			`return correct / total if total > 0 else 0.0`
			```

			`### Use in Task Config`

			```yaml
			`task: my_custom_task`
			`dataset_path: data/my_data.jsonl`

			`# Use Python functions`
			`process_docs: !function utils.process_docs`
			`doc_to_text: !function utils.doc_to_text`
			`doc_to_target: !function utils.doc_to_target`

			`metric_list:`
			`- metric: exact_match`
			`aggregation: !function utils.aggregate_scores`
			`higher_is_better: true`
			```

			`## Real-World Examples`

			`### Example 1: Domain QA Task`

			`Goal: Evaluate medical question answering.`

			`medical_qa/medical_qa.yaml`:
			```yaml
			`task: medical_qa`
			`dataset_path: data/medical_qa.jsonl`
			`output_type: generate_until`
			`num_fewshot: 3`

			`doc_to_text: \|`
			`Medical Question: {{question}}`
			`Context: {{context}}`
			`Answer (be concise):`

			`doc_to_target: "{{answer}}"`

			`generation_kwargs:`
			`max_gen_toks: 100`
			`until:`
			`- "\n\n"`
			`temperature: 0.0`

			`metric_list:`
			`- metric: exact_match`
			`aggregation: mean`
			`higher_is_better: true`
			`- metric: !function utils.medical_f1`
			`aggregation: mean`
			`higher_is_better: true`

			`filter_list:`
			`- name: lowercase`
			`filter:`
			`- function: lowercase`
			`- function: remove_whitespace`

			`metadata:`
			`version: 1.0`
			`domain: medical`
			```

			`medical_qa/utils.py`:
			```python
			`from sklearn.metrics import f1_score`
			`import re`

			`def medical_f1(predictions, references):`
			`"""Custom F1 for medical terms."""`
			`pred_terms = set(extract_medical_terms(predictions[0]))`
			`ref_terms = set(extract_medical_terms(references[0]))`

			`if not pred_terms and not ref_terms:`
			`return 1.0`
			`if not pred_terms or not ref_terms:`
			`return 0.0`

			`tp = len(pred_terms & ref_terms)`
			`fp = len(pred_terms - ref_terms)`
			`fn = len(ref_terms - pred_terms)`

			`precision = tp / (tp + fp) if (tp + fp) > 0 else 0`
			`recall = tp / (tp + fn) if (tp + fn) > 0 else 0`

			`return 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0`

			`def extract_medical_terms(text):`
			`"""Extract medical terminology."""`
			`# Custom logic`
			`return re.findall(r'\b[A-Z][a-z]+(?:[A-Z][a-z]+)*\b', text)`
			```

			`### Example 2: Code Evaluation`

			`code_eval/python_challenges.yaml`:
			```yaml
			`task: python_challenges`
			`dataset_path: data/python_problems.jsonl`
			`output_type: generate_until`
			`num_fewshot: 0`

			`doc_to_text: \|`
			`Write a Python function to solve:`
			`{{problem_statement}}`

			`Function signature:`
			`{{function_signature}}`

			`doc_to_target: "{{canonical_solution}}"`

			`generation_kwargs:`
			`max_gen_toks: 512`
			`until:`
			`- "\n\nclass"`
			`- "\n\ndef"`
			`temperature: 0.2`

			`metric_list:`
			`- metric: !function utils.execute_code`
			`aggregation: mean`
			`higher_is_better: true`

			`process_results: !function utils.process_code_results`

			`metadata:`
			`version: 1.0`
			```

			`code_eval/utils.py`:
			```python
			`import subprocess`
			`import json`

			`def execute_code(predictions, references):`
			`"""Execute generated code against test cases."""`
			`generated_code = predictions[0]`
			`test_cases = json.loads(references[0])`

			`try:`
			`# Execute code with test cases`
			`for test_input, expected_output in test_cases:`
			`result = execute_with_timeout(generated_code, test_input, timeout=5)`
			`if result != expected_output:`
			`return 0.0`
			`return 1.0`
			`except Exception:`
			`return 0.0`

			`def execute_with_timeout(code, input_data, timeout=5):`
			`"""Safely execute code with timeout."""`
			`# Implementation with subprocess and timeout`
			`pass`

			`def process_code_results(doc, results):`
			`"""Process code execution results."""`
			`return {`
			`"passed": results[0] == 1.0,`
			`"generated_code": results[1]`
			`}`
			```

			`### Example 3: Instruction Following`

			`instruction_eval/instruction_eval.yaml`:
			```yaml
			`task: instruction_following`
			`dataset_path: data/instructions.jsonl`
			`output_type: generate_until`
			`num_fewshot: 0`

			`doc_to_text: \|`
			`Instruction: {{instruction}}`
			`{% if constraints %}`
			`Constraints: {{constraints}}`
			`{% endif %}`
			`Response:`

			`doc_to_target: "{{expected_response}}"`

			`generation_kwargs:`
			`max_gen_toks: 256`
			`temperature: 0.7`

			`metric_list:`
			`- metric: !function utils.check_constraints`
			`aggregation: mean`
			`higher_is_better: true`
			`- metric: !function utils.semantic_similarity`
			`aggregation: mean`
			`higher_is_better: true`

			`process_docs: !function utils.add_constraint_checkers`
			```

			`instruction_eval/utils.py`:
			```python
			`from sentence_transformers import SentenceTransformer, util`

			`model = SentenceTransformer('all-MiniLM-L6-v2')`

			`def check_constraints(predictions, references):`
			`"""Check if response satisfies constraints."""`
			`response = predictions[0]`
			`constraints = json.loads(references[0])`

			`satisfied = 0`
			`total = len(constraints)`

			`for constraint in constraints:`
			`if verify_constraint(response, constraint):`
			`satisfied += 1`

			`return satisfied / total if total > 0 else 1.0`

			`def verify_constraint(response, constraint):`
			`"""Verify single constraint."""`
			`if constraint["type"] == "length":`
			`return len(response.split()) >= constraint["min_words"]`
			`elif constraint["type"] == "contains":`
			`return constraint["keyword"] in response.lower()`
			`# Add more constraint types`
			`return True`

			`def semantic_similarity(predictions, references):`
			`"""Compute semantic similarity."""`
			`pred_embedding = model.encode(predictions[0])`
			`ref_embedding = model.encode(references[0])`
			`return float(util.cos_sim(pred_embedding, ref_embedding))`

			`def add_constraint_checkers(dataset):`
			`"""Parse constraints into verifiable format."""`
			`def _parse(doc):`
			`# Parse constraint string into structured format`
			`doc["parsed_constraints"] = parse_constraints(doc.get("constraints", ""))`
			`return doc`
			`return dataset.map(_parse)`
			```

			`## Advanced Features`

			`### Output Filtering`

			```yaml
			`filter_list:`
			`- name: extract_answer`
			`filter:`
			`- function: regex`
			`regex_pattern: "Answer: (.*)"`
			`group: 1`
			`- function: lowercase`
			`- function: strip_whitespace`
			```

			`### Multiple Metrics`

			```yaml
			`metric_list:`
			`- metric: exact_match`
			`aggregation: mean`
			`higher_is_better: true`
			`- metric: f1`
			`aggregation: mean`
			`higher_is_better: true`
			`- metric: bleu`
			`aggregation: mean`
			`higher_is_better: true`
			```

			`### Task Groups`

			Create `my_tasks/_default.yaml`:
			```yaml
			`group: my_eval_suite`
			`task:`
			`- simple_qa`
			`- medical_qa`
			`- python_challenges`
			```

			`Run entire suite:`
			```bash
			`lm_eval --model hf \`
			`--model_args pretrained=meta-llama/Llama-2-7b-hf \`
			`--tasks my_eval_suite \`
			`--include_path my_tasks/`
			```

			`## Testing Your Task`

			`### Validate Configuration`

			```bash
			`# Test task loading`
			`lm_eval --tasks my_custom_task --include_path my_tasks/ --limit 0`

			`# Run on 5 samples`
			`lm_eval --model hf \`
			`--model_args pretrained=gpt2 \`
			`--tasks my_custom_task \`
			`--include_path my_tasks/ \`
			`--limit 5`
			```

			`### Debug Mode`

			```bash
			`lm_eval --model hf \`
			`--model_args pretrained=gpt2 \`
			`--tasks my_custom_task \`
			`--include_path my_tasks/ \`
			`--limit 1 \`
			`--log_samples # Save input/output samples`
			```

			`## Best Practices`

			`1. Start simple: Test with minimal config first`
			2. Version your tasks: Use `metadata.version`
			`3. Document your metrics: Explain custom metrics in comments`
			`4. Test with multiple models: Ensure robustness`
			`5. Validate on known examples: Include sanity checks`
			`6. Use filters carefully: Can hide errors`
			`7. Handle edge cases: Empty strings, missing fields`

			`## Common Patterns`

			`### Classification Task`

			```yaml
			`output_type: loglikelihood`
			`doc_to_text: "Text: {{text}}\nLabel:"`
			`doc_to_target: " {{label}}" # Space prefix important!`
			`metric_list:`
			`- metric: acc`
			`aggregation: mean`
			```

			`### Perplexity Evaluation`

			```yaml
			`output_type: loglikelihood_rolling`
			`doc_to_text: "{{text}}"`
			`metric_list:`
			`- metric: perplexity`
			`aggregation: perplexity`
			```

			`### Ranking Task`

			```yaml
			`output_type: loglikelihood`
			`doc_to_text: "Query: {{query}}\nPassage: {{passage}}\nRelevant:"`
			`doc_to_target: [" Yes", " No"]`
			`metric_list:`
			`- metric: acc`
			`aggregation: mean`
			```

			`## Troubleshooting`

			"Task not found": Check `--include_path` and task name

			Empty results: Verify `doc_to_text` and `doc_to_target` templates

			`Metric errors: Ensure metric names are correct (exact_match, not exact-match)`

			Filter issues: Test filters with `--log_samples`

			Python function not found: Check `!function module.function_name` syntax

			`## References`

			`- Task system: EleutherAI/lm-evaluation-harness docs`
			- Example tasks: `lm_eval/tasks/` directory
			- TaskConfig: `lm_eval/api/task.py`