Files
hermes-sync/skills/mlops/evaluation/lm-evaluation-harness/references/custom-tasks.md

13 KiB

Custom Tasks

Complete guide to creating domain-specific evaluation tasks in lm-evaluation-harness.

Overview

Custom tasks allow you to evaluate models on your own datasets and metrics. Tasks are defined using YAML configuration files with optional Python utilities for complex logic.

Why create custom tasks:

  • Evaluate on proprietary/domain-specific data
  • Test specific capabilities not covered by existing benchmarks
  • Create evaluation pipelines for internal models
  • Reproduce research experiments

Quick Start

Minimal Custom Task

Create my_tasks/simple_qa.yaml:

task: simple_qa
dataset_path: data/simple_qa.jsonl
output_type: generate_until
doc_to_text: "Question: {{question}}\nAnswer:"
doc_to_target: "{{answer}}"
metric_list:
  - metric: exact_match
    aggregation: mean
    higher_is_better: true

Run it:

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks simple_qa \
  --include_path my_tasks/

Task Configuration Reference

Essential Fields

# Task identification
task: my_custom_task           # Unique task name (required)
task_alias: "My Task"          # Display name
tag:                           # Tags for grouping
  - custom
  - domain_specific

# Dataset configuration
dataset_path: data/my_data.jsonl  # HuggingFace dataset or local path
dataset_name: default             # Subset name (if applicable)
training_split: train
validation_split: validation
test_split: test

# Evaluation configuration
output_type: generate_until    # or loglikelihood, multiple_choice
num_fewshot: 5                 # Number of few-shot examples
batch_size: auto               # Batch size

# Prompt templates (Jinja2)
doc_to_text: "Question: {{question}}"
doc_to_target: "{{answer}}"

# Metrics
metric_list:
  - metric: exact_match
    aggregation: mean
    higher_is_better: true

# Metadata
metadata:
  version: 1.0

Output Types

generate_until: Free-form generation

output_type: generate_until
generation_kwargs:
  max_gen_toks: 256
  until:
    - "\n"
    - "."
  temperature: 0.0

loglikelihood: Compute log probability of targets

output_type: loglikelihood
# Used for perplexity, classification

multiple_choice: Choose from options

output_type: multiple_choice
doc_to_choice: "{{choices}}"  # List of choices

Data Formats

Local JSONL File

data/my_data.jsonl:

{"question": "What is 2+2?", "answer": "4"}
{"question": "Capital of France?", "answer": "Paris"}

Task config:

dataset_path: data/my_data.jsonl
dataset_kwargs:
  data_files:
    test: data/my_data.jsonl

HuggingFace Dataset

dataset_path: squad
dataset_name: plain_text
test_split: validation

CSV File

data/my_data.csv:

question,answer,category
What is 2+2?,4,math
Capital of France?,Paris,geography

Task config:

dataset_path: data/my_data.csv
dataset_kwargs:
  data_files:
    test: data/my_data.csv

Prompt Engineering

Simple Template

doc_to_text: "Question: {{question}}\nAnswer:"
doc_to_target: "{{answer}}"

Conditional Logic

doc_to_text: |
  {% if context %}
  Context: {{context}}
  {% endif %}
  Question: {{question}}
  Answer:

Multiple Choice

doc_to_text: |
  Question: {{question}}
  A. {{choices[0]}}
  B. {{choices[1]}}
  C. {{choices[2]}}
  D. {{choices[3]}}
  Answer:

doc_to_target: "{{ 'ABCD'[answer_idx] }}"
doc_to_choice: ["A", "B", "C", "D"]

Few-Shot Formatting

fewshot_delimiter: "\n\n"        # Between examples
target_delimiter: " "            # Between question and answer
doc_to_text: "Q: {{question}}"
doc_to_target: "A: {{answer}}"

Custom Python Functions

For complex logic, use Python functions in utils.py.

Create my_tasks/utils.py

def process_docs(dataset):
    """Preprocess documents."""
    def _process(doc):
        # Custom preprocessing
        doc["question"] = doc["question"].strip().lower()
        return doc

    return dataset.map(_process)

def doc_to_text(doc):
    """Custom prompt formatting."""
    context = doc.get("context", "")
    question = doc["question"]

    if context:
        return f"Context: {context}\nQuestion: {question}\nAnswer:"
    return f"Question: {question}\nAnswer:"

def doc_to_target(doc):
    """Custom target extraction."""
    return doc["answer"].strip().lower()

def aggregate_scores(items):
    """Custom metric aggregation."""
    correct = sum(1 for item in items if item == 1.0)
    total = len(items)
    return correct / total if total > 0 else 0.0

Use in Task Config

task: my_custom_task
dataset_path: data/my_data.jsonl

# Use Python functions
process_docs: !function utils.process_docs
doc_to_text: !function utils.doc_to_text
doc_to_target: !function utils.doc_to_target

metric_list:
  - metric: exact_match
    aggregation: !function utils.aggregate_scores
    higher_is_better: true

Real-World Examples

Example 1: Domain QA Task

Goal: Evaluate medical question answering.

medical_qa/medical_qa.yaml:

task: medical_qa
dataset_path: data/medical_qa.jsonl
output_type: generate_until
num_fewshot: 3

doc_to_text: |
  Medical Question: {{question}}
  Context: {{context}}
  Answer (be concise):

doc_to_target: "{{answer}}"

generation_kwargs:
  max_gen_toks: 100
  until:
    - "\n\n"
  temperature: 0.0

metric_list:
  - metric: exact_match
    aggregation: mean
    higher_is_better: true
  - metric: !function utils.medical_f1
    aggregation: mean
    higher_is_better: true

filter_list:
  - name: lowercase
    filter:
      - function: lowercase
      - function: remove_whitespace

metadata:
  version: 1.0
  domain: medical

medical_qa/utils.py:

from sklearn.metrics import f1_score
import re

def medical_f1(predictions, references):
    """Custom F1 for medical terms."""
    pred_terms = set(extract_medical_terms(predictions[0]))
    ref_terms = set(extract_medical_terms(references[0]))

    if not pred_terms and not ref_terms:
        return 1.0
    if not pred_terms or not ref_terms:
        return 0.0

    tp = len(pred_terms & ref_terms)
    fp = len(pred_terms - ref_terms)
    fn = len(ref_terms - pred_terms)

    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0

    return 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

def extract_medical_terms(text):
    """Extract medical terminology."""
    # Custom logic
    return re.findall(r'\b[A-Z][a-z]+(?:[A-Z][a-z]+)*\b', text)

Example 2: Code Evaluation

code_eval/python_challenges.yaml:

task: python_challenges
dataset_path: data/python_problems.jsonl
output_type: generate_until
num_fewshot: 0

doc_to_text: |
  Write a Python function to solve:
  {{problem_statement}}

  Function signature:
  {{function_signature}}

doc_to_target: "{{canonical_solution}}"

generation_kwargs:
  max_gen_toks: 512
  until:
    - "\n\nclass"
    - "\n\ndef"
  temperature: 0.2

metric_list:
  - metric: !function utils.execute_code
    aggregation: mean
    higher_is_better: true

process_results: !function utils.process_code_results

metadata:
  version: 1.0

code_eval/utils.py:

import subprocess
import json

def execute_code(predictions, references):
    """Execute generated code against test cases."""
    generated_code = predictions[0]
    test_cases = json.loads(references[0])

    try:
        # Execute code with test cases
        for test_input, expected_output in test_cases:
            result = execute_with_timeout(generated_code, test_input, timeout=5)
            if result != expected_output:
                return 0.0
        return 1.0
    except Exception:
        return 0.0

def execute_with_timeout(code, input_data, timeout=5):
    """Safely execute code with timeout."""
    # Implementation with subprocess and timeout
    pass

def process_code_results(doc, results):
    """Process code execution results."""
    return {
        "passed": results[0] == 1.0,
        "generated_code": results[1]
    }

Example 3: Instruction Following

instruction_eval/instruction_eval.yaml:

task: instruction_following
dataset_path: data/instructions.jsonl
output_type: generate_until
num_fewshot: 0

doc_to_text: |
  Instruction: {{instruction}}
  {% if constraints %}
  Constraints: {{constraints}}
  {% endif %}
  Response:

doc_to_target: "{{expected_response}}"

generation_kwargs:
  max_gen_toks: 256
  temperature: 0.7

metric_list:
  - metric: !function utils.check_constraints
    aggregation: mean
    higher_is_better: true
  - metric: !function utils.semantic_similarity
    aggregation: mean
    higher_is_better: true

process_docs: !function utils.add_constraint_checkers

instruction_eval/utils.py:

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')

def check_constraints(predictions, references):
    """Check if response satisfies constraints."""
    response = predictions[0]
    constraints = json.loads(references[0])

    satisfied = 0
    total = len(constraints)

    for constraint in constraints:
        if verify_constraint(response, constraint):
            satisfied += 1

    return satisfied / total if total > 0 else 1.0

def verify_constraint(response, constraint):
    """Verify single constraint."""
    if constraint["type"] == "length":
        return len(response.split()) >= constraint["min_words"]
    elif constraint["type"] == "contains":
        return constraint["keyword"] in response.lower()
    # Add more constraint types
    return True

def semantic_similarity(predictions, references):
    """Compute semantic similarity."""
    pred_embedding = model.encode(predictions[0])
    ref_embedding = model.encode(references[0])
    return float(util.cos_sim(pred_embedding, ref_embedding))

def add_constraint_checkers(dataset):
    """Parse constraints into verifiable format."""
    def _parse(doc):
        # Parse constraint string into structured format
        doc["parsed_constraints"] = parse_constraints(doc.get("constraints", ""))
        return doc
    return dataset.map(_parse)

Advanced Features

Output Filtering

filter_list:
  - name: extract_answer
    filter:
      - function: regex
        regex_pattern: "Answer: (.*)"
        group: 1
      - function: lowercase
      - function: strip_whitespace

Multiple Metrics

metric_list:
  - metric: exact_match
    aggregation: mean
    higher_is_better: true
  - metric: f1
    aggregation: mean
    higher_is_better: true
  - metric: bleu
    aggregation: mean
    higher_is_better: true

Task Groups

Create my_tasks/_default.yaml:

group: my_eval_suite
task:
  - simple_qa
  - medical_qa
  - python_challenges

Run entire suite:

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks my_eval_suite \
  --include_path my_tasks/

Testing Your Task

Validate Configuration

# Test task loading
lm_eval --tasks my_custom_task --include_path my_tasks/ --limit 0

# Run on 5 samples
lm_eval --model hf \
  --model_args pretrained=gpt2 \
  --tasks my_custom_task \
  --include_path my_tasks/ \
  --limit 5

Debug Mode

lm_eval --model hf \
  --model_args pretrained=gpt2 \
  --tasks my_custom_task \
  --include_path my_tasks/ \
  --limit 1 \
  --log_samples  # Save input/output samples

Best Practices

  1. Start simple: Test with minimal config first
  2. Version your tasks: Use metadata.version
  3. Document your metrics: Explain custom metrics in comments
  4. Test with multiple models: Ensure robustness
  5. Validate on known examples: Include sanity checks
  6. Use filters carefully: Can hide errors
  7. Handle edge cases: Empty strings, missing fields

Common Patterns

Classification Task

output_type: loglikelihood
doc_to_text: "Text: {{text}}\nLabel:"
doc_to_target: " {{label}}"  # Space prefix important!
metric_list:
  - metric: acc
    aggregation: mean

Perplexity Evaluation

output_type: loglikelihood_rolling
doc_to_text: "{{text}}"
metric_list:
  - metric: perplexity
    aggregation: perplexity

Ranking Task

output_type: loglikelihood
doc_to_text: "Query: {{query}}\nPassage: {{passage}}\nRelevant:"
doc_to_target: [" Yes", " No"]
metric_list:
  - metric: acc
    aggregation: mean

Troubleshooting

"Task not found": Check --include_path and task name

Empty results: Verify doc_to_text and doc_to_target templates

Metric errors: Ensure metric names are correct (exact_match, not exact-match)

Filter issues: Test filters with --log_samples

Python function not found: Check !function module.function_name syntax

References

  • Task system: EleutherAI/lm-evaluation-harness docs
  • Example tasks: lm_eval/tasks/ directory
  • TaskConfig: lm_eval/api/task.py