Files

15 KiB

DSPy Optimizers (Teleprompters)

Complete guide to DSPy's optimization algorithms for improving prompts and model weights.

What are Optimizers?

DSPy optimizers (called "teleprompters") automatically improve your modules by:

  • Synthesizing few-shot examples from training data
  • Proposing better instructions through search
  • Fine-tuning model weights (optional)

Key idea: Instead of manually tuning prompts, define a metric and let DSPy optimize.

Optimizer Selection Guide

Optimizer Best For Speed Quality Data Needed
BootstrapFewShot General purpose Fast Good 10-50 examples
MIPRO Instruction tuning Medium Excellent 50-200 examples
BootstrapFinetune Fine-tuning Slow Excellent 100+ examples
COPRO Prompt optimization Medium Good 20-100 examples
KNNFewShot Quick baseline Very fast Fair 10+ examples

Core Optimizers

BootstrapFewShot

Most popular optimizer - Generates few-shot demonstrations from training data.

How it works:

  1. Takes your training examples
  2. Uses your module to generate predictions
  3. Selects high-quality predictions (based on metric)
  4. Uses these as few-shot examples in future prompts

Parameters:

  • metric: Function that scores predictions (required)
  • max_bootstrapped_demos: Max demonstrations to generate (default: 4)
  • max_labeled_demos: Max labeled examples to use (default: 16)
  • max_rounds: Optimization iterations (default: 1)
  • metric_threshold: Minimum score to accept (optional)
import dspy
from dspy.teleprompt import BootstrapFewShot

# Define metric
def validate_answer(example, pred, trace=None):
    """Return True if prediction matches gold answer."""
    return example.answer.lower() == pred.answer.lower()

# Training data
trainset = [
    dspy.Example(question="What is 2+2?", answer="4").with_inputs("question"),
    dspy.Example(question="What is 3+5?", answer="8").with_inputs("question"),
    dspy.Example(question="What is 10-3?", answer="7").with_inputs("question"),
]

# Create module
qa = dspy.ChainOfThought("question -> answer")

# Optimize
optimizer = BootstrapFewShot(
    metric=validate_answer,
    max_bootstrapped_demos=3,
    max_rounds=2
)

optimized_qa = optimizer.compile(qa, trainset=trainset)

# Now optimized_qa has learned few-shot examples!
result = optimized_qa(question="What is 5+7?")

Best practices:

  • Start with 10-50 training examples
  • Use diverse examples covering edge cases
  • Set max_bootstrapped_demos=3-5 for most tasks
  • Increase max_rounds=2-3 for better quality

When to use:

  • First optimizer to try
  • You have 10+ labeled examples
  • Want quick improvements
  • General-purpose tasks

MIPRO (Most Important Prompt Optimization)

State-of-the-art optimizer - Iteratively searches for better instructions.

How it works:

  1. Generates candidate instructions
  2. Tests each on validation set
  3. Selects best-performing instructions
  4. Iterates to refine further

Parameters:

  • metric: Evaluation metric (required)
  • num_candidates: Instructions to try per iteration (default: 10)
  • init_temperature: Sampling temperature (default: 1.0)
  • verbose: Show progress (default: False)
from dspy.teleprompt import MIPRO

# Define metric with more nuance
def answer_quality(example, pred, trace=None):
    """Score answer quality 0-1."""
    if example.answer.lower() in pred.answer.lower():
        return 1.0
    # Partial credit for similar answers
    return 0.5 if len(set(example.answer.split()) & set(pred.answer.split())) > 0 else 0.0

# Larger training set (MIPRO benefits from more data)
trainset = [...]  # 50-200 examples
valset = [...]    # 20-50 examples

# Create module
qa = dspy.ChainOfThought("question -> answer")

# Optimize with MIPRO
optimizer = MIPRO(
    metric=answer_quality,
    num_candidates=10,
    init_temperature=1.0,
    verbose=True
)

optimized_qa = optimizer.compile(
    student=qa,
    trainset=trainset,
    valset=valset,  # MIPRO uses separate validation set
    num_trials=100   # More trials = better quality
)

Best practices:

  • Use 50-200 training examples
  • Separate validation set (20-50 examples)
  • Run 100-200 trials for best results
  • Takes 10-30 minutes typically

When to use:

  • You have 50+ labeled examples
  • Want state-of-the-art performance
  • Willing to wait for optimization
  • Complex reasoning tasks

BootstrapFinetune

Fine-tune model weights - Creates training dataset for fine-tuning.

How it works:

  1. Generates synthetic training data
  2. Exports data in fine-tuning format
  3. You fine-tune model separately
  4. Load fine-tuned model back

Parameters:

  • metric: Evaluation metric (required)
  • max_bootstrapped_demos: Demonstrations to generate (default: 4)
  • max_rounds: Data generation rounds (default: 1)
from dspy.teleprompt import BootstrapFinetune

# Training data
trainset = [...]  # 100+ examples recommended

# Define metric
def validate(example, pred, trace=None):
    return example.answer == pred.answer

# Create module
qa = dspy.ChainOfThought("question -> answer")

# Generate fine-tuning data
optimizer = BootstrapFinetune(metric=validate)
optimized_qa = optimizer.compile(qa, trainset=trainset)

# Exports training data to file
# You then fine-tune using your LM provider's API

# After fine-tuning, load your model:
finetuned_lm = dspy.OpenAI(model="ft:gpt-3.5-turbo:your-model-id")
dspy.settings.configure(lm=finetuned_lm)

Best practices:

  • Use 100+ training examples
  • Validate on held-out test set
  • Monitor for overfitting
  • Compare with prompt-based methods first

When to use:

  • You have 100+ examples
  • Latency is critical (fine-tuned models faster)
  • Task is narrow and well-defined
  • Prompt optimization isn't enough

COPRO (Coordinate Prompt Optimization)

Optimize prompts via gradient-free search.

How it works:

  1. Generates prompt variants
  2. Evaluates each variant
  3. Selects best prompts
  4. Iterates to refine
from dspy.teleprompt import COPRO

# Training data
trainset = [...]

# Define metric
def metric(example, pred, trace=None):
    return example.answer == pred.answer

# Create module
qa = dspy.ChainOfThought("question -> answer")

# Optimize with COPRO
optimizer = COPRO(
    metric=metric,
    breadth=10,  # Candidates per iteration
    depth=3      # Optimization rounds
)

optimized_qa = optimizer.compile(qa, trainset=trainset)

When to use:

  • Want prompt optimization
  • Have 20-100 examples
  • MIPRO too slow

KNNFewShot

Simple k-nearest neighbors - Selects similar examples for each query.

How it works:

  1. Embeds all training examples
  2. For each query, finds k most similar examples
  3. Uses these as few-shot demonstrations
from dspy.teleprompt import KNNFewShot

trainset = [...]

# No metric needed - just selects similar examples
optimizer = KNNFewShot(k=3)
optimized_qa = optimizer.compile(qa, trainset=trainset)

# For each query, uses 3 most similar examples from trainset

When to use:

  • Quick baseline
  • Have diverse training examples
  • Similarity is good proxy for helpfulness

Writing Metrics

Metrics are functions that score predictions. They're critical for optimization.

Binary Metrics

def exact_match(example, pred, trace=None):
    """Return True if prediction exactly matches gold."""
    return example.answer == pred.answer

def contains_answer(example, pred, trace=None):
    """Return True if prediction contains gold answer."""
    return example.answer.lower() in pred.answer.lower()

Continuous Metrics

def f1_score(example, pred, trace=None):
    """F1 score between prediction and gold."""
    pred_tokens = set(pred.answer.lower().split())
    gold_tokens = set(example.answer.lower().split())

    if not pred_tokens:
        return 0.0

    precision = len(pred_tokens & gold_tokens) / len(pred_tokens)
    recall = len(pred_tokens & gold_tokens) / len(gold_tokens)

    if precision + recall == 0:
        return 0.0

    return 2 * (precision * recall) / (precision + recall)

def semantic_similarity(example, pred, trace=None):
    """Embedding similarity between prediction and gold."""
    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer('all-MiniLM-L6-v2')

    emb1 = model.encode(example.answer)
    emb2 = model.encode(pred.answer)

    similarity = cosine_similarity(emb1, emb2)
    return similarity

Multi-Factor Metrics

def comprehensive_metric(example, pred, trace=None):
    """Combine multiple factors."""
    score = 0.0

    # Correctness (50%)
    if example.answer.lower() in pred.answer.lower():
        score += 0.5

    # Conciseness (25%)
    if len(pred.answer.split()) <= 20:
        score += 0.25

    # Citation (25%)
    if "source:" in pred.answer.lower():
        score += 0.25

    return score

Using Trace for Debugging

def metric_with_trace(example, pred, trace=None):
    """Metric that uses trace for debugging."""
    is_correct = example.answer == pred.answer

    if trace is not None and not is_correct:
        # Log failures for analysis
        print(f"Failed on: {example.question}")
        print(f"Expected: {example.answer}")
        print(f"Got: {pred.answer}")

    return is_correct

Evaluation Best Practices

Train/Val/Test Split

# Split data
trainset = data[:100]   # 70%
valset = data[100:120]  # 15%
testset = data[120:]    # 15%

# Optimize on train
optimized = optimizer.compile(module, trainset=trainset)

# Validate during optimization (for MIPRO)
optimized = optimizer.compile(module, trainset=trainset, valset=valset)

# Evaluate on test
from dspy.evaluate import Evaluate
evaluator = Evaluate(devset=testset, metric=metric)
score = evaluator(optimized)

Cross-Validation

from sklearn.model_selection import KFold

kfold = KFold(n_splits=5)
scores = []

for train_idx, val_idx in kfold.split(data):
    trainset = [data[i] for i in train_idx]
    valset = [data[i] for i in val_idx]

    optimized = optimizer.compile(module, trainset=trainset)
    score = evaluator(optimized, devset=valset)
    scores.append(score)

print(f"Average score: {sum(scores) / len(scores):.2f}")

Comparing Optimizers

results = {}

for opt_name, optimizer in [
    ("baseline", None),
    ("fewshot", BootstrapFewShot(metric=metric)),
    ("mipro", MIPRO(metric=metric)),
]:
    if optimizer is None:
        module_opt = module
    else:
        module_opt = optimizer.compile(module, trainset=trainset)

    score = evaluator(module_opt, devset=testset)
    results[opt_name] = score

print(results)
# {'baseline': 0.65, 'fewshot': 0.78, 'mipro': 0.85}

Advanced Patterns

Custom Optimizer

from dspy.teleprompt import Teleprompter

class CustomOptimizer(Teleprompter):
    def __init__(self, metric):
        self.metric = metric

    def compile(self, student, trainset, **kwargs):
        # Your optimization logic here
        # Return optimized student module
        return student

Multi-Stage Optimization

# Stage 1: Bootstrap few-shot
stage1 = BootstrapFewShot(metric=metric, max_bootstrapped_demos=3)
optimized1 = stage1.compile(module, trainset=trainset)

# Stage 2: Instruction tuning
stage2 = MIPRO(metric=metric, num_candidates=10)
optimized2 = stage2.compile(optimized1, trainset=trainset, valset=valset)

# Final optimized module
final_module = optimized2

Ensemble Optimization

class EnsembleModule(dspy.Module):
    def __init__(self, modules):
        super().__init__()
        self.modules = modules

    def forward(self, question):
        predictions = [m(question=question).answer for m in self.modules]
        # Vote or average
        return dspy.Prediction(answer=max(set(predictions), key=predictions.count))

# Optimize multiple modules
opt1 = BootstrapFewShot(metric=metric).compile(module, trainset=trainset)
opt2 = MIPRO(metric=metric).compile(module, trainset=trainset)
opt3 = COPRO(metric=metric).compile(module, trainset=trainset)

# Ensemble
ensemble = EnsembleModule([opt1, opt2, opt3])

Optimization Workflow

1. Start with Baseline

# No optimization
baseline = dspy.ChainOfThought("question -> answer")
baseline_score = evaluator(baseline, devset=testset)
print(f"Baseline: {baseline_score}")

2. Try BootstrapFewShot

# Quick optimization
fewshot = BootstrapFewShot(metric=metric, max_bootstrapped_demos=3)
optimized = fewshot.compile(baseline, trainset=trainset)
fewshot_score = evaluator(optimized, devset=testset)
print(f"Few-shot: {fewshot_score} (+{fewshot_score - baseline_score:.2f})")

3. If More Data Available, Try MIPRO

# State-of-the-art optimization
mipro = MIPRO(metric=metric, num_candidates=10)
optimized_mipro = mipro.compile(baseline, trainset=trainset, valset=valset)
mipro_score = evaluator(optimized_mipro, devset=testset)
print(f"MIPRO: {mipro_score} (+{mipro_score - baseline_score:.2f})")

4. Save Best Model

if mipro_score > fewshot_score:
    optimized_mipro.save("models/best_model.json")
else:
    optimized.save("models/best_model.json")

Common Pitfalls

1. Overfitting to Training Data

# ❌ Bad: Too many demos
optimizer = BootstrapFewShot(max_bootstrapped_demos=20)  # Overfits!

# ✅ Good: Moderate demos
optimizer = BootstrapFewShot(max_bootstrapped_demos=3-5)

2. Metric Doesn't Match Task

# ❌ Bad: Binary metric for nuanced task
def bad_metric(example, pred, trace=None):
    return example.answer == pred.answer  # Too strict!

# ✅ Good: Graded metric
def good_metric(example, pred, trace=None):
    return f1_score(example.answer, pred.answer)  # Allows partial credit

3. Insufficient Training Data

# ❌ Bad: Too little data
trainset = data[:5]  # Not enough!

# ✅ Good: Sufficient data
trainset = data[:50]  # Better

4. No Validation Set

# ❌ Bad: Optimizing on test set
optimizer.compile(module, trainset=testset)  # Cheating!

# ✅ Good: Proper splits
optimizer.compile(module, trainset=trainset, valset=valset)
evaluator(optimized, devset=testset)

Performance Tips

  1. Start simple: BootstrapFewShot first
  2. Use representative data: Cover edge cases
  3. Monitor overfitting: Validate on held-out set
  4. Iterate metrics: Refine based on failures
  5. Save checkpoints: Don't lose progress
  6. Compare to baseline: Measure improvement
  7. Test multiple optimizers: Find best fit

Resources