BEP-005: Prompt Optimization Design
Summary
- Add prompt optimization to BAML via a new
baml-cli optimizecommand - Use DSPy's GEPA algorithm
- Optimize the prompt text, @description, and @alias data
- Users define accuracy criteria using existing BAML tests with assertions
- Also allow optimizing other outcomes: fewer input tokens, fewer response tokens, latency. @@check establishes new outcomes
- Stores optimization state to enable resumable runs and provides detailed artifacts including all candidate prompts and their performance metrics.
Motivation
Copy the graet things about DSPy. The people demand prompt optimization!
BAML already has the key components needed for automated optimization: - Functions with prompts as the optimization target - Tests with assertions as the success metric
Scope: What Gets Optimized
In Scope (Phase 1)
- Function prompts: The
prompt #"..."#text - Class field descriptions:
@description("...")annotations - Class field aliases:
@alias(...)annotations - Enum value descriptions: Comments on enum values
- Class descriptions: Class-level
@description
All optimized together as a cohesive system. Changes to schema descriptions/aliases can improve parsing without requiring test updates.
Out of Scope
- Structural changes: Adding/removing fields, changing types (would break tests)
- Multi-function workflows: Optimizing multiple functions together (Phase 4+)
- Few-shot examples: Optimizing example selection (Phase 2+)
Proposed Design
High-Level Architecture
The optimization system follows GEPA's evolutionary approach:
- Initialization: Start with the current prompt in each BAML function
- Evaluation: Run tests to measure prompt performance (test pass rate + optional metrics)
- Reflection: Analyze test failures to understand what went wrong
- Proposal: Generate new prompt variations using LLM-based reflection on failures
- Selection: Maintain a Pareto frontier of candidates balancing multiple objectives
- Iteration: Repeat steps 2-5 until budget exhausted or convergence
Key differences from DSPy's GEPA:
- Uses BAML tests (with @@assert and @@check) instead of custom metrics
- Optimizes BAML prompt templates (with Jinja2) instead of Python strings
- Stores state in BAML-native formats alongside the codebase
Command-Line Interface
# Basic usage - optimize all functions with tests
baml-cli optimize
# Optimize specific function(s)
baml-cli optimize --function ExtractReceipt --function ClassifyEmail
# Optimize with test filtering
baml-cli optimize --test "ExtractReceipt::*"
# Control optimization budget
baml-cli optimize --max-evals 50 # Total function evaluations
baml-cli optimize --trials 20 # Optimization iterations
# Auto-sized optimization budgets
baml-cli optimize --auto light # Quick exploration (6 candidates)
baml-cli optimize --auto medium # Balanced (12 candidates)
baml-cli optimize --auto heavy # Thorough (18 candidates)
# Multi-objective optimization
baml-cli optimize --weight accuracy=0.8,tokens=0.2
baml-cli optimize --weight accuracy=0.7,latency=0.2,prompt_tokens=0.1
baml-cli optimize --weight accuracy=0.9,completion_tokens=0.1
# Resume previous optimization run
baml-cli optimize --resume .baml_optimize/run_20250106_143022
# Reset GEPA reflection prompts to defaults
baml-cli optimize --reset-gepa-prompts
# Control parallelism
baml-cli optimize --parallel 8
# Output and logging
baml-cli optimize --output-dir .baml_optimize/custom_run
baml-cli optimize --verbose
Syntax
The optimization system reuses existing BAML syntax with no new language features required:
// Existing BAML function - the prompt will be optimized
function ExtractReceipt(image: image) -> Receipt {
client GPT4o
prompt #"
Extract structured receipt information from this image.
Return the merchant name, date, items, and total.
"#
}
// Existing BAML tests - these define the optimization objective
test ReceiptTest1 {
functions [ExtractReceipt]
args {
image { file "test_receipts/starbucks.jpg" }
}
// Assertions are the success criteria
@@assert({{ this.merchant == "Starbucks" }})
@@assert({{ this.total > 0 }})
@@check(correct_items, {{ this.items|length == 2 }})
}
test ReceiptTest2 {
functions [ExtractReceipt]
args {
image { file "test_receipts/target.jpg" }
}
@@assert({{ this.merchant == "Target" }})
@@assert({{ this.total == 45.67 }})
}
// Example: Custom checks for multi-objective optimization (Phase 3)
test ReceiptWithGroundedness {
functions [ExtractReceipt]
args {
image { file "test_receipts/complex.jpg" }
}
@@assert({{ this.merchant != "" }})
// Custom checks can be weighted in optimization
@@check(groundedness, {{ this.confidence > 0.8 }})
@@check(safety, {{ this.contains_no_pii }})
}
BAML-Driven GEPA Reflection
A key design principle: GEPA's reflection logic is implemented in BAML itself. This makes the optimization process transparent, customizable, and dogfoods BAML for optimizing BAML.
gepa.baml Location and Versioning
GEPA reflection functions live in .baml_optimize/gepa/baml_src/:
.baml_optimize/
└── gepa/
└── baml_src/
├── gepa.baml # Reflection functions
├── clients.baml # Client configs
└── .gepa_version # Tracks baml-cli version
First run behavior:
$ baml-cli optimize
Creating .baml_optimize/gepa/baml_src/ with defaults from baml-cli 0.73.0...
Using reflection model: gpt-4o (default)
Customization workflow:
# User modifies reflection logic
$ vim .baml_optimize/gepa/baml_src/gepa.baml
# Or changes the reflection model
$ vim .baml_optimize/gepa/baml_src/clients.baml
# Next run uses custom GEPA implementation
$ baml-cli optimize
Version tracking:
The .gepa_version file contains:
{
"baml_cli_version": "0.73.0",
"created_at": "2025-01-06T14:30:22Z",
"gepa_baml_hash": "a3f5c9d..."
}
Modifications are detected by comparing file hash to embedded default. On version mismatch:
$ baml-cli --version
baml-cli 0.74.0
$ baml-cli optimize
Warning: Your GEPA implementation is from baml-cli 0.73.0
Run 'baml-cli optimize --reset-gepa-prompts' to upgrade
Default gepa.baml Implementation
The default gepa.baml embedded in baml-cli includes:
Data Models:
class SchemaFieldDefinition {
field_name string
field_type string
description string?
aliases string[]
is_optional bool
}
class ClassDefinition {
class_name string
description string?
fields SchemaFieldDefinition[]
}
class EnumDefinition {
enum_name string
values string[]
value_descriptions map<string, string>
}
class OptimizableFunction {
function_name string
prompt_text string
classes ClassDefinition[] // All reachable classes
enums EnumDefinition[] // All reachable enums
}
class ReflectiveExample {
inputs map<string, string>
generated_outputs map<string, string>
feedback string
failure_location string? // "prompt" | "parsing" | "schema"
}
class ImprovedFunction {
prompt_text string
classes ClassDefinition[] // Only modified classes
enums EnumDefinition[] // Only modified enums
rationale string
}
Core Reflection Function:
function ProposeImprovements(
current_function: OptimizableFunction,
failed_examples: ReflectiveExample[],
successful_examples: ReflectiveExample[]?
) -> ImprovedFunction {
client ReflectionModel
prompt #"
You are optimizing a BAML function. Improve both the prompt and schema.
## Current Implementation
Prompt:
```
{{ current_function.prompt_text }}
```
Schema:
{% for class in current_function.classes %}
class {{ class.class_name }} {
{% for field in class.fields %}
/// @description("{{ field.description or 'none' }}")
{{ field.field_name }} {{ field.field_type }}{% if field.aliases %} @alias({{ field.aliases | join(", ") }}){% endif %}
{% endfor %}
}
{% endfor %}
## Failures
{% for ex in failed_examples %}
Inputs: {{ ex.inputs }}
Generated: {{ ex.generated_outputs }}
Issue: {{ ex.feedback }}
{% endfor %}
## Your Task
Analyze failures and propose improvements to:
1. Prompt text - clarity, instructions, examples
2. Field descriptions - guide LLM parsing
3. Field aliases - catch output variations
Consider: Do the prompt and schema work well together?
Return improvements as ImprovedFunction JSON.
"#
}
function MergeVariants(
variant_a: OptimizableFunction,
variant_b: OptimizableFunction,
variant_a_strengths: string[],
variant_b_strengths: string[]
) -> ImprovedFunction {
client ReflectionModel
prompt #"
Merge two successful BAML function variants.
Combine their strengths into a single improved version.
[Details omitted for brevity]
"#
}
Default clients.baml:
client<llm> ReflectionModel {
provider openai
options {
model "gpt-4o"
temperature 1.0
max_tokens 8000
}
}
Schema Optimization Examples
Example 1: Adding aliases based on failures
Before:
After reflection on failures where LLM outputs "store_name":
class Receipt {
/// @description("Merchant name exactly as shown on receipt")
merchant string @alias("store_name", "shop_name", "vendor")
}
Example 2: Improving descriptions
Before:
After reflection on failures with parsing errors:
class Receipt {
/// @description("Total amount in decimal format (e.g., 12.99, not '$12.99')")
total float @alias("amount", "total_amount")
}
Example 3: Coordinated prompt and schema improvements
GEPA optimizes prompt and schema together, ensuring they work cohesively:
// Before
function ExtractReceipt(image: image) -> Receipt {
prompt #"Extract receipt information"#
}
class Receipt {
merchant string
total float
}
// After GEPA optimization
function ExtractReceipt(image: image) -> Receipt {
prompt #"
Extract structured receipt data:
- merchant: exact name as printed
- total: decimal amount (e.g., 45.67)
"#
}
class Receipt {
/// @description("Merchant name preserving exact capitalization")
merchant string @alias("store_name", "vendor")
/// @description("Total in decimal format, no currency symbols")
total float @alias("amount", "sum")
}
Semantics
Test-Based Objective Function
The optimization objective is computed from BAML test results:
- Primary metric: Test pass rate
- Each test case yields a binary pass/fail based on
@@assertstatements - First failing
@@assertstops evaluation of remaining assertions - Pass rate = (passed tests) / (total tests)
-
This is always the primary component of the objective
-
Secondary metrics (optional weights)
Like DSPy GEPA, BAML supports multi-objective optimization with the following metrics:
tokens: Minimize total tokens (prompt + completion). Useful for reducing API costs.latency: Minimize inference latency (milliseconds). Useful for real-time applications.prompt_tokens: Minimize prompt tokens specifically. Useful when optimizing prompt length.completion_tokens: Minimize completion tokens. Useful for controlling output verbosity.- Custom metrics via
@@check: User-defined checks can be weighted (Phase 3 feature)groundedness: For RAG applications, measure citation qualitysafety: Domain-specific safety constraintscompliance: Regulatory or policy compliance checks
Optimization State Storage
The optimizer stores artifacts in baml_src/../.baml_optimize/run_<timestamp>/:
.baml_optimize/
└── run_20250106_143022/
├── config.json # Optimization parameters
├── candidates/
│ ├── 00_initial.baml # Initial prompts
│ ├── 01_candidate.baml # Generated variations
│ ├── 02_candidate.baml
│ └── ...
├── evaluations/
│ ├── 00_initial.json # Test results per candidate
│ ├── 01_candidate.json
│ └── ...
├── reflections/
│ ├── iteration_01.json # Failure analysis
│ └── ...
├── state.json # Resumable optimization state
├── pareto_frontier.json # Current best candidates
└── final_results.json # Summary statistics
State Format (JSON, not pickle):
All optimization state is stored in human-readable JSON format for language-agnostic resumability:
{
"version": "1.0",
"baml_cli_version": "0.73.0",
"iteration": 15,
"total_evals": 450,
"budget_remaining": 550,
"rng_seed": 42,
"pareto_frontier_indices": [3, 7, 12, 15],
"candidate_lineage": {
"0": {"parents": null, "method": "initial"},
"1": {"parents": [0], "method": "reflection"},
"2": {"parents": [0, 1], "method": "merge"}
},
"normalization_stats": {
"tokens": {"mean": 1500, "std": 300, "min": 800, "max": 2500},
"latency": {"mean": 1200, "std": 200, "min": 800, "max": 1800}
}
}
This replaces pickle files, ensuring: - Language-agnostic: Works across Python, Rust, TypeScript implementations - Human-readable: Can inspect/debug state manually - Git-friendly: Can diff checkpoints - Secure: No code execution risk
Candidate BAML File Format
Each candidate file contains only the optimized functions:
// Generated candidate #5
// Iteration: 3
// Parent candidates: [2, 4]
// Score: 0.85 (accuracy=0.90, tokens=-0.05)
function ExtractReceipt(image: image) -> Receipt {
client GPT4o
prompt #"
Carefully analyze the receipt image and extract:
1. Merchant name (exactly as shown)
2. Purchase date (in ISO format)
3. Line items with prices
4. Total amount
Pay special attention to currency formatting.
"#
}
Reflection and Proposal
The reflection phase analyzes test failures to guide prompt evolution:
- Collect failure data:
- For each failed test, capture: inputs, outputs, assertions that failed
-
Sample a minibatch of failures (default: 3) to avoid overwhelming the reflection model
-
Generate reflective dataset:
{ "function": "ExtractReceipt", "examples": [ { "inputs": {"image": "test_receipts/starbucks.jpg"}, "outputs": {"merchant": "STARBUCKS", "total": 8.50}, "feedback": "Assertion failed: this.merchant == 'Starbucks'. The merchant name should match the expected casing exactly." }, { "inputs": {"image": "test_receipts/target.jpg"}, "outputs": {"merchant": "Target", "total": 45.0}, "feedback": "Assertion failed: this.total == 45.67. The total is incorrect, possibly due to missing cents." } ] } -
Propose new prompt:
- Use reflection LLM (e.g., GPT-4o, Claude Sonnet) to analyze failure patterns
-
Prompt template (simplified from GEPA's InstructionProposalSignature):
You are optimizing a BAML prompt. Here is the current prompt: <current_prompt> {current_prompt_text} </current_prompt> Here are examples where the prompt failed: <failures> {reflective_dataset} </failures> Based on these failures, propose an improved version of the prompt that: 1. Addresses the specific issues shown in the failures 2. Maintains the overall structure and intent 3. Is clear and concise New prompt: -
Merge successful variants (optional):
- When multiple candidates perform well on different test subsets
- Use reflection LLM to synthesize a combined prompt
- Helps escape local optima by combining diverse successful strategies
Pareto Frontier Selection
When optimizing multiple objectives, maintain a Pareto frontier:
- A candidate A dominates B if A is better on at least one objective and no worse on all others
- The Pareto frontier is the set of non-dominated candidates
- When selecting candidates for reflection, sample from the frontier (rather than always using the single "best")
- Final output presents the entire frontier, letting users choose their preferred trade-off
Integration with Existing BAML Features
Function Variants and Clients
The optimizer respects BAML's existing client configuration:
function ExtractReceipt(image: image) -> Receipt {
client GPT4o
prompt #"..."#
}
// The optimizer will use GPT4o for all evaluations
// To optimize for a different model, create a variant:
function ExtractReceipt(image: image) -> Receipt {
client GPT4oMini // Changed client
prompt #"..."#
}
Users can optimize separately for different models by using function variants or by modifying the client between optimization runs.
Dynamic Types
If a function uses @@dynamic types, tests can override type definitions:
class Receipt {
merchant string
total float
@@dynamic
}
test ReceiptWithCustomFields {
functions [ExtractReceipt]
args {
image { file "test_receipts/custom.jpg" }
}
type_builder {
dynamic Receipt {
merchant string
total float
loyalty_number string // Additional field for this test
}
}
@@assert({{ this.loyalty_number != "" }})
}
The optimizer will respect these test-specific type extensions when evaluating candidates.
Expression language
This proposal should be orthogonal to expression language as possible, so we can ship it ASAP.
When we ship expression language, we get two benefits:
-
More interesting checks and asserts - e.g. checks and asserts could load data from external sources and make their own LLM calls, so that we could test RAG groundedness directly.
-
Cool Alert 😎. We can optimize expression functions. Write a test block for expression function,
baml-cli optimizewill build a pareto frontier for the full workflow by optimizing every LLM call made in the full workflow and evaluating the Workflow's own test checks & asserts. I think DSPy does something like this too.
Backwards Compatibility
Fully backward compatible.
Open Questions
- Reflection Model Selection: Should we default to a specific reflection model, or require users to specify one?
-
Proposal: Default to
gpt-4owith opt-in to others via--reflection-model -
Automatic Prompt Updates: Should the optimizer automatically update BAML source files, or just recommend changes?
-
Proposal: Never auto-update source. Instead, generate a diff/patch that users can review and apply
-
Validation: How do we ensure optimized prompts don't break type safety or introduce security issues?