Skip to main content
Auto Prompt Optimization is the part of Improve that proposes prompt changes. Adaline generates candidate prompt snapshots, scores them against available evidence, rejects unsafe or regressing candidates, and packages the selected candidate for review. Improve review page showing the selected prompt candidate, prompt diff, and real traffic comparison

What can change

Prompt areaExample changeReview concern
InstructionsAdd or clarify a constraint.Avoid broad rules that affect unrelated traffic.
ExamplesAdd a demonstration of the desired behavior.Avoid overfitting one customer or trace.
VariablesClarify how runtime inputs should be used.Confirm variable mapping still works.
Model settingsAdjust supported generation settings.Check cost, latency, determinism, and output length.
Response schemaTighten structured output requirements.Confirm downstream consumers still accept the output.
Tool guidanceClarify when to call a tool and what arguments matter.Fix broken tools or backends outside the prompt.
For tool-using and coding agents, optimization may also affect routing rules, verification policy, tool descriptions, and few-shot examples. Always review the full diff.

Candidate exploration

The Prompts stage summarizes the search.
SignalMeaning
Variants exploredPrompt candidates generated for the run.
Passed safety gateCandidates that did not regress protected checks.
Failed safety gateCandidates blocked by constraints or evaluator regressions.
Strong contendersCandidates with positive evidence after scoring.
Selected candidateThe candidate packaged for review.
More variants are not automatically better. A strong run finds a narrow candidate that improves the target issue and preserves healthy behavior.

Read the diff first

The diff is the source of truth for what will change. Use the diagnosis and scores to understand why the change exists, but use the diff to decide whether the change is acceptable.
Diff patternUsually goodUsually risky
Narrow constraintMatches the failing behavior.Changes all outputs broadly.
Tool-use clarificationExplains when and how to call a tool.Hides a bad tool contract.
Added exampleCovers the failure and a healthy path.Encodes private or one-off context.
Output format changeMatches downstream requirements.Breaks existing consumers.
Generation setting changeImproves reliability or consistency.Moves cost, latency, or output quality without coverage.
Reject or edit candidates that try to solve retrieval, provider, data, or backend failures through prompt text.

Check regression evidence

Regression report showing evaluator movement, cost, token, and latency tradeoffs for the selected candidate The regression report compares baseline and candidate behavior across authored evaluators, auto generated evaluators, and validation cases. Watch for:
  • Protected evaluators moving down.
  • Blank baseline or aggregate cells, which usually mean weak comparable scoring coverage.
  • New generated checks that need review before becoming hard gates.
  • Healthy dataset rows failing after the candidate improves the target issue.
An evaluator drop is not automatically fatal, but it needs a named owner and a reason.

Inspect traffic comparison

Traffic comparison showing current prompt outputs beside improved candidate outputs for tested conversations Traffic comparison answers the question metrics cannot fully answer: would you want users to receive this new output? Use it to check tone, format, tool behavior, verbosity, structured output shape, and whether the original failure is actually fixed.

Runtime tradeoffs

Prompt improvements can increase runtime cost. Before review, check:
SignalWatch for
CostMore expensive model paths, longer generations, or extra calls.
Input tokensLonger instructions, examples, or context packaging.
Output tokensMore verbose answers or larger structured objects.
LatencySlower model paths, extra tool calls, retries, or longer outputs.
For high-volume prompts, cost or latency increases should be explicit release tradeoffs, not surprises.

Review a Cycle

Make the approval, edit, or rejection decision.

Deploy your prompt

Move from reviewed prompt version to production rollout.