
What can change
| Prompt area | Example change | Review concern |
|---|---|---|
| Instructions | Add or clarify a constraint. | Avoid broad rules that affect unrelated traffic. |
| Examples | Add a demonstration of the desired behavior. | Avoid overfitting one customer or trace. |
| Variables | Clarify how runtime inputs should be used. | Confirm variable mapping still works. |
| Model settings | Adjust supported generation settings. | Check cost, latency, determinism, and output length. |
| Response schema | Tighten structured output requirements. | Confirm downstream consumers still accept the output. |
| Tool guidance | Clarify when to call a tool and what arguments matter. | Fix broken tools or backends outside the prompt. |
Candidate exploration
The Prompts stage summarizes the search.| Signal | Meaning |
|---|---|
| Variants explored | Prompt candidates generated for the run. |
| Passed safety gate | Candidates that did not regress protected checks. |
| Failed safety gate | Candidates blocked by constraints or evaluator regressions. |
| Strong contenders | Candidates with positive evidence after scoring. |
| Selected candidate | The candidate packaged for review. |
Read the diff first
The diff is the source of truth for what will change. Use the diagnosis and scores to understand why the change exists, but use the diff to decide whether the change is acceptable.| Diff pattern | Usually good | Usually risky |
|---|---|---|
| Narrow constraint | Matches the failing behavior. | Changes all outputs broadly. |
| Tool-use clarification | Explains when and how to call a tool. | Hides a bad tool contract. |
| Added example | Covers the failure and a healthy path. | Encodes private or one-off context. |
| Output format change | Matches downstream requirements. | Breaks existing consumers. |
| Generation setting change | Improves reliability or consistency. | Moves cost, latency, or output quality without coverage. |
Check regression evidence

- Protected evaluators moving down.
- Blank baseline or aggregate cells, which usually mean weak comparable scoring coverage.
- New generated checks that need review before becoming hard gates.
- Healthy dataset rows failing after the candidate improves the target issue.
Inspect traffic comparison

Runtime tradeoffs
Prompt improvements can increase runtime cost. Before review, check:| Signal | Watch for |
|---|---|
| Cost | More expensive model paths, longer generations, or extra calls. |
| Input tokens | Longer instructions, examples, or context packaging. |
| Output tokens | More verbose answers or larger structured objects. |
| Latency | Slower model paths, extra tool calls, retries, or longer outputs. |
Review a Cycle
Make the approval, edit, or rejection decision.
Deploy your prompt
Move from reviewed prompt version to production rollout.