Evaluations connect prompts, datasets, and evaluators. They answer a practical question: “Does this prompt version satisfy the criteria we care about across the cases we care about?”
When to run evaluations
Run evaluations:
- Before deploying a prompt.
- After changing model, provider, messages, variables, schema, or tools.
- After approving an Improve candidate through Edit & approve.
- After adding new regression rows.
- After changing evaluator criteria.
- Before promoting a version from staging to production.
For high-risk workflows, make evaluations part of release review rather than a one-off debugging step.
Run an evaluation
Open the prompt
Open the prompt you want to test.
Open Evaluate
Go to the prompt’s evaluation workflow.
Select datasets and evaluators
Choose the dataset rows and active evaluators that define the test.
Run the evaluation
Start the run and wait for row-level results.
Review failures
Inspect failing rows, evaluator reasons, output, cost, latency, and token usage.
Read evaluation results
Review results at three levels:
| Level | What to inspect |
|---|
| Run summary | Overall pass rate, score, cost, latency, and whether the run is good enough for release. |
| Evaluator result | Which rule failed, why it failed, and whether the evaluator itself looks reliable. |
| Dataset row | The exact input, expected behavior, actual output, and row metadata. |
Use row-level failures to decide whether the prompt, dataset, evaluator, or product expectation needs to change.
Distinguish prompt failures from test failures
Not every failing evaluation means the prompt is wrong.
| Failure type | What to do |
|---|
| Prompt output is wrong | Fix the prompt, tool instructions, schema, or model settings. |
| Evaluator is too broad | Rewrite the rubric or split it into narrower evaluators. |
| Dataset row is stale | Update or archive the row. |
| Expected value is wrong | Fix the expected output or label. |
| Tool/backend failure | Inspect traces or backend logs before changing prompt text. |
| Provider/model issue | Compare with another model or review provider status. |
Use continuous evaluation
Continuous evaluation scores production traces when configured. Results can appear in Traces and aggregate into Monitor.
Use continuous evaluation when:
- The requirement is important after deployment.
- Production inputs differ from test datasets.
- You need quality signals next to latency, cost, and traffic.
- Improve should have strong regression checks.
Sampling can reduce cost for high-volume traffic. Choose a sample rate that gives useful signal without making every request expensive.
Connect evaluation failures to other workflows
After a failed evaluation:
- Add the case to a regression dataset if it represents a real risk.
- Open related production traces if the issue appears live.
- Check Behaviors to see whether the pattern is recurring.
- Start an Improve cycle if the fix belongs in prompt behavior.
- Update deployment notes if the failure blocks release.
Evaluation review checklist
Before calling an evaluation run “passing”:
- The dataset represents current product behavior.
- The evaluator criteria match the release goal.
- Failing rows are understood or fixed.
- Cost and latency are acceptable.
- Results are not driven by a single noisy evaluator.
- The prompt version being evaluated is the version you intend to deploy.
Keep a small, trusted golden dataset for release gates and a broader exploratory dataset for discovery. Mixing the two makes releases harder to reason about.