Skip to main content
Evaluations connect prompts, datasets, and evaluators. They answer a practical question: “Does this prompt version satisfy the criteria we care about across the cases we care about?”

When to run evaluations

Run evaluations:
  • Before deploying a prompt.
  • After changing model, provider, messages, variables, schema, or tools.
  • After approving an Improve candidate through Edit & approve.
  • After adding new regression rows.
  • After changing evaluator criteria.
  • Before promoting a version from staging to production.
For high-risk workflows, make evaluations part of release review rather than a one-off debugging step.

Run an evaluation

1

Open the prompt

Open the prompt you want to test.
2

Open Evaluate

Go to the prompt’s evaluation workflow.
3

Select datasets and evaluators

Choose the dataset rows and active evaluators that define the test.
4

Run the evaluation

Start the run and wait for row-level results.
5

Review failures

Inspect failing rows, evaluator reasons, output, cost, latency, and token usage.

Read evaluation results

Review results at three levels:
LevelWhat to inspect
Run summaryOverall pass rate, score, cost, latency, and whether the run is good enough for release.
Evaluator resultWhich rule failed, why it failed, and whether the evaluator itself looks reliable.
Dataset rowThe exact input, expected behavior, actual output, and row metadata.
Use row-level failures to decide whether the prompt, dataset, evaluator, or product expectation needs to change.

Distinguish prompt failures from test failures

Not every failing evaluation means the prompt is wrong.
Failure typeWhat to do
Prompt output is wrongFix the prompt, tool instructions, schema, or model settings.
Evaluator is too broadRewrite the rubric or split it into narrower evaluators.
Dataset row is staleUpdate or archive the row.
Expected value is wrongFix the expected output or label.
Tool/backend failureInspect traces or backend logs before changing prompt text.
Provider/model issueCompare with another model or review provider status.

Use continuous evaluation

Continuous evaluation scores production traces when configured. Results can appear in Traces and aggregate into Monitor. Use continuous evaluation when:
  • The requirement is important after deployment.
  • Production inputs differ from test datasets.
  • You need quality signals next to latency, cost, and traffic.
  • Improve should have strong regression checks.
Sampling can reduce cost for high-volume traffic. Choose a sample rate that gives useful signal without making every request expensive.

Connect evaluation failures to other workflows

After a failed evaluation:
  • Add the case to a regression dataset if it represents a real risk.
  • Open related production traces if the issue appears live.
  • Check Behaviors to see whether the pattern is recurring.
  • Start an Improve cycle if the fix belongs in prompt behavior.
  • Update deployment notes if the failure blocks release.

Evaluation review checklist

Before calling an evaluation run “passing”:
  • The dataset represents current product behavior.
  • The evaluator criteria match the release goal.
  • Failing rows are understood or fixed.
  • Cost and latency are acceptable.
  • Results are not driven by a single noisy evaluator.
  • The prompt version being evaluated is the version you intend to deploy.
Keep a small, trusted golden dataset for release gates and a broader exploratory dataset for discovery. Mixing the two makes releases harder to reason about.