Run and monitor evaluations

Evaluations connect prompts, datasets, and evaluators. They answer a practical question: “Does this prompt version satisfy the criteria we care about across the cases we care about?”

When to run evaluations

Run evaluations:

Before deploying a prompt.
After changing model, provider, messages, variables, schema, or tools.
After approving an Improve candidate through Edit & approve.
After adding new regression rows.
After changing evaluator criteria.
Before promoting a version from staging to production.

For high-risk workflows, make evaluations part of release review rather than a one-off debugging step.

Run an evaluation

Open the prompt

Open the prompt you want to test.

Open Evaluate

Go to the prompt’s evaluation workflow.

Select datasets and evaluators

Choose the dataset rows and active evaluators that define the test.

Run the evaluation

Start the run and wait for row-level results.

Review failures

Inspect failing rows, evaluator reasons, output, cost, latency, and token usage.

Read evaluation results

Review results at three levels:

Level	What to inspect
Run summary	Overall pass rate, score, cost, latency, and whether the run is good enough for release.
Evaluator result	Which rule failed, why it failed, and whether the evaluator itself looks reliable.
Dataset row	The exact input, expected behavior, actual output, and row metadata.

Use row-level failures to decide whether the prompt, dataset, evaluator, or product expectation needs to change.

Distinguish prompt failures from test failures

Not every failing evaluation means the prompt is wrong.

Failure type	What to do
Prompt output is wrong	Fix the prompt, tool instructions, schema, or model settings.
Evaluator is too broad	Rewrite the rubric or split it into narrower evaluators.
Dataset row is stale	Update or archive the row.
Expected value is wrong	Fix the expected output or label.
Tool/backend failure	Inspect traces or backend logs before changing prompt text.
Provider/model issue	Compare with another model or review provider status.

Use continuous evaluation

Continuous evaluation scores production traces when configured. Results can appear in Traces and aggregate into Monitor. Use continuous evaluation when:

The requirement is important after deployment.
Production inputs differ from test datasets.
You need quality signals next to latency, cost, and traffic.
Improve should have strong regression checks.

Sampling can reduce cost for high-volume traffic. Choose a sample rate that gives useful signal without making every request expensive.

Connect evaluation failures to other workflows

After a failed evaluation:

Add the case to a regression dataset if it represents a real risk.
Open related production traces if the issue appears live.
Check Behaviors to see whether the pattern is recurring.
Start an Improve cycle if the fix belongs in prompt behavior.
Update deployment notes if the failure blocks release.

Evaluation review checklist

Before calling an evaluation run “passing”:

The dataset represents current product behavior.
The evaluator criteria match the release goal.
Failing rows are understood or fixed.
Cost and latency are acceptable.
Results are not driven by a single noisy evaluator.
The prompt version being evaluated is the version you intend to deploy.

Keep a small, trusted golden dataset for release gates and a broader exploratory dataset for discovery. Mixing the two makes releases harder to reason about.

​When to run evaluations

​Run an evaluation

​Read evaluation results

​Distinguish prompt failures from test failures

​Use continuous evaluation

​Connect evaluation failures to other workflows

​Evaluation review checklist

When to run evaluations

Run an evaluation

Read evaluation results

Distinguish prompt failures from test failures

Use continuous evaluation

Connect evaluation failures to other workflows

Evaluation review checklist