Create evaluators

Create an evaluator when you can describe what good or bad output means. The strongest evaluators start from product requirements or production evidence, then become repeatable checks. In the current app, the project Evaluators page shows active evaluators across prompts. To create or edit evaluator definitions, open the prompt’s evaluation workflow and add evaluators there. They appear in the project Evaluators library after they are active.

Start from a concrete failure mode

Good sources for evaluator ideas:

A Behavior marked as an issue.
A trace where the assistant failed.
A customer complaint or support ticket.
A product policy or safety rule.
A deployment regression.
A cost, latency, or token budget.
A dataset row that should never fail again.

Avoid writing evaluators from vague goals such as “better answer quality”. Translate the goal into a testable rule.

Create the evaluator

Open the prompt

Open the prompt whose output should be evaluated.

Open the evaluation workflow

Go to the prompt’s evaluation area.

Choose the evaluator type

Pick the type that matches the rule: judge model, deterministic code, text matching, JSON, cost, latency, response length, or API call.

Write the criterion

Enter the rubric, code, matcher, threshold, or request configuration.

Attach a dataset when needed

Link the dataset that contains examples for the evaluator.

Run and inspect results

Validate the evaluator on known passing and failing examples before using it as a release gate.

Calibrate before trusting it

Before relying on a new evaluator, run it against:

A known good answer.
A known bad answer.
An ambiguous answer.
A normal happy-path dataset row.
A hard edge-case row.

If the evaluator fails these calibration cases, fix the evaluator before fixing the prompt.

Write better rubrics

For LLM-as-a-Judge and Custom Prompt evaluators:

Define the role of the judge.
State the exact pass/fail criteria.
Include what evidence the judge should inspect.
Tell the judge how to handle missing context.
Ask for a concise failure reason.
Avoid criteria that require external facts the evaluator cannot see.

Example rubric shape:

Pass if the assistant answers the user's question using only the provided policy context,
states any eligibility limits clearly, and does not invent account-specific facts.
Fail if the assistant gives a confident answer without policy support, omits a required
limit, or recommends an action that violates the policy.

Write better deterministic checks

For JavaScript, JSON, Text Matcher, Cost, Latency, and Response Length evaluators:

Keep the rule exact.
Prefer simple checks over clever code.
Return or configure reasons that explain the failure.
Use dataset columns for expected values when the threshold changes by row.
Test the evaluator after changing prompt variables or response schema.

Link evaluators to datasets

Link a dataset when the evaluator needs rows to score. Dataset columns can provide:

Prompt variables.
Expected output.
Labels.
Expected tool behavior.
Thresholds or allowed values.
Scenario metadata.

The evaluator should make failures reviewable. A failing row should tell the reviewer which requirement was missed and which input caused it.

Continuous evaluation

Enable continuous evaluation when the evaluator should score production traffic. Use sampling when the evaluator is expensive or when traffic volume is high. Continuous evaluation is useful for:

Monitoring policy compliance.
Tracking quality after deployment.
Measuring cost and latency budgets.
Feeding Monitor and Traces with quality signals.
Giving Improve stronger guardrails.

How evaluators support Improve

Improve uses evaluator coverage as safety rails. A candidate that fixes one Behavior should still preserve existing evaluator performance. Strong evaluators make the review page more trustworthy because candidate quality is measured against rules the team already accepts.

Do not use a newly written, uncalibrated evaluator as the only reason to ship a high-impact prompt change. First prove that the evaluator agrees with examples your team understands.

​Start from a concrete failure mode

​Create the evaluator

​Calibrate before trusting it

​Write better rubrics

​Write better deterministic checks

​Link evaluators to datasets

​Continuous evaluation

​How evaluators support Improve

Start from a concrete failure mode

Create the evaluator

Calibrate before trusting it

Write better rubrics

Write better deterministic checks

Link evaluators to datasets

Continuous evaluation

How evaluators support Improve