Auto Generated Evaluators

Auto Generated Evaluators are production-derived checks created from logs, Behaviors, feedback, and representative examples. They help an Improve cycle score candidates against the issue that actually appeared in production. Use them to cover newly discovered failures quickly. Review them before treating them as durable release gates.

Human-authored vs auto generated

Source	What it is for	Treat it as
Human-authored evaluator	Known product policy, safety, quality, format, or regression requirements.	Stable release coverage when calibrated.
Auto Generated Evaluator	A newly observed production pattern that needs fast scoring coverage.	Draft or supporting evidence until reviewed.

Generated evaluators should still have readable criteria, representative examples, and pass/fail behavior that matches human judgment.

Where they appear

Auto Generated Evaluators appear in the Evals stage of an Improve cycle.

State	Meaning	Reviewer action
Covered	Generated evaluators are available for candidate scoring.	Review criteria and examples before trusting the score.
Started	Generation is running from selected evidence.	Wait for completion or continue with authored coverage if urgent.
Awaiting review	A generated evaluator needs human approval before it becomes trusted coverage.	Publish it only if it captures the failure correctly.
Unavailable	There is not enough suitable evidence.	Add logs, Behavior evidence, or a manual evaluator.
Failed	The pipeline could not produce usable coverage.	Improve evidence quality or write the evaluator manually.

Review checklist

Before relying on an auto generated evaluator, make sure it names the user-visible failure, passes healthy examples, catches the bad examples it was created for, and does not block unrelated behavior. It should be readable, privacy-safe, grounded in enough examples to explain its decisions, and close to how a human reviewer would judge representative cases. If the idea is useful but the criteria are too broad, tighten it manually before making it durable coverage.

When to write one yourself

Use a human-authored evaluator when the requirement is explicit, high-risk, shared across prompts, or ambiguous enough that automation should not define “good” by itself. Examples: compliance policy, medical or financial safety, structured output format, brand tone, tool correctness, or a high-volume regression the team already understands.

After the cycle

Good generated evaluators should become part of the system’s memory:

Keep the checks that explain why the approved candidate worked.
Rewrite noisy criteria before making them a release gate.
Link important evaluators to the prompt and relevant datasets.
Ignore or delete generated checks that overfit a single cluster.

The goal is compounding coverage: each repeated production issue should leave behind a better evaluator, dataset row, or review note. When a cycle is approved or approved with edits, the AI-generated evaluators from that cycle are saved in your project so they can be reused in evaluations, Playground runs, and continuous evaluations.

Synthetic Datasets

Understand generated cases and production-derived validation data.

Evaluator overview

Maintain the evaluator library outside an Improve cycle.

Get started

Instrument

Improve

Behaviors

Monitor

Evaluators

Datasets

Prompts

Tools

Admin

Others

Auto Generated Evaluators

Human-authored vs auto generated

Where they appear

Review checklist

When to write one yourself

After the cycle

Synthetic Datasets

Evaluator overview

​Human-authored vs auto generated

​Where they appear

​Review checklist

​When to write one yourself

​After the cycle

Synthetic Datasets

Evaluator overview

Human-authored vs auto generated

Where they appear

Review checklist

When to write one yourself

After the cycle