Skip to main content
Auto Generated Evaluators are production-derived checks created from logs, Behaviors, feedback, and representative examples. They help an Improve cycle score candidates against the issue that actually appeared in production. Use them to cover newly discovered failures quickly. Review them before treating them as durable release gates. Improve regression report showing authored and auto generated evaluators scored against the baseline and candidate

Human-authored vs auto generated

SourceWhat it is forTreat it as
Human-authored evaluatorKnown product policy, safety, quality, format, or regression requirements.Stable release coverage when calibrated.
Auto Generated EvaluatorA newly observed production pattern that needs fast scoring coverage.Draft or supporting evidence until reviewed.
Generated evaluators should still have readable criteria, representative examples, and pass/fail behavior that matches human judgment.

Where they appear

Auto Generated Evaluators appear in the Evals stage of an Improve cycle.
StateMeaningReviewer action
CoveredGenerated evaluators are available for candidate scoring.Review criteria and examples before trusting the score.
StartedGeneration is running from selected evidence.Wait for completion or continue with authored coverage if urgent.
Awaiting reviewA generated evaluator needs human approval before it becomes trusted coverage.Publish it only if it captures the failure correctly.
UnavailableThere is not enough suitable evidence.Add logs, Behavior evidence, or a manual evaluator.
FailedThe pipeline could not produce usable coverage.Improve evidence quality or write the evaluator manually.

Review checklist

Before relying on an auto generated evaluator, make sure it names the user-visible failure, passes healthy examples, catches the bad examples it was created for, and does not block unrelated behavior. It should be readable, privacy-safe, grounded in enough examples to explain its decisions, and close to how a human reviewer would judge representative cases. If the idea is useful but the criteria are too broad, tighten it manually before making it durable coverage.

When to write one yourself

Use a human-authored evaluator when the requirement is explicit, high-risk, shared across prompts, or ambiguous enough that automation should not define “good” by itself. Examples: compliance policy, medical or financial safety, structured output format, brand tone, tool correctness, or a high-volume regression the team already understands.

After the cycle

Good generated evaluators should become part of the system’s memory:
  1. Keep the checks that explain why the approved candidate worked.
  2. Rewrite noisy criteria before making them a release gate.
  3. Link important evaluators to the prompt and relevant datasets.
  4. Ignore or delete generated checks that overfit a single cluster.
The goal is compounding coverage: each repeated production issue should leave behind a better evaluator, dataset row, or review note. When a cycle is approved or approved with edits, the AI-generated evaluators from that cycle are saved in your project so they can be reused in evaluations, Playground runs, and continuous evaluations.

Synthetic Datasets

Understand generated cases and production-derived validation data.

Evaluator overview

Maintain the evaluator library outside an Improve cycle.