
Human-authored vs auto generated
| Source | What it is for | Treat it as |
|---|---|---|
| Human-authored evaluator | Known product policy, safety, quality, format, or regression requirements. | Stable release coverage when calibrated. |
| Auto Generated Evaluator | A newly observed production pattern that needs fast scoring coverage. | Draft or supporting evidence until reviewed. |
Where they appear
Auto Generated Evaluators appear in the Evals stage of an Improve cycle.| State | Meaning | Reviewer action |
|---|---|---|
| Covered | Generated evaluators are available for candidate scoring. | Review criteria and examples before trusting the score. |
| Started | Generation is running from selected evidence. | Wait for completion or continue with authored coverage if urgent. |
| Awaiting review | A generated evaluator needs human approval before it becomes trusted coverage. | Publish it only if it captures the failure correctly. |
| Unavailable | There is not enough suitable evidence. | Add logs, Behavior evidence, or a manual evaluator. |
| Failed | The pipeline could not produce usable coverage. | Improve evidence quality or write the evaluator manually. |
Review checklist
Before relying on an auto generated evaluator, make sure it names the user-visible failure, passes healthy examples, catches the bad examples it was created for, and does not block unrelated behavior. It should be readable, privacy-safe, grounded in enough examples to explain its decisions, and close to how a human reviewer would judge representative cases. If the idea is useful but the criteria are too broad, tighten it manually before making it durable coverage.When to write one yourself
Use a human-authored evaluator when the requirement is explicit, high-risk, shared across prompts, or ambiguous enough that automation should not define “good” by itself. Examples: compliance policy, medical or financial safety, structured output format, brand tone, tool correctness, or a high-volume regression the team already understands.After the cycle
Good generated evaluators should become part of the system’s memory:- Keep the checks that explain why the approved candidate worked.
- Rewrite noisy criteria before making them a release gate.
- Link important evaluators to the prompt and relevant datasets.
- Ignore or delete generated checks that overfit a single cluster.
Synthetic Datasets
Understand generated cases and production-derived validation data.
Evaluator overview
Maintain the evaluator library outside an Improve cycle.