
Case sources
| Case source | Best use | Risk |
|---|---|---|
| Production trace case | Proving the candidate helps real traffic. | Messy context or customer-specific wording. |
| Regression dataset row | Preventing a known issue from returning. | Can become stale if expected behavior changes. |
| Golden path row | Protecting normal healthy workflows. | Easy cases can hide edge-case failures. |
| Synthetic case | Testing nearby variants before many real examples exist. | Can overrepresent hypothetical failures. |
Where synthetic cases fit
Synthetic cases appear in the Datasets stage of Improve. Adaline uses the prompt variables, evaluator criteria, and recent Behavior evidence to build a small test space around the problem being fixed. That test space is organized into dimensions: user intent, topic, difficulty, request shape, expected output, policy boundary, tool context, or any other axis that helps compare the current prompt against candidates. Adaline then creates cases across a mix of strategies:| Strategy | What it adds |
|---|---|
| Direct variants | Normal examples that cover the main dimensions of the selected Behavior and prompt variables. |
| Persona variants | Examples shaped around different user profiles, expertise levels, communication styles, or expectations. |
| Adversarial and edge cases | Harder examples around boundary conditions, prompt-injection-like wording, format pressure, semantic ambiguity, or known failure modes. |
| Evaluator-aligned cases | Extra columns or expected outputs that help generated and authored evaluators score the candidate consistently. |

Review generated cases
Before trusting a synthetic case, make sure it is plausible for your product, has clear expected behavior, includes enough context to score fairly, and does not duplicate existing coverage or include private customer details. Good synthetic cases should protect healthy behavior as well as the target failure. If a case feels unrealistic or overfit, do not let it drive approval; keep only the examples that make future releases safer.Preserve useful cases
After the cycle, promote good examples into durable datasets:| Outcome | Evidence to keep |
|---|---|
| Approve | Failing examples, generated variants, and criteria that proved the candidate worked. |
| Edit & approve | Examples that explain why the human edit was necessary. |
| Reject | Examples that show the candidate was unsafe, off-policy, too expensive, or under-tested. |
| Failed cycle | The missing-coverage lesson: no cases, no scores, noisy evaluator, or vague Behavior. |
Auto Generated Evaluators
Understand generated checks created from production evidence.
Build datasets from logs
Turn useful cases into durable test coverage.