Skip to main content
Synthetic Datasets are generated cases that expand coverage around production evidence. They help Improve compare the current prompt against candidates when real examples are sparse or when a release needs edge-case pressure. Treat synthetic cases as evidence, not truth. They are release-grade only when the reviewer agrees they represent realistic user or agent behavior. Improve audit packet showing production and synthetic cases, selection process, stage provenance, and execution timeline

Case sources

Case sourceBest useRisk
Production trace caseProving the candidate helps real traffic.Messy context or customer-specific wording.
Regression dataset rowPreventing a known issue from returning.Can become stale if expected behavior changes.
Golden path rowProtecting normal healthy workflows.Easy cases can hide edge-case failures.
Synthetic caseTesting nearby variants before many real examples exist.Can overrepresent hypothetical failures.

Where synthetic cases fit

Synthetic cases appear in the Datasets stage of Improve. Adaline uses the prompt variables, evaluator criteria, and recent Behavior evidence to build a small test space around the problem being fixed. That test space is organized into dimensions: user intent, topic, difficulty, request shape, expected output, policy boundary, tool context, or any other axis that helps compare the current prompt against candidates. Adaline then creates cases across a mix of strategies:
StrategyWhat it adds
Direct variantsNormal examples that cover the main dimensions of the selected Behavior and prompt variables.
Persona variantsExamples shaped around different user profiles, expertise levels, communication styles, or expectations.
Adversarial and edge casesHarder examples around boundary conditions, prompt-injection-like wording, format pressure, semantic ambiguity, or known failure modes.
Evaluator-aligned casesExtra columns or expected outputs that help generated and authored evaluators score the candidate consistently.
When production Behavior data is available, Adaline biases generation toward real patterns: broad and granular Behaviors, newly observed patterns, high-error Behaviors, rare requests, and representative snippets. It may generate more rows than needed first, then prune near-duplicates so the final dataset is more diverse and useful for scoring. Improve stage provenance showing production cases, synthetic cases, derived evaluators, and candidate exploration

Review generated cases

Before trusting a synthetic case, make sure it is plausible for your product, has clear expected behavior, includes enough context to score fairly, and does not duplicate existing coverage or include private customer details. Good synthetic cases should protect healthy behavior as well as the target failure. If a case feels unrealistic or overfit, do not let it drive approval; keep only the examples that make future releases safer.

Preserve useful cases

After the cycle, promote good examples into durable datasets:
OutcomeEvidence to keep
ApproveFailing examples, generated variants, and criteria that proved the candidate worked.
Edit & approveExamples that explain why the human edit was necessary.
RejectExamples that show the candidate was unsafe, off-policy, too expensive, or under-tested.
Failed cycleThe missing-coverage lesson: no cases, no scores, noisy evaluator, or vague Behavior.

Auto Generated Evaluators

Understand generated checks created from production evidence.

Build datasets from logs

Turn useful cases into durable test coverage.