Build regression coverage

Regression coverage makes sure solved problems stay solved. In Adaline, the strongest regression coverage comes from production evidence: a trace fails, a Behavior clusters the pattern, a dataset stores representative cases, and evaluators define what must pass next time.

Regression loop

Monitor shows a metric shift or falling eval score.
Traces reveal exact failing requests and spans.
Behaviors groups repeated patterns and issue signals.
Dataset rows capture representative cases.
Evaluators encode the requirement.
Improve proposes a prompt candidate.
Deployment ships only after the regression cases pass.

What deserves a regression row

Add a regression row when the issue is:

Customer-visible.
High-risk for safety, policy, finance, healthcare, legal, or trust.
Repeated across several traces.
Caused by ambiguous prompt instructions.
Related to tool failure or missing context.
A format or schema break that can disrupt downstream code.
A behavior the team explicitly fixed and wants to preserve.

Do not add every bad example forever. Regression sets should be curated enough that failures remain meaningful.

Convert a trace into a test

When a production trace reveals a failure:

Open the trace

Inspect the full trace, not only the failing span.

Identify the minimum useful input

Keep enough context to reproduce the behavior, but remove irrelevant metadata.

Create or update a dataset row

Store the user input, required context, expected behavior, and source metadata.

Attach an evaluator

Add the rule that should fail the bad output and pass the desired output.

Run the evaluation

Confirm the current prompt fails before the fix and passes after the fix.

For the trace-side workflow, see Add spans to datasets.

Use Behavior evidence

Behavior rows and detail pages help decide whether a failure is worth regression coverage. Use:

Issue status.
Error rate.
Conversation count.
Representative snippets.
Role: user, assistant, or tool.
Lifecycle state such as new, drifted, reactivated, or vanished.
Related trace evidence.

A high-volume issue Behavior usually deserves regression coverage before prompt changes are approved.

Promote Improve evidence

Improve cycles can produce or prepare dataset evidence while generating candidates. Treat that evidence as draft until review. After approval:

Promote the best cases into long-lived regression datasets.
Remove synthetic or duplicate rows that do not teach anything.
Keep the evaluator that represents the accepted fix.
Link the dataset and evaluator to the prompt that owns the behavior.
Watch the original Behavior after deployment.

Organize regression datasets

Use names that explain what the dataset protects:

refund_policy_regressions
tool_lookup_failures
json_contract_regressions
safety_boundary_cases
high_value_customer_escalations

Avoid names such as bugs_june, random_failures, or new_tests. Regression datasets should outlive the incident that created them.

Keep coverage healthy

Review regression datasets regularly:

Archive stale rows when product policy changes.
Split large mixed datasets by behavior or workflow.
Remove duplicates.
Add metadata for source, incident, trace, or behavior.
Keep expected outputs current.
Re-run after major model or provider changes.

A passing regression dataset does not prove the prompt is globally safe. It proves the prompt still passes the cases you captured. Keep adding production evidence as the product evolves.

​Regression loop

​What deserves a regression row

​Convert a trace into a test

​Use Behavior evidence

​Promote Improve evidence

​Organize regression datasets

​Keep coverage healthy

Regression loop

What deserves a regression row

Convert a trace into a test

Use Behavior evidence

Promote Improve evidence

Organize regression datasets

Keep coverage healthy