Regression loop
- Monitor shows a metric shift or falling eval score.
- Traces reveal exact failing requests and spans.
- Behaviors groups repeated patterns and issue signals.
- Dataset rows capture representative cases.
- Evaluators encode the requirement.
- Improve proposes a prompt candidate.
- Deployment ships only after the regression cases pass.
What deserves a regression row
Add a regression row when the issue is:- Customer-visible.
- High-risk for safety, policy, finance, healthcare, legal, or trust.
- Repeated across several traces.
- Caused by ambiguous prompt instructions.
- Related to tool failure or missing context.
- A format or schema break that can disrupt downstream code.
- A behavior the team explicitly fixed and wants to preserve.
Convert a trace into a test
When a production trace reveals a failure:Identify the minimum useful input
Keep enough context to reproduce the behavior, but remove irrelevant metadata.
Create or update a dataset row
Store the user input, required context, expected behavior, and source metadata.
Use Behavior evidence
Behavior rows and detail pages help decide whether a failure is worth regression coverage. Use:- Issue status.
- Error rate.
- Conversation count.
- Representative snippets.
- Role: user, assistant, or tool.
- Lifecycle state such as new, drifted, reactivated, or vanished.
- Related trace evidence.
Promote Improve evidence
Improve cycles can produce or prepare dataset evidence while generating candidates. Treat that evidence as draft until review. After approval:- Promote the best cases into long-lived regression datasets.
- Remove synthetic or duplicate rows that do not teach anything.
- Keep the evaluator that represents the accepted fix.
- Link the dataset and evaluator to the prompt that owns the behavior.
- Watch the original Behavior after deployment.
Organize regression datasets
Use names that explain what the dataset protects:refund_policy_regressionstool_lookup_failuresjson_contract_regressionssafety_boundary_caseshigh_value_customer_escalations
bugs_june, random_failures, or new_tests. Regression datasets should outlive the incident that created them.
Keep coverage healthy
Review regression datasets regularly:- Archive stale rows when product policy changes.
- Split large mixed datasets by behavior or workflow.
- Remove duplicates.
- Add metadata for source, incident, trace, or behavior.
- Keep expected outputs current.
- Re-run after major model or provider changes.