The operating loop
| Step | Product area | What the team is deciding |
|---|---|---|
| Watch health | Monitor | Did traffic, latency, cost, token usage, or evaluation score change? |
| Inspect evidence | Traces | What actually happened in the request path? |
| Find patterns | Behaviors | Is this a repeated user, assistant, or tool behavior? |
| Preserve coverage | Datasets and Evaluators | What case and rule should prevent this from regressing? |
| Improve safely | Improve | Which candidate fixes the issue without breaking existing checks? |
| Release and watch | Deploy | Which environment should run the reviewed prompt version, and what should be watched after release? |
After integration
Once traces are arriving, confirm these basics before starting prompt improvement:- Monitor shows trace volume for the expected project.
- Traces include useful names for major workflow steps.
- Tags and attributes identify environment, release, route, user segment, or tenant where appropriate.
- Model spans include enough variable and response content to debug failures.
- Sensitive data handling is understood before traces are shared, exported, or copied into datasets.
- At least one prompt has a small golden dataset and a few evaluators.
- Deployment environments match the release lanes your application uses.
Choose the next surface
| Situation | Start here | Then go here |
|---|---|---|
| A chart changed after deploy | Monitor | Filter Traces by the release window, then compare deployments. |
| Users report bad answers | Traces | Deep Search for similar cases, then check Behaviors. |
| A failure repeats | Behaviors | Add representative spans to a dataset, then start Improve. |
| A prompt edit is ready | Evaluators | Run datasets, compare deployment diffs, then deploy. |
| Cost is rising | Monitor | Inspect high-cost traces for tokens, tools, model, and output length. |
| Latency is rising | Monitor | Use trace waterfall view before changing prompt text. |
| A release failed | Deploy | Roll back, preserve the failed traces, and add regression coverage. |
| A team asks for proof | Traces | Save a view, export filtered evidence, and include evaluator results. |
First week after launch
Use the first week to learn how the agent behaves under real traffic.Day 0: launch window
- Watch Monitor for traffic, latency, cost, token usage, and evaluation score.
- Open representative traces from normal requests, not only failures.
- Confirm deployment webhooks and cache refresh behavior.
- Confirm the runtime is reading the intended prompt and environment.
- Record the rollback target before the next release.
Day 1: evidence review
- Review failed traces and low evaluator scores.
- Use Deep Search for known failure themes.
- Add the best production examples to regression datasets.
- Tighten evaluators that failed for the wrong reason.
- Avoid prompt edits until the team understands the failure shape.
Day 3: behavior review
- Review Behaviors once enough traces have accumulated.
- Identify high-volume issues and repeated assistant/tool patterns.
- Decide which Behaviors should become Improve targets.
- Add dataset rows for the cases that should never regress.
Week 1: release discipline
- Run the golden and regression datasets before prompt releases.
- Compare deployment snapshots before changing production.
- Watch Monitor after each release.
- Turn release learnings into saved trace views, evaluators, and dataset rows.
Incident workflow
When production behavior looks wrong:- Open Monitor and capture the affected time window.
- Open Traces with the same window and filter by environment, release, route, status, tags, or attributes.
- Inspect representative traces in tree and waterfall views.
- Use Deep Search if the issue is semantic rather than metadata-based.
- Check whether the traces belong to an existing or new Behavior.
- Add important model spans to a regression dataset.
- Fix the right layer: prompt, evaluator, dataset, tool, retrieval, backend, provider, or deployment.
- If a release caused the issue, roll back before running a longer Improve cycle.
Weekly review
For a mature project, hold a short weekly review:- Which Monitor metrics moved, and why?
- Which Behaviors are new, reactivated, or getting worse?
- Which evaluator failures were true product failures?
- Which dataset rows were added from production?
- Which Improve cycles were approved, edited, or rejected?
- Which deployments changed production behavior?
- Which saved trace views or exports should be kept for operations?
- Which old datasets, views, or evaluators should be cleaned up?
Evidence packet
For high-risk changes, keep a small packet before approval or deployment:- Monitor time range and metric change.
- Trace view filters or export.
- Representative trace IDs and span IDs when allowed by policy.
- Behavior link and issue rationale.
- Dataset rows added or updated.
- Evaluator results before and after the fix.
- Deployment comparison and target environment.
- Rollback target.
- Post-release watch plan.
Investigate metric changes
Move from dashboard movement to trace evidence and next actions.
Inspect a trace
Read the request path before deciding what to change.
Build regression coverage
Preserve production failures as repeatable tests.
Release prompts safely
Deploy reviewed prompt versions with comparison, rollback, and monitoring.