Use this workflow when Monitor shows a spike, drop, or drift that deserves investigation.
Start with the shape of the change
Before opening individual traces, classify the change:
| Shape | What it suggests |
|---|
| Sudden spike | Deployment, traffic burst, retry loop, provider incident, or instrumentation change. |
| Slow drift | Prompt growth, changing user behavior, retrieval expansion, evaluator drift, or product mix shift. |
| One bad bucket | Temporary provider issue, batch job, customer event, or load test. |
| Repeated daily pattern | Scheduled traffic, business hours, regional traffic, or recurring automation. |
| Metric pair movement | A causal hint, such as input tokens rising before cost rises. |
Write down the suspected window and affected metric. That keeps the trace investigation focused.
Investigate by metric
Traffic changed
If Logs changed:
- Confirm whether the change matches application traffic.
- Open Traces with the same time range.
- Filter by environment, route, release, user segment, or API key metadata if available.
- Check whether trace names or span names changed.
- If traffic unexpectedly dropped to zero, inspect instrumentation and API key usage.
Latency increased
If Avg latency increased:
- Filter traces by high latency.
- Open slow traces and switch between tree and waterfall views.
- Identify whether time is spent in model spans, tool calls, retrieval, guardrails, or orchestration.
- Compare slow traces to normal traces from the same route.
- Add a latency evaluator if the workflow needs a release gate.
Cost increased
If Avg cost increased:
- Filter traces by high cost.
- Compare input tokens, output tokens, model, provider, and tool calls.
- Check whether a prompt deployment increased system instructions, examples, retrieval context, or output length.
- Add cost and response-length evaluators if the issue should block future releases.
Tokens changed
If input tokens increased:
- Inspect retrieved context.
- Check conversation history size.
- Compare prompt messages before and after deployment.
- Look for new dynamic prompt or API dataset columns in evaluation workflows.
If output tokens increased:
- Inspect response examples.
- Check whether instructions now ask for longer reasoning, summaries, or explanations.
- Add a response-length evaluator when brevity matters.
Eval score dropped
If Avg eval score dropped:
- Filter traces by evaluator failure or low score when available.
- Open failed spans and read evaluator reasons.
- Check whether failures cluster into a Behavior.
- Save representative spans to a regression dataset.
- Decide whether to fix the evaluator, dataset, prompt, tool behavior, or deployment.
Move from trace evidence to action
Use this decision table after inspecting representative traces:
| Evidence | Next action |
|---|
| One isolated trace failed | Save or annotate only if the case is important. |
| Many traces share the same semantic pattern | Open Behaviors. |
| Prompt instructions caused the failure | Start or review an Improve cycle. |
| Dataset coverage is missing | Add rows in Datasets. |
| Evaluator failed for the wrong reason | Update the evaluator before changing the prompt. |
| Tool/backend failed | Debug the tool or service before changing prompt text. |
| Deployment introduced the issue | Compare deployments and roll back if needed. |
Build the incident packet
For serious changes, keep a small packet of evidence:
- Monitor time range and metric screenshot.
- Filtered trace view or export.
- Representative trace IDs and span IDs.
- Behavior link if a cluster exists.
- Dataset rows added for regression coverage.
- Evaluator failures or score changes.
- Deployment comparison or rollback decision.
This packet becomes the review context for prompt owners, release owners, and support teams.
Do not approve an Improve candidate from a dashboard metric alone. Move through traces, behaviors, evaluators, and datasets until the change is grounded in examples.