Skip to main content
Use this workflow when Monitor shows a spike, drop, or drift that deserves investigation.

Start with the shape of the change

Before opening individual traces, classify the change:
ShapeWhat it suggests
Sudden spikeDeployment, traffic burst, retry loop, provider incident, or instrumentation change.
Slow driftPrompt growth, changing user behavior, retrieval expansion, evaluator drift, or product mix shift.
One bad bucketTemporary provider issue, batch job, customer event, or load test.
Repeated daily patternScheduled traffic, business hours, regional traffic, or recurring automation.
Metric pair movementA causal hint, such as input tokens rising before cost rises.
Write down the suspected window and affected metric. That keeps the trace investigation focused.

Investigate by metric

Traffic changed

If Logs changed:
  1. Confirm whether the change matches application traffic.
  2. Open Traces with the same time range.
  3. Filter by environment, route, release, user segment, or API key metadata if available.
  4. Check whether trace names or span names changed.
  5. If traffic unexpectedly dropped to zero, inspect instrumentation and API key usage.

Latency increased

If Avg latency increased:
  1. Filter traces by high latency.
  2. Open slow traces and switch between tree and waterfall views.
  3. Identify whether time is spent in model spans, tool calls, retrieval, guardrails, or orchestration.
  4. Compare slow traces to normal traces from the same route.
  5. Add a latency evaluator if the workflow needs a release gate.

Cost increased

If Avg cost increased:
  1. Filter traces by high cost.
  2. Compare input tokens, output tokens, model, provider, and tool calls.
  3. Check whether a prompt deployment increased system instructions, examples, retrieval context, or output length.
  4. Add cost and response-length evaluators if the issue should block future releases.

Tokens changed

If input tokens increased:
  • Inspect retrieved context.
  • Check conversation history size.
  • Compare prompt messages before and after deployment.
  • Look for new dynamic prompt or API dataset columns in evaluation workflows.
If output tokens increased:
  • Inspect response examples.
  • Check whether instructions now ask for longer reasoning, summaries, or explanations.
  • Add a response-length evaluator when brevity matters.

Eval score dropped

If Avg eval score dropped:
  1. Filter traces by evaluator failure or low score when available.
  2. Open failed spans and read evaluator reasons.
  3. Check whether failures cluster into a Behavior.
  4. Save representative spans to a regression dataset.
  5. Decide whether to fix the evaluator, dataset, prompt, tool behavior, or deployment.

Move from trace evidence to action

Use this decision table after inspecting representative traces:
EvidenceNext action
One isolated trace failedSave or annotate only if the case is important.
Many traces share the same semantic patternOpen Behaviors.
Prompt instructions caused the failureStart or review an Improve cycle.
Dataset coverage is missingAdd rows in Datasets.
Evaluator failed for the wrong reasonUpdate the evaluator before changing the prompt.
Tool/backend failedDebug the tool or service before changing prompt text.
Deployment introduced the issueCompare deployments and roll back if needed.

Build the incident packet

For serious changes, keep a small packet of evidence:
  • Monitor time range and metric screenshot.
  • Filtered trace view or export.
  • Representative trace IDs and span IDs.
  • Behavior link if a cluster exists.
  • Dataset rows added for regression coverage.
  • Evaluator failures or score changes.
  • Deployment comparison or rollback decision.
This packet becomes the review context for prompt owners, release owners, and support teams.
Do not approve an Improve candidate from a dashboard metric alone. Move through traces, behaviors, evaluators, and datasets until the change is grounded in examples.