Investigate metric changes

Use this workflow when Monitor shows a spike, drop, or drift that deserves investigation.

Start with the shape of the change

Before opening individual traces, classify the change:

Shape	What it suggests
Sudden spike	Deployment, traffic burst, retry loop, provider incident, or instrumentation change.
Slow drift	Prompt growth, changing user behavior, retrieval expansion, evaluator drift, or product mix shift.
One bad bucket	Temporary provider issue, batch job, customer event, or load test.
Repeated daily pattern	Scheduled traffic, business hours, regional traffic, or recurring automation.
Metric pair movement	A causal hint, such as input tokens rising before cost rises.

Write down the suspected window and affected metric. That keeps the trace investigation focused.

Investigate by metric

Traffic changed

If Logs changed:

Confirm whether the change matches application traffic.
Open Traces with the same time range.
Filter by environment, route, release, user segment, or API key metadata if available.
Check whether trace names or span names changed.
If traffic unexpectedly dropped to zero, inspect instrumentation and API key usage.

Latency increased

If Avg latency increased:

Filter traces by high latency.
Open slow traces and switch between tree and waterfall views.
Identify whether time is spent in model spans, tool calls, retrieval, guardrails, or orchestration.
Compare slow traces to normal traces from the same route.
Add a latency evaluator if the workflow needs a release gate.

Cost increased

If Avg cost increased:

Filter traces by high cost.
Compare input tokens, output tokens, model, provider, and tool calls.
Check whether a prompt deployment increased system instructions, examples, retrieval context, or output length.
Add cost and response-length evaluators if the issue should block future releases.

Tokens changed

If input tokens increased:

Inspect retrieved context.
Check conversation history size.
Compare prompt messages before and after deployment.
Look for new dynamic prompt or API dataset columns in evaluation workflows.

If output tokens increased:

Inspect response examples.
Check whether instructions now ask for longer reasoning, summaries, or explanations.
Add a response-length evaluator when brevity matters.

Eval score dropped

If Avg eval score dropped:

Filter traces by evaluator failure or low score when available.
Open failed spans and read evaluator reasons.
Check whether failures cluster into a Behavior.
Save representative spans to a regression dataset.
Decide whether to fix the evaluator, dataset, prompt, tool behavior, or deployment.

Move from trace evidence to action

Use this decision table after inspecting representative traces:

Evidence	Next action
One isolated trace failed	Save or annotate only if the case is important.
Many traces share the same semantic pattern	Open Behaviors.
Prompt instructions caused the failure	Start or review an Improve cycle.
Dataset coverage is missing	Add rows in Datasets.
Evaluator failed for the wrong reason	Update the evaluator before changing the prompt.
Tool/backend failed	Debug the tool or service before changing prompt text.
Deployment introduced the issue	Compare deployments and roll back if needed.

Build the incident packet

For serious changes, keep a small packet of evidence:

Monitor time range and metric screenshot.
Filtered trace view or export.
Representative trace IDs and span IDs.
Behavior link if a cluster exists.
Dataset rows added for regression coverage.
Evaluator failures or score changes.
Deployment comparison or rollback decision.

This packet becomes the review context for prompt owners, release owners, and support teams.

Do not approve an Improve candidate from a dashboard metric alone. Move through traces, behaviors, evaluators, and datasets until the change is grounded in examples.

​Start with the shape of the change

​Investigate by metric

​Traffic changed

​Latency increased

​Cost increased

​Tokens changed

​Eval score dropped

​Move from trace evidence to action

​Build the incident packet

Start with the shape of the change

Investigate by metric

Traffic changed

Latency increased

Cost increased

Tokens changed

Eval score dropped

Move from trace evidence to action

Build the incident packet