Skip to main content
Logs become useful when they change what your team ships. Use this workflow when Monitor or Traces reveals a prompt issue: a bad answer, missing instruction, weak tool-use decision, format error, safety issue, or repeated Behavior that should be fixed in the prompt.

Start with evidence

Use Monitor charts, filters, Traces, or Behaviors to find representative examples. Good evidence includes:
  • A trace name that explains the request.
  • The selected model span with input and output content.
  • Relevant tool calls or retrieval spans.
  • Cost, latency, tokens, and status.
  • Evaluator results when available.
  • Metadata such as prompt, model, environment, release, route, or customer-safe segment.
Avoid fixing from a chart alone. Open at least one representative trace before deciding the prompt is the right layer to change.

Reproduce or preserve the case

For model spans, use the trace side sheet actions: Selected model span with Open in Playground and Add to Dataset actions
  • Open in Playground when you want to reproduce the production call and inspect the prompt behavior.
  • Add to Dataset when the case should become regression coverage.
  • Review Evaluations when a score or reason explains the failure.
  • Inspect Raw when an external system or deeper debugging needs exact payload evidence.

Decide the fix layer

What you findBetter next step
Prompt output is wrong despite good contextRun or review an Improve cycle, or edit the prompt directly.
Tool arguments are wrongImprove tool-use instructions, tool schema, or prompt examples.
Tool response is wrong or slowDebug the tool/backend before changing the prompt.
Evaluator reason is wrongUpdate evaluator criteria or dataset labels.
Production case should never regressAdd the span to a dataset and attach evaluators.
Pattern repeats across many tracesReview the related Behavior before choosing examples.

Close the loop

A strong prompt-improvement workflow leaves durable artifacts:
  1. Filter or inspect logs to find representative evidence.
  2. Add useful spans to datasets.
  3. Attach or refine evaluators that measure the failure mode.
  4. Run Improve against the attached prompt when the issue belongs in prompt behavior.
  5. Review the candidate, audit packet, datasets, generated evaluators, and runtime impact.
  6. Deploy through Adaline or your external release process.
  7. Watch Monitor after release to confirm quality, cost, latency, and Behavior movement.

Review a cycle

Inspect the evidence packet and candidate prompt changes.

Auto Generated Evaluators

Preserve production lessons as reusable checks.

Synthetic Datasets

Extend weak coverage from behavior evidence.

Auto Prompt optimization

Understand how prompt candidates are optimized from evidence.