Skip to main content
When something goes wrong in production — a hallucination, a format error, an edge case your prompt doesn’t handle — the fastest path to a fix starts in your logs. The Monitor pillar connects directly back to the Iterate and Evaluate pillars, letting you go from spotting an issue to deploying a fix in a single workflow. Filter down to the problematic log, open it in the Playground with the exact production settings, fix the prompt, account for the case in your datasets, and ship the improvement.

Spot the issue

The first step is finding the logs that indicate a problem. Use filters and search to narrow down to the signals that matter:
SignalHow to find it
Failed requestsFilter traces by failure status to find errors, timeouts, and rejected requests.
Low quality responsesFilter spans with low continuous evaluation scores.
User complaintsFilter by tags like thumbs-down or attributes like user_feedback: negative. See Log User Feedback for setup.
Cost outliersFilter by cost above a threshold to find expensive requests.
Latency spikesFilter by duration to find requests that exceeded your SLA.
Anomalies in chartsSpot trend changes in analytics charts, then click a data point to drill into the underlying traces.
Once you have filtered to the relevant logs, click into a trace or span to inspect the full details — input messages, model response, token usage, cost, and evaluation results. Filtered spans for dataset in Adaline

Reproduce in the Playground

Every span in the Monitor has an Open in Playground button. Clicking it loads the exact request configuration into the Playground — the same prompt messages, model settings, variable values, and tools that were active when the request ran in production. Opening a span in the Playground in Adaline This means you can reproduce the exact issue your user experienced:
  1. Click “Open in Playground” on the problematic span.
  2. Run the prompt — the Playground executes with the same inputs and settings, producing the same (or similar) problematic output.
  3. Confirm the issue — verify that you can see the bug, hallucination, format error, or whatever went wrong.
Reproducing the issue with production inputs is critical. Synthetic test cases often miss the specific combination of inputs, context length, or edge conditions that triggered the problem in the first place.
The Playground preserves the exact model, parameters, and variable values from the production request. This eliminates guesswork — you are debugging with real data, not approximations.

Iterate on a fix

With the issue reproduced, you can now iterate directly in the Editor: After each change, run the prompt again in the Playground with the same inputs. Keep iterating until the output is correct. Running a span in the Playground to iterate on a fix in Adaline

Common issues and fixes

IssueWhat you see in logsHow to fix
HallucinationsResponse contains incorrect information not supported by context.Add stronger grounding instructions, provide more context, or use retrieval to inject factual data.
Format errorsOutput doesn’t match expected structure (JSON, markdown, etc.).Add explicit format instructions, use JSON schema response format, or add a JavaScript evaluator.
Tone issuesResponse uses inappropriate or inconsistent voice.Refine system message persona instructions, add tone examples using multi-shot prompting.
Edge casesInputs the prompt doesn’t handle well (unusual queries, empty inputs, multi-language).Add handling instructions, provide examples of tricky inputs, use text matcher evaluation to catch patterns.
Excessive costSpan shows high token usage relative to output quality.Shorten system prompts, reduce few-shot examples, prune context, or switch to a smaller model.
High latencySpan duration exceeds SLA requirements.Reduce prompt complexity, switch to a faster model, or enable streaming.

Account for the case in your datasets

Once you have fixed the prompt, add the original failing case to a dataset so it becomes a permanent regression test. This ensures the issue never silently reappears after future prompt changes. From the span you debugged:
  1. Click “Add to Dataset” to add the original production inputs and the (previously broken) output as a new row.
  2. Add an expected output column — write the correct response by hand, so evaluators can compare against it.
  3. Add an annotation — note what went wrong and what you changed, so your team has context.
Over time, these cases accumulate into a comprehensive regression dataset that represents every real issue your prompt has encountered and recovered from.

Set up an evaluator

With the failing case in your dataset, configure an evaluator to catch this class of issue automatically:
  • LLM-as-a-Judge — Write a rubric that checks for the specific quality dimension that failed (e.g., factual accuracy, format compliance, tone).
  • JavaScript — Write code to validate structured outputs, enforce business rules, or check for specific patterns.
  • Text Matcher — Check for required keywords, banned phrases, or regex patterns.
  • Cost or Latency — Set thresholds to catch operational regressions.
Run an evaluation against your dataset to verify that the fix passes on the new case and does not break any existing cases.

Deploy the fix

Once your evaluation confirms the fix works:
  1. Deploy the improved prompt to production.
  2. Monitor the incoming logs and charts to confirm the fix holds under real traffic.
  3. Check continuous evaluations — if enabled, the eval score chart should reflect the improvement.
If the issue recurs or new problems surface, repeat the cycle. Each iteration strengthens both your prompt and your test suite — fewer issues reach users, and your evaluation datasets become increasingly comprehensive.

The full workflow

  1. Filter logs to find the problematic span.
  2. Open in Playground to reproduce with exact production inputs.
  3. Diagnose the root cause — instructions, model, missing context, or edge case.
  4. Iterate on the prompt in the Editor until the output is correct.
  5. Add to dataset so the case becomes a permanent regression test.
  6. Set up evaluators to catch this class of issue automatically.
  7. Evaluate to verify the fix works across all test cases.
  8. Deploy the improved prompt to production.
  9. Monitor to confirm the fix holds in production.

Next steps

Filter and Search Logs

Find the logs that indicate problems.

Build Datasets from Logs

Turn production cases into evaluation datasets.

Run Prompts in Playground

Test and iterate on prompts interactively.

Evaluate Prompts

Verify fixes across all test cases.