Skip to main content
After an evaluation completes, Adaline generates a detailed report with per-row results, evaluator scores, and comparison tools. Use these reports to identify failures, spot patterns, and drive prompt improvements. Analyzing evaluation results in Adaline

View evaluation results

Once your evaluation finishes, the results are displayed automatically: Evaluation results overview

Inspect individual test cases

Click on any row to inspect the details of a specific test case — including the model’s full response, each evaluator’s score, and the reason for pass or failure: Inspecting an individual test case

Filter results

Use the Filter option to narrow down the results and focus on what matters most: Filtering evaluation results Common filters include:
FilterUse case
Pass/FailFocus on failing test cases to prioritize fixes.
Evaluator typeIsolate results from a specific evaluator (e.g., only LLM-as-a-Judge scores).
Score rangeFind test cases with borderline scores that may need attention.
SearchFind specific patterns or keywords in test case data.

Compare across evaluation runs

When you run multiple evaluations, you can navigate between each run’s results using the graph view. Hover over a point to see the summary: Navigating between evaluation runs Click on a point in the graph to view that run’s detailed results. The Go to latest button redirects to the most recent run. You can view and compare any of the last 20 evaluation runs. This is valuable for:
  • Tracking progress — See how scores improve as you refine your prompt.
  • A/B testing — Compare results between different prompt versions or model configurations.
  • Regression detection — Spot cases where a prompt change caused previously passing test cases to fail.

Act on insights

Use report data to drive systematic prompt improvements:
  1. Prioritize failures — Focus on test cases that fail consistently across multiple runs.
  2. Open in Playground — Click on any failing test case to open it in the Playground for interactive debugging.
  3. Identify patterns — Look for common themes in failures (certain input types, specific evaluator criteria, edge cases).
  4. Refine and re-evaluate — Update your prompt in the Editor, then run a new evaluation to measure the impact.
  5. Compare runs — Use the run comparison graph to verify that your changes improved overall scores without introducing regressions.
When you identify a common failure pattern, add more test cases to your dataset that target that specific pattern. This builds a robust test suite that catches regressions as your prompt evolves.

Next steps

Evaluate Prompts

Run another evaluation with updated prompts.

Deploy Your Prompt

Deploy your validated prompt to production.