Take your configured evaluators from setup to insights. Run tests, view detailed results, and iterate based on real performance data.

Features

Execute Evaluations

Run your tests against real datasets automatically:

  • Run evaluation in the background for large datasets
  • Get instant insights on prompt performance

Analyze Results

View detailed reports to understand evaluation outcomes:

  • Inspect individual evaluation runs
  • View which dataset rows passed or failed the evaluation

Find specific patterns in your evaluation data:

  • Filter by status, reason, response content, or variables
  • Search across tokens, cost, and latency metrics

Rollback and Iterate

Restore previous states to understand performance changes:

  • Return to the exact prompt configurations from any evaluation run
  • Compare different versions to optimize prompts