Evaluations
Overview
Run and analyze evaluations at scale.
Take your configured evaluators from setup to insights. Run tests, view detailed results, and iterate based on real performance data.
Features
Execute Evaluations
Run your tests against real datasets automatically:
- Run evaluation in the background for large datasets
- Get instant insights on prompt performance
Analyze Results
View detailed reports to understand evaluation outcomes:
- Inspect individual evaluation runs
- View which dataset rows passed or failed the evaluation
Filter and Search
Find specific patterns in your evaluation data:
- Filter by status, reason, response content, or variables
- Search across tokens, cost, and latency metrics
Rollback and Iterate
Restore previous states to understand performance changes:
- Return to the exact prompt configurations from any evaluation run
- Compare different versions to optimize prompts