Features
Execute Evaluations

- Run evaluation in the background for large datasets
- Get instant insights on prompt performance
Analyze Results

- Inspect individual evaluation runs
- View which dataset rows passed or failed the evaluation
Filter and Search

- Filter by status, reason, response content, or variables
- Search across tokens, cost, and latency metrics
Rollback and Iterate

- Return to the exact prompt configurations from any evaluation run
- Compare different versions to optimize prompts