What is Evaluate?

Evaluate let’s you evaluate your prompt on thousand rows. It’s your quality assurance center where you test prompts against real-world scenarios, measure their effectiveness, and identify areas for improvement.

Here, you can run batch evaluations, compare different prompt versions, and ensure your AI solutions meet performance standards before going live.

Key Features

Datasets

Your evaluation foundation:

  • Create and manage test datasets with multiple data types.
  • Import existing data from CSV, JSON, or Excel files.
  • Perform row and column operations for data preparation.
  • Search, filter, and work with image data.
  • Build comprehensive test suites for thorough validation.

Evaluation Runs

Execute and analyze your tests:

  • Run evaluations across entire datasets.
  • View detailed results and performance metrics.
  • Filter and search through evaluation history.
  • Review past runs and rollback to previous versions.
  • Open any evaluation directly in Playground for debugging.

Evaluators

Choose the right metrics for your use case:

  • Use LLM-as-a-Judge to assess quality.
  • Measure information retrieval accuracy.
  • Use JavaScript, JSON, Text Matching for technical validators.
  • Use metrics like Completion Length and Latency to measure performance.