Dataset structure
A dataset is a table:- Rows are test cases.
- Columns map to prompt variables, expected outputs, labels, annotations, or evaluator inputs.
- Static columns store manually entered values.
- Dynamic API columns can resolve values from an API request.
- Dynamic prompt columns can resolve values from another prompt.
Common dataset types
| Dataset | Purpose |
|---|---|
| Golden set | Canonical examples that should keep passing. |
| Regression set | Cases that previously failed and must not fail again. |
| Production samples | Real traces selected from Monitor or Behaviors. |
| Adversarial set | Edge cases, unsafe inputs, ambiguity, or malformed requests. |
| Synthetic set | Generated examples used to broaden coverage. |
Dataset workflow
Create or import a dataset
Start from the Datasets library, import a CSV, or create rows from production evidence.
Map columns to prompt variables
Make sure the prompt can resolve every required variable from the dataset row.
Dataset links
A dataset can be linked to prompts in two main ways:- An evaluator attached to a prompt uses the dataset.
- A dynamic prompt column references a prompt.
Improve and dataset drafts
Improve can generate or prepare dataset evidence during a cycle. Treat that evidence as draft review material until the cycle is approved. After approval, review whether the resulting rows should become part of a long-lived regression set.Create and import datasets
Add manual rows, import CSVs, copy traces, copy playground cases, and generate synthetic rows.
Manage columns and rows
Design static, API, and prompt-backed columns that prompts and evaluators can use.
Regression coverage
Turn production failures and Behaviors into durable tests.
Evaluators
Define the criteria datasets should score against.