Skip to main content
Adding production logs to a dataset captures the raw data — inputs, outputs, and metadata. But raw data alone is not enough for high-quality evaluations. You need human judgment: why a response failed, what the correct answer should have been, and which cases are high priority. Annotation columns and dataset filters let you build a structured review queue so your team can systematically annotate every row, turning a log dump into a curated evaluation suite.

Add annotation columns

When you set up a dataset, you can add columns beyond the prompt’s input and output variables. These extra columns hold human annotations and review metadata that live alongside the production data. Recommended annotation columns:
ColumnTypePurpose
annotationFree textReviewer notes explaining why the response is wrong and what needs to change.
annotation_statusLabelTracks whether the row has been reviewed — empty for pending, filled for completed.
feedback_categoryLabelStructured classification like correct, incorrect, hallucination, off-topic, too-long.
expected_outputFree textThe ideal response a reviewer writes by hand, used as a reference for LLM-as-a-Judge evaluations.
priorityLabelUrgency flag — high, medium, low — so reviewers can triage effectively.
The annotation_status column is the key to building a review queue. Every row added from Monitor starts with annotation_status = empty, signaling that it needs human review. Once a reviewer fills in their annotations, they set the status to filled.
See Setup Dataset for how to add and configure columns. Column names are flexible — use whatever naming convention fits your team, as long as you stay consistent.

Build a review queue

The review queue is simply your dataset filtered to show only rows where annotation columns are still empty. This gives reviewers a focused list of exactly the rows that need attention — no searching, no guesswork.

Filter for unannotated rows

Open your dataset and apply a filter on the annotation_status column:
  • annotation_status = empty — Shows all rows that have been added from Monitor but not yet reviewed by a human.
This filtered view is your annotation queue. Each row in the queue is a production case waiting for human judgment. As reviewers work through the queue and set annotation_status = filled, rows disappear from the filtered view automatically. Dataset filtered to show unannotated rows

Work through the queue

For each row in the queue:
  1. Read the input and output — Understand what the user asked and what the model produced.
  2. Classify the issue — Set feedback_category to a structured label (hallucination, incorrect, off-topic, etc.) so you can analyze failure patterns later.
  3. Write the annotation — Explain in annotation what went wrong and what the correct behavior should be.
  4. Write the expected output (optional) — If the row will be used for LLM-as-a-Judge evaluations, write the ideal response in expected_output.
  5. Set priority (optional) — Flag rows that need urgent prompt fixes.
  6. Mark as filled — Set annotation_status = filled to remove the row from the queue.
You can also filter by feedback_category or priority to focus on specific failure types or urgent cases. Combining filters lets different reviewers own different slices of the queue.

End-to-end workflow

Annotation works best as a recurring loop that connects monitoring, human review, and evaluation:
1

Filter logs in Monitor

Use filters to find the logs that matter — failures, low eval scores, negative user feedback, or edge cases.
2

Add to dataset

Add selected spans to a dataset. The span’s input variables and output are mapped to dataset columns automatically. New rows arrive with annotation_status = empty.
3

Review the queue

Open the dataset and filter for annotation_status = empty. This is your team’s review backlog. Work through it by filling annotation columns and marking rows as filled.
4

Run evaluations

Evaluate your prompt against the annotated dataset. Rows with expected_output filled in can be scored by LLM-as-a-Judge evaluators that compare the model’s response against your reference answer.
5

Fix and redeploy

Use evaluation results to iterate on your prompt. Once fixes pass evaluation, deploy the improved prompt. The annotated rows stay in the dataset as permanent regression tests.
Each cycle adds more annotated rows to your dataset, making your evaluations more comprehensive and your regression safety net stronger over time.

Next steps

Build Datasets from Logs

Add production spans to datasets for evaluation.

Setup Dataset

Create and configure datasets with annotation columns.

Evaluate Prompts

Run evaluations against your annotated datasets.

Use Logs to Improve Prompts

Debug and fix issues found in production logs.