Skip to main content
Continuous evaluations run evaluators on production model spans so quality appears beside runtime metrics. Use them when you want Monitor to show not only whether the agent answered, but whether the answer met your criteria.

How it works

  1. A prompt is deployed and receiving logged model spans.
  2. Evaluators are attached to the prompt or evaluation workflow.
  3. Adaline scores sampled production spans.
  4. Results appear in span details and Monitor charts.
  5. Failing or important spans can become dataset rows, Behavior evidence, or Improve context.
Monitor quality charts showing eval score, eval pass rate by evaluator, and errors by type

What to monitor

SignalWhat it means
Average eval scoreOverall quality trend for evaluated production traffic.
Eval pass rate by evaluatorWhich quality dimensions are passing or failing.
Errors by typeWhere failures are concentrated.
Span evaluation tabThe exact score and reason for one model span.
If quality drops, open the traces behind the chart before changing the prompt. The failure could be a prompt issue, missing retrieval context, tool behavior, evaluator criteria, or bad instrumentation.

Good evaluator coverage

Use continuous evaluations for checks that are meaningful on live traffic:
  • Safety or policy compliance.
  • Format or schema correctness.
  • Tool-use correctness.
  • Domain-specific answer quality.
  • Cost, latency, or response length thresholds.
  • Regression checks from important dataset rows.
For evaluator setup and types, see Evaluators overview.

Use results in the improvement loop

When a production span fails an evaluator:
  1. Open the trace and read the model span, tool context, and evaluator reason.
  2. Add the span to a dataset if it should become regression coverage.
  3. Update the evaluator if the score is wrong or too broad.
  4. Start or review an Improve cycle when the fix belongs in the prompt.
  5. Watch the eval score chart after release.

Evaluate prompts

Run evaluation checks across prompts and datasets.

Evaluators overview

Understand production scoring and pre-release checks.

Analyze log spans

Read evaluator results on a selected span.

Use logs to improve prompts

Turn failing production evidence into improvements.