Skip to main content
When you run evaluations in the Evaluate pillar, you test against a fixed dataset that may not cover every production scenario. Continuous evaluations solve this by automatically assessing the quality of your deployed prompts on live traffic — catching regressions and new failure patterns as they happen.

How it works

Once configured, continuous evaluations automatically:
  1. Sample incoming spans — A percentage of LLM spans associated with your prompt are selected based on the sample rate.
  2. Run evaluators — The sampled spans’ output is evaluated against the evaluators you configured for the prompt.
  3. Store results — Evaluation scores are attached to the spans and visible in the logs and charts.

Configure continuous evaluations

1

Select a prompt

Navigate to the Monitor section of a monitored project and select the prompt you want to continuously evaluate.
2

Set the sample rate

Define the continuous evaluation sample rate — a value between 0 and 1:Configuring continuous evaluation sample rate
Sample rateBehavior
0No logs are evaluated. Continuous evaluation is disabled.
0.550% of incoming spans are randomly sampled and evaluated.
1100% of incoming spans are evaluated.
Start with a low sample rate (e.g., 0.1 or 0.2) and increase it as you gain confidence in your evaluator configuration. A rate of 1.0 evaluates every single request, which provides complete coverage but incurs additional cost.
3

Configure evaluators

Set up the evaluators that will run on sampled spans. See Setup Evaluators for the available evaluator types:
Once the evaluators and sample rate are configured, all incoming spans associated with this prompt of type Model (LLM calls) will have their output values evaluated against the configured evaluators, adhering to the sample rate.

Override the sample rate

You can force evaluation on a specific span regardless of the sample rate. This is useful for high-priority requests that you always want evaluated, or for testing your evaluator configuration on specific requests.
Set runEvaluation: true when creating a span or via the update method:
const span = trace.logSpan({
  name: "LLM Completion",
  promptId: "your-prompt-id",
  runEvaluation: true,
  tags: ["high-priority"],
});
You can also enable it after creation:
span.update({ runEvaluation: true });
See the Span class reference for the full API.

View results

Continuous evaluation results appear in multiple places:
LocationWhat you see
Span detailsEvaluation score and reason attached to each evaluated span.
ChartsThe Avg eval score chart shows quality trends over time.
Trace viewEvaluated spans are marked with their score in the trace tree.

Best practices

  • Start with a low sample rate — Begin at 0.1–0.2 to validate your evaluator configuration before scaling up.
  • Use representative evaluators — Choose evaluators that measure the dimensions most important to your use case.
  • Monitor the eval score chart — Watch the Avg eval score chart for trend changes that indicate quality regressions.
  • Combine with alerts — Set up alerts to get notified when eval scores drop below a threshold.
  • Iterate on evaluators — Refine your evaluator rubrics and thresholds based on what you observe in production.

Next steps

Build Datasets from Logs

Capture production cases for offline evaluation.

Use Logs to Fix Prompts

Debug and improve prompts using production insights.