Skip to main content
Evaluators define what good output means for a prompt. They score model responses during prompt evaluations, production monitoring, log review, and Improve cycles. Use Evaluators when a product rule, quality bar, safety requirement, format contract, cost budget, latency target, or production failure should become a repeatable check. Evaluator setup in Adaline showing evaluator configuration for prompt quality checks

Evaluation anatomy

Every useful evaluation has five parts:
PartMeaningQuestion it answers
DataDataset rows, production spans, generated cases, or Improve examples.Which customer situations are being tested?
Task or promptPrompt version, model settings, variables, tools, and response schema.What behavior would the user experience?
EvaluatorCriterion used to score the response.What does the team mean by good enough?
ResultScore, pass/fail, reason, cost, latency, or custom output.Did it pass for the right reason?
ActionApprove, reject, edit, add coverage, Improve, deploy, or roll back.What changes because of this result?
If an evaluator result does not lead to a decision, it is still evidence, but it is not yet an operational quality gate.

Evaluator types

EvaluatorUse it for
LLM-as-a-JudgeRubric-based quality, safety, policy, tone, and reasoning checks.
Custom PromptLLM-based evaluation with custom model configuration and prompt logic.
JavaScriptDeterministic checks, schema validation, custom scoring, and business rules.
JSONStructured JSON checks and schema-like assertions.
API CallExternal service checks through your own evaluator endpoint.
Text MatcherRequired or forbidden strings, regexes, and formatting markers.
CostBudget thresholds based on provider cost.
LatencySLA thresholds based on runtime.
Response LengthWord, token, character, or brevity requirements.
Prefer deterministic evaluators for exact rules. Use LLM-based evaluators when the criterion requires judgment, then calibrate them with known passing and failing examples.

Where evaluators run

Evaluation report showing scored prompt outputs and detailed results
WorkflowWhat evaluators do
Prompt evaluationsRun against datasets before release or during prompt development.
Monitor and LogsScore sampled production traffic for continuous quality signals.
ImproveReject candidates that improve one behavior while regressing another.
Draft evaluators created during an Improve cycle should be reviewed before they become trusted release gates.

Online and offline evaluation

ModeUse it forSource
Offline evaluationPre-release checks, prompt comparison, regression testing, Improve candidate review, and CI/CD gates.Curated datasets, golden examples, and production failures promoted into datasets.
Online evaluationContinuous monitoring, silent failure detection, release watch, and drift investigation.Production logs and spans with useful metadata.
The strongest loop is: online failure -> log evidence -> dataset row -> evaluator -> offline release gate -> deployment -> online watch.

Create useful evaluators

1

Start from a failure mode

Use a Behavior, failing log, customer report, or product requirement to define what should pass or fail.
2

Choose the evaluator type

Use deterministic evaluators for exact rules and LLM-as-a-Judge for qualitative criteria.
3

Attach it to the prompt

Link the evaluator where it should run so evaluations and Improve cycles can use it.
4

Validate against examples

Run it against known passing and failing cases before relying on it for approval decisions.

Coverage checklist

For important prompts, cover the risks customers would notice:
RiskCoverage example
CorrectnessLLM-as-a-Judge rubric, JavaScript business rule, or API evaluator.
Safety and policyLLM-as-a-Judge rubric with explicit passing and failing examples.
StructureJSON, JavaScript, or Text Matcher evaluator.
Tool behaviorDataset rows requiring tool use plus output checks.
LatencyLatency evaluator for response-time budgets.
Cost and verbosityCost and Response Length evaluators.
Known regressionsDataset rows created from production logs or Behaviors.
Coverage does not need to be large to be useful. A small dataset with clear evaluators beats a large dataset with vague scoring.

Evaluator types

Choose the right evaluator for quality, schema, cost, latency, and custom rules.

Online and offline evaluation

Connect curated datasets, production scoring, release gates, and Improve.

Create useful evaluators

Turn product requirements and production failures into repeatable checks.

Evaluate prompts

Run prompt evaluations against datasets and review results.

Datasets

Store the cases evaluators should score.