Evaluator types

Evaluators turn product expectations into checks that can run on prompt outputs, datasets, production spans, and Improve candidates. Choose the evaluator type based on the kind of judgment you need. Adaline evaluator types are grouped into three families: AI-powered, performance, and code.

AI-powered evaluators

Type	Use it for	Watch out for
LLM-as-a-Judge	Rubrics that require judgment, such as helpfulness, policy adherence, tone, reasoning quality, or whether an answer used context correctly.	Calibrate with passing and failing examples. Rubrics that are too broad become noisy.
Custom Prompt	Custom judge prompts with their own model configuration and prompt logic.	Treat it like another production prompt: it needs clear instructions, stable variables, and test cases.

AI-powered evaluators are best when a deterministic rule cannot capture quality. Use them for qualitative criteria, then validate the judge against examples your team agrees on.

Performance evaluators

Type	Use it for	Configuration
Response Length	Enforce short, complete, or bounded answers.	Compare words, characters, or tokens with less-than, greater-than, or equals.
Latency	Enforce runtime SLAs.	Compare milliseconds or seconds with less-than, greater-than, or equals.
Cost	Enforce per-run budget constraints.	Compare USD cost with less-than, greater-than, or equals.

Performance evaluators can include cost and latency from linked dataset cells and prompt execution. Use them when product quality depends on speed or spend, not only answer correctness.

Code and structured evaluators

Type	Use it for	Configuration
JavaScript	Deterministic business logic, schema validation, custom scoring, or checks that combine variables and response content.	Return `grade`, `score`, and `reason`.
JSON	Valid JSON, exact JSON match, or schema-style checks.	Choose valid-json, exact-match, or schema.
API Call	External evaluators owned by your system.	Configure an HTTP request that returns evaluation data.
Text Matcher	Required strings, forbidden strings, regex, starts-with, ends-with, contains-all, contains-any, or not-contains-any.	Use exact text checks for objective formatting rules.

Use deterministic evaluators whenever the rule is exact. They are easier to debug, cheaper to run, and more stable than a judge model.

Evaluator fields that matter

Every evaluator should have:

A clear title.
A description that explains the product rule.
A type-specific value or configuration.
A threshold.
A weight when multiple evaluators contribute to a score.
A dataset link when the evaluator should run against specific rows.
Continuous evaluation settings when it should score production traffic.

Draft evaluators created inside an Improve cycle can stay isolated until the cycle is approved, so in-flight experiments do not silently change production scoring.

Choosing an evaluator

Product need	Recommended evaluator
”The answer must be valid JSON.”	JSON evaluator with valid-json.
”The answer must match this schema.”	JSON or JavaScript evaluator.
”The response must cite retrieved policy.”	LLM-as-a-Judge, Text Matcher, or JavaScript depending on how strict the citation format is.
”The answer must not mention internal policy names.”	Text Matcher or LLM-as-a-Judge.
”The response must be under 200 words.”	Response Length evaluator.
”The workflow must finish under 2 seconds.”	Latency evaluator.
”The run must cost less than $0.02.”	Cost evaluator.
”The answer must satisfy our private business rule.”	JavaScript or API Call evaluator.

Good evaluator design

Good evaluators are narrow enough to debug and important enough to block bad releases.

One evaluator should measure one expectation.
Titles should name the requirement, not the implementation.
Failure reasons should help a reviewer fix the prompt.
Thresholds should match the cost of failure.
Use datasets to keep examples attached to the rule.
Revisit evaluators after major product or policy changes.

If a rubric has several unrelated requirements, split it into several evaluators. Separate failures are easier to diagnose and easier for Improve to preserve.

​AI-powered evaluators

​Performance evaluators

​Code and structured evaluators

​Evaluator fields that matter

​Choosing an evaluator

​Good evaluator design

AI-powered evaluators

Performance evaluators

Code and structured evaluators

Evaluator fields that matter

Choosing an evaluator

Good evaluator design