Skip to main content
Evaluators turn product expectations into checks that can run on prompt outputs, datasets, production spans, and Improve candidates. Choose the evaluator type based on the kind of judgment you need. Adaline evaluator types are grouped into three families: AI-powered, performance, and code.

AI-powered evaluators

TypeUse it forWatch out for
LLM-as-a-JudgeRubrics that require judgment, such as helpfulness, policy adherence, tone, reasoning quality, or whether an answer used context correctly.Calibrate with passing and failing examples. Rubrics that are too broad become noisy.
Custom PromptCustom judge prompts with their own model configuration and prompt logic.Treat it like another production prompt: it needs clear instructions, stable variables, and test cases.
AI-powered evaluators are best when a deterministic rule cannot capture quality. Use them for qualitative criteria, then validate the judge against examples your team agrees on.

Performance evaluators

TypeUse it forConfiguration
Response LengthEnforce short, complete, or bounded answers.Compare words, characters, or tokens with less-than, greater-than, or equals.
LatencyEnforce runtime SLAs.Compare milliseconds or seconds with less-than, greater-than, or equals.
CostEnforce per-run budget constraints.Compare USD cost with less-than, greater-than, or equals.
Performance evaluators can include cost and latency from linked dataset cells and prompt execution. Use them when product quality depends on speed or spend, not only answer correctness.

Code and structured evaluators

TypeUse it forConfiguration
JavaScriptDeterministic business logic, schema validation, custom scoring, or checks that combine variables and response content.Return grade, score, and reason.
JSONValid JSON, exact JSON match, or schema-style checks.Choose valid-json, exact-match, or schema.
API CallExternal evaluators owned by your system.Configure an HTTP request that returns evaluation data.
Text MatcherRequired strings, forbidden strings, regex, starts-with, ends-with, contains-all, contains-any, or not-contains-any.Use exact text checks for objective formatting rules.
Use deterministic evaluators whenever the rule is exact. They are easier to debug, cheaper to run, and more stable than a judge model.

Evaluator fields that matter

Every evaluator should have:
  • A clear title.
  • A description that explains the product rule.
  • A type-specific value or configuration.
  • A threshold.
  • A weight when multiple evaluators contribute to a score.
  • A dataset link when the evaluator should run against specific rows.
  • Continuous evaluation settings when it should score production traffic.
Draft evaluators created inside an Improve cycle can stay isolated until the cycle is approved, so in-flight experiments do not silently change production scoring.

Choosing an evaluator

Product needRecommended evaluator
”The answer must be valid JSON.”JSON evaluator with valid-json.
”The answer must match this schema.”JSON or JavaScript evaluator.
”The response must cite retrieved policy.”LLM-as-a-Judge, Text Matcher, or JavaScript depending on how strict the citation format is.
”The answer must not mention internal policy names.”Text Matcher or LLM-as-a-Judge.
”The response must be under 200 words.”Response Length evaluator.
”The workflow must finish under 2 seconds.”Latency evaluator.
”The run must cost less than $0.02.”Cost evaluator.
”The answer must satisfy our private business rule.”JavaScript or API Call evaluator.

Good evaluator design

Good evaluators are narrow enough to debug and important enough to block bad releases.
  • One evaluator should measure one expectation.
  • Titles should name the requirement, not the implementation.
  • Failure reasons should help a reviewer fix the prompt.
  • Thresholds should match the cost of failure.
  • Use datasets to keep examples attached to the rule.
  • Revisit evaluators after major product or policy changes.
If a rubric has several unrelated requirements, split it into several evaluators. Separate failures are easier to diagnose and easier for Improve to preserve.