Evaluate multi-turn chat

Coming soon — Multi-turn chat evaluation is currently under development. This page will be updated when the feature is available.

Multi-turn chat evaluation will let you assess conversational AI systems where the quality of responses depends on context accumulated across multiple user-assistant exchanges. This is essential for chatbots, customer support agents, and any prompt that maintains state across turns.

What to expect

Evaluate how well models maintain context across turns, handle references to previous messages, and stay consistent with persona and instructions throughout a conversation.
Structure datasets with full conversation histories so each test case represents a complete conversation state.
Use evaluators like LLM-as-a-Judge with conversation-focused rubrics to assess context retention, coherence, and task completion.
Track how cost and latency scale as conversations grow longer.

Next steps

Evaluate Prompts

Run evaluations on single and chained prompts today.

LLM-as-a-Judge

Set up qualitative evaluation rubrics.

Analyze evaluation reportsReview, compare, and act on detailed evaluation results to improve your prompts

⌘I

What to expect
Next steps

Get started

Instrument

Monitor

Iterate

Evaluate

Deploy

Admin

Others

Evaluate multi-turn chat

What to expect

Next steps

Evaluate Prompts

LLM-as-a-Judge

Get started

Instrument

Monitor

Iterate

Evaluate

Deploy

Admin

Others

Documentation Index

​What to expect

​Next steps

Evaluate Prompts

LLM-as-a-Judge

What to expect

Next steps