Skip to main content
Coming soon — Multi-turn chat evaluation is currently under development. This page will be updated when the feature is available.
Multi-turn chat evaluation will let you assess conversational AI systems where the quality of responses depends on context accumulated across multiple user-assistant exchanges. This is essential for chatbots, customer support agents, and any prompt that maintains state across turns.

What to expect

  • Evaluate how well models maintain context across turns, handle references to previous messages, and stay consistent with persona and instructions throughout a conversation.
  • Structure datasets with full conversation histories so each test case represents a complete conversation state.
  • Use evaluators like LLM-as-a-Judge with conversation-focused rubrics to assess context retention, coherence, and task completion.
  • Track how cost and latency scale as conversations grow longer.

Next steps

Evaluate Prompts

Run evaluations on single and chained prompts today.

LLM-as-a-Judge

Set up qualitative evaluation rubrics.