Coming soon — Multi-turn chat evaluation is currently under development. This page will be updated when the feature is available.
What to expect
- Evaluate how well models maintain context across turns, handle references to previous messages, and stay consistent with persona and instructions throughout a conversation.
- Structure datasets with full conversation histories so each test case represents a complete conversation state.
- Use evaluators like LLM-as-a-Judge with conversation-focused rubrics to assess context retention, coherence, and task completion.
- Track how cost and latency scale as conversations grow longer.
Next steps
Evaluate Prompts
Run evaluations on single and chained prompts today.
LLM-as-a-Judge
Set up qualitative evaluation rubrics.