Skip to main content
Tool problems usually show up in one of four ways: the model does not call the tool, calls the wrong tool, calls the right tool with bad arguments, or handles the tool response incorrectly. Use playground runs for quick iteration and Traces for production truth.

Test in the prompt playground

Start with realistic inputs. A tool that works for a happy-path example can still fail when the user gives partial information, conflicting constraints, or a request outside the tool’s scope. For each important tool, test:
  • The normal call path.
  • Missing required user information.
  • Ambiguous user requests.
  • Tool returns no result.
  • Tool returns an error.
  • Tool returns slow or partial data.
  • User asks for something the tool should not be used for.

Inspect tool call quality

When a tool is called, review:
SignalWhat to ask
SelectionShould the model have called this tool for this request?
ArgumentsAre required fields present and correctly typed?
Source dataDid the prompt provide enough context for the model to fill the arguments?
Response handlingDid the assistant use the returned data correctly?
FallbackDid the assistant behave safely when the tool failed or returned nothing?
Latency and costIs the call acceptable for the product path?
If the arguments are wrong, fix the schema and prompt instructions before changing backend behavior.

Debug in Traces

Production traces show the exact request path. Open a trace and inspect:
  • The span hierarchy around the model call.
  • Tool call arguments.
  • Tool response content.
  • Tool status and latency.
  • Tags and attributes for route, release, user segment, or environment.
  • The assistant response after the tool returned.
If a tool issue appears in many traces, open Behaviors to see whether Adaline has clustered the pattern. Tool behaviors can reveal repeated failures that are hard to see from isolated traces.

Common fixes

SymptomLikely fix
Tool is never calledMake the trigger condition explicit in the prompt and confirm the model supports tool calls.
Tool is called too oftenNarrow the tool description and add prompt instructions for when not to call it.
Arguments are malformedTighten the JSON schema, required fields, enum values, and field descriptions.
Tool result is ignoredAdd instructions for how to incorporate returned data into the answer.
Assistant over-trusts tool dataAdd fallback and uncertainty instructions for stale, empty, or conflicting responses.
Tool latency is highInspect trace spans and decide whether the backend, tool chain, or model workflow needs optimization.

Add evaluator coverage

Important tool behavior should have evaluator coverage. Consider:
  • JavaScript evaluators for exact argument or JSON-shape requirements.
  • Text matchers for required citations or refusal phrases.
  • LLM-as-a-Judge evaluators for answer quality after tool use.
  • Latency evaluators for workflows with tool-call SLAs.
  • Cost evaluators when tool calls trigger expensive model or API work.
Add dataset rows for each tool edge case so future prompt versions keep passing.

Production review loop

After deploying a prompt that uses tools:
  1. Watch Monitor for latency, cost, token, and eval-score changes.
  2. Filter Traces by route, environment, tool-related tags, or high latency.
  3. Use Deep search to find semantic tool failures such as “assistant answered without lookup”.
  4. Open Behaviors for recurring user, assistant, or tool patterns.
  5. Start Improve if the fix belongs in prompt instructions.
Do not debug production tool issues only from playground output. Playground runs prove a draft can work; production traces prove what actually happened for users.