Test and debug tools

Tool problems usually show up in one of four ways: the model does not call the tool, calls the wrong tool, calls the right tool with bad arguments, or handles the tool response incorrectly. Use playground runs for quick iteration and Traces for production truth.

Test in the prompt playground

Start with realistic inputs. A tool that works for a happy-path example can still fail when the user gives partial information, conflicting constraints, or a request outside the tool’s scope. For each important tool, test:

The normal call path.
Missing required user information.
Ambiguous user requests.
Tool returns no result.
Tool returns an error.
Tool returns slow or partial data.
User asks for something the tool should not be used for.

Inspect tool call quality

When a tool is called, review:

Signal	What to ask
Selection	Should the model have called this tool for this request?
Arguments	Are required fields present and correctly typed?
Source data	Did the prompt provide enough context for the model to fill the arguments?
Response handling	Did the assistant use the returned data correctly?
Fallback	Did the assistant behave safely when the tool failed or returned nothing?
Latency and cost	Is the call acceptable for the product path?

If the arguments are wrong, fix the schema and prompt instructions before changing backend behavior.

Debug in Traces

Production traces show the exact request path. Open a trace and inspect:

The span hierarchy around the model call.
Tool call arguments.
Tool response content.
Tool status and latency.
Tags and attributes for route, release, user segment, or environment.
The assistant response after the tool returned.

If a tool issue appears in many traces, open Behaviors to see whether Adaline has clustered the pattern. Tool behaviors can reveal repeated failures that are hard to see from isolated traces.

Common fixes

Symptom	Likely fix
Tool is never called	Make the trigger condition explicit in the prompt and confirm the model supports tool calls.
Tool is called too often	Narrow the tool description and add prompt instructions for when not to call it.
Arguments are malformed	Tighten the JSON schema, required fields, enum values, and field descriptions.
Tool result is ignored	Add instructions for how to incorporate returned data into the answer.
Assistant over-trusts tool data	Add fallback and uncertainty instructions for stale, empty, or conflicting responses.
Tool latency is high	Inspect trace spans and decide whether the backend, tool chain, or model workflow needs optimization.

Add evaluator coverage

Important tool behavior should have evaluator coverage. Consider:

JavaScript evaluators for exact argument or JSON-shape requirements.
Text matchers for required citations or refusal phrases.
LLM-as-a-Judge evaluators for answer quality after tool use.
Latency evaluators for workflows with tool-call SLAs.
Cost evaluators when tool calls trigger expensive model or API work.

Add dataset rows for each tool edge case so future prompt versions keep passing.

Production review loop

After deploying a prompt that uses tools:

Watch Monitor for latency, cost, token, and eval-score changes.
Filter Traces by route, environment, tool-related tags, or high latency.
Use Deep search to find semantic tool failures such as “assistant answered without lookup”.
Open Behaviors for recurring user, assistant, or tool patterns.
Start Improve if the fix belongs in prompt instructions.

Do not debug production tool issues only from playground output. Playground runs prove a draft can work; production traces prove what actually happened for users.

​Test in the prompt playground

​Inspect tool call quality

​Debug in Traces

​Common fixes

​Add evaluator coverage

​Production review loop

Test in the prompt playground

Inspect tool call quality

Debug in Traces

Common fixes

Add evaluator coverage

Production review loop