Operate your AI agent

After your application sends traces to Adaline, the goal is not to stare at logs. The goal is to build an operating loop: notice what changed, understand why, turn important evidence into tests, improve the prompt, and release only when the change is reviewable. Use this page as the day-to-day map for a project that is already connected or close to launch.

The operating loop

Step	Product area	What the team is deciding
Watch health	Monitor	Did traffic, latency, cost, token usage, or evaluation score change?
Inspect evidence	Traces	What actually happened in the request path?
Find patterns	Behaviors	Is this a repeated user, assistant, or tool behavior?
Preserve coverage	Datasets and Evaluators	What case and rule should prevent this from regressing?
Improve safely	Improve	Which candidate fixes the issue without breaking existing checks?
Release and watch	Deploy	Which environment should run the reviewed prompt version, and what should be watched after release?

The loop should feel boring when the system is healthy and useful when it is not. Every serious issue should leave behind better traces, better dataset rows, better evaluators, or a clearer deployment practice.

After integration

Once traces are arriving, confirm these basics before starting prompt improvement:

Monitor shows trace volume for the expected project.
Traces include useful names for major workflow steps.
Tags and attributes identify environment, release, route, user segment, or tenant where appropriate.
Model spans include enough variable and response content to debug failures.
Sensitive data handling is understood before traces are shared, exported, or copied into datasets.
At least one prompt has a small golden dataset and a few evaluators.
Deployment environments match the release lanes your application uses.

If these are missing, fix instrumentation and project setup first. Improve and Behaviors become more useful when the underlying traces carry stable metadata.

Choose the next surface

Situation	Start here	Then go here
A chart changed after deploy	Monitor	Filter Traces by the release window, then compare deployments.
Users report bad answers	Traces	Deep Search for similar cases, then check Behaviors.
A failure repeats	Behaviors	Add representative spans to a dataset, then start Improve.
A prompt edit is ready	Evaluators	Run datasets, compare deployment diffs, then deploy.
Cost is rising	Monitor	Inspect high-cost traces for tokens, tools, model, and output length.
Latency is rising	Monitor	Use trace waterfall view before changing prompt text.
A release failed	Deploy	Roll back, preserve the failed traces, and add regression coverage.
A team asks for proof	Traces	Save a view, export filtered evidence, and include evaluator results.

First week after launch

Use the first week to learn how the agent behaves under real traffic.

Day 0: launch window

Watch Monitor for traffic, latency, cost, token usage, and evaluation score.
Open representative traces from normal requests, not only failures.
Confirm deployment webhooks and cache refresh behavior.
Confirm the runtime is reading the intended prompt and environment.
Record the rollback target before the next release.

Day 1: evidence review

Review failed traces and low evaluator scores.
Use Deep Search for known failure themes.
Add the best production examples to regression datasets.
Tighten evaluators that failed for the wrong reason.
Avoid prompt edits until the team understands the failure shape.

Day 3: behavior review

Review Behaviors once enough traces have accumulated.
Identify high-volume issues and repeated assistant/tool patterns.
Decide which Behaviors should become Improve targets.
Add dataset rows for the cases that should never regress.

Week 1: release discipline

Run the golden and regression datasets before prompt releases.
Compare deployment snapshots before changing production.
Watch Monitor after each release.
Turn release learnings into saved trace views, evaluators, and dataset rows.

Incident workflow

When production behavior looks wrong:

Open Monitor and capture the affected time window.
Open Traces with the same window and filter by environment, release, route, status, tags, or attributes.
Inspect representative traces in tree and waterfall views.
Use Deep Search if the issue is semantic rather than metadata-based.
Check whether the traces belong to an existing or new Behavior.
Add important model spans to a regression dataset.
Fix the right layer: prompt, evaluator, dataset, tool, retrieval, backend, provider, or deployment.
If a release caused the issue, roll back before running a longer Improve cycle.

Do not start with prompt editing when the root cause might be a tool, retrieval service, provider issue, schema mismatch, or missing runtime context. Use traces to locate the failing layer first.

Weekly review

For a mature project, hold a short weekly review:

Which Monitor metrics moved, and why?
Which Behaviors are new, reactivated, or getting worse?
Which evaluator failures were true product failures?
Which dataset rows were added from production?
Which Improve cycles were approved, edited, or rejected?
Which deployments changed production behavior?
Which saved trace views or exports should be kept for operations?
Which old datasets, views, or evaluators should be cleaned up?

This review keeps Adaline from becoming only an incident tool. It turns production learning into a system that gets harder to regress.

Evidence packet

For high-risk changes, keep a small packet before approval or deployment:

Monitor time range and metric change.
Trace view filters or export.
Representative trace IDs and span IDs when allowed by policy.
Behavior link and issue rationale.
Dataset rows added or updated.
Evaluator results before and after the fix.
Deployment comparison and target environment.
Rollback target.
Post-release watch plan.

Good operations create durable evidence. If an example matters enough to drive a release decision, it should probably become a dataset row, evaluator case, saved view, or Behavior investigation.

Investigate metric changes

Move from dashboard movement to trace evidence and next actions.

Inspect a trace

Read the request path before deciding what to change.

Build regression coverage

Preserve production failures as repeatable tests.

Release prompts safely

Deploy reviewed prompt versions with comparison, rollback, and monitoring.

​The operating loop

​After integration

​Choose the next surface

​First week after launch

​Day 0: launch window

​Day 1: evidence review

​Day 3: behavior review

​Week 1: release discipline

​Incident workflow

​Weekly review

​Evidence packet

Investigate metric changes

Inspect a trace

Build regression coverage

Release prompts safely

The operating loop

After integration

Choose the next surface

First week after launch

Day 0: launch window

Day 1: evidence review

Day 3: behavior review

Week 1: release discipline

Incident workflow

Weekly review

Evidence packet