Skip to main content
After your application sends traces to Adaline, the goal is not to stare at logs. The goal is to build an operating loop: notice what changed, understand why, turn important evidence into tests, improve the prompt, and release only when the change is reviewable. Use this page as the day-to-day map for a project that is already connected or close to launch.

The operating loop

StepProduct areaWhat the team is deciding
Watch healthMonitorDid traffic, latency, cost, token usage, or evaluation score change?
Inspect evidenceTracesWhat actually happened in the request path?
Find patternsBehaviorsIs this a repeated user, assistant, or tool behavior?
Preserve coverageDatasets and EvaluatorsWhat case and rule should prevent this from regressing?
Improve safelyImproveWhich candidate fixes the issue without breaking existing checks?
Release and watchDeployWhich environment should run the reviewed prompt version, and what should be watched after release?
The loop should feel boring when the system is healthy and useful when it is not. Every serious issue should leave behind better traces, better dataset rows, better evaluators, or a clearer deployment practice.

After integration

Once traces are arriving, confirm these basics before starting prompt improvement:
  • Monitor shows trace volume for the expected project.
  • Traces include useful names for major workflow steps.
  • Tags and attributes identify environment, release, route, user segment, or tenant where appropriate.
  • Model spans include enough variable and response content to debug failures.
  • Sensitive data handling is understood before traces are shared, exported, or copied into datasets.
  • At least one prompt has a small golden dataset and a few evaluators.
  • Deployment environments match the release lanes your application uses.
If these are missing, fix instrumentation and project setup first. Improve and Behaviors become more useful when the underlying traces carry stable metadata.

Choose the next surface

SituationStart hereThen go here
A chart changed after deployMonitorFilter Traces by the release window, then compare deployments.
Users report bad answersTracesDeep Search for similar cases, then check Behaviors.
A failure repeatsBehaviorsAdd representative spans to a dataset, then start Improve.
A prompt edit is readyEvaluatorsRun datasets, compare deployment diffs, then deploy.
Cost is risingMonitorInspect high-cost traces for tokens, tools, model, and output length.
Latency is risingMonitorUse trace waterfall view before changing prompt text.
A release failedDeployRoll back, preserve the failed traces, and add regression coverage.
A team asks for proofTracesSave a view, export filtered evidence, and include evaluator results.

First week after launch

Use the first week to learn how the agent behaves under real traffic.

Day 0: launch window

  • Watch Monitor for traffic, latency, cost, token usage, and evaluation score.
  • Open representative traces from normal requests, not only failures.
  • Confirm deployment webhooks and cache refresh behavior.
  • Confirm the runtime is reading the intended prompt and environment.
  • Record the rollback target before the next release.

Day 1: evidence review

  • Review failed traces and low evaluator scores.
  • Use Deep Search for known failure themes.
  • Add the best production examples to regression datasets.
  • Tighten evaluators that failed for the wrong reason.
  • Avoid prompt edits until the team understands the failure shape.

Day 3: behavior review

  • Review Behaviors once enough traces have accumulated.
  • Identify high-volume issues and repeated assistant/tool patterns.
  • Decide which Behaviors should become Improve targets.
  • Add dataset rows for the cases that should never regress.

Week 1: release discipline

  • Run the golden and regression datasets before prompt releases.
  • Compare deployment snapshots before changing production.
  • Watch Monitor after each release.
  • Turn release learnings into saved trace views, evaluators, and dataset rows.

Incident workflow

When production behavior looks wrong:
  1. Open Monitor and capture the affected time window.
  2. Open Traces with the same window and filter by environment, release, route, status, tags, or attributes.
  3. Inspect representative traces in tree and waterfall views.
  4. Use Deep Search if the issue is semantic rather than metadata-based.
  5. Check whether the traces belong to an existing or new Behavior.
  6. Add important model spans to a regression dataset.
  7. Fix the right layer: prompt, evaluator, dataset, tool, retrieval, backend, provider, or deployment.
  8. If a release caused the issue, roll back before running a longer Improve cycle.
Do not start with prompt editing when the root cause might be a tool, retrieval service, provider issue, schema mismatch, or missing runtime context. Use traces to locate the failing layer first.

Weekly review

For a mature project, hold a short weekly review:
  • Which Monitor metrics moved, and why?
  • Which Behaviors are new, reactivated, or getting worse?
  • Which evaluator failures were true product failures?
  • Which dataset rows were added from production?
  • Which Improve cycles were approved, edited, or rejected?
  • Which deployments changed production behavior?
  • Which saved trace views or exports should be kept for operations?
  • Which old datasets, views, or evaluators should be cleaned up?
This review keeps Adaline from becoming only an incident tool. It turns production learning into a system that gets harder to regress.

Evidence packet

For high-risk changes, keep a small packet before approval or deployment:
  • Monitor time range and metric change.
  • Trace view filters or export.
  • Representative trace IDs and span IDs when allowed by policy.
  • Behavior link and issue rationale.
  • Dataset rows added or updated.
  • Evaluator results before and after the fix.
  • Deployment comparison and target environment.
  • Rollback target.
  • Post-release watch plan.
Good operations create durable evidence. If an example matters enough to drive a release decision, it should probably become a dataset row, evaluator case, saved view, or Behavior investigation.

Investigate metric changes

Move from dashboard movement to trace evidence and next actions.

Inspect a trace

Read the request path before deciding what to change.

Build regression coverage

Preserve production failures as repeatable tests.

Release prompts safely

Deploy reviewed prompt versions with comparison, rollback, and monitoring.