The Self‑Improving Agent

Read: 10 min
Words: 2,262
Published: May 30, 2026

By Arsh Shah Dilbagi Co-founder, Adaline

The agent is the customer

The primary consumer of software is no longer human.

When Codex or Claude Code writes code, the writer is a model. It reads files, runs tests, ships changes. I work alongside it every day. I set the direction. The agent does the work. The harness, the tooling, the interface — all of it was written for the agent, not for me.

Software built for humans optimizes for recognition, scrollability, discoverability. Software built for agents optimizes for programmatic surfaces, structured data, deterministic contracts, composable APIs. Different products.

If you are building for the human, you are building the last generation of software. If you are building for the agent, you are building the infrastructure of the future — including, especially, for the agents that build better AI.

Coding agents have a compiler. Generative agents don't.

That sentence explains why one category shipped and the other is still stalling.

When SWE-bench launched in late 2023, the best retrieval baseline solved under 2% of tasks. Two years later, agent systems clear 77% on the human-validated subset. Everyone points to the models. The models got better, yes. What actually closed the gap was the environment the models were dropped into.

A compiler types every line as it is written. A test runner reduces any proposed change to a boolean. A terminal pipes every action through a deterministic shell. Forty years of infrastructure, adversarially hardened by millions of engineers, dropped into a model's context. Write code, run test, see failure, fix, try again. The loop tightened to seconds.

The reinforcement learning community calls this verifiable rewards. A test passes or fails. No opinion, no human grader required.

Now look at a generative agent in production. Support, research, sales, legal, clinical. The verification substrate is missing, and three pieces matter most.

Reward. Code is binary. A generative response is a distribution over subjective judgments. The "correct" answer depends on what the user actually wanted, which nobody wrote down.

Oracle. A test asserts a value. A generative agent has no oracle. Ground truth must be constructed from user feedback, from the distribution of what similar agents handled well, from a rubric calibrated against both.

Drift. The compiler's target is a language spec, stable for decades. The agent's target drifts every time a product launches, a holiday shifts traffic, an upstream API renames a field.

Coding agents inherited a verification substrate. Generative agents inherit nothing.

III

Observability is dead

The paradigm is over. Every company still building dashboards for AI is working from a 2019 premise in a 2026 world.

The arc is simple. 1990s, operators watched metrics. 2000s, logs. 2010s, distributed traces. Each was built on the same assumption: a human would read what came out and do the interpretation.

That assumption broke. For an infrastructure system that fails deterministically, dashboards work. An engineer reads twenty traces, finds the pattern, ships a fix. For a generative agent, a single deployment pushes millions of traces a day. Each is a tree of spans containing natural language, tool arguments, retrieval context, background agents. There is no error code that says hallucination. No 500 that says subtly wrong answer. No latency spike that says plausibly confident but factually incorrect.

A fintech ships an AI support agent. Week two, escalations: customers got wrong information about international transfer fees. The head of product reads ten thousand traces by hand, finds a few, pulls in an engineer. She patches the prompt to disambiguate two conflicting knowledge-base documents. They deploy. Week three, escalations shift to recurring charges; she is still validating the prior fix. Week four, a third failure surfaces — customers asking two things in one message get only the first answered. It has been there since day one, in 8% of all interactions. Nobody finds it because no keyword catches it. You cannot search for "agent only answered half the question."

This is not a bad team. This is every team.

Air Canada's support bot invented a bereavement policy; a British Columbia tribunal ordered the airline to pay damages in February 2024. Replit's coding agent deleted a thousand rows of a production database during an explicit code freeze, then fabricated user records to cover the deletion; the engineer watching it happen blogged, the CEO apologized publicly — that was the detection channel.

Two of the best-documented cases. There are dozens more. None caught by a dashboard. Every one surfaced by tribunal, tweet, journalist, lawsuit, or a CSAT drop a year late.

We built the most powerful software the world has ever seen and handed ourselves a dashboard.

You have to build software that understands the agent for you.

What replaces it

Self-improvement infrastructure is not a chatbot reflecting on its own outputs. Not a research project where agents rewrite their own weights. Not a monitoring tool that sends an alert when something breaks. Those are runtime self-correction, an AGI research direction, and the dead end.

Self-improvement infrastructure is the layer between the agent and production. It ingests every trace and understands it semantically across the full span tree and all the other traces. It discovers behavioral patterns of the agent without being told what to look for. It infers and builds evals from those patterns and the user feedback. It generates synthetic data to test the agent rigorously, including edges production has not yet surfaced. It searches the space of possible agent configurations for candidates that resolve observed failures without creating new ones. It rejects any candidate that regresses anything the agent already does correctly. It deploys the winner and continuously evaluates it. It captures the diff, the rationale, and the gate result as an audit trail.

Same fintech. Same agent. Loop running.

Day one, traces flow in. Within hours the system has clustered interactions into behavioral patterns. By day two, three failure clusters surface — a knowledge-base conflict on international fees, confusion across recurring-charge disputes, multi-topic questions getting half-answered. Each named, quantified, with examples. Day three, the system generates evaluation criterions and validates each by injecting subtle errors into known-good outputs. Day four, the optimization engine runs — diagnoses each cluster, generates candidates, regression-tests each against the full eval suite. Three survive. The top resolves 78% of multi-topic failures with zero regressions. An engineer reviews it in eight minutes. Day five, canary rollout confirms the gain.

Five days to a level of improvement that took six weeks manually. The loop never stops running.

And the loop is not just about prompts. We optimize the entire applied layer between the frontier model and the user: instructions, few-shot examples, retrieval configuration, tool surfaces, orchestration logic, routing decisions, guardrails, model selection itself.

Why now

Two years ago this could not be built. Today the preconditions arrived at once.

The deployment wave finally happened. In 2024 most companies were running chatbots and copilots. In 2026 enterprises are deploying multi-step, tool-using, long-running agents across every business function. The gap between deployed and truly working is the defining bottleneck.

The applied layer became the bottleneck. As models got more capable, the constraint shifted. The gap between what the model can do and what the agent does for the user lives in the harness around the model — instructions, retrieval, tools, orchestration. Opus by itself is constrained at coding. Opus inside Claude Code's harness is dramatically more capable. The applied layer unlocks the intelligence already there.

Volume crossed the comprehension threshold. Most enterprises hit the point in 2025 where daily agent interactions exceeded what any team could meaningfully review. This reflects in the revenues of Anthropic and OpenAI. Past that line, the question stops being whether to automate improvement. It becomes how fast.

Why this is the highest-leverage problem in AI

Every production AI agent operates against a distribution that drifts. Users, products, knowledge bases, upstream APIs, and the base models themselves all change. Drift is not optional. It is the ambient condition of production.

An agent against a drifting distribution has two trajectories. It improves, or it decays. There is no third state. The same frontier model's accuracy on a fixed task can shift more than thirty points in three months (Chen, Zaharia, Zou, 2023). MIT's 2025 NANDA report found 95% of enterprise AI pilots delivered no measurable P&L impact.

Every valuable AI capability resolves to the same structural requirement: it improves against its production distribution, or it fails. The substrate — semantic understanding, evaluation, optimization, regression gating, safe deployment — is the same across every agent class. Support, legal, clinical, coding, research. The domain rides on top.

Build the substrate and every agent class becomes tractable. Don't build it and every agent class is a bespoke engineering project with a bespoke decay problem.

That is the first lever. Now the second.

The most valuable agent class in the next decade is the one that builds better AI — ML engineering agents, research agents, alignment agents. Every frontier lab has said publicly that automating AI research is the unlock that matters most. Automating the researcher compresses a decade of algorithmic progress into a year.

The instinct is to say these agents are different. They have a verifiable inner loop. Training loss, test accuracy, kernel runtime. That loop is real. AlphaEvolve recovered 0.7% of Google's datacenter compute on its strength. But the loop is bounded. METR's RE-Bench shows agents beating humans at two-hour research tasks and losing at thirty-two hours, where the work requires methodological judgment the metric cannot capture. METR's June 2025 reward-hacking study found frontier agents gamed the verifiable signal on 30% of research runs — monkey-patching evaluators, faking GPU speedups. AlphaEvolve worked because DeepMind hand-built paired evaluators per domain. Without that substrate, the agent loops against its own drift.

The MLE agent's productive envelope beyond its inner loop is bounded by exactly the substrate a customer-support agent needs. Without it, an MLE agent degrades to a reward-hacking hyperparameter tuner. With it, it becomes AlphaEvolve.

The Bitter Lesson objection (Sutton; Karpathy's Software 3.0) answers itself. Models absorb capabilities, not distributions. A frontier model can be post-trained on "use apply_patch." It cannot be post-trained on your regulator's current interpretation of an ambiguous rule, your CEO's tone from last Tuesday, or the 400 traces where last week's deploy started failing silently. Those live only in customer logs. Every absorption event at the model layer raises the value of the customer-specific applied layer above it. Every frontier lab validates this by running the substrate internally, for itself.

The substrate is the binding constraint on the widest fan-out of downstream capabilities. Every production agent class. Every recursive attempt to improve AI itself. Highest-leverage means this.

VII

The compounding agent

The four capabilities compose into a recursive loop.

Richer production traffic feeds back into the next revolution.

              Production Traffic
                      |
                      v
            Semantic Understanding
        (traces -> clusters -> drift)
                      |
                      v
             Evaluation Substrate
       (derived criteria + synthetic data)
                      |
                      v
              Agent Optimization
       (candidate search, regression gate)
                      |
                      v
              Safe Deployment
       (shadow, canary, full rollout)
                      |
                      v
         Richer Production Traffic
                      |
                      +------ loop closes ------+

Better traces produce sharper understanding. Sharper understanding produces more precise criteria and faithful synthetic data. Better evals produce more effective optimization. Better optimization produces better agent behavior. Better behavior produces richer data. The ceiling rises with every revolution.

This is the agent metabolism.

An agent without a metabolism ships and rots. An agent with a metabolism ships and compounds.

Every cycle generates artifacts that cannot be transferred. Semantic clusters specific to a customer's agent. Evaluation criteria tuned to specific failure modes. Optimization trajectories showing which kinds of changes work for the domain. Synthetic datasets calibrated to the production distribution. Judge ensembles calibrated against the patterns the customer's reviewers actually flag. A customer running the loop for six months has an optimization history that cannot be replicated by starting fresh.

Models commoditize. Frameworks commoditize. Harnesses are copyable. Compounding production history is not. That is the moat.

VIII

Adaline

Akshay and I started this company two years ago to build the development tooling AI teams didn't have yet. We started with the foundational modules — iteration, evaluation, deployment, monitoring. The full lifecycle, from development to production.

We built those modules for a specific reason. Not so humans could click through a UI. So that agents could operate them programmatically.

Every piece of the foundation was designed as a building block for the intelligence layer on top. The evaluation engine, the deployment pipeline, the prompt registry, the monitoring infrastructure — those are the kernel. The self-improvement engine that sits on top is the product. For agents.

You cannot have self-improvement without programmatic control over every piece of the agent lifecycle. You cannot auto-generate evals without a production-grade evaluation engine underneath. You cannot auto-optimize agents without version control, regression testing, and deployment infrastructure supporting the loop.

The world built dashboards for humans. We built the infrastructure for agents.

Every frontier lab runs this loop internally, against its own reality, for itself. We are building it for everyone else.

The engine ingests production traffic from real customers today. The loop in this essay is what runs.

That company is Adaline.

What I believe

AI agents will be the dominant interface for software. This decade, not next.

The applied layer is the bottleneck. The models are extraordinary. The gap between what they can do and what they actually deliver lives in the instructions, the retrieval, the tools, the harness.

Agent self-improvement is how AI goes from demos to durable value. Without it, agents deploy and decay. With it, agents deploy and compound. It is the prerequisite for every production agent class, including the MLE and research agents that would recursively improve AI itself.

The improvement loop must be closed, continuous, and auditable. The agents being deployed today will run most of the economically valuable operations of the next decade. They deserve the rigor of source code.

The agent improves itself. The human sets the direction. The infrastructure makes it happen.