What is the harness in AI agent architecture?

A harness is the set of environmental controls around an AI agent that the agent cannot override: API spend ceilings, hard limits on self-modification, permission gates on outbound calls, circuit breakers, and stop conditions. The harness sits in the path of the energy, not beside it watching. For production AI agents, the harness is the architecture — the model is the engine.

June 7, 20269 min read

Tokenmaxxing: Why AI Agents Burn Millions of Tokens — and Why GitHub Trained Them To

Tokenmaxxing isn't just a metrics fad. AI agents burn tokens because they were trained on GitHub's verbose code — and run without a harness. From a CTO running production AI in front of 290M+ users, here's what's actually going on, and what to build instead of writing longer rule files.

By Vikas Goel

In early 2026, the tech industry discovered a strange new status symbol: how many AI tokens you burn. The Wall Street Journal called it "tokenmaxxing," and within weeks it was everywhere. Meta employees consumed roughly 60 trillion tokens in a single month, with an internal leaderboard celebrating the top consumers. Amazon pulled its own AI-usage rankings after employees started spinning up agents purely to climb them. Uber reportedly exhausted its entire 2026 AI coding budget by April.

By late May, the trend was being declared dead. Leaderboards were shut down. CFOs got sticker shock. "Inference yield" — value per token — replaced raw consumption as the metric that matters.

The shift in framing matters because the two metrics measure almost opposite things:

Dimension	Tokenmaxxing	Inference Yield
What's measured	Raw token volume consumed	Value or outcome produced per token
Implicit incentive	Generate more, consume more	Generate less, ship more
Failure mode	Runaway cost, agent freelancing, leaderboard gaming	Hard to measure "value" objectively, gameable through narrow KPIs
What it favours	Verbose, eager agents with no harness	Restrained agents with strong evaluation discipline
Who likes it	Vendors selling per-token, employees building adoption metrics	CFOs, engineering leaders, customers
Production reality	Looks good on dashboards, breaks budgets	Hard to compute monthly, but maps to business outcomes
Status as of mid-2026	Officially declared dead	Officially the new orthodoxy, operationally underdefined

The transition isn't a clean upgrade. Inference yield is harder to measure honestly than tokenmaxxing was. But the structural intuition behind the shift is correct: a metric that rewards consumption rewards the system's worst tendencies. A metric that rewards outcome at least points at the right question, even when it can't fully answer it.

But the post-mortems missed the real question. Tokenmaxxing wasn't only a story about bad incentives and vanity leaderboards. It was a story about machines that genuinely don't know when to stop — and the reason traces all the way back to the data they learned from.

The GitHub Inheritance: AI Learned to Code From Average Code

Every major coding model was trained substantially on public repositories — millions of projects from GitHub and similar sources. That corpus is enormous, but it is not curated for restraint. It is full of:

Boilerplate and scaffolding — generated project templates, copied configuration, repeated patterns across thousands of repos.
Tutorial-grade verbosity — code written to demonstrate, not to ship. Every concept spelled out, every option handled, every file padded with comments.
Over-engineering as a default — abstraction layers, class hierarchies, and "future-proofing" for futures that never arrived.
Abandoned and duplicated work — half-finished projects, forks of forks, the same solution rewritten ten thousand slightly different ways.

A model trained on this corpus doesn't just learn syntax. It learns a statistical worldview: more code is the normal answer. When the average response to a problem in the training data is a full module rather than a three-line fix, the model's instinct becomes generation, not restraint.

Then reinforcement learning from human feedback amplified it. Human raters consistently reward answers that look thorough, complete, and impressive. The model internalizes the lesson: when in doubt, do more. Ask for a function, receive a class hierarchy. Ask for a bug fix, receive a refactor. Mention an idea, receive four hundred lines of unrequested implementation.

This is the engine of tokenmaxxing that no leaderboard created and no leaderboard shutdown will fix: trained eagerness. The model's capability and its compulsion to demonstrate that capability come from the same place.

Why Instructions Don't Stop It: The Soft Contract Problem

The obvious fix — "just tell the agent to do less" — fails for a measurable reason. Research on instruction-following published in 2025 found that compliance degrades as instruction count grows, and that even the strongest models follow fewer than a third of their instructions perfectly in agentic scenarios. Worse, models don't drop only the newest rules; adherence to all rules erodes together.

Agent harnesses make this worse. A typical coding agent's system prompt already contains dozens of instructions before a developer adds a single project rule. Teams respond to disobedience by writing longer rule files — and every added rule dilutes the rest. Developers report rule files hundreds of lines long that the agent can read, quote, and still ignore.

Time degrades the contract further. As long sessions get compressed, explicit rules are summarized into background narrative. A model treats "something that was said" very differently from "something that applies." That's when agents start what the community calls freelancing: refactoring files nobody mentioned, renaming variables across a project, adding dependencies on their own initiative — every action burning tokens that look like productivity on a usage dashboard.

A contract written in prose is a request. The agent's trained eagerness wins the negotiation almost every time.

The Harness Waste Nobody Measured

A third layer of tokenmaxxing had nothing to do with any model's decisions. The tooling itself leaked tokens at industrial scale:

A compaction bug in one popular coding agent retried a failing operation thousands of times per session before a three-line circuit breaker fixed it — waste estimated at hundreds of thousands of API calls per day globally.
Idle plugin and connector definitions can consume 55,000–134,000 tokens of overhead on every message — up to two-thirds of the working context spent before the user types a word.
Naive retry loops resend the entire conversation history on every failure, turning a transient error into exponential token burn.

So when an enterprise dashboard showed a developer "consuming" billions of tokens, it was conflating three completely different things: genuine work, trained over-generation, and pure mechanical leakage. Tokenmaxxing leaderboards rewarded all three equally — because all three look identical when the only metric is volume.

The Real Fix: A Harness, Not a Smarter Horse

The instinct across the industry is to wait for a smarter model. But you don't control a powerful horse by breeding a second horse, and you don't govern a reactor with another reactor. In every domain where humanity has tamed raw power, the control mechanism is simpler than the thing it controls — and it sits in the path of the energy, not beside it watching: reins, fuel valves, control rods.

This is the load-bearing decision in AI Agent Architecture that almost no agent-platform vendor talks about: the harness is not optional, it is not a future feature, and it cannot be built with prose. For Enterprise AI Agents in production, the harness is the architecture. The model is the engine; everything that determines whether the engine ships value rather than burning fuel lives in the harness around it.

For AI agents, that means:

A gate on the input side. Meter and curate what flows into the context — lean rule files, disabled unused tools, scoped tasks — instead of measuring exhaust after the burn.
Contracts compiled into the environment. Permission hooks, protected files, and budget ceilings that the agent cannot violate. "Cannot" never decays, never gets compacted away, and never loses a negotiation. One developer's verdict after switching from rules to hooks: a single enforcement hook is worth a hundred lines of instructions.
Circuit breakers and stop conditions. Maximum retries, maximum iterations, an explicit definition of done. Agents have no natural fatigue or deadline — termination must be designed.
Outcome metrics, not volume metrics. Industry data from thousands of teams shows the danger of measuring activity: in high-AI-adoption environments, throughput rose sharply while bugs per developer climbed over 50% and code churn exploded. Measure what survived, not what shipped.

Running Nexiva voice agents in production at enterprise scale across India, MENA, and LATAM, this is not a hypothetical preference. It is the difference between an agent that finishes a customer call cleanly within a 700-millisecond latency budget and one that spirals into a 30-second hallucination because nothing in the substrate told it to stop. The latency budget is a harness. The eval discipline is a harness. The permission model is a harness. Together they are what makes the system shippable — and the model itself is, in production, almost the smallest part of the engineering problem.

Tokenmaxxing was never really about leaderboards. It was the first visible symptom of deploying enormous trained eagerness — eagerness learned from the most verbose code corpus ever assembled — without a harness. The leaderboards are gone. The eagerness isn't. The companies that win the next phase won't be the ones with the biggest engines. They'll be the ones who built the reins.

This is also, incidentally, why the ThinkerWave research direction sits where it sits: the load-bearing question for the next generation of agent systems is not how to make the agent more capable, it is how to let the agent's evaluation criteria evolve alongside its capability so the harness doesn't ossify around yesterday's definition of success. Patent application 202611044024 covers that mechanism. The point of the work, though, is the same point as this whole essay: in any system this powerful, the control surface needs at least as much engineering attention as the engine.

FAQ

What is tokenmaxxing? Tokenmaxxing is the practice of maximizing AI token consumption — often to hit internal adoption metrics or climb usage leaderboards — treating token volume as a proxy for productivity.

Why do AI coding agents generate so much unnecessary code? Largely because they were trained on public repositories full of boilerplate, tutorial code, and over-engineered projects, then fine-tuned with feedback that rewards thorough-looking answers. The result is a default bias toward generating more rather than less.

Why don't written instructions stop AI agents from over-generating? Instruction-following research shows compliance drops sharply as rules accumulate, and long-session context compression downgrades rules into background text. Prose instructions are soft contracts; only environment-level enforcement (hooks, budgets, permissions) reliably holds.

What replaced tokenmaxxing as the metric that matters? Outcome-based measures — often called inference yield or value per token — which track what AI usage actually produced rather than how much was consumed.

Part of The AI Transformation series. Related reading: AI Agent Architecture for Enterprise Production · Sovereign AI from India · Research.

AI Agents
Production AI
Tokenmaxxing
AI Agent Architecture
LLMs
Engineering Leadership

← Previous

How AI Agents Will Reshape Businesses (And What Most People Get Wrong)