January 19, 20267 min read

The Infrastructure Underneath AI Automation

Twenty-five years of building enterprise systems has taught me that the model is the easiest part. The substrate underneath — observability, recovery, queueing, idempotency, identity, budget control — is where AI automation actually succeeds or fails at scale.

By Vikas Goel

The most common question I get when people learn what we're building at Nexiva is some variant of: "How did you do this so fast?" The honest answer is that we didn't do it fast. We did the unglamorous half of the work for the previous decade, and the AI half rode on top of that.

This post is about the unglamorous half — the infrastructure underneath AI automation, which most teams underbuild and then quietly suffer for. I'm Vikas Goel, CTO at blackNgreen; I've been building enterprise systems for a quarter century. The patterns below are not theoretical. They are what we lean on every day to keep AI voice agents working at carrier scale.

The architectural shift, in one paragraph

Classical automation is deterministic. You write code, the code does what you wrote, observability is mostly about whether the system is up. AI automation is non-deterministic. The system might do the right thing, the wrong thing, a half-right thing, or something completely unexpected, and observability is mostly about whether what it did made sense given what it was trying to do. This shift makes a different set of infrastructure layers load-bearing. Things that were nice-to-haves under classical automation are now the difference between a working system and a smoking crater.

What actually has to be solid

Observability with semantic depth. Logging that captures what the system did (a tool call to the billing API) is necessary but insufficient. You also need to capture why it thought it should do that (the model's reasoning trace, the tool selection logic), what alternatives were considered (the rejected branches), and what evidence it had (the inputs and the retrieved context). Without all of that, you can't debug a misbehaviour even when you can see it. Most teams underinvest here because the data volumes get scary; the alternative is debugging blind.

Idempotency, everywhere. A non-deterministic system will retry. It will retry because of a network blip, because of a model timeout, because of a logic loop, because the user said something ambiguous. If your downstream actions are not idempotent, retries become double-charged customers, double-issued credits, double-sent SMSes. Every action the agent can take should be designed so that taking it twice produces the same result as taking it once. This is a discipline before it is a technical pattern.

Budget control as a system, not a setting. AI calls cost money. Agents can recurse. Tool calls can fan out. Without explicit budget control at multiple levels — per turn, per session, per customer per day, per workflow — a single misbehaving prompt can produce a five-figure cloud bill before anyone notices. We treat budget control as a first-class architectural concern, with hard limits, soft limits, alarms, and circuit breakers. It is one of those things that feels overengineered until the day you wish you had it.

Identity and authorisation, propagated end to end. When an agent acts on behalf of a customer, the action carries that customer's identity. If the agent calls a tool that calls a backend service that calls another service, the original customer identity has to make it all the way down. Otherwise you end up with an agent that can do things on behalf of customer A using credentials that are also valid for customer B's account, and that's a security incident waiting to happen.

Queueing and back-pressure. Agents can produce bursty load — a sudden spike in calls, a dependency slowing down, a model provider rate-limiting. The system needs queues that absorb the burst, back-pressure that protects downstream services, and graceful degradation that maintains partial functionality when things slow down. The patterns here are 30 years old and still load-bearing.

A versioned eval harness. The thing that catches regressions before customers do. I wrote about this in my voice AI piece — the eval system is part of the product, not separate from it. Versioned evals, regression tests, A/B harness, drift detection. None of this is exciting. All of it is necessary.

Three patterns that have aged surprisingly well

The patterns I rely on most for AI infrastructure are not new. They are largely the same patterns I've been using since I was building telecom systems at Aricent and VNL.

Event sourcing. Every action is an immutable event in an append-only log. State is a function of replaying events. This is invaluable for debugging non-deterministic systems: when something weird happens, you can replay exactly the inputs the system saw and reconstruct exactly what it did. It also makes audit trails free, which matters in regulated markets.

Circuit breakers. When a downstream service starts misbehaving — slow, flaky, returning weird data — the system stops calling it for a while, falls back to a degraded path, and tries again later. Circuit breakers are 20 years old and they save us roughly weekly. The same pattern works just as well for AI components: when the model starts producing low-quality outputs (high hallucination rate, weird tool calls, unusually long latencies), break the circuit and fall back to a more constrained behaviour.

Backpressure with bounded queues. When load exceeds capacity, the system pushes back rather than buffering indefinitely. This is uncomfortable — operators would rather queue than reject — but unbounded queues are how you turn a 10-minute outage into a 6-hour one. We use bounded queues with explicit overflow behaviours everywhere.

What's genuinely new

Three things in AI automation that don't have great precedents:

Semantic monitoring. Detecting that "something has changed" in a non-deterministic system requires looking at distributions of behaviours, not single values. Did the agent's average reasoning length change? Are tool selection patterns drifting? Is the rate of escalations climbing? These are statistical questions on event streams, and the tooling for answering them in real time is still maturing.

Cost as a runtime concern. Classical systems' costs were dominated by fixed infrastructure. AI systems have a per-call cost component that varies with what the user said, how complex the agent's reasoning was, and how many tools got called. Cost is now something you measure at the request level and budget at the session level. The discipline this requires didn't exist five years ago.

Behavioural rollback. When you deploy a new prompt or model version and quality degrades, you need to roll back fast. But the system has memory now — conversations in flight, accumulated context, in-progress workflows. Rolling back without breaking those is its own engineering problem, and one I haven't seen well-solved by anyone, including us.

What I would tell a team starting from scratch

If I were starting an AI automation project today with no existing infrastructure, here is the order I'd build things:

Observability first. Before the model, before the agent, before the prompt. If you can't see what your system is doing, nothing else matters.
Idempotency next. Make every external action safely retryable. This will save you from a hundred small disasters.
Budget controls third. Hard limits, alerts, circuit breakers. Now.
Then build the agent. The capability work is where the fun is, but it is the easiest part to get right and the hardest part to backfill the foundations for.
Eval harness in parallel with the agent. Not after. Not when you have time. In parallel, treated as part of the same workstream.

The teams that follow roughly this order ship slow and steady. The teams that build the agent first and the foundations later ship fast initially and then either rebuild everything or quietly hit a quality plateau they can't escape from. I have watched this play out enough times now to recommend the slow path with full confidence.

The unromantic conclusion, again

The model is the easiest thing in your AI stack. It will be commoditised on a 12-month cycle. The infrastructure underneath — observability, idempotency, budget control, identity, queueing, eval — is where the durable engineering work lives. That work is unsexy, doesn't fit into a launch announcement, and almost no one outside your team will appreciate it. Do it anyway. It is the difference between an AI system that gets better over time and one that quietly degrades while everyone tells you the model is fine.

If you want more on this — see my pieces on AI agents reshaping business and building voice AI systems, or reach out.

AI
infrastructure
architecture
automation

← Previous

Going Back to School for AI in My Forties: What I Learned

The Future of Human–AI Collaboration in Customer Service