Vikas Goel
6 min read

How AI Agents Will Reshape Businesses (And What Most People Get Wrong)

AI agents are not a faster chatbot. They are autonomous systems that can hold a goal across hundreds of steps, recover from their own mistakes, and act in the world. The implications for business are bigger than most leaders realise — and the failure modes are different than the ones they're preparing for.

By Vikas Goel

Most conversations I have about AI agents in 2026 still start in the wrong place. Someone tells me they're "implementing agents" and what they mean is they've replaced a multi-step form with a chatbot, or wired an LLM to a CRM. That is not what is interesting about agents. That is barely automation.

The interesting thing about agents — the thing that genuinely reshapes how a business works — is what happens when you give a system a goal, a set of tools, and the ability to plan, act, observe, and revise without a human in the loop. Suddenly the unit of work is not a single LLM call. It is a multi-step trajectory through a problem, with branching, recovery, and self-evaluation.

I'm the CTO at Nexiva, our AI voice agent platform spun up at blackNgreen. We are now live across India, the Middle East, and Latin America, handling inbound service queries, outbound sales, and collections. The transition from "voice bot" thinking to "voice agent" thinking changed almost everything about how we build, test, and operate the product.

Three things that change once you take agents seriously

The first is that the boundary of the system moves outward. A chatbot's job ends when it returns a response. An agent's job ends when the customer's problem is solved — which might be six tool calls, two API failures, one retry, and a successful billing adjustment later. This means the failure surface is no longer "did the model say the right thing?" It is "did the system, as a whole, take the right sequence of actions?" That is a much harder evaluation problem, and it is where most production AI systems still fall over.

The second is that latency becomes a system property, not a model property. In a chatbot, you can hide a slow LLM behind a typing indicator. In a voice agent talking to a real customer over the phone, you can't. The latency budget has to absorb speech-to-text, model inference, tool calls, possibly a knowledge-base lookup, possibly a clarifying turn, and text-to-speech — all inside conversational silence thresholds. We obsess over P99s here, not because we're perfectionists, but because a customer who waits four seconds for a response is already wondering whether the line dropped.

The third is that the cost structure inverts. Old automation: low marginal cost per interaction, but expensive to extend to new use cases (you have to build a new flow). Agentic automation: higher marginal cost per interaction (tokens are not free), but extending to new use cases is dramatically cheaper, sometimes a prompt change away. This changes which problems are worth solving. Things that would never have justified a custom-built workflow now become viable.

What most leaders are preparing for vs. what actually happens

I keep seeing the same risk register in enterprise AI rollouts: hallucination, bias, data leakage, model availability. These are real concerns, and you should absolutely have controls for them. But they are not where production AI agents fail in 2026. The failure modes I see most often are quieter and structurally harder.

Goal drift. The agent starts on the customer's stated problem, then gets pulled into a sub-problem that looks adjacent but isn't, and twenty turns later you are arguing about a feature flag instead of issuing a refund. This is not hallucination — every individual step is fine. The system just lost track of what it was supposed to be doing.

Tool misuse under uncertainty. When the agent is unsure what to do, it tends to do something, because the prompt has trained it to be helpful. Often that something is a tool call with plausible-but-wrong arguments. In a customer support context, this can mean correctly-formatted refunds for the wrong amounts, or transfers to the wrong queue. The system looks confident; the action is wrong.

Evaluation rot. You set up a benchmark, the agent does well, you ship. Then production drift sets in — new product SKUs, new customer phrasings, new edge cases — and your benchmark stops measuring what matters. The agent's score stays high while its actual quality silently degrades. This is what got me interested in the evaluation gap in the first place.

What this means for businesses, concretely

Three implications I'd bet on:

Customer-facing roles get stratified faster than people expect. The bottom of the pyramid — first-line support, basic sales qualification, collections reminders — is increasingly handled by agents not because they are cheaper but because they are available: 24/7, multilingual, no queue. The middle, where agents handle structured-but-complex flows, is the next wave. The top — judgment-heavy work, complex negotiation, anything genuinely novel — stays human, and probably pays better.

The skill that compounds is system design, not prompt engineering. Prompts are commodities. The teams that win are the ones who design for graceful failure, instrumented evaluation, and the recovery paths that customers never see. That skill set looks a lot more like distributed systems engineering than like the "prompt engineering" job titles of 2023.

Companies that built deep operational instrumentation in the last decade have a structural advantage. This is the unglamorous half of why blackNgreen has been able to build Nexiva at the pace we have. Decade-old logging and observability become the substrate that makes agent behaviour analysable. Without that, you are flying blind, and agent rollouts feel like they're going well right up until they aren't.

The unromantic conclusion

AI agents will reshape businesses. But not the way most decks suggest. They will not "10x productivity overnight." They will eat well-defined, well-instrumented, frequently-repeated workflows where the system can learn from its own outcomes. They will struggle in places where the goals are ambiguous, the data is thin, or the failure modes are political rather than technical. And the companies that benefit most will be the ones that took the boring engineering work seriously a decade ago.

If you're building in this space, two thoughts I keep coming back to. First, optimise for observability over capability in the early days — a slightly worse agent that you can debug is worth more than a slightly better agent that you can't. Second, the agent's evaluation criteria are part of the design, not an afterthought. If you're shipping something that scores 9/10 on a benchmark you wrote yourself last quarter, you are quietly already in trouble.

I write more about this kind of thing on the blog. If you're working on production AI agents and want to compare notes, get in touch.