Devramp

Join Waitlist

Nic Maquet

15th September 2025

Copied

Why AI agents stumble in large complex codebases

AI coding agents are still very new. Tools like Claude Code and Codex have only been available for a few months, yet they’ve already become remarkably popular. What the industry lacks, however, is experience. We don’t have years of lessons to draw on. Instead, we are collectively figuring out how these tools fit into our workflows, and learning quickly that they are both powerful and difficult to use well.

A recurring piece of advice today is to keep coding agents on a very short leash. Developers are told to give them small, tightly scoped tasks, provide exhaustive context, and stay in the loop at every step. Used this way, they can be helpful. But this is a narrow interpretation of what agentic coding could be. The real promise is delegation: being able to hand off a task and let the agent make progress independently while you focus elsewhere. This may not be possible for tasks of all sizes and complexity but at least some real unit of work should be able to be delegated reliably.

That promise collides directly with the reality of large, complex, messy codebases. The reason developers are told to be so explicit and hand-holding is that agents stumble badly when scope grows, both the scope of the task and the size of the code it touches. This problem is magnified in real systems, where codebases are sprawling, inconsistent, and weighed down with technical debt.

At Devramp, our stance is that the industry cannot settle for short-leash usage if we want the productivity gains agentic coding has promised. Agents must be able to "get the gist" more often than not, and operate asynchronously on real codebases without constant babysitting. Achieving that requires new approaches and new tools.

Babysitting vs Over-Specification is a False Choice

When used on a short leash, agents can deliver strong results. But there’s a catch: this only works reliably if the developer provides extremely precise instructions, restricts the task to a narrow scope, and stays in the loop at every step. That works, but it’s the opposite of delegation. It feels like babysitting, and it ties up the developer’s time instead of freeing it. Of course it’s absolutely possible to be productive using agents in this way, but the point is that there is a natural "cap" on that productivity boost.

On the other end of the spectrum is long-leash usage, where the agent is expected to make progress more independently. This is where things unravel. The agent starts every session with no prior knowledge of the codebase, and its orientation relies on basic primitives: listing directories, opening files, and searching via keyword grep or embedding-based search. These can surface relevant fragments, but none conveys the big-picture structure of the system. Without that context, the agent is forced to extrapolate. In large, complex codebases with inconsistent patterns and hidden domain knowledge, those guesses quickly compound into serious errors.

Developers experience this directly. Sometimes it’s the obnoxiously cheery "you’re absolutely right!" right after you’ve told it that it missed something important. Other times it’s a desperate Search(<12 keywords>) tool invocation as it fumbles for bearings. In either case, the outcome is the same: work that looks plausible but misses key patterns, abstractions, or constraints in the codebase.

Yes, you can try to compensate by writing exhaustive specs of hundreds of words of Markdown spelling out every requirement. But that’s not real delegation either. It’s tedious and impractical at scale. Both extremes of short-leash babysitting and long-leash over-specification fall short of what agentic coding is supposed to deliver.

Agents Are Stuck in Unfamiliar Territory

Why does this happen? The issue isn’t a lack of power. Agents (or rather, the LLMs that power them) have extraordinary raw ability: they can generate correct code, refactor with precision, and explain complex logic when the scope is narrow and context is complete. The problem is that they are not given the opportunity to invest in understanding the system the way humans do.

When a developer encounters a new codebase, the first few days (or weeks if the codebase is large enough!) feel clumsy: jumping between files, following call-chains, poking at unfamiliar patterns. But over time, those fragments coalesce into a clearer and clearer mental model: the architecture, the abstractions, the invariants, the "shape" of the system. That model gets richer with every task, and its cost is amortized over weeks and months of work. To a human, this increasing familiarity with the code becomes compounding leverage.

Agents do not benefit from exposure to the code. Every session is a cold start, with no accumulated context and no persistence of what they’ve already discovered. They cannot afford to spend hours orienting themselves, and even if they did, there is clear evidence that the performance of large language models degrades as the context window fills. Today’s prevailing wisdom treats the context window as a precious resource: it’s consumed not only by the user’s conversation, but also by the system prompt, tool descriptions, and instruction files like AGENTS.md. Asking the agent to load vast portions of a codebase into that same window goes directly against the grain of current best practice. The result is that orientation never really happens; agents are forced to operate from fragments, extrapolating outward. That brittle strategy collapses under the weight of large, messy systems.

This is the root cause: not a lack of raw capability, but the absence of familiarity. Humans gain effectiveness by steadily building a model of the system and reusing it across countless tasks. Agents never get that chance. Each time they begin, they’re starting from zero, forced to cut corners and miss patterns that matter. In complex codebases, experience beats horsepower every time. Until agents can develop or be given a persistent model of the systems they work in, they will remain unreliable, and their promise of long-leash delegation will stay out of reach.

The Path to Reliable Agentic Delegation

Given how little context agents actually start with, it’s remarkable they can accomplish as much as they do. That they can deliver anything useful at all while starting from zero familiarity is a testament to how far the technology has come. Which means that if we can close the familiarity gap, the upside could be enormous. Long-leash delegation could go from a frustrating no-go to a reliable experience.

The first step is accepting that familiarity comes at a cost. Just as humans invest time upfront to read, explore, and internalize a system, agentic workflows will need a phase of learning and synthesis. That effort is unavoidable. The good news is that the investment pays back. Large, complex codebases evolve slowly; their structure takes years to form and tends to remain stable. Once learned, that structure can be reused across countless tasks, just as it is for humans.

But familiarity is not just a dump of information. A developer doesn’t hold every detail of a system in their head at once. They carry a mental model, stored in long-term memory, and can recall just the right fragment when needed. This effortless, selective recall is what makes them effective. To give agents the same leverage, we need two things: first, systems that can synthesize a codebase-level model; second, mechanisms to inject just the relevant parts of that model into the workflow without overwhelming the context window.

This is the real opportunity: to move beyond optimistic extrapolation and give agents a foundation of familiarity. With that foundation, their raw ability could be multiplied, and proper delegation workflows could become not only possible, but dependable.

Our Approach — Closing the Familiarity Gap

At Devramp, we see the familiarity gap as the central challenge to making coding agents dependable. Closing it requires more than bigger models or longer context windows, it calls for infrastructure that lets agents build and reuse a working mental model of a codebase. Our approach rests on three pillars.

First, pre-analysis: We synthesize structured information about every symbol, file, and directory in a codebase ahead of time, and keep that knowledge up to date with source control. Each change triggers updates so that the model of the codebase remains fresh. This upfront effort is the equivalent of doing the hard reading once and taking detailed study notes. These notes can then be reused across countless tasks.
Second, structured access: Instead of forcing agents to rely on grep or embeddings alone, we expose this synthesized knowledge through standard protocols like MCP. An agent can query by symbol, file, or directory and immediately retrieve high-level summaries of what lives there: purpose, semantics, patterns, constraints. This allows it to get oriented without exhaustively re-reading raw code.
Third, prescriptive guidance: We instruct agents to begin each task with a deliberate exploration phase. Rather than let them stumble through haphazard searching, we direct them to traverse the codebase model top-down: from directories, to files, to symbols, zooming in and out as needed until they’ve assembled a coherent picture of what matters for the task at hand. This simulates the effortless recall that experienced developers rely on. It isn’t free (some time and tokens are spent on exploration) but it leverages a body of prior work, so orientation is fast, systematic, and reliable.

Taken together, these pillars turn orientation from an improvised scramble into a structured, repeatable process. By relying on pre-analysis and top-down exploration, agents can approach each task with a consistent frame of reference. The result isn’t perfection, but it is a workflow that is far more deterministic and reliable, and makes genuine delegation possible, even in large and complex codebases.

An Exciting Road Ahead

Every generation of developer tooling has hit this same inflection point: early excitement, obvious limitations, and then the slow, deliberate work of building the infrastructure that makes the tools reliable. Agentic coding is at that point today. The failures are real, but so is the opportunity. And it won’t be unlocked by brute force alone, but by giving agents the same kind of structural leverage that human developers rely on.

At Devramp, we’re excited to be part of that journey. Closing the familiarity gap is not just a technical challenge, it’s the path to making agents dependable contributors in real-world software development. The road ahead will take time, but the destination is worth it: agents that don’t just generate code, but work alongside us with context, orientation, and confidence.

We run experiments on the latest agents/models
and discuss context engineering tactics.

Ready to make AI work in your complex code base?

Without context AI stumbles in complex codebases.
Devramp makes it work!

Join Waitlist