All articles
AI & Automation·

Why Your AI Agent's Biggest Problem Isn't the Model — It's the Context

A researcher ran 283 sessions with AI agents over 70 days and found that 24% of his entire system was documentation — not code. The context is the moat.

Why Your AI Agent's Biggest Problem Isn't the Model — It's the Context

A developer named Aristidis Vasilopoulos spent 70 days building a complex distributed system with AI agents. He wrote 108,000 lines of code. By the end, he'd also written 26,200 lines of documentation — not for humans, but for the agents themselves.

That's a ratio of roughly one line of context for every four lines of code. Nearly a quarter of the entire system was instructions telling the AI what it needed to know.

He published the data as an academic paper. The numbers are striking, but the core finding is something anyone who's tried to use AI for real work already suspects: the model is not the bottleneck. The context is.

The Amnesia Problem

Every AI agent starts every conversation from zero. It doesn't remember what you told it yesterday. It doesn't know your project's architecture, your naming conventions, your business rules, or the bug you spent three hours fixing last week.

This is the fundamental challenge. You can have the most capable model on the planet, and it will still generate wrong code, give bad advice, and repeat your mistakes — because it doesn't know anything about your specific situation.

We've written about this before in the context of business AI. A mid-tier model with good context beats a top-tier model with none. That principle applies to code, customer service, operations — everywhere.

But Vasilopoulos's paper goes further. It doesn't just say context matters. It shows exactly how much context, organized how, maintained how, and at what cost.

What 283 Sessions Taught One Developer

The paper tracks 283 sessions building a 108,000-line C# distributed simulation system. During those sessions, a human wrote 2,801 prompts. The AI agents ran 16,522 autonomous turns — about 6 turns for every single human prompt. The agents also made 1,478 calls to a knowledge retrieval system, pulling documentation they needed to do their work.

The most important number is the knowledge-to-code ratio: 24.2%. For every 100 lines of application code, there were 24 lines of context documentation. That documentation wasn't optional. It was infrastructure — as critical to the system working as the code itself.

And maintaining all of it? About 1-2 hours per week. Roughly 4.3% of total prompts were spent on upkeep — updating docs, fixing stale references, keeping the context accurate.

Three Tiers of Memory

The documentation wasn't dumped into one giant file. It was organized into three tiers, each loaded differently depending on what the agent needed.

Tier 1 — The Constitution. A single ~660-line file loaded into every session. Think of it as the employee handbook that every new hire reads on day one. It contains coding standards, naming conventions, build commands, and routing rules that tell agents which specialist to call for which task. Always present. Always loaded.

Tier 2 — Domain Experts. Nineteen specialized agent specifications totaling 9,300 lines, averaging about 490 lines each. These aren't lightweight routing instructions — more than half of each specification is deep domain knowledge. The networking agent, for example, runs 915 lines, with 65% of that being actual networking theory. These get loaded only when a task matches their domain.

Tier 3 — The Knowledge Base. Thirty-four subsystem specifications totaling 16,250 lines, served on demand through a retrieval system. The agents query this when they need details about a specific part of the codebase — file paths, known failure modes, code patterns. It's reference material, pulled as needed rather than loaded upfront.

If that sounds like how a well-run company organizes knowledge — an onboarding manual everyone reads, department-specific training for specialists, and a shared wiki for looking things up — that's exactly the point. The principles aren't new. Applying them to AI agents is.

The Part Nobody Wants to Hear

Here's the finding that should make anyone selling "plug-and-play AI" uncomfortable: the infrastructure grew reactively from failures, not from upfront planning.

Every new specification, every new agent, every new documentation file emerged from what the paper calls "a real failure, a recurring bug, an architectural mistake." The developer didn't sit down on day one and design a three-tier memory system. He started with a 100-line file. When agents kept making the same mistake, he documented the fix. When he found himself explaining the same concept across multiple sessions, he codified it.

This is the pattern the paper formalizes as a guideline: repeated explanation equals a documentation need. If you tell your AI agent the same thing twice, that's a signal. Write it down once, in a place the agent can find it, and never explain it again.

The other guideline that deserves attention: stale specs mislead silently. AI agents trust their documentation completely. If the docs say the API endpoint is /v1/users but someone changed it to /v2/users last week, the agent will confidently generate code that calls the old endpoint. It will look correct. It will compile. It will fail at runtime, and you'll spend an hour figuring out why.

The paper documents two incidents where outdated specifications caused agents to generate code using deprecated patterns. The fix wasn't better models — it was a context-drift detector that compares git commits against documentation and flags when code changes don't have corresponding doc updates.

Staleness is the number one failure mode. Not model capability. Not prompt engineering. Stale documentation.

Why This Matters Beyond Code

Say you run a property management company and you've set up an AI agent to handle maintenance requests. It knows your vendor list, your approval thresholds, your tenant communication templates.

Six months later, you've switched plumbers, raised the approval threshold from $500 to $750, and updated your email templates. But nobody updated the agent's context. It's still dispatching the old plumber, approving jobs under the old limit, and sending emails that reference a portal you decommissioned in January.

The agent isn't broken. The context is stale. And unlike a human employee who might notice something feels off, the agent will follow outdated instructions with complete confidence.

This is the same failure mode the paper describes in code, playing out in business operations. The fix is the same too: treat context maintenance as real work, not an afterthought. Schedule it. Budget time for it.

What We've Learned Running Our Own System

We run a similar — though smaller — three-tier system at Blue Octopus Technology. We've been using it daily for over 140 sessions.

Our hot memory is a CLAUDE.md file plus a memory file that loads every session. Our warm layer is a set of topic-specific files that get pulled in based on what we're working on. Our cold layer is a knowledge base of 71 documents served through a retrieval system — strategy breakdowns, tool evaluations, research analyses.

We arrived at this structure the same way Vasilopoulos did — reactively. We didn't design it upfront. We built it one painful lesson at a time, adding documentation every time the agent forgot something important or repeated a mistake we'd already solved.

The paper's finding about maintenance cost — 1-2 hours per week — matches our experience. It's real work. But the alternative is repeating yourself constantly, or worse, letting your agent operate on stale information and cleaning up the mess afterward.

The Uncomfortable Truth About AI Memory

There's a popular post circulating right now: "No one has solved AI memory yet." It has over 1,600 likes. And it's right — there is no magical, automatic solution that gives AI agents perfect memory.

But the paper — and our own experience — suggests the answer isn't a better plugin or a fancier memory tool. As one practitioner put it, barebones setups with context discipline beat bloated plugin stacks every time.

The answer is documentation. Boring, unglamorous, structured documentation that you maintain like you'd maintain any other critical system. Written for the machine, organized in tiers, and kept ruthlessly up to date.

That's the moat. Not which model you're running. Not which framework you picked. Not how clever your prompts are. The moat is whether your AI agent knows enough about your specific situation to do useful work — and whether that knowledge is still accurate.

Where This Is Heading

The paper's data suggests a pattern that anyone building with AI agents should watch.

As models get more capable, they can do more autonomous work per human prompt — Vasilopoulos saw about 6 agent turns for every human instruction. But more autonomy means more dependence on accurate context. A model that takes 6 actions on its own needs to know more about your situation than one that takes 1 action and asks for guidance.

Better models don't reduce the need for context. They increase it.

That's why context engineering — the practice of building and maintaining the documentation layer that feeds your AI agents — is becoming a real discipline. Not a nice-to-have. Not something you do when you have time. A core part of the system that deserves the same rigor as the code or the business processes it supports.

If you're running AI agents for your business, or thinking about starting, the question isn't which model to pick. The question is: what does your agent need to know, how will it access that knowledge, and who's keeping it current?

Get that right, and the model almost doesn't matter. Get it wrong, and the best model in the world can't save you.

Want to see what a real context system looks like? Read our breakdown of the file that runs our business, or start with the basics in CLAUDE.md vs SOUL.md vs SKILL.md.

Share:

Stay Connected

Get practical insights on using AI and automation to grow your business. No fluff.