Articles

How We Squeezed a 1998 VBA Codebase Into One AI Agent

By Claus Villumsen

06 April, 2026

Share this article

Legacy Modernization AI Engineering 8 min read · April 2026

Not by force. Not by throwing money at the problem. By going back to the basics of real engineering — washing code until it was clean enough for a machine to understand.

Let me start with a confession. We hired so-called AI developers. People who could quote GPT-4 context windows from memory, who had opinions about embedding models, who could spin up a LangChain pipeline before their morning coffee. Smart people. Enthusiastic people. People who believed, completely and without doubt, that AI could solve anything if you just wrote the right prompt.

They were wrong. And it took us an embarrassingly long time to admit it.

The project was a VBA codebase from 1998. Still in production. Still doing real work for a real client, every single day. Massive doesn't quite cover it. We are talking about a system so old that some of the people who originally wrote it are no longer at other companies — they are no longer on this earth. That is how old the code was. Or rather, is. Because as of right now, we have not finished it. We have transformed it.

This blog is not a victory lap. This blog is about what we learned when AI hit its wall — and what happened when two engineers who do not know how to quit decided to go around it.

Why does throwing AI at old code fail?

AI cannot process legacy code effectively because old codebases contain decades of technical debt, inconsistent naming conventions, undocumented logic, and structural complexity that exceeds AI context window limits. Simply uploading legacy code to AI tools without proper preparation results in incomplete analysis and unreliable outputs.

The naive approach — the approach our AI developers kept proposing — was context stuffing. Load the whole codebase into the model's context window and ask it to understand. Just give it everything and let it figure it out.

Here is what nobody tells you in the Medium articles about AI-powered code migration: even a 200,000-token context window is not the same as 200,000 tokens of useful context. Researchers studying what they now call the Maximum Effective Context Window (MECW) have found that effective performance often falls far below the advertised limit — by up to 99% on complex tasks. There is a phenomenon called "context rot": the model's attention degrades the further into a sequence you go. Important logic buried in the middle of a 50,000-line file is practically invisible to the model. It sees the beginning. It sees the end. The middle is where your business logic lives.

Add to this the nature of VBA code from the late 1990s. VBA from that era does not look like code. It looks like sediment. It is layers of decisions made by a dozen different programmers across a decade. Naming conventions shift mid-file. Comments are in three languages — two of them inconsistent. Framework artifacts, library imports, deprecated method calls, dead code that was never cleaned up because nobody was certain it was actually dead — all of it packed together like geological strata, impossible for a human to read quickly, and equally impossible for a model to parse cleanly.

We ran the experiments. We threw chunks of this code at every approach available to us. The AI developers had their turn. And one by one, the approaches failed or underdelivered.

💭 Pause & reflect

Did we do the right thing by hiring those AI developers first? Looking back honestly — no. They were skilled at wrapping existing AI capabilities, not at engineering through their limits. There is a difference between a developer who uses AI and an engineer who understands what AI cannot do. We confused the two.

What should we have tried earlier? We should have started with the code, not the model. Every AI project that fails starts by overestimating the model and underestimating the data.

What methods work for communicating legacy code to AI agents?

Effective methods include code washing (cleaning and standardizing syntax), semantic categorization (grouping by function rather than arbitrary chunks), documentation extraction, dependency mapping, and structured abstraction. Testing various prompting strategies and context organization techniques determines which approach works best for specific legacy codebases.

Before we wrote a single line of transformation code, Max and Ezekiel — two engineers who share the trait of simply refusing to accept "it cannot be done" — spent months experimenting with how to communicate with the model effectively. Not what to ask. How to ask.

There are four primary strategies for getting knowledge in and out of a large language model. Each has a different cost profile, a different maintenance burden, and a different ceiling.

Method	Best For	Pros	Cons	Verdict
Prompt Engineering	Formatting, tone, structured output	Zero training cost; instant iteration; no infrastructure	Brittle on complex reasoning; inconsistent results; limited by context window	Essential baseline
RAG	Large, frequently updated knowledge bases	Fresh information; citations; scales to millions of docs	Complex infrastructure; poor chunking breaks meaning; retrieval quality depends on embeddings	Situational
Fine-Tuning	When behavior must change, not just knowledge	Consistent output style; smaller model can match larger baseline	Expensive; time-consuming; risks catastrophic forgetting; overkill for most tasks	Last resort
Context Stuffing	Single-document, complete-coverage analysis	No extra infrastructure; works with any model	Context rot; attention degrades in the middle; punishing cost at scale	Dangerous default

We discovered something in our experiments that is not talked about enough. Vector databases are not always the right answer for getting large amounts of data in and out of AI. Everyone defaults to RAG with a vector store as if it is the only option, but vector search optimizes for semantic similarity. What we needed in many parts of this codebase was not "find me something similar" — it was "give me exactly this, in this order, with this structure." For that, a well-indexed relational store or even a graph database communicates better with the model than a vector database does. The problem is not always retrieval. Sometimes the problem is representation.

📦 The experiment

We tried asking the model to return only changes and a delta index — not repeat the full input back to us — to save tokens and reduce noise. We gave up. AI cannot count reliably. It cannot maintain a consistent index across a long generation. The output would drift, misalign, or simply hallucinate index values. A simple thing that any junior developer could do with a diff tool.

That discovery told us something important: we had to stop trying to make the AI do engineering work. We had to do the engineering work ourselves, and give the AI a clean, constrained, well-defined job.

What are AI context window limits and why do they matter?

AI context window limits define the maximum amount of text an AI model can process in one interaction, typically measured in tokens. Legacy codebases often exceed these limits by orders of magnitude, making it impossible to feed entire systems to AI agents without strategic reduction, categorization, and cleaning of the code.

Context windows exist because of mathematics, not corporate restriction. The computational complexity of the attention mechanism that powers transformer models is quadratic — double the context length, quadruple the computational cost. This is not a policy choice. It is physics.

As of 2026, the largest context windows sit between 200,000 tokens (Claude 3.5 Sonnet) and 1 million tokens (Gemini 1.5 Pro). One million tokens sounds enormous — roughly 750,000 words. But a large enterprise codebase can contain tens of millions of lines of code across thousands of files. Even at one million tokens, you are seeing a fraction of a complex legacy system. And the tokens you do stuff in? The model attends to them unevenly. Recent tokens and first tokens get priority. Middle tokens get lost.

Here is a quick map of where each major model actually sits today:

Model	Context Window	Effective Ceiling (real-world)
GPT-4o	128K tokens	Degrades past ~32K on complex tasks
Claude 3.5 Sonnet	200K tokens	Strong but attention thins past ~60K
Gemini 1.5 Pro	1M tokens	Best at full-context; still subject to rot
Llama 3.1	128K tokens	Often deployed at 2K–4K by default

Where will this go? The honest answer is: windows will keep expanding, but the effective ceiling will always lag the advertised one. Better architectures will reduce it. Sparse attention, memory compression, hierarchical context management — all of these help. But the fundamental tension between what is available in a codebase and what a model can genuinely attend to simultaneously will remain a real constraint for years.

The right question is never "how do I fit more into the model?" The right question is: how do I make what goes into the model maximally clean and maximally useful?

"We stopped trying to make the AI do engineering work. We did the engineering work ourselves — and gave the AI a clean, constrained, well-defined job."

— The lesson it took us months to learn

How do you wash legacy VBA code for AI processing?

Code washing involves removing obsolete comments, standardizing variable names, eliminating dead code, normalizing formatting, extracting reusable functions, documenting implicit logic, and reducing redundancy. This process transforms messy legacy code into clean, machine-readable input that AI agents can effectively analyze and work with within context limits.

This is where Max and Ezekiel did what most people are not willing to do. They went back to basics. Not AI basics. Engineering basics.

Before any model saw a single line of this VBA codebase, the code had to be washed. Washed clean of everything that was not logic. That means:

Stripping framework artifacts — the VBA runtime boilerplate that appears in every file regardless of what the file actually does
Removing library import noise — the residue of third-party dependencies that have nothing to do with the business behavior
Normalizing naming conventions — so that a variable called strCustomerName in 1998 style maps consistently to its semantic equivalent
Isolating dead code — branches that can never be reached, conditions that can never be true, so the model reasons about live logic only
Removing human style, keeping human intent — the single most important step, and the hardest to automate

This took months. Real, deep, careful engineering work. Not prompt writing. Not infrastructure configuration. Reading code, understanding it, and making decisions about what it actually means versus what it literally says. That distinction — between what code means and what it says — is the entire problem in legacy system modernization, and no model can resolve it on your behalf.

💭 Pause & reflect

Here is the question that kept me up at nights during this project: how do you validate the output when all the people who understood the original system are gone? Some had moved to other companies. Some, genuinely, had passed away. When we generated new microservices from this VBA logic, was the output correct?

We could test behavior. We could run outputs against historical records where they existed. But there were entire sections of business logic where we had no ground truth. It raises a question the industry has not answered: as AI-generated code grows, who holds the responsibility for correctness when the original authors no longer exist?

Why categorize code instead of chunking it randomly?

Categorical organization groups code by functional purpose, business logic, or domain relationships, preserving semantic meaning and dependencies. Random chunking breaks logical connections and context, causing AI agents to miss critical relationships between code segments. Categories enable targeted AI analysis while maintaining the integrity of business logic.

The instinct, when you have a large codebase and a context window limitation, is to break the code into chunks. Slice it into segments that fit the window. Process each chunk. Reassemble. This is the approach the AI developers wanted to take. It is also, for complex business systems, deeply wrong.

You cannot chunk a business system by lines of code. You have to chunk it by logic type.

We divided the washed codebase into categories before any AI touched it:

VBA Codebase (washed)
│
├── Features (known)        → Agent A: cross-reference with documentation
├── Features (undocumented) → Agent B: extract and name from behavior
├── Business Logic          → Agent C: pure rule extraction, no UI concern
├── Data Transformations    → Agent D: structural mapping and formatting rules
└── Infrastructure Concerns → Agent E: config, connection, environment

Each agent was given clean input, a narrow task, and a constrained output format. It was not asked to understand everything. It was asked to understand one thing well.

This is real engineering. This is what software architecture has always been about, even before AI entered the conversation. Separation of concerns. Single responsibility. Narrow interfaces. The principles that make code maintainable for humans make it processable by models.

🔧 The architecture

Looking at our own Kodey Server API, you can see this principle at work. Separate agents for testing, syntax checking, spell checking, translation, commenting, documentation, optimization, and format conversion. Each does one thing. Each has a clean contract. That is not a coincidence — it is the design philosophy that made the VBA project possible.

You build tools that do less, so that the model you point them at can do more.

What can you do with AI agents after preparing legacy code properly?

Properly prepared legacy code enables AI agents to perform automated documentation generation, impact analysis for changes, code modernization suggestions, bug pattern detection, business logic extraction, dependency mapping, test case generation, and accurate migration planning. These capabilities were previously impossible with unprepared legacy codebases.

We know this codebase. Not in the way the original developers knew it — through years of proximity and tribal memory. We know it in a different way, perhaps a clearer way. We know the database structure. We know every piece of business logic. We know every feature, including the ones that were never documented.

That knowledge, extracted and structured through months of careful engineering work, now powers something the client never thought possible. We are not recoding the system. We are building it new. Not a migration — a rebirth. The same business logic, expressed in microservices that scale. The same features, now testable, maintainable, and understandable by any developer who joins the team.

The VBA system from 1998 will not be fixed. It will be made unnecessary.

And the reason we can do that is not AI. The reason is that we understand the system completely. AI is the tool that transforms that understanding into new code faster than any team could write it manually. But the understanding itself? That came from engineers who rolled up their sleeves, read the sediment, and refused to stop until they could explain every line.

💭 Pause & reflect

Is there a better solution we did not try? Yes, probably. Given more time, I would have pushed harder on graph-based code representations — turning the codebase into a dependency graph before any model touches it. I believe that is where the real next generation of application modernization lives: not in making context windows larger, but in changing how we represent code before it enters the context.

What will AI look like in five years for this kind of work? The constraints will shift but not disappear. The engineering work of understanding always comes first.

What does real engineering with AI look like in 2026?

Modern engineering combines human expertise in system understanding and architecture with AI capabilities in pattern recognition and processing. It requires disciplined preparation, structured methodology, code hygiene, and strategic orchestration rather than simply feeding raw code to AI tools. Success comes from engineering fundamentals, not brute-force AI application.

I have watched the industry develop a strange relationship with AI capability. There is a kind of magical thinking — the belief that a sufficiently large model, given a sufficiently well-written prompt, can bypass the foundational work of software engineering. I have hired people who carry that belief. Some of them were technically impressive. None of them delivered what Max and Ezekiel delivered.

Real engineering means making something difficult possible by reducing its complexity — not by hoping the tool is powerful enough to ignore the complexity.

The AI developers wanted to pour a complex system into a model and pull a solution out the other side. Max and Ezekiel spent months making the system simple enough that the model could handle it. The first approach is a bet on the tool. The second approach is engineering.

As you think about your own legacy systems — and you have them, every organization of more than ten years has them — the question worth asking is not "which AI tool should we use?" The question is: do we understand the system we are trying to modernize? Not at a high level. At the level of every rule, every edge case, every undocumented behavior that only shows up in production once every eighteen months.

If you do not understand it, no model will understand it for you. It will generate plausible-looking output that misses the edge cases, and you will not know until something breaks in production. If you do understand it — truly, deeply, in writing — then AI becomes what it should be: an extraordinary force multiplier for engineers who have already done the hard intellectual work.

The VBA project taught us that. It cost us months and a few good nights of sleep. But standing where we stand now, with complete knowledge of a system that nobody had fully understood in two decades, the work was worth every hour of it.

Go back to basics. Wash the code. Understand before you generate. The AI will be ready when you are.

Further reading on application modernization services and legacy code:

martinfowler.com — Architecture patterns and modernization thinking
InfoQ.com — Engineering culture and practice
Thoughtworks.com — Technology radar and transformation
DZone.com — Practitioner-level technical depth
stackoverflow.blog — Where engineers write honestly

Kodebaze Engineering Leadership · Written for CTOs building serious technology in serious conditions · April 2026

Frequently Asked Questions

Why does throwing AI at old code fail?

What methods work for communicating legacy code to AI agents?

What are AI context window limits and why do they matter?

How do you wash legacy VBA code for AI processing?

Why categorize code instead of chunking it randomly?

What can you do with AI agents after preparing legacy code properly?

What does real engineering with AI look like in 2026?

How long does it take to prepare a 1998 VBA codebase for AI?

The timeline depends on codebase size and complexity, but the washing and categorization process for a substantial legacy VBA system typically requires weeks of focused engineering work. This includes analysis, cleaning, testing, categorization, and validation. The upfront investment enables ongoing AI-assisted maintenance and modernization that was previously impossible.

Book a discovery call here

Claus Villumsen

Software development

Work

Productivity

11 Coding Habits That Make Engineers Effective at Legacy Modernization

Legacy modernization requires different instincts than greenfield development. These are the eleven habits that separate engineers who succeed at it from those who struggle.

By Claus Villumsen

02 October, 2023

Legacy Modernization

CAST, vFunction, GitHub, and Kodebaze: Choosing the Right Legacy Modernization Platform

CAST, vFunction, GitHub Copilot, OpenRewrite, Kodebaze — they keep appearing in the same conversations but they are not competing for the same job. An honest map of what each platform does well, where it runs out of road, and how to build the modernization stack that matches your actual problem.

By Claus Villumsen

10 April, 2026

AI vs. Consulting for Legacy Modernization: An Honest CTO's Guide

You have a legacy system holding your business hostage. A consulting firm costs a fortune. AI tooling sounds risky. An honest CTO’s guide to what each approach actually delivers — and how to combine them without getting burned.

By Claus Villumsen

17 April, 2026

AI + Human software Solution

Legal