Articles

Blog section illustration

How We Squeezed a 1998 VBA Codebase Into One AI Agent

Author img

By Claus Villumsen

06 April, 2026

Share this article

Legacy Modernization AI Engineering 8 min read  ·  April 2026

Not by force. Not by throwing money at the problem. By going back to the basics of real engineering — washing code until it was clean enough for a machine to understand.

Let me start with a confession. We hired so-called AI developers. People who could quote GPT-4 context windows from memory, who had opinions about embedding models, who could spin up a LangChain pipeline before their morning coffee. Smart people. Enthusiastic people. People who believed, completely and without doubt, that AI could solve anything if you just wrote the right prompt.

They were wrong. And it took us an embarrassingly long time to admit it.

The project was a VBA codebase from 1998. Still in production. Still doing real work for a real client, every single day. Massive doesn't quite cover it. We are talking about a system so old that some of the people who originally wrote it are no longer at other companies — they are no longer on this earth. That is how old the code was. Or rather, is. Because as of right now, we have not finished it. We have transformed it.

This blog is not a victory lap. This blog is about what we learned when AI hit its wall — and what happened when two engineers who do not know how to quit decided to go around it.

I. The Problem With Throwing AI at Old Code

The naive approach — the approach our AI developers kept proposing — was context stuffing. Load the whole codebase into the model's context window and ask it to understand. Just give it everything and let it figure it out.

Here is what nobody tells you in the Medium articles about AI-powered code migration: even a 200,000-token context window is not the same as 200,000 tokens of useful context. Researchers studying what they now call the Maximum Effective Context Window (MECW) have found that effective performance often falls far below the advertised limit — by up to 99% on complex tasks. There is a phenomenon called "context rot": the model's attention degrades the further into a sequence you go. Important logic buried in the middle of a 50,000-line file is practically invisible to the model. It sees the beginning. It sees the end. The middle is where your business logic lives.

Add to this the nature of VBA code from the late 1990s. VBA from that era does not look like code. It looks like sediment. It is layers of decisions made by a dozen different programmers across a decade. Naming conventions shift mid-file. Comments are in three languages — two of them inconsistent. Framework artifacts, library imports, deprecated method calls, dead code that was never cleaned up because nobody was certain it was actually dead — all of it packed together like geological strata, impossible for a human to read quickly, and equally impossible for a model to parse cleanly.

We ran the experiments. We threw chunks of this code at every approach available to us. The AI developers had their turn. And one by one, the approaches failed or underdelivered.

💭 Pause & reflect

Did we do the right thing by hiring those AI developers first? Looking back honestly — no. They were skilled at wrapping existing AI capabilities, not at engineering through their limits. There is a difference between a developer who uses AI and an engineer who understands what AI cannot do. We confused the two.

What should we have tried earlier? We should have started with the code, not the model. Every AI project that fails starts by overestimating the model and underestimating the data.

II. How We Actually Communicate With AI: The Methods We Tested

Before we wrote a single line of transformation code, Max and Ezekiel — two engineers who share the trait of simply refusing to accept "it cannot be done" — spent months experimenting with how to communicate with the model effectively. Not what to ask. How to ask.

There are four primary strategies for getting knowledge in and out of a large language model. Each has a different cost profile, a different maintenance burden, and a different ceiling.

MethodBest ForProsConsVerdict
Prompt EngineeringFormatting, tone, structured outputZero training cost; instant iteration; no infrastructureBrittle on complex reasoning; inconsistent results; limited by context windowEssential baseline
RAGLarge, frequently updated knowledge basesFresh information; citations; scales to millions of docsComplex infrastructure; poor chunking breaks meaning; retrieval quality depends on embeddingsSituational
Fine-TuningWhen behavior must change, not just knowledgeConsistent output style; smaller model can match larger baselineExpensive; time-consuming; risks catastrophic forgetting; overkill for most tasksLast resort
Context StuffingSingle-document, complete-coverage analysisNo extra infrastructure; works with any modelContext rot; attention degrades in the middle; punishing cost at scaleDangerous default

We discovered something in our experiments that is not talked about enough. Vector databases are not always the right answer for getting large amounts of data in and out of AI. Everyone defaults to RAG with a vector store as if it is the only option, but vector search optimizes for semantic similarity. What we needed in many parts of this codebase was not "find me something similar" — it was "give me exactly this, in this order, with this structure." For that, a well-indexed relational store or even a graph database communicates better with the model than a vector database does. The problem is not always retrieval. Sometimes the problem is representation.

📦 The experiment

We tried asking the model to return only changes and a delta index — not repeat the full input back to us — to save tokens and reduce noise. We gave up. AI cannot count reliably. It cannot maintain a consistent index across a long generation. The output would drift, misalign, or simply hallucinate index values. A simple thing that any junior developer could do with a diff tool.

That discovery told us something important: we had to stop trying to make the AI do engineering work. We had to do the engineering work ourselves, and give the AI a clean, constrained, well-defined job.

III. The Limits Are Real. Here Is Why They Exist.

Context windows exist because of mathematics, not corporate restriction. The computational complexity of the attention mechanism that powers transformer models is quadratic — double the context length, quadruple the computational cost. This is not a policy choice. It is physics.

As of 2026, the largest context windows sit between 200,000 tokens (Claude 3.5 Sonnet) and 1 million tokens (Gemini 1.5 Pro). One million tokens sounds enormous — roughly 750,000 words. But a large enterprise codebase can contain tens of millions of lines of code across thousands of files. Even at one million tokens, you are seeing a fraction of a complex legacy system. And the tokens you do stuff in? The model attends to them unevenly. Recent tokens and first tokens get priority. Middle tokens get lost.

Here is a quick map of where each major model actually sits today:

ModelContext WindowEffective Ceiling (real-world)
GPT-4o128K tokensDegrades past ~32K on complex tasks
Claude 3.5 Sonnet200K tokensStrong but attention thins past ~60K
Gemini 1.5 Pro1M tokensBest at full-context; still subject to rot
Llama 3.1128K tokensOften deployed at 2K–4K by default

Where will this go? The honest answer is: windows will keep expanding, but the effective ceiling will always lag the advertised one. Better architectures will reduce it. Sparse attention, memory compression, hierarchical context management — all of these help. But the fundamental tension between what is available in a codebase and what a model can genuinely attend to simultaneously will remain a real constraint for years.

The right question is never "how do I fit more into the model?" The right question is: how do I make what goes into the model maximally clean and maximally useful?

"We stopped trying to make the AI do engineering work. We did the engineering work ourselves — and gave the AI a clean, constrained, well-defined job."

— The lesson it took us months to learn

IV. Washing the Code

This is where Max and Ezekiel did what most people are not willing to do. They went back to basics. Not AI basics. Engineering basics.

Before any model saw a single line of this VBA codebase, the code had to be washed. Washed clean of everything that was not logic. That means:

  • Stripping framework artifacts — the VBA runtime boilerplate that appears in every file regardless of what the file actually does
  • Removing library import noise — the residue of third-party dependencies that have nothing to do with the business behavior
  • Normalizing naming conventions — so that a variable called strCustomerName in 1998 style maps consistently to its semantic equivalent
  • Isolating dead code — branches that can never be reached, conditions that can never be true, so the model reasons about live logic only
  • Removing human style, keeping human intent — the single most important step, and the hardest to automate

This took months. Real, deep, careful engineering work. Not prompt writing. Not infrastructure configuration. Reading code, understanding it, and making decisions about what it actually means versus what it literally says. That distinction — between what code means and what it says — is the entire problem in legacy system modernization, and no model can resolve it on your behalf.

💭 Pause & reflect

Here is the question that kept me up at nights during this project: how do you validate the output when all the people who understood the original system are gone? Some had moved to other companies. Some, genuinely, had passed away. When we generated new microservices from this VBA logic, was the output correct?

We could test behavior. We could run outputs against historical records where they existed. But there were entire sections of business logic where we had no ground truth. It raises a question the industry has not answered: as AI-generated code grows, who holds the responsibility for correctness when the original authors no longer exist?

V. Breaking It Into Categories, Not Chunks

The instinct, when you have a large codebase and a context window limitation, is to break the code into chunks. Slice it into segments that fit the window. Process each chunk. Reassemble. This is the approach the AI developers wanted to take. It is also, for complex business systems, deeply wrong.

You cannot chunk a business system by lines of code. You have to chunk it by logic type.

We divided the washed codebase into categories before any AI touched it:

VBA Codebase (washed)
│
├── Features (known)        → Agent A: cross-reference with documentation
├── Features (undocumented) → Agent B: extract and name from behavior
├── Business Logic          → Agent C: pure rule extraction, no UI concern
├── Data Transformations    → Agent D: structural mapping and formatting rules
└── Infrastructure Concerns → Agent E: config, connection, environment

Each agent was given clean input, a narrow task, and a constrained output format. It was not asked to understand everything. It was asked to understand one thing well.

This is real engineering. This is what software architecture has always been about, even before AI entered the conversation. Separation of concerns. Single responsibility. Narrow interfaces. The principles that make code maintainable for humans make it processable by models.

🔧 The architecture

Looking at our own Kodey Server API, you can see this principle at work. Separate agents for testing, syntax checking, spell checking, translation, commenting, documentation, optimization, and format conversion. Each does one thing. Each has a clean contract. That is not a coincidence — it is the design philosophy that made the VBA project possible.

You build tools that do less, so that the model you point them at can do more.

VI. What We Can Do Now That We Could Not Before

We know this codebase. Not in the way the original developers knew it — through years of proximity and tribal memory. We know it in a different way, perhaps a clearer way. We know the database structure. We know every piece of business logic. We know every feature, including the ones that were never documented.

That knowledge, extracted and structured through months of careful engineering work, now powers something the client never thought possible. We are not recoding the system. We are building it new. Not a migration — a rebirth. The same business logic, expressed in microservices that scale. The same features, now testable, maintainable, and understandable by any developer who joins the team.

The VBA system from 1998 will not be fixed. It will be made unnecessary.

And the reason we can do that is not AI. The reason is that we understand the system completely. AI is the tool that transforms that understanding into new code faster than any team could write it manually. But the understanding itself? That came from engineers who rolled up their sleeves, read the sediment, and refused to stop until they could explain every line.

💭 Pause & reflect

Is there a better solution we did not try? Yes, probably. Given more time, I would have pushed harder on graph-based code representations — turning the codebase into a dependency graph before any model touches it. I believe that is where the real next generation of application modernization lives: not in making context windows larger, but in changing how we represent code before it enters the context.

What will AI look like in five years for this kind of work? The constraints will shift but not disappear. The engineering work of understanding always comes first.

VII. What Real Engineering Looks Like in 2026

I have watched the industry develop a strange relationship with AI capability. There is a kind of magical thinking — the belief that a sufficiently large model, given a sufficiently well-written prompt, can bypass the foundational work of software engineering. I have hired people who carry that belief. Some of them were technically impressive. None of them delivered what Max and Ezekiel delivered.

Real engineering means making something difficult possible by reducing its complexity — not by hoping the tool is powerful enough to ignore the complexity.

The AI developers wanted to pour a complex system into a model and pull a solution out the other side. Max and Ezekiel spent months making the system simple enough that the model could handle it. The first approach is a bet on the tool. The second approach is engineering.

As you think about your own legacy systems — and you have them, every organization of more than ten years has them — the question worth asking is not "which AI tool should we use?" The question is: do we understand the system we are trying to modernize? Not at a high level. At the level of every rule, every edge case, every undocumented behavior that only shows up in production once every eighteen months.

If you do not understand it, no model will understand it for you. It will generate plausible-looking output that misses the edge cases, and you will not know until something breaks in production. If you do understand it — truly, deeply, in writing — then AI becomes what it should be: an extraordinary force multiplier for engineers who have already done the hard intellectual work.

The VBA project taught us that. It cost us months and a few good nights of sleep. But standing where we stand now, with complete knowledge of a system that nobody had fully understood in two decades, the work was worth every hour of it.

Go back to basics. Wash the code. Understand before you generate. The AI will be ready when you are.

Further reading on application modernization services and legacy code:

Kodebaze Engineering Leadership · Written for CTOs building serious technology in serious conditions · April 2026

Book a discovery call here

Loading
Loading
Loading

Claus Villumsen

Software development

Related articles

Blog section illustration

Work

Productivity

11 Coding Habits That Make Engineers Effective at Legacy Modernization

Legacy modernization requires different instincts than greenfield development. These are the eleven habits that separate engineers who succeed at it from those who struggle.

Author img
By  Claus Villumsen
02 October, 2023
Blog section illustration

Legacy Modernization

AI

CAST, vFunction, GitHub, and Kodebaze: Choosing the Right Legacy Modernization Platform
CAST, vFunction, GitHub Copilot, OpenRewrite, Kodebaze — they keep appearing in the same conversations but they are not competing for the same job. An honest map of what each platform does well, where it runs out of road, and how to build the modernization stack that matches your actual problem.
Author img
By  Claus Villumsen
10 April, 2026
Blog section illustration

AI

AI vs. Consulting for Legacy Modernization: An Honest CTO's Guide
You have a legacy system holding your business hostage. A consulting firm costs a fortune. AI tooling sounds risky. An honest CTO’s guide to what each approach actually delivers — and how to combine them without getting burned.
Author img
By  Claus Villumsen
17 April, 2026
Loading

AI + Human software Solution

Follow us
Loading

© 2026 Kodebaze. All Rights Reserved.