Onboarding to a 'legacy' codebase with the help of AI

Birgitta is a Distinguished Engineer and AI-assisted delivery expert at Thoughtworks. She has over 20 years of experience as a software developer, architect and technical leader.

This article is part of “Exploring Gen AI”. A series capturing Thoughtworks technologists' explorations of using gen ai technology for software development.

15 August 2024

One of the promising applications of GenAI in software delivery is about how it can help us understand code better, in particular large codebases, or multiple codebases that form one big application. This is especially interesting for older codebases that are getting hard to maintain (“legacy”), or to improve onboarding in teams that have a lot of fluctuation.

To get an idea for how well this works and what the potential is, I picked an issue of the open source project Bahmni and tried to understand the issue and what needs to be done with the help of AI. Bahmni is built on top of OpenMRS, which has been around for a very long time. OpenMRS and Bahmni are good examples of very large codebases with a lot of tech debt, representing a lot of different styles and technologies over time.

Spoiler alert: I did not actually figure out how to solve the ticket! But the journey gave me a bunch of observations about what AI could and could not help with in such a use case.

The ticket

Organisation name is not fetched from parent location for few Hi-Types

“When the visit location has a target location tagged as Organization then the parent location’s name should be used for Organisation resource in FHIR bundle. This works only for Immunization, Health Document Record and Wellness Record Hi-types. For others the visit location is only used.”

The codebase(s)

OpenMRS and Bahmni have many, many repositories. As I did not have access to a tool that lets me ask questions across all the repositories I would have needed, I cheated and looked at the pull request already attached to the ticket to identify the relevant codebase, openmrs-module-hip.

The tools

I used a bunch of different AI tools in this journey:

Simple Retrieval-Augmented Generation (RAG) over a vectorised version of the full Bahmni Wiki. I’ll refer to this as Wiki-RAG-Bot.
An AI-powered code understanding product called Bloop. It’s one of many products in the market that focus on using AI to understand and ask questions about large codebases.
GitHub Copilot’s chat in VS Code, where one can ask questions about the currently open codebase in chat queries via @workspace.

Understanding the domain

First, I wanted to understand the domain terms used in the ticket that I was unfamiliar with.

Wiki-RAG-Bot: Both for “What is a Hi-type?” (Health Information Type) and “What is FHIR?” (Fast Healthcare Interoperability Resource) I got relevant definitions from the AI.
The wiki search directly, to see if I could have found it just as well there: I did just as quickly find that “HI type” is “Health Information Type”. However, finding a relevant definition for FHIR was much trickier in the Wiki, because the term is referenced all over the place, so it gave me lots of results that only referenced the acronym, but did not have the actual definition.
Wiki-RAG-Bot with the full ticket: In this attempt I asked more broadly, “Explain to me the Bahmni and healthcare terminology in the following ticket: …”. It gave me an answer that was a bit verbose and repetitive, but overall helpful. It put the ticket in context, and explained it once more. It also mentioned that the relevant functionality is “done through the Bahmni HIP plugin module”, a clue to where the relevant code is.
ChatGPT: Just to see which of these explanations could also have come from a model’s training data, I also asked ChatGPT about these 2 terms. It does know what FHIR is, but failed on “HI type”, which is contextual to Bahmni.

Understanding more of the ticket from the code

The ticket says that the functionality currently “only works for Immunization, Health Document Record and Wellness Record Hi-types”, and the ticket is about improving the location tagging for other Hi-types as well. So I wanted to know: What are those “other Hi-types”?

Bloop: Pointed me to a few of the other Hi-types (“these classes seem to handle…”), but wasn’t definitive if those are really all the possible types
GH Copilot: Pointed me to an enum called HiTypeDocumentKind which seems to be exactly the right place, listing all possible values for Hi-Type
Ctrl+F in the IDE: I searched for the string “hitype”, which wasn’t actually that broadly used in the code, and the vast majority of the results also pointed me to HiTypeDocumentKind. So I could have also found this with a simple code search.

Finding the relevant implementation in the code

Next, I wanted to find out where the functionality that needs to be changed for this ticket is implemented. I fed the JIRA ticket text into the tools and asked them: “Help me find the code relevant to this feature - where are we setting the organization resource in the FHIR bundle?”

Both GH Copilot and Bloop gave me similar lists of files. When I compared them with the files changed by the pull request, I only found one file in all 3 lists, FhirBundledPrescriptionBuilder, which turned out to be one of the core classes to look at for this. While the other classes listed by AI were not changed by the pull request, they were all dependencies of this FhirBundledPrescriptionBuilder class, so the tools generally pointed me to the right cluster of code.

Understanding how to reproduce the status quo

Now that I had what seemed to be the right place in the code, I wanted to reproduce the behaviour that needed to be enhanced as part of the ticket.

My biggest problem at this point was that most options to reproduce the behaviour of course include some form of “running the application”. However, in a legacy setup like here, that is easier said than done. Often these applications are run with outdated stacks (here: Java 8) and tools (here: Vagrant). Also, I needed to understand the wider ecosystem of Bahmni, and how all the different components work together. I did ask all three of my tools, “How do I run this application?”. But the list of steps suggested were extensive, therefore I had a long feedback loop in front of me, combined with very low confidence that the AI suggestions were correct or at least useful. For GH Copilot and Bloop, who only had access to the codebase, I suspected that they made up quite a bit of their suggestions, and the list of actions looked very generic. The Wiki-RAG-Bot was at least based on the official Bahmni documentation, but even here I couldn’t be sure if the bot was only basing its answer on the most current run book, or if there was also information from outdated wiki pages that it might indiscriminately reproduce.

I briefly started following some of the steps, but then decided to not go further down this rabbit hole.

Writing a test

I did manage to compile the application and run the tests though! (After about 1 hour of fiddling with Maven, which the AI tools could not help me with.)

Unfortunately, there was no existing test class for FhirBundledPrescriptionBuilder. This was a bad sign because it often means that the implementation is not easy to unit-test. However, it’s quite a common situation in “legacy” codebases. So I asked my tools to help me generate a test.

Both GH Copilot and Bloop gave me test code suggestions that were not viable. They made extensive use of mocking, and were mocking parts of the code that should not be mocked for the test, e.g. the input data object for the function under test. So I asked the AI tools to not mock the input object, and instead set up reasonable test data for it. The challenge with that was that the input argument, OpenMrsPrescription, is the root of quite a deep object hierarchy, that includes object types from OpenMRS libraries that the AI did not even have access to. E.g., OpenMrsPrescription contains org.openmrs.Encounter, which contains org.openmrs.Patient, etc. The test data setup suggested by the AI only went one level deep, so when I tried to use it, I kept running into NullPointerExceptions because of missing values.

This is where I stopped my experiment.

Learnings

For the use case of onboarding to an unknown application, it is crucial that the tools have the ability to automatically determine the relevant repositories and files. Extra bonus points for AI awareness of dependency code and/or documentation. In Bloop, I often had to first put files into the context myself to get helpful answers, which kind of defeats the purpose of understanding an unknown codebase. And in a setup like Bahmni that has a LOT of repositories, for a newbie it’s important to have a tool that can answer questions across all of them, and point me to the right repo. So this automatic context orchestration is a feature to watch out for in these tools.
While the results of the “where is this in the code?” questions were usually not 100% accurate, they did always point me in a generally useful direction. So it remains to be seen in real life usage of these tools: Is this significantly better than Ctrl+F text search? In this case I think it was, I wouldn’t have known where to start with a generic string like “organization”.
For older applications and stacks, development environment setup is usually a big challenge in onboarding. AI cannot magically replace a well-documented and well-automated setup. Outdated or non-existing documentation, as well as obscure combinations of outdated runtimes and tools will stump AI as much as any human.
The ability of AI to generate unit tests for existing code that doesn’t have unit tests yet all depends on the quality and design of that code. And in my experience, a lack of unit tests often correlates with low modularity and cohesion, i.e. sprawling and entangled code like I encountered in this case. So I suspect that in most cases, the hope to use AI to add unit tests to a codebase that doesn’t have unit tests yet will remain a pipe dream.