Onboarding to a 'legacy' codebase with the help of AI

This article is part of “Exploring Gen AI”. A series capturing our explorations of using gen ai technology for software development.

15 Aug 2024

One of the promising applications of GenAI in software delivery is about how it can help us understand code better, in particular large codebases, or multiple codebases that form one big application. This is especially interesting for older codebases that are getting hard to maintain (“legacy”), or to improve onboarding in teams that have a lot of fluctuation.

To get an idea for how well this works and what the potential is, I picked an issue of the open source project Bahmni and tried to understand the issue and what needs to be done with the help of AI. Bahmni is built on top of OpenMRS, which has been around for a very long time. OpenMRS and Bahmni are good examples of very large codebases with a lot of tech debt, representing a lot of different styles and technologies over time.

Spoiler alert: I did not actually figure out how to solve the ticket! But the journey gave me a bunch of observations about what AI could and could not help with in such a use case.

The ticket

Organisation name is not fetched from parent location for few Hi-Types

“When the visit location has a target location tagged as Organization then the parent location’s name should be used for Organisation resource in FHIR bundle. This works only for Immunization, Health Document Record and Wellness Record Hi-types. For others the visit location is only used.”

The codebase(s)

OpenMRS and Bahmni have many, many repositories. As I did not have access to a tool that lets me ask questions across all the repositories I would have needed, I cheated and looked at the pull request already attached to the ticket to identify the relevant codebase, openmrs-module-hip.

The tools

I used a bunch of different AI tools in this journey:

  1. Simple Retrieval-Augmented Generation (RAG) over a vectorised version of the full Bahmni Wiki. I’ll refer to this as Wiki-RAG-Bot.

  2. An AI-powered code understanding product called Bloop. It’s one of many products in the market that focus on using AI to understand and ask questions about large codebases.

  3. GitHub Copilot’s chat in VS Code, where one can ask questions about the currently open codebase in chat queries via @workspace.

Understanding the domain

First, I wanted to understand the domain terms used in the ticket that I was unfamiliar with.

Understanding more of the ticket from the code

The ticket says that the functionality currently “only works for Immunization, Health Document Record and Wellness Record Hi-types”, and the ticket is about improving the location tagging for other Hi-types as well. So I wanted to know: What are those “other Hi-types”?

Finding the relevant implementation in the code

Next, I wanted to find out where the functionality that needs to be changed for this ticket is implemented. I fed the JIRA ticket text into the tools and asked them: “Help me find the code relevant to this feature - where are we setting the organization resource in the FHIR bundle?”

Both GH Copilot and Bloop gave me similar lists of files. When I compared them with the files changed by the pull request, I only found one file in all 3 lists, FhirBundledPrescriptionBuilder, which turned out to be one of the core classes to look at for this. While the other classes listed by AI were not changed by the pull request, they were all dependencies of this FhirBundledPrescriptionBuilder class, so the tools generally pointed me to the right cluster of code.

Understanding how to reproduce the status quo

Now that I had what seemed to be the right place in the code, I wanted to reproduce the behaviour that needed to be enhanced as part of the ticket.

My biggest problem at this point was that most options to reproduce the behaviour of course include some form of “running the application”. However, in a legacy setup like here, that is easier said than done. Often these applications are run with outdated stacks (here: Java 8) and tools (here: Vagrant). Also, I needed to understand the wider ecosystem of Bahmni, and how all the different components work together. I did ask all three of my tools, “How do I run this application?”. But the list of steps suggested were extensive, therefore I had a long feedback loop in front of me, combined with very low confidence that the AI suggestions were correct or at least useful. For GH Copilot and Bloop, who only had access to the codebase, I suspected that they made up quite a bit of their suggestions, and the list of actions looked very generic. The Wiki-RAG-Bot was at least based on the official Bahmni documentation, but even here I couldn’t be sure if the bot was only basing its answer on the most current run book, or if there was also information from outdated wiki pages that it might indiscriminately reproduce.

I briefly started following some of the steps, but then decided to not go further down this rabbit hole.

Writing a test

I did manage to compile the application and run the tests though! (After about 1 hour of fiddling with Maven, which the AI tools could not help me with.)

Unfortunately, there was no existing test class for FhirBundledPrescriptionBuilder. This was a bad sign because it often means that the implementation is not easy to unit-test. However, it’s quite a common situation in “legacy” codebases. So I asked my tools to help me generate a test.

Both GH Copilot and Bloop gave me test code suggestions that were not viable. They made extensive use of mocking, and were mocking parts of the code that should not be mocked for the test, e.g. the input data object for the function under test. So I asked the AI tools to not mock the input object, and instead set up reasonable test data for it. The challenge with that was that the input argument, OpenMrsPrescription, is the root of quite a deep object hierarchy, that includes object types from OpenMRS libraries that the AI did not even have access to. E.g., OpenMrsPrescription contains org.openmrs.Encounter, which contains org.openmrs.Patient, etc. The test data setup suggested by the AI only went one level deep, so when I tried to use it, I kept running into NullPointerExceptions because of missing values.

This is where I stopped my experiment.

Learnings