Exploring Generative AI

Generative AI and particularly LLMs (Large Language Models) have exploded into the public consciousness. Like many software developers I am intrigued by the possibilities, but unsure what exactly it will mean for our profession in the long run. I have now taken on a role in Thoughtworks to coordinate our work on how this technology will affect software delivery practices. I'll be posting various memos here to describe what my colleagues and I are learning and thinking.

Earlier Memos

The toolchain (26 July 2023) 

The toolchain

Let’s start with the toolchain. Whenever there is a new area with still evolving patterns and technology, I try to develop a mental model of how things fit together. It helps deal with the wave of information coming at me. What types of problems are being solved in the space? What are the common types of puzzle pieces needed to solve those problems? How are things fitting together?

How to categorise the tools

The following are the dimensions of my current mental model of tools that use LLMs (Large Language Models) to support with coding.

Assisted tasks

  • Finding information faster, and in context
  • Generating code
  • “Reasoning” about code (Explaining code, or problems in the code)
  • Transforming code into something else (e.g. documentation text or diagram)

These are the types of tasks I see most commonly tackled when it comes to coding assistance, although there is a lot more if I would expand the scope to other tasks in the software delivery lifecycle.

Interaction modes

I’ve seen three main types of interaction modes:

  • Chat interfaces
  • In-line assistance, i.e. typing in a code editor
  • CLI

Prompt composition

The quality of the prompt obviously has a big impact on the usefulness on the tools, in combination with the suitability of the LLM used in the backend. Prompt engineering does not have to be left purely to the user though, many tools apply prompting techniques for you in a backend.

  • User creates the prompt from scratch
  • Tool composes prompt from user input and additional context (e.g. open files, a set of reusable context snippets, or additional questions to the user)

Properties of the model

  • What the model was trained with
    • Was it trained specifically with code, and coding tasks? Which languages?
    • When was it trained, i.e. how current is the information
  • Size of the model (it’s still very debated in which way this matters though, and what a “good” size is for a specific task like coding)
  • Size of the context window supported by the model, which is basically the number of tokens it can take as the prompt
  • What filters have been added to the model, or the backend where it is hosted

Origin and hosting

  • Commercial products, with LLM APIs hosted by a the product company
  • Open source tools, connecting to LLM API services
  • Self-built tools, connecting to LLM API services
  • Self-built tools connecting to fine-tuned, self-hosted LLM API

Examples

Here are some common examples of tools in the space, and how they fit into this model. (The list is not an endorsement of these tools, or dismissal of other tools, it’s just supposed to help illustrate the dimensions.)

Tool Tasks Interaction Prompt composition Model Origin / Hosting
GitHub Copilot Code generation In-line assistance Composed by IDE extension Trained with code, vulnerability filters Commercial
GitHub Copilot Chat All of them Chat Composed of user chat + open files Trained with code Commercial
ChatGPT All of them Chat All done by user Trained with code Commercial
GPT Engineer Code generation CLI Prompt composed based on user input Choice of OpenAI models Open Source, connecting to OpenAI API
“Team AIs” All of them Web UI Prompt composed based on user input and use case Most commonly with OpenAI’s GPT models Maintained by a team for their use cases, connecting to OpenAI APIs
Meta’s CodeCompose Code generation In-line assistance Composed by editor extension Model fine-tuned on internal use cases and codebases Self-hosted

What are people using today, and what’s next?

Today, people are most commonly using combinations of direct chat interaction (e.g. via ChatGPT or Copilot Chat) with coding assistance in the code editor (e.g. via GitHub Copilot or Tabnine). In-line assistance in the context of an editor is probably the most mature and effective way to use LLMs for coding assistance today, compared to other approaches. It supports the developer in their natural workflow with small steps. Smaller steps make it easier to follow along and review the quality more diligently, and it’s easy to just move on in the cases where it does not work.

There is a lot of experimentation going on in the open source world with tooling that provides prompt composition to generate larger pieces of code (e.g. GPT Engineer, Aider). I’ve seen similar usage of small prompt composition applications tuned by teams for their specific use cases, e.g. by combining a reusable architecture and tech stack definition with user stories to generate task plans or test code, similar to what my colleague Xu Hao is describing here. Prompt composition applications like this are most commonly used with OpenAI’s models today, as they are most easily available and relatively powerful. Experiments are moving more and more towards open source models and the big hyperscalers hosted models though, as people are looking for more control over their data.

As a next step forward, beyond advanced prompt composition, people are putting lots of hopes for future improvements into the model component. Do larger models, or smaller but more specifically trained models work better for coding assistance? Will models with larger context windows enable us to feed them with more code to reason about the quality and architecture of larger parts of our codebases? At what scale does it pay off to fine-tune a model with your organization’s code? What will happen in the space of open source models? Questions for a future memo.

Thanks to Kiran Prakash for his input


Median - A tale in three functions (27 July 2023) 

Median - A tale in three functions

This is a little story of generating a median function, and how it illustrates the usefulness and limitations of LLM-assisted coding.

I needed to calculate the median of a list of numbers in a Typescript codebase. Instead of what I would usually do, which is start an internet search for “median function javascript”, I tried to get GitHub Copilot to assist me.

// calculate the median of a list of numbers

Copilot first generated a proper method signature for me:

public static median(values:number[]): number {

And then suggested three different implementations of that function.

Suggestion 1

public static median(values:number[]): number {
    if(values.length === 0) return 0;

    values.sort((a,b) => a-b);

    var half = Math.floor(values.length / 2);

    if (values.length % 2)
    return values[half];

    return (values[half - 1] + values[half]) / 2.0;
}

This implementation works fine. However, the input parameter values is being changed with the sort function. When showing some colleagues this example, I got mixed responses - one said “that’s pretty bad, something we would usually forget to write a test about, and then it can become an error that’s really hard to debug”. Another said, “it’s not necessarily always a bad thing, it depends”. I was wondering if this would be something that a static code analysis tool in the IDE would mark as a problem, but neither ESLint nor Sonar flagged it.

Suggestion 2

function median2(values: number[]) {
    const sorted = values.slice().sort((a, b) => a - b);
    const middle = Math.floor(sorted.length / 2);
    if (sorted.length % 2 === 0) {
        return (sorted[middle - 1] + sorted[middle]) / 2;
    } else {
        return sorted[middle];
    }
}

This one works, and does not have the problem of changing the input parameter, as slice() returns a copy of the input array.

Suggestion 3

function median3(values: number[]) {
    const sorted = values.slice().sort((a, b) => a - b);
    const middle = Math.floor(sorted.length / 2);
    return sorted[middle];
}

At first sight, this looks better than the others - it’s shorter, and has zero cyclomatic complexity, we tend to think of that as “more elegant”. The test I wrote for it also worked, great!

But I later realised that it is important to test it with both an even and an uneven number of values - the test I had written used an uneven length input, but this implementation fails for the case of an even length. This example is great to illustrate that it is still important to understand what we are doing, to be able to write reasonable tests for the code that got generated. We need to know how median works to write tests that cover the full functionality.

Isn’t this the same though as copying from the internet?

A use case like this is a perfect example for one of the benefits people see in this “autocomplete on steroids”: You don’t have to leave your flow and tool chain to get answers to questions like this. And otherwise we’d copy & paste it from somewhere anyway, and would then have to review that code thoroughly as well, and write the tests. So it’s the same risk - right?

The only difference is that with Copilot, we don’t know the source of the code. In the case of StackOverflow e.g., we have an additional data point about the quality of a snippet: The number of upvotes.

Incidentally, “Suggestion 1” is almost exactly the code suggested by the most highest voted response to a StackOverflow question on the topic, in spite of the little flaw. The mutation of the input parameter is called out by a user in the comments though.

Generate the tests, or the code? Or both?

What about the other way around then, what if I had asked Copilot to generate the tests for me first? I tried that with Copilot Chat, and it gave me a very nice set of tests, including one that fails for “Suggestion 3” with an even length.

it("should return the median of an array of odd length", () => { ... }

it("should return the median of an array of even length", () => { ... }

it("should return the median of an array with negative numbers", () => { ... }

it("should return the median of an array with duplicate values", () => { ... }

In this particular case of a very common and small function like median, I would even consider using generated code for both the tests and the function. The tests were quite readable and it was easy for me to reason about their coverage, plus they would have helped me remember that I need to look at both even and uneven lengths of input. However, for other more complex functions with more custom code I would consider writing the tests myself, as a means of quality control. Especially with larger functions, I would want to think through my test cases in a structured way from scratch, instead of getting partial scenarios from a tool, and then having to fill in the missing ones.

Could the tool itself help me fix the flaws with the generated code?

I asked Copilot Chat to refactor “Suggestion 1” in a way that it does not change the input parameter, and it gave me a reasonable fix. The question implies though that I already know what I want to improve in the code.

I also asked ChatGPT what is wrong or could be improved with “Suggestion 3”, more broadly. It did tell me that it does not work for an even length of input.

Conclusions

  • You have to know what you’re doing, to judge the generated suggestions. In this case, I needed an understanding of how median calculation works, to be able to write reasonable tests for the generated code.
  • The tool itself might have the answer to what’s wrong or could be improved in the generated code - is that a path to make it better in the future, or are we doomed to have circular conversation with our AI tools?
  • I’ve been skeptical about generating tests as well as implementations, for quality control reasons. But, generating tests could give me ideas for test scenarios I missed, even if I discard the code afterwards. And depending on the complexity of the function, I might consider using generated tests as well, if it’s easy to reason about the scenarios.

Thanks to Aleksei Bekh-Ivanov and Erik Doernenburg for their insights


In-line assistance - when is it more useful? (01 August 2023) 

In-line assistance - when is it more useful?

The most widely used form of coding assistance in Thoughtworks at the moment is in-line code generation in the IDE, where an IDE extension generates suggestions for the developer as they are typing in the IDE.

The short answer to the question, “Is this useful?” is: “Sometimes it is, sometimes it isn’t.” ¯_(ツ)_/¯ You will find a wide range of developer opinions on the internet, from “this made me so much faster” all the way to “I switched it off, it was useless”. That is because the usefulness of these tools depends on the circumstances. And the judgment of usefulness depends on how high your expectations are.

What do I mean by “useful”?

For the purposes of this memo, I’m defining “useful” as “the generated suggestions are helping me solve problems faster and at comparable quality than without the tool”. That includes not only the writing of the code, but also the review and tweaking of the generated suggestions, and dealing with rework later, should there be quality issues.

Factors that impact usefulness of suggestions

Note: This is mostly based on experiences with GitHub Copilot.

More prevalent tech stacks

Safer waters: The more prevalent the tech stack, the more discussions and code examples will have been part of the training data for the model. This means that the generated suggestions are more likely to be useful for languages like Java or Javascript than for a newer or less discussed language like Lua.

However: My colleague Erik Doernenburg wrote about his experience of “Taking Copilot to difficult terrain” with Rust. His conclusion: “Overall, though, even for a not-so-common programming language like Rust, with a codebase that uses more complicated data structures I found Copilot helpful.”

Simpler and more commonplace problems

Safer waters: This one is a bit hard to define. What does “simpler” mean, what does “commonplace” mean? I’ll use some examples to illustrate.

  • Common problems: In a previous memo, I discussed an example of generating a median function. I would consider that a very commonplace problem and therefore a good use case for generation.
  • Common solution patterns applied to our context: For example, I have used it successfully to implement problems that needed list processing, like a chain of mapping, grouping, and sorting of lists.
  • Boilerplate: Create boilerplate setups like an ExpressJS server, or a React component, or a database connection and query execution.
  • Repetitive patterns: It helps speed up typing of things that have very common and repetitive patterns, like creating a new constructor or a data structure, or a repetition of a test setup in a test suite. I traditionally use a lot of copy and paste for these things, and Copilot can speed that up.

When a colleague who had been working with Copilot for over 2 months was pairing with somebody who did not have a license yet, he “found having to write repetitive code by hand excruciating”. This autocomplete-on-steroids effect can be less useful though for developers who are already very good at using IDE features, shortcuts, and things like multiple cursor mode. And beware that when coding assistants reduce the pain of repetitive code, we might be less motivated to refactor.

However: You can use a coding assistant to explore some ideas when you are getting started with more complex problems, even if you discard the suggestion afterwards.

Smaller size of the suggestions

Safer waters: The smaller the generated suggestion, the less review effort is needed, the easier the developer can follow along with what is being suggested.

The larger the suggestion, the more time you will have to spend to understand it, and the more likely it is that you will have to change it to fit your context. Larger snippets also tempt us to go in larger steps, which increases the risk of missing test coverage, or introducing things that are unnecessary.

However: I suspect a lot of interplay of this factor with the others. Small steps particularly help when you already have an idea of how to solve the problem. So when you do not have a plan yet because you are less experienced, or the problem is more complex, then a larger snippet might help you get started with that plan.

More experienced developer(s)

Safer waters: Experience still matters. The more experienced the developer, the more likely they are to be able to judge the quality of the suggestions, and to be able to use them effectively. As GitHub themselves put it: “It’s good at stuff you forgot.” This study even found that “in some cases, tasks took junior developers 7 to 10 percent longer with the tools than without them”.

However: Most of the observations I have collected so far have been made by more experienced developers. So this is one where I am currently least sure about the trade-offs at play. My hypothesis is that the safer the waters are from the other factors mentioned above, the less likely it is that the tools would lead less experienced developers down the wrong path, and the higher the chance that it will give them a leg up. Pair programming and other forms of code review further mitigate the risks.

Higher margin for errors

I already touched on the importance of being able to judge the quality and correctness of suggestions. As has been widely reported, Large Language Models can “hallucinate” information, or in this case, code. When you are working on a problem or a use case that has a higher impact when you get it wrong, you need to be particularly vigilant about reviewing the suggestions. For example, when I was recently working on securing cookies in a web application, Copilot suggested a value for the Content-Security-Policy HTTP header. As I have low experience in this area, and this was a security related use case, I did not just want to accept Copilot’s suggestions, but went to a trusted online source for research instead.

In conclusion

There are safer waters for coding assistance, but as you can see from this discussion, there are multiple factors at play and interplay that determine the usefulness. Using coding assistance tools effectively is a skill that is not simply learned from a training course or a blog post. It’s important to use them for a period of time, experiment in and outside of the safe waters, and build up a feeling for when this tooling is useful for you, and when to just move on and do it yourself.

Thanks to James Emmott, Joern Dinkla, Marco Pierobon, Paolo Carrasco, Paul Sobocinski and Serj Krasnov for their insights and feedback


In-line assistance - how can it get in the way? (03 August 2023) 

In-line assistance - how can it get in the way?

In the previous memo, I talked about the circumstances under which coding assistance can be useful. This memo is two in one: Here are two ways where we’ve noticed the tools can get in the way.

Amplification of bad or outdated practices

One of the strengths of coding assistants right in the IDE is that they can use snippets of the surrounding codebase to enhance the prompt with additional context. We have found that having the right files open in the editor to enhance the prompt is quite a big factor in improving the usefulness of suggestions.

However, the tools cannot distinguish good code from bad code. They will inject anything into the context that seems relevant. (According to this reverse engineering effort, GitHub Copilot will look for open files with the same programming language, and use some heuristic to find similar snippets to add to the prompt.) As a result, the coding assistant can become that developer on the team who keeps copying code from the bad examples in the codebase.

We also found that after refactoring an interface, or introducing new patterns into the codebase, the assistant can get stuck in the old ways. For example, the team might want to introduce a new pattern like “start using the Factory pattern for dependency injection”, but the tool keeps suggesting the current way of dependency injection because that is still prevalent all over the codebase and in the open files. We call this a poisoned context, and we don’t really have a good way to mitigate this yet.

In conclusion

The AI’s eagerness to improve the prompting context with our codebase can be a blessing and a curse. That is one of many reasons why it is so important for developers to not start trusting the generated code too much, but still review and think for themselves.

Review fatigue and complacency

Using a coding assistant means having to do small code reviews over and over again. Usually when we code, our flow is much more about actively writing code, and implementing the solution plan in our head. This is now sprinkled with reading and reviewing code, which is cognitively different, and also something most of us enjoy less than actively producing code. This can lead to review fatigue, and a feeling that the flow is more disrupted than enhanced by the assistant. Some developers might switch off the tool for a while to take a break from that. Or, if we don’t deal with the fatigue, we might get sloppy and complacent with the review of the code.

Review complacency can also be the result of a bunch of cognitive biases:

  • Automation Bias is our tendency “to favor suggestions from automated systems and to ignore contradictory information made without automation, even if it is correct.” Once we have had good experience and success with GenAI assistants, we might start trusting them too much.
  • I’ve also often feel a twisted version of Sunk Cost Fallacy at work when I’m working with an AI coding assistant. Sunk cost fallacy is defined as “a greater tendency to continue an endeavor once an investment in money, effort, or time has been made”. In this case, we are not really investing time ourselves, on the contrary, we’re saving time. But once we have that multi-line code suggestion from the tool, it can feel more rational to spend 20 minutes on making that suggestion work than to spend 5 minutes on writing the code ourselves once we see the suggestion is not quite right.
  • Once we have seen a code suggestion, it’s hard to unsee it, and we have a harder time thinking about other solutions. That is because of the Anchoring Effect, which happens when “an individual’s decisions are influenced by a particular reference point or ‘anchor’”. so while coding assistants’ suggestions can be great for brainstorming when we don’t know how to solve something yet, awareness of the Anchoring Effect is important when the brainstorm is not fruitful, and we need to reset our brain for a fresh start.

In conclusion

Sometimes it’s ok to take a break from the assistant. And we have to be careful not to become that person who drives their car into a lake just because the navigation system tells them to.

Thanks to the “Ensembling with Copilot” group around Paul Sobocinski in Thoughtworks Canada, who described the “context poisoning” effect and the review fatigue to me: Eren, Geet, Nenad, Om, Rishi, Janice, Vivian, Yada and Zack

Thanks to Bruno, Chris, Gabriel, Javier and Roselma for their review comments on this memo


Coding assistants do not replace pair programming (10 August 2023) 

Coding assistants do not replace pair programming

As previous memos have hopefully shown, I find GenAI-powered coding assistants a very useful addition to the developer toolchain. They can clearly speed up writing of code under certain circumstances, they can help us get unstuck, and remember and look things up faster. So far, all memos have mainly been about in-line assistance in the IDE, but if we add chatbot interfaces to that, there’s even more potential for useful assistance. Especially powerful are chat interfaces integrated into the IDE, enhanced with additional context of the codebase that we don’t have to spell out in our prompts.

However, while I see the potential, I honestly get quite frustrated when people talk about coding assistants as a replacement for pair programming (GitHub even calls their Copilot product “your AI pair programmer”). At Thoughtworks, we have long been strong proponents for pair programming and pairing in general to make teams more effective. It is part of our “Sensible Default Practices” that we use as a starting point for our projects.

The framing of coding assistants as pair programmers is a disservice to the practice, and reinforces the widespread simplified understanding and misconception of what the benefits of pairing are. I went back to a set of slides I use to talk about pairing, and the comprehensive article published right here on this site, and I crammed all the benefits I mention there into one slide:

Title: Benefits and purposes of pair programming, in 5 categories: 1. “one plus one is greater than two”, for things like knowledge exchange or onboarding; 2. Flow, for things like keeping focus and limiting work in process; 3. Avoid waste, referencing the 7 wastes of software development; 4. Continuous Integration, as in integrating multiple times a day, mentioning shorter code review loops; and 5., Practice skills needed on highly effective teams, like task organisation, empathy, communication, feedback

The area where coding assistants can have the most obvious impact here is the first one, “1 plus 1 is greater than 2”. They can help us get unstuck, they can make onboarding better, and they can help the tactical work faster, so we can focus more on the strategic, i.e. the design of the overall solution. They also help with knowledge sharing in the sense of “How does this technology work?”.

Pair programming however is also about the type of knowledge sharing that creates collective code ownership, and a shared knowledge of the history of the codebase. It’s about sharing the tacit knowledge that is not written down anywhere, and therefore also not available to a Large Language Model. Pairing is also about improving team flow, avoiding waste, and making Continuous Integration easier. It helps us practice collaboration skills like communication, empathy, and giving and receiving feedback. And it provides precious opportunities to bond with one another in remote-first teams.

Conclusion

Coding assistants can cover only a small part of the goals and benefits of pair programming. That is because pairing is a practice that helps improve the team as a whole, not just an individual coder. When done well, the increased level of communication and collaboration improves flow and collective code ownership. I would even argue that the risks of LLM-assisted coding are best mitigated by using those tools in a pair (see “How it can get in the way” in a previous memo).

Use coding assistants to make pairs better, not to replace pairing.


TDD with GitHub Copilot (17 August 2023) 

TDD with GitHub Copilot

by Paul Sobocinski

Will the advent of AI coding assistants such as GitHub Copilot mean that we won’t need tests? Will TDD become obsolete? To answer this, let’s examine two ways TDD helps software development: providing good feedback, and a means to “divide and conquer” when solving problems.

TDD for good feedback

Good feedback is fast and accurate. In both regards, nothing beats starting with a well-written unit test. Not manual testing, not documentation, not code review, and yes, not even Generative AI. In fact, LLMs provide irrelevant information and even hallucinate. TDD is especially needed when using AI coding assistants. For the same reasons we need fast and accurate feedback on the code we write, we need fast and accurate feedback on the code our AI coding assistant writes.

TDD to divide-and-conquer problems

Problem-solving via divide-and-conquer means that smaller problems can be solved sooner than larger ones. This enables Continuous Integration, Trunk-Based Development, and ultimately Continuous Delivery. But do we really need all this if AI assistants do the coding for us?

Yes. LLMs rarely provide the exact functionality we need after a single prompt. So iterative development is not going away yet. Also, LLMs appear to “elicit reasoning” (see linked study) when they solve problems incrementally via chain-of-thought prompting. LLM-based AI coding assistants perform best when they divide-and-conquer problems, and TDD is how we do that for software development.

TDD tips for GitHub Copilot

At Thoughtworks, we have been using GitHub Copilot with TDD since the start of the year. Our goal has been to experiment with, evaluate, and evolve a series of effective practices around use of the tool.

0. Getting started

TDD represented as a three-part wheel with 'Getting Started' highlighted in the center

Starting with a blank test file doesn’t mean starting with a blank context. We often start from a user story with some rough notes. We also talk through a starting point with our pairing partner.

This is all context that Copilot doesn’t “see” until we put it in an open file (e.g. the top of our test file). Copilot can work with typos, point-form, poor grammar — you name it. But it can’t work with a blank file.

Some examples of starting context that have worked for us:

  • ASCII art mockup
  • Acceptance Criteria
  • Guiding Assumptions such as:
    • “No GUI needed”
    • “Use Object Oriented Programming” (vs. Functional Programming)

Copilot uses open files for context, so keeping both the test and the implementation file open (e.g. side-by-side) greatly improves Copilot’s code completion ability.

1. Red

TDD represented as a three-part wheel with the 'Red' portion highlighted on the top left third

We begin by writing a descriptive test example name. The more descriptive the name, the better the performance of Copilot’s code completion.

We find that a Given-When-Then structure helps in three ways. First, it reminds us to provide business context. Second, it allows for Copilot to provide rich and expressive naming recommendations for test examples. Third, it reveals Copilot’s “understanding” of the problem from the top-of-file context (described in the prior section).

For example, if we are working on backend code, and Copilot is code-completing our test example name to be, “given the user… clicks the buy button, this tells us that we should update the top-of-file context to specify, “assume no GUI” or, “this test suite interfaces with the API endpoints of a Python Flask app”.

More “gotchas” to watch out for:

  • Copilot may code-complete multiple tests at a time. These tests are often useless (we delete them).
  • As we add more tests, Copilot will code-complete multiple lines instead of one line at-a-time. It will often infer the correct “arrange” and “act” steps from the test names.
    • Here’s the gotcha: it infers the correct “assert” step less often, so we’re especially careful here that the new test is correctly failing before moving onto the “green” step.

2. Green

TDD represented as a three-part wheel with the 'Green' portion highlighted on the top right third

Now we’re ready for Copilot to help with the implementation. An already existing, expressive and readable test suite maximizes Copilot’s potential at this step.

Having said that, Copilot often fails to take “baby steps”. For example, when adding a new method, the “baby step” means returning a hard-coded value that passes the test. To date, we haven’t been able to coax Copilot to take this approach.

Backfilling tests

Instead of taking “baby steps”, Copilot jumps ahead and provides functionality that, while often relevant, is not yet tested. As a workaround, we “backfill” the missing tests. While this diverges from the standard TDD flow, we have yet to see any serious issues with our workaround.

Delete and regenerate

For implementation code that needs updating, the most effective way to involve Copilot is to delete the implementation and have it regenerate the code from scratch. If this fails, deleting the method contents and writing out the step-by-step approach using code comments may help. Failing that, the best way forward may be to simply turn off Copilot momentarily and code out the solution manually.

3. Refactor

TDD represented as a three-part wheel with the 'Refactor' portion highlighted on the bottom third

Refactoring in TDD means making incremental changes that improve the maintainability and extensibility of the codebase, all performed while preserving behavior (and a working codebase).

For this, we’ve found Copilot’s ability limited. Consider two scenarios:

  1. “I know the refactor move I want to try”: IDE refactor shortcuts and features such as multi-cursor select get us where we want to go faster than Copilot.
  2. “I don’t know which refactor move to take”: Copilot code completion cannot guide us through a refactor. However, Copilot Chat can make code improvement suggestions right in the IDE. We have started exploring that feature, and see the promise for making useful suggestions in a small, localized scope. But we have not had much success yet for larger-scale refactoring suggestions (i.e. beyond a single method/function).

Sometimes we know the refactor move but we don’t know the syntax needed to carry it out. For example, creating a test mock that would allow us to inject a dependency. For these situations, Copilot can help provide an in-line answer when prompted via a code comment. This saves us from context-switching to documentation or web search.

Conclusion

The common saying, “garbage in, garbage out” applies to both Data Engineering as well as Generative AI and LLMs. Stated differently: higher quality inputs allow for the capability of LLMs to be better leveraged. In our case, TDD maintains a high level of code quality. This high quality input leads to better Copilot performance than is otherwise possible.

We therefore recommend using Copilot with TDD, and we hope that you find the above tips helpful for doing so.

Thanks to the “Ensembling with Copilot” team started at Thoughtworks Canada; they are the primary source of the findings covered in this memo: Om, Vivian, Nenad, Rishi, Zack, Eren, Janice, Yada, Geet, and Matthew.


How is GenAI different from other code generators? (19 September 2023) 

How is GenAI different from other code generators?

At the beginning of my career, I worked a lot in the space of Model-Driven Development (MDD). We would come up with a modeling language to represent our domain or application, and then describe our requirements with that language, either graphically or textually (customized UML, or DSLs). Then we would build code generators to translate those models into code, and leave designated areas in the code that would be implemented and customized by developers.

That style of code generation never quite took off though, except for some areas of embedded development. I think that’s because it sits at an awkward level of abstraction that in most cases doesn’t deliver a better cost-benefit ratio than other levels of abstraction, like frameworks or platforms.

What’s different about code generation with GenAI?

One of the key decisions we continuously take in our software engineering work is choosing the right abstraction levels to strike a good balance between implementation effort and the level of customizability and control we need for our use case. As an industry, we keep trying to raise the abstraction level to reduce implementation efforts and become more efficient. But there is a kind of invisible force field for that, limited by the level of control we need. Take the example of Low Code platforms: They raise the abstraction level and reduce development efforts, but as a result are most suitable for certain types of simple and straightforward applications. As soon as we need to do something more custom and complex, we hit the force field and have to take the abstraction level down again.

An illustration of typical abstraction levels in software, starting from 0s and 1s at the bottom, going up via programming languages, frameworks, platforms, and low code applications. A squiggly line at the top marks the "Invisible force field for customizability and extensibility".

GenAI unlocks a whole new area of potential because it is not another attempt at smashing that force field. Instead, it can make us humans more effective on all the abstraction levels, without having to formally define structured languages and translators like compilers or code generators.

An illustration of typical abstraction levels in software, starting from 0s and 1s at the bottom, going up via programming languages, frameworks, platforms, and low code applications. A vertical box on the right marked "Natural Language" has arrows going into all of the abstraction levels, indicating that GenAI can help create all of those levels.

The higher up the abstraction level we go to apply GenAI, the lower the overall effort becomes to build a piece of software. To go back to the Low Code example, there are some impressive examples in that space which show how you can build full applications with just a few prompts. This comes with the same limitations of the Low Code abstraction level though, in terms of the use cases you can cover. If your use case hits that force field, and you need more control - you’ll have to go back to a lower abstraction level, and also back to smaller promptable units.

Do we need to rethink our abstraction levels?

One approach I take when I speculate about the potential of GenAI for software engineering is to think about the distance in abstraction between our natural language prompts, and our target abstraction levels. Google’s AppSheet demo that I linked above uses a very high level prompt (“I need to create an app that will help my team track travel requests […] fill a form […] requests should be sent to managers […]”) to create a functioning Low Code application. How many target levels down could we push with a prompt like that to get the same results, e.g. with Spring and React framework code? Or, how much more detailed (and less abstract) would the prompt have to be to achieve the same result in Spring and React?

If we want to better leverage GenAI’s potential for software engineering, maybe we need to rethink our conventional abstraction levels altogether, to build more “promptable” distances for GenAI to bridge.

Thanks to John Hearn, John King, Kevin Bralten, Mike Mason and Paul Sobocinski for their insightful review comments on this memo


How to tackle unreliability of coding assistants (29 November 2023) 

How to tackle unreliability of coding assistants

One of the trade-offs to the usefulness of coding assistants is their unreliability. The underlying models are quite generic and based on a huge amount of training data, relevant and irrelevant to the task at hand. Also, Large Language Models make things up, they “hallucinate” as it’s commonly called. (Side note: There is a lot of discourse about the term “hallucination”, about how it is not actually the right psychology metaphor to describe this, but also about using psychology terms in the first place, as it anthropomorphizes the models.)

That unreliability creates two main risks: It can affect the quality of my code negatively, and it can waste my time. Given these risks, quickly and effectively assessing my confidence in the coding assistant’s input is crucial.

How I determine my confidence in the assistant’s input

The following are some of the questions that typically go through my head when I try to gauge the reliability and risk of using a suggestion. This applies to “auto complete” suggestions while typing code as well as to answers from the chat.

Do I have a quick feedback loop?

The quicker I can find out if the answer or the generated information works, the lower the risk that the assistant is wasting my time.

  • Can my IDE help me with the feedback loop? Do I have syntax highlighting, compiler or transpiler integration, linting plugins?
  • Do I have a test, or a quick way to run the suggested code manually? In one case, I was using the coding assistant chat to help me research how to best display a collapsible JSON data structure in a HTML page. The chat told me about an HTML element I had never heard about, so I was not sure if it existed. But it was easy enough to put it into an HTML file and load that in the browser, to confirm. To give a counterexample, the feedback loop for verifying a piece of infrastructure code I have never heard about is usually a lot longer.

Do I have a reliable feedback loop?

As well as the speed of the feedback loop for the AI input, I also reflect on the reliability of that feedback loop.

  • If I have a test, how confident am I in that test?
  • Did I write the test myself, or did I also generate it with the AI assistant?
  • If the AI generated the test(s), how confident am I in my ability to review the efficacy of those tests? If the functionality I’m writing is relatively simple and routine, and in a language I’m familiar with, then I’m of course a lot more confident than with a more complex or larger piece of functionality.
  • Am I pairing with somebody while using the assistant? They will give additional input and review for the AI input, and increase my confidence.
  • If I’m unsure of my test coverage, I can even use the assistant itself to raise my confidence, and ask it for more edge cases to test. This is how I could have found the crucial missing test scenario for the median function I described in a previous memo.

What is the margin of error?

I also reflect on what my margin of error is for what I’m doing. The lower the margin for error, the more critical I will be of the AI input.

  • When I’m introducing a new pattern, I consider that a larger blast radius for the overall design of the codebase. Other developers on the team will pick up that pattern, and the coding assistant will reproduce that pattern across the team as well, once it is in the code. For example, I have noticed that in CSS, GitHub Copilot suggests flexbox layout to me a lot. Choosing a layouting approach is a big decision though, so I would want to consult with a frontend expert and other members of my team before I use this.
  • Anything related to security has of course a low margin of error. For example, I was working on a web application and needed to set a “Content-Security-Policy” header. I didn’t know anything about this particular header, and I first asked Copilot chat. But because of the subject matter, I did not want to rely on its answer, and instead went to a trusted source of security information on the internet.
  • How long-lived will this code be? If I’m working on a prototype, or a throwaway piece of code, I’m more likely to use the AI input without much questioning than if I’m working on a production system.

Do I need very recent information?

The more recent and the more specific (e.g. to a version of a framework) I need the answer to be, the higher the risk that it is wrong, because the probability is higher that the information I’m looking for is not available or not distinguishable to the AI. For this assessment it’s also good to know if the AI tool at hand has access to more information than just the training data. If I’m using a chat, I want to be aware if it has the ability to take online searches into account, or if it is limited to the training data.

Give the assistant a timebox

To mitigate the risk of wasting my time, one approach I take is to give it a kind of ultimatum. If the suggestion doesn’t bring me value with little additional effort, I move on. If an input is not helping me quick enough, I always assume the worst about the assistant, rather than giving it the benefit of the doubt and spending 20 more minutes on making it work.

The example that comes to mind is when I was using an AI chat to help me generate a mermaid.js class diagram. I’m not very familiar with the mermaid.js syntax, and I kept trying to make the suggestion work, and thought I had maybe included it in my markdown file in the wrong way. Turns out, the syntax was totally wrong, which I found out when I finally went to the online documentation after 10 minutes or so.

Make up a persona for the assistant

When preparing this memo, I started wondering if making up a persona for the assistant could help with how to use it responsibly, and with as little waste of time as possible. Maybe anthropomorphizing the AI could actually help in this case?

Thinking about the types of unreliabilities, I’d imagine the AI persona with these traits:

  • eager to help
  • stubborn
  • very well-read, but inexperienced (for Dungeons and Dragons fans: high intelligence, low wisdom)
  • won’t admit when it doesn’t “know” something

I tried a few prompts with an image generator, asking it for variations of eager beavers and stubborn donkeys. Here’s the one I liked the best (“eager stubborn donkey happy books computer; cartoon, vector based, flat areas of color” in Midjourney):

a cartoonish picture of a very excited donkey among a stack of books

You could even come up with a fun name for your persona, and talk about it on the team. “Dusty was an annoying know-it-all during that session, we had to turn them off for a bit”, or “I’m glad Dusty was there, I got that task done before lunch”. But the one thing you should never say is “Dusty caused that incident!”, because Dusty is basically underage, they don’t have a license to commit. We are kind of the parents who are ultimately responsible for the commits, and “parents are liable for their children”.

Conclusion

The list of situation assessments might seem like a lot to apply every single time you’re using a coding assistant. But I believe we’re all going to get better at it the more we use these tools. We make quick assessments with multiple dimensions like this all the time when we are coding, based on our experience. I’ve found that I’ve gotten better at deciding when to use and trust the assistant the more times I ran into the situations mentioned above - the more I touch the hot stove, so to say.

You might also think, “If the AI assistants are unreliable, than why would I use them in the first place?”. There is a mindset shift we have to make when using Generative AI tools in general. We cannot use them with the same expectations we have for “regular” software. GitHub Copilot is not a traditional code generator that gives you 100% what you need. But in 40-60% of situations, it can get you 40-80% of the way there, which is still useful. When you adjust these expectations, and give yourself some time to understand the behaviours and quirks of the eager donkey, you’ll get more out of AI coding assistants.

Thanks to Brandon Cook, Jörn Dinkla, Paul Sobocinski and Ryder Dain for their feedback and input.

This memo was written with GitHub Copilot active, in markdown files. It helps with ideas and turns of phrase, and sometimes when I’m stuck, but suggestions very rarely end up as they were in the final text. I use ChatGPT as a thesaurus, and to find a good name for a donkey.


Onboarding to a 'legacy' codebase with the help of AI (15 August 2024) 

Onboarding to a 'legacy' codebase with the help of AI

One of the promising applications of GenAI in software delivery is about how it can help us understand code better, in particular large codebases, or multiple codebases that form one big application. This is especially interesting for older codebases that are getting hard to maintain (“legacy”), or to improve onboarding in teams that have a lot of fluctuation.

To get an idea for how well this works and what the potential is, I picked an issue of the open source project Bahmni and tried to understand the issue and what needs to be done with the help of AI. Bahmni is built on top of OpenMRS, which has been around for a very long time. OpenMRS and Bahmni are good examples of very large codebases with a lot of tech debt, representing a lot of different styles and technologies over time.

Spoiler alert: I did not actually figure out how to solve the ticket! But the journey gave me a bunch of observations about what AI could and could not help with in such a use case.

The ticket

Organisation name is not fetched from parent location for few Hi-Types

“When the visit location has a target location tagged as Organization then the parent location’s name should be used for Organisation resource in FHIR bundle. This works only for Immunization, Health Document Record and Wellness Record Hi-types. For others the visit location is only used.”

The codebase(s)

OpenMRS and Bahmni have many, many repositories. As I did not have access to a tool that lets me ask questions across all the repositories I would have needed, I cheated and looked at the pull request already attached to the ticket to identify the relevant codebase, openmrs-module-hip.

The tools

I used a bunch of different AI tools in this journey:

  1. Simple Retrieval-Augmented Generation (RAG) over a vectorised version of the full Bahmni Wiki. I’ll refer to this as Wiki-RAG-Bot.

  2. An AI-powered code understanding product called Bloop. It’s one of many products in the market that focus on using AI to understand and ask questions about large codebases.

  3. GitHub Copilot’s chat in VS Code, where one can ask questions about the currently open codebase in chat queries via @workspace.

Understanding the domain

First, I wanted to understand the domain terms used in the ticket that I was unfamiliar with.

  • Wiki-RAG-Bot: Both for “What is a Hi-type?” (Health Information Type) and “What is FHIR?” (Fast Healthcare Interoperability Resource) I got relevant definitions from the AI.

  • The wiki search directly, to see if I could have found it just as well there: I did just as quickly find that “HI type” is “Health Information Type”. However, finding a relevant definition for FHIR was much trickier in the Wiki, because the term is referenced all over the place, so it gave me lots of results that only referenced the acronym, but did not have the actual definition.

  • Wiki-RAG-Bot with the full ticket: In this attempt I asked more broadly, “Explain to me the Bahmni and healthcare terminology in the following ticket: …”. It gave me an answer that was a bit verbose and repetitive, but overall helpful. It put the ticket in context, and explained it once more. It also mentioned that the relevant functionality is “done through the Bahmni HIP plugin module”, a clue to where the relevant code is.

  • ChatGPT: Just to see which of these explanations could also have come from a model’s training data, I also asked ChatGPT about these 2 terms. It does know what FHIR is, but failed on “HI type”, which is contextual to Bahmni.

Understanding more of the ticket from the code

The ticket says that the functionality currently “only works for Immunization, Health Document Record and Wellness Record Hi-types”, and the ticket is about improving the location tagging for other Hi-types as well. So I wanted to know: What are those “other Hi-types”?

  • Bloop: Pointed me to a few of the other Hi-types (“these classes seem to handle…”), but wasn’t definitive if those are really all the possible types

  • GH Copilot: Pointed me to an enum called HiTypeDocumentKind which seems to be exactly the right place, listing all possible values for Hi-Type

  • Ctrl+F in the IDE: I searched for the string “hitype”, which wasn’t actually that broadly used in the code, and the vast majority of the results also pointed me to HiTypeDocumentKind. So I could have also found this with a simple code search.

Finding the relevant implementation in the code

Next, I wanted to find out where the functionality that needs to be changed for this ticket is implemented. I fed the JIRA ticket text into the tools and asked them: “Help me find the code relevant to this feature - where are we setting the organization resource in the FHIR bundle?”

Both GH Copilot and Bloop gave me similar lists of files. When I compared them with the files changed by the pull request, I only found one file in all 3 lists, FhirBundledPrescriptionBuilder, which turned out to be one of the core classes to look at for this. While the other classes listed by AI were not changed by the pull request, they were all dependencies of this FhirBundledPrescriptionBuilder class, so the tools generally pointed me to the right cluster of code.

Understanding how to reproduce the status quo

Now that I had what seemed to be the right place in the code, I wanted to reproduce the behaviour that needed to be enhanced as part of the ticket.

My biggest problem at this point was that most options to reproduce the behaviour of course include some form of “running the application”. However, in a legacy setup like here, that is easier said than done. Often these applications are run with outdated stacks (here: Java 8) and tools (here: Vagrant). Also, I needed to understand the wider ecosystem of Bahmni, and how all the different components work together. I did ask all three of my tools, “How do I run this application?”. But the list of steps suggested were extensive, therefore I had a long feedback loop in front of me, combined with very low confidence that the AI suggestions were correct or at least useful. For GH Copilot and Bloop, who only had access to the codebase, I suspected that they made up quite a bit of their suggestions, and the list of actions looked very generic. The Wiki-RAG-Bot was at least based on the official Bahmni documentation, but even here I couldn’t be sure if the bot was only basing its answer on the most current run book, or if there was also information from outdated wiki pages that it might indiscriminately reproduce.

I briefly started following some of the steps, but then decided to not go further down this rabbit hole.

Writing a test

I did manage to compile the application and run the tests though! (After about 1 hour of fiddling with Maven, which the AI tools could not help me with.)

Unfortunately, there was no existing test class for FhirBundledPrescriptionBuilder. This was a bad sign because it often means that the implementation is not easy to unit-test. However, it’s quite a common situation in “legacy” codebases. So I asked my tools to help me generate a test.

Both GH Copilot and Bloop gave me test code suggestions that were not viable. They made extensive use of mocking, and were mocking parts of the code that should not be mocked for the test, e.g. the input data object for the function under test. So I asked the AI tools to not mock the input object, and instead set up reasonable test data for it. The challenge with that was that the input argument, OpenMrsPrescription, is the root of quite a deep object hierarchy, that includes object types from OpenMRS libraries that the AI did not even have access to. E.g., OpenMrsPrescription contains org.openmrs.Encounter, which contains org.openmrs.Patient, etc. The test data setup suggested by the AI only went one level deep, so when I tried to use it, I kept running into NullPointerExceptions because of missing values.

This is where I stopped my experiment.

Learnings

  • For the use case of onboarding to an unknown application, it is crucial that the tools have the ability to automatically determine the relevant repositories and files. Extra bonus points for AI awareness of dependency code and/or documentation. In Bloop, I often had to first put files into the context myself to get helpful answers, which kind of defeats the purpose of understanding an unknown codebase. And in a setup like Bahmni that has a LOT of repositories, for a newbie it’s important to have a tool that can answer questions across all of them, and point me to the right repo. So this automatic context orchestration is a feature to watch out for in these tools.

  • While the results of the “where is this in the code?” questions were usually not 100% accurate, they did always point me in a generally useful direction. So it remains to be seen in real life usage of these tools: Is this significantly better than Ctrl+F text search? In this case I think it was, I wouldn’t have known where to start with a generic string like “organization”.

  • For older applications and stacks, development environment setup is usually a big challenge in onboarding. AI cannot magically replace a well-documented and well-automated setup. Outdated or non-existing documentation, as well as obscure combinations of outdated runtimes and tools will stump AI as much as any human.

  • The ability of AI to generate unit tests for existing code that doesn’t have unit tests yet all depends on the quality and design of that code. And in my experience, a lack of unit tests often correlates with low modularity and cohesion, i.e. sprawling and entangled code like I encountered in this case. So I suspect that in most cases, the hope to use AI to add unit tests to a codebase that doesn’t have unit tests yet will remain a pipe dream.


Latest Memo: Building an AI agent application to migrate a tech stack

20 August 2024

Tech stack migrations are a promising use case for AI support. To understand the potential better, and to learn more about agent implementations, I used Microsoft’s open source tool autogen to try and migrate an Enzyme test to the React Testing Library.

What does “agent” mean?

Agents in this context are applications that are using a Large Language Model, but are not just displaying the model’s responses to the user, but are also taking actions autonomously, based on what the LLM tells them. There is also the hype term “multi agent”, which I think can mean anything from “I have multiple actions available in my application, and I call each of those actions an agent”, to “I have multiple applications with access to LLMs and they all interact with each other”. In this example, I have my application acting autonomously on behalf of an LLM, and I do have multiple actions - I guess that means I can say that this is “multi agent”?!

The goal

It’s quite common at the moment for teams to migrate their React component tests from Enzyme to the React Testing Library (RTL), as Enzyme is not actively maintained anymore, and RTL has been deemed the superior testing framework. I know of at least one team in Thoughtworks who has tried to use AI to help with that migration (unsuccessfully), and the Slack team have also published an interesting article about this problem. So I thought this is a nice relevant use case for my experiment, I looked up some documentation about how to migrate and picked a repository with Enzyme tests.

I first migrated a small and simple test for the EncounterHistory component myself, to understand what success looks like. The following were my manual steps, they are roughly what I want the AI to do. You don’t have to understand exactly what each step means, but they should give you an idea of the types of changes needed, and you’ll recognise these again later when I describe what the AI did:

  • I added RTL imports to the test file
  • I replaced Enzyme’s mount() with RTL’s render(), and used RTL’s global screen object in the assertion, instead of Enzyme’s mount() return value
  • The test failed -> I realized I needed to change the selector function, selection by “div” selector doesn’t exist in RTL
  • I added a data-testid to the component code, and changed the test code to use that with screen.getByTestId()
  • Tests passed!

How does Autogen work?

In Autogen, you can define a bunch of Agents, and then put them in a GroupChat together. Autogen’s GroupChatManager will manage the conversation between the agents, e.g. decide who should get to “speak” next. One of the members of the group chat is usually the UserProxyAgent, so that basically represents me, the developer who wants assistance. I can implement a bunch of functions in my code that can be registered with the agents as available tools, so I can say that I want to allow my user proxy to execute a function, and that I want to make the AssistantAgents aware of these functions, so they can tell the UserProxyAgent to execute them as needed. Each function needs to have some annotations on it that describes what they do in natural language, so that the LLM can decide if they are relevant.

For example, here is the function I wrote to run the tests (where engineer is the name of my AssistantAgent):

@user_proxy.register_for_execution()
@engineer.register_for_llm(description="Run the tests")
def run_tests(test_file_path: Annotated[str, "Relative path to the test file to run"]) -> Tuple[int, str]:
    
    output_file = "jest-output-for-agent.json"
    subprocess.run(
        ["./node_modules/.bin/jest", "--outputFile=" + output_file, "--json", test_file_path],
        cwd=work_dir
    )

    with open(work_dir + "/" + output_file, "r") as file:
        test_results = file.read()

    return 0, test_results

Implementation

Using this Autogen documentation example as a starting point, I created a GroupChat with 2 agents, one AssistantAgent called engineer, and a UserProxyAgent. I implemented and registered 3 tool functions: see_file, modify_code, run_tests. Then I started the group chat with a prompt that describes some of the basics of how to migrate an Enzyme test, based on my experience doing it manually. (You can find a link to the full code example at the end of the article.)

Did it work?!

It worked - at least once… But it also failed a bunch of times, more so than it worked. In one of the first runs that worked, the model basically went through the same steps that I went through when I was doing this manually - maybe not surprisingly, because that was the basis of my prompt instructions.

How does it work?

This experiment helped me understand a lot better how “function calling” works, which is a key LLM capability to make agents work. It’s basically an LLM’s ability to accept function (aka tool) descriptions in a request, pick relevant functions based on the user’s prompt, and ask the application to give it the output of those function calls. For this to work, function calling needs to be implemented in the model’s API, but the model itself also needs to be good at “reasoning” about picking the relevant tools - apparently some models are better at doing this than others.

I traced the requests and responses to get a better idea of what’s happening. Here is a visualisation of that:

Overview of the requests the agent was sending

A few observations:

  • Note how the request gets bigger and bigger with each iteration. This is how LLMs work - with every new request, you need to give them the full context of what happened so far. The application that sends the requests to the LLM can of course implement something to truncate the history once it gets larger, basically find good ways to summarise it in a way that doesn’t drop important context, which is not trivial. For example, when you are having a conversation with ChatGPT, the longer the conversation gets, the requests to the model also get larger and larger. But the ChatGPT application starts truncating the context when it gets too large before sending it to the GPT model itself. In my case here, I don’t have an optimisation like that, so my request is just stacking up the information with each iteration. In fact, I ran into token limits a few times with this.

  • There is a LOT of back and forth. The overview above shows the 10 first requests of a failed run, the test was still failing at the end of that 10th request. Note how every time after a code modification, the application sends another request to confirm that this was done, to then get the next instruction from the LLM. I could have probably saved some back and forth by including the code contents directly in the initial prompt, but it was interesting to see this see_file action play out. The successful run that I had took 10 requests in total:

Iteration Request (user_proxy:UserProxyAgent) Response (engineer:AssistantAgent)
1 Start with prompt
(“You will be getting paths to a JavaScript Enzyme test file, […], here are instructions how to migrate: […]”)
see_file for component code
2 Add content of component file see_file for test code
3 Add content of test file modify_code: add imports, use RTL’s render function and screen object
4 Confirm modification run_tests
5 Add test results
(“Unable to find an accessible element with the role “div””)
modify_code: “We must add a test id to the component”
6 Confirm modification modify_code: Use getTestById in test
7 Confirm modification run_tests
8 Add test results
(“expect(…).toBeInTheDocument is not a function”)
modify_code: “Let’s try to solve this by importing @testing-library/jest-dom”
9 Confirm modification run_tests
10 Add test results
(numFailedTests = 0)
TERMINATE (“The tests ran successfully […]. The change is now completed.”)
  • In a lot of the runs that failed, the code changes created basic syntax problems with brackets and such, because they were e.g. deleting too much code. This usually totally stumped the AI, understandably. It could be either the AI’s fault, for giving unsuitable change diff instructions. Or it could be the fault of the modify_code function used, maybe it’s too simple. I wonder about the potential of using functions that don’t just take textual code diffs from the model, but instead offer functions that actually represent refactoring functionality of the IDE.

Conclusion

There is a lot of hype around developer agents right now, the most prominent examples being GitHub Copilot Workspace and Amazon Q’s developer agent, and you can find a bunch more on the SWE Bench website. But even hand-picked product demos often show examples where the AI’s solution doesn’t hold up to scrutiny. These agents still have quite a way to go until they can fulfill the promise of solving any kind of coding problem we throw at them. However, I do think it’s worth considering what the specific problem spaces are where agents can help us, instead of dismissing them altogether for not being the generic problem solvers they are misleadingly advertised to be. Tech stack migrations like the one described above seem like a great use case: Upgrades that are not quite straightforward enough to be solved by a mass refactoring tool, but that are also not a complete redesign that definitely requires a human. With the right optimisations and tool integrations, I can imagine useful agents in this space a lot sooner than the “solve any coding problem” moonshot.

Code gist on GitHub

Thanks to Vignesh Radhakrishnan, whose autogen experiments I used as a starting point.