Maintainability sensors for coding agents

In a recent article about harness engineering for coding agent users, I laid out a mental model for expanding a coding agent harness: a system of guides and sensors that increase the probability of good agent outputs and enable self-correction before issues reach human eyes. This article is a more practical follow-up where I walk through my experience with using sensors that help keep the codebase maintainable.

19 May 2026

Birgitta Böckeler

Birgitta is a Distinguished Engineer and AI-assisted delivery expert at Thoughtworks. She has over 20 years of experience as a software developer, architect and technical leader.

The application
- Overview of all sensors used
- Base harnesses and models
Static code analysis: Basic linting

There are multiple dimensions we usually want to achieve and monitor in our codebases: Functional correctness (works as intended), architectural fitness (is fast/secure/usable enough), and maintainability. I define maintainability here as making it easy and low risk to change the codebase over time - also known as “internal quality”. So I don't only want to be able to make changes quickly today, but also in the future. And I don't want to worry about introducing bugs or degradation of fitness every time I make a change - or have AI make a change. I usually see the first signs of cracks in the maintainability of an AI-generated codebase when the number of files changed for a small adjustment increases. Or when changes start breaking things that used to work.

Internal quality problems affect AI agents in similar ways that they affect human developers. An agent working in a tangled codebase might look in the wrong place for an existing implementation, create inconsistencies because it has not noticed a duplicate, or be forced to load more context than a task should require.

In this article, I describe my experimentation with various sensors that help us and AI reflect on the maintainability of a codebase, and what I learned from that.

The application

I'm working on an internal analytics dashboard for community managers that reads chat space activity, engagement, and demographic data from a combination of APIs and presents the data in a web frontend.

Figure 1: The example app: web UI, service layer, and external APIs.

The tech stack is a TypeScript, NextJS, and React. The backend reads and joins data from the APIs. The application has been around for a while, but for the sake of these experiments I rebuilt it with AI from scratch.

There are hardly any guides (e.g. markdown files) for AI about code quality and maintainability present, I wanted to see how well it can do just by relying on sensor feedback.

Overview of all sensors used

Overview of sensors: During coding session, after integration in the pipeline, repeatedly, and runtime feedback in production

Figure 2: Where sensors can run: during the initial coding session, in the pipeline, on a schedule, and in production.

This is an overview of the sensors I set up across the path to production.

During coding session

Sensors that run continuously alongside the agent to provide fast feedback.

Type checker (computational)
ESLint (computational)
Semgrep, SAST tool prescribed by our internal AppSec team (computational)
dependency-cruiser, runs structural rules to check internal module dependencies (computational)
Test suite results including test coverage (computational - though the test suite is generated by AI, therefore created in an inferential way)
Incremental mutation testing (computational)
GitLeaks runs as part of the pre-commit hook, I consider it to be a sensor as well, as it will give the agent feedback when it tries to commit (computational)

After integration - pipeline

The same computational sensors run again in CI. The in-session sensors give the agent early feedback during development. The CI pipeline confirms the result on clean infrastructure and after integration.

Repeatedly

Sensors that run on a slower cadence to detect drift that accumulates over time, rather than errors that occur in the moment.

A security review, prompt derived from our AppSec checklist for internal applications (inferential)
A data handling review, prompt describes things like “no user names should ever be sent to the web frontend” (inferential)
Dependency freshness report, which runs a script first to get the age and activity of the library dependencies, and then has AI create a report with recommendations about potential upgrades, deprecations, etc (computational and inferential)
Modularity and coupling review (computational and inferential)

With this context out of the way, let's dive into the first category of sensors.

Base harnesses and models

Throughout building the application, I used a mix of Cursor, Claude Code, and OpenCode (in that order of frequency). My default model was usually Claude Sonnet, for some of the planning and analysis tasks I used Claude Opus, and for implementation tasks I frequently used Cursor's composer-2 model.

Static code analysis: Basic linting

I'll start with my learnings from using ESLint in this application. Basic linting tools like ESLint mostly target maintainability risk at the level of individual files and functions.

Rules for typical AI shortcomings

In my experience, the AI failure modes that are the most low-hanging fruit for static code analysis are

Max number of arguments for functions
File length
Function length
Cyclomatic complexity

However, these weren't even active in ESLint's default preset, I had to configure maximums for them first. Hopefully, static analysis tools will evolve to provide better presets for usage with AI. A bit of research shows that people are also starting to publish ESLint plugins with rule sets that are specifically targeting known agent failure modes, like this one by Factory, with rules about things like requiring test files or structured logging.

Guidance for self-correction

A sensor is meant to give the agent feedback so that it can self-correct. Ideally, we want to give the agent extra context for that self-correction - a good kind of prompt injection. To do that, I built a custom ESLint formatter to override some of the default messages - with the help of AI of course, naturally.

Here is an example of my guidance for the no-explicit-any warning.

We want things to be typed to make it easier to avoid errors, especially for key concepts.
But we also want to avoid cluttering our codebase with unnecessary types. Make a judgment
call about this. If you choose to not introduce a type, suppress it with:
// eslint-disable-next-line @typescript-eslint/no-explicit-any -- (give reason why)`,

Managing warnings - now more feasible?

Static code analysis has been around for a long time, and yet, teams often didn't use it consistently, even when they had it set up. One of the reasons for that is the management overhead that comes with it. Effective use of this analysis requires a team to keep a “clean house”, otherwise the metrics just become noise. In particular warnings like the no-explicit-any example above are tricky, because you don't always want to fix them - it depends. And suppressing them one by one has always felt tedious, and like noise in the code.

With coding agents, we might now have a chance at that clean baseline. In the guidance text above, the agent is told to make a judgment call, and allowed to suppress a warning in the code. This keeps the suppressions manageable, visible and reviewable.

For thresholds, like the maximum number of lines, or the maximum allowed cyclomatic complexity, I told the agent in the lint message that it may slightly increase the thresholds if it thinks that a refactoring is unnecessary or impossible in a particular case. This doesn't suppress the threshold forever, just increases it, so that the rule fires again if it gets even worse in the future. Constraints are preserved without forcing a binary suppress-or-comply choice.

Observations

Looking at the exceptions AI created (suppressed warnings, increased thresholds) was a good point to start my code review.
AI frequently decided to increase the cyclomatic complexity threshold, but suggested good refactorings when I nudged it further. It was the only category where it did that, and I later discovered that I didn't have a self-correction guidance in place for this one, so there was no explicit instruction saying that a threshold increase should be the absolute exception. This is an indicator that the custom lint messages can indeed make quite a difference.
Sometimes I want to treat rules differently in different parts of the code. Let's take no-console, telling AI off when it uses console.log. In the backend, I want it to use a logger component instead. In the frontend, I might want to not use direct logging at all, or at the very least I need to use a different logging component. This is another example of the power of the self-correction guidance, and where AI can help with semantic judgment and management of analysis warnings.
I was watching out for examples of trade-offs between rules. The only one I've seen so far was created by the max-lines and max-lines-per-function rules. I've seen AI do quite a bit of useful refactoring and breakdown into smaller functions and components as a result of this sensor feedback. However, in the React frontend, I'm seeing a worrying trend of components with lots and lots of properties as a result of passing values through a growing chain of smaller and smaller components. I haven't got useful observations yet about how good AI might be at making consistent decisions between tradeoffs like that.

Main takeaways

Overall, I was positively surprised by how many things I can cover with static analysis. I had to remind myself multiple times why it has been somewhat underused in the past, and what has changed: The cost-benefit balance. Cost is reduced because it's much cheaper to create custom scripts and rules with AI. And the benefit has also increased: the analysis results help me get a first sense of lots of hygiene factors that wouldn't even happen that much when I write code myself, so I can get common AI mistakes out of the way.

However, I can't help but wonder if this can also lead to a false sense of security and an illusion of quality. After all, another reason why linters like this have been less used in the past is that they have limits, and we have been wary of using them as a simplified indicator of quality. There are lots of more semantic aspects of quality that static analysis cannot catch, it remains to be seen if AI can adequately fill that gap in partnership with those tools. I also discovered new supposed issues in the code every time I activated a new set of rules. It was always a mix of irrelevant things and things that actually matter. So I worry about feedback overload for the agent, sending it into a spiral of over-engineered refactorings.

In the next update to this article, I will share about my experiences with how static code analysis can help us and AI keep up good modularity inside of a codebase.

To find out when we publish the next installment subscribe to this site's RSS feed, or Martin's feeds on Mastodon, Bluesky, LinkedIn, or X.

Significant Revisions

19 May 2026: Published basic linting