RefinementCodeReview

28 January 2021

When people think of code reviews, they usually think in terms of an explicit step in a development team's workflow. These days the Pre-Integration Review, carried out on a Pull Request is the most common mechanism for a code review, to the point that many people witlessly consider that not using pull requests removes all opportunities for doing code review. Such a narrow view of code reviews doesn't just ignore a host of explicit mechanisms for review, it more importantly neglects probably the most powerful code review technique - that of perpetual refinement done by the entire team.

One of the most pervasive perspectives in software is the notion that it's something we build and complete - hence the endless metaphor of building construction and architecture. Yet the key property of software is that it is soft, and can be as easily modified after it's released as it was when initially composed in the programmer's editor. That's why Erik Dörnenburg wisely argues that architecture is a poor metaphor and would be better replaced by town planning. Valuable software is usually in a constant state of change, as we add features from a better understanding of the value it can bring. But the opportunity is not just to add new features, but also to refine that software - incorporating the lessons the team steadily learns about how best that software can enable these changes.

With the right environment, I can look a bit of code written six months ago, see some problems with how it's written, and quickly fix them. This may be because this code was flawed when it was written, or that changes in the code base since led to the code no longer being quite right. Whichever the cause, the important thing is to fix problems as soon as they start getting in our way. As soon as I have an understanding about the code that wasn't immediately apparent from reading it, I have the responsibility to (as Ward Cunningham so wonderfully said) take that understanding out of my head and put it into the code. That way the next reader won't have to work so hard.

This process of refinement is exactly the same as what happens in a code review, but it's triggered each time the code is looked at rather than when the code is added to the codebase. This was, for me, a crucial insight. After all, many problems that code reviews seek to remedy are problems that only become problems when the code is read in the future. There's a strong argument for not worrying about them until then. After all, just like adding a large apartment complex changes traffic patterns, we may have altered the context of the code six months later, altering the kind of fix that code needs. It also involves more people, in this scheme every developer that reads the code is a reviewer, and one that's able to review based on their actual use of the code rather than on some general, but often hazily-justified guidelines.

A way to think about the validity of a practice is by thinking about what happens if it's a monopoly. What if the only code review mechanism we have is the iteration from later programmers? One consequence is that the review attention gets concentrated on the areas of code that are read more often - which is mostly the areas that ought to get the attention. One concern is that code that's never read will never get reviewed - but mostly that's fine. A team with good testing practices can be confident that the code works, performance tests can identify performance issues. Given that, if the code never needs to be looked at again, we don't need to spend effort on making it comprehensible. I'd expect such cases to be vanishingly rare, but it's an informative thought experiment.

But most ≠ all. One obvious exception here is security issues. Code can work just fine for years until an attacker finds an exploit, at that point we'll lament its lack of review. This is an example of high-impact but rare safety concerns which deserve special scrutiny. However that doesn't mean we shouldn't make conscious use of refinement as a code review mechanism. Instead it means we should be aware of rare-high-impact concerns and adjust our workflow to watch for that kind of specific problem to the degree that it's needed in our circumstances. Threat analysis should alert us to the modules that need additional attention and the kinds of risks they face. Targeted code reviews might be scheduled for security concerns, these can run more effectively because they are focused on a specific kind of problem.

In order to do this perpetual code refinement we require other practices. If I'm going to change code I need to have confidence that it won't break existing functionality, so I need something like Self Testing Code. I need to know that it won't cause big merge conflicts for others, so I need Continuous Integration. We all need to be good at refactoring so we can change code effectively. Since this relies on many developers being expected to modify any part of the code base, we are best off with collective (or at least weak) code ownership. But given a team that has these skills, they can rely on using their regular refinement as a substantial part of their code review strategy.

If nothing else, I think it's important that we put more thought into the role of refinement as code review. One of the dangers of focusing solely on Pre-Integration Reviews is that it can lead teams to neglect how change works in a code base. If I have a pristine mainline, and ensure that every commit merged into that mainline is pristine - can I be sure that the codebase is still pristine after six months? I'd argue that I can't, because the changes mean a good decision about some code six months ago is no longer a good decision now. Refining the code allows us to evaluate old code against this changing usage, allowing us to sustain its health.

Acknowledgements

Ben Noble, Chris Ford, Evan Bottcher, Ian Cartwright, Jeremy Huiskamp, Ken Mugrage, Mario Giampietri, Martha Rohte, Omar Bashir, Peter Gillard-Moss, and Simon Brunning commented on drafts of this post on our internal mailing list.

PullRequest

28 January 2021

Pull Requests are a mechanism popularized by github, used to help facilitate merging of work, particularly in the context of open-source projects. A contributor works on their contribution in a fork (clone) of the central repository. Once their contribution is finished they create a pull request to notify the owner of the central repository that their work is ready to be merged into the mainline. Tooling supports and encourages code review of the contribution before accepting the request. Pull requests have become widely used in software development, but critics are concerned by the addition of integration friction which can prevent continuous integration.

Pull requests essentially provide convenient tooling for a development workflow that existed in many open-source projects, particularly those using a distributed source-control system (such as git). This workflow begins with a contributor creating a new logical branch, either by starting a new branch in the central repository, cloning into a personal repository, or both. The contributor then works on that branch, typically in the style of a Feature Branch, pulling any updates from Mainline into their branch. When they are done they communicate with the maintainer of the central repository indicating that they are done, together with a reference to their commits. This reference could be the URL of a branch that needs to be integrated, or a set of patches in an email.

Once the maintainer gets the message, she can then examine the commits to decide if they are ready to go into mainline. If not, she can then suggest changes to the contributor, who then has opportunity to adjust their submission. Once all is ok, the maintainer can then merge, either with a regular merge/rebase or applying the patches from the final email.

Github's pull request mechanism makes this flow much easier. It keeps track of the clones through its fork mechanism, and automatically creates a message thread to discuss the pull request, together with behavior to handle the various steps in the review workflow. These conveniences were a major part of what made github successful and led to "pull request" becoming a fundamental part of the developer's lexicon.

So that's how pull requests work, but should we use them, and if so how? To answer that question, I like to step back from the mechanism and think about how it works in the context of a source code management workflow. To help me think about that, I wrote down a series of patterns for managing source code branching. I find understanding these (specifically the Base and Integration patterns) clarifies the role of pull requests.

In terms of these patterns, pull requests are a mechanism designed to implement a combination of Feature Branching and Pre-Integration Reviews. Thus to assess the usefulness of pull requests we first need to consider how applicable those patterns are to our situation. Like most patterns, they are sometimes valuable, and sometimes a pain in the neck - we have to examine them based on our specific context. Feature Branching is a good way of packaging together a logical contribution so that it can be assessed, accepted, or deferred as a single unit. This makes a lot of sense when contributors are not trusted to commit directly to mainline. But Feature Branching comes at a cost, which is that it usually limits the frequency of integration, leading to complicated merges and deterring refactoring. Pre-Integration Reviews provide a clear place to do code review at the cost of a significant increase in integration friction. [1]

That's a drastic summary of the situation (I need a lot more words to explain this further in the feature branching article), but it boils down to the fact that the value of these patterns, and thus the value of pull requests, rest mostly on the social structure of the team. Some teams work better with pull requests, some teams would find pull requests a severe drag on the effectiveness. I suspect that since pull requests are so popular, a lot of teams are using them by default when they would do better without them.

While pull requests are built for Feature Branches, teams can use them within a Continuous Integration environment. To do this they need to ensure that pull requests are small enough, and the team responsive enough, to follow the CI rule of thumb that everybody does Mainline Integration at least daily. (And I should remind everyone that Mainline Integration is more than just merging the current mainline into the feature branch).

The wide usage of pull requests has encouraged a wider use of code review, since pull requests provide a clear point for Pre-Integration Review, together with tooling that encourages it. Code review is a Good Thing, but we must remember that a pull request isn't the only mechanism we can use for it. Many teams find great value in the continuous review afforded by Pair Programming. To avoid reducing integration frquency we can carry out post-integration code review in several ways. A formal process can record a review for each commit, or a tech lead can examine risky commits every couple of days. Perhaps the most powerful form of code review is one that's frequently ignored. A team that takes the attitude that the codebase is a fluid system, one that can be steadily refined with repeated iteration carries out Refinement Code Review every time a developer looks at existing code. I often hear people say that pull requests are necessary because without them you can't do code reviews - that's rubbish. Pre-integration code review is just one way to do code reviews, and for many teams it isn't the best choice.

Acknowledgements

Chris Ford, Dan Mutton, Jeremy Huiskamp, Kief Morris, Pramod Sadalage, and Ryan Boucher commented on drafts of this post on our internal mailing list.

Notes

1: A colleague of mine recently calculated the time a client spent waiting for pull requests that had no comments (true of 91% of them). Total time waiting in 2020 for 7000 PRs was 130,000 hours. This figure included time elapsed over nights and weekends.


ComputationalNotebook

18 November 2020

A computational notebook is an environment for writing a prose document that allows the author to embed code which can be easily executed with the results also incorporated into the document. It's a platform particularly well-suited for data science work. Such environments include Jupyter Notebook, R Markdown, Mathematica, and Emacs's org-mode.

When I'm exploring some data, it's useful to keep my notes close together with the code that performs the exploration. I like to try some code, look at the results, and note down any observations I have from that execution. A computational notebook allows me to combine these together easily in a single document.

Here's an example of this, looking at some analysis of my google analytics data for martinfowler.com. I'm doing this in R Studio, which uses the R Markdown format.

The example out here is a graph, as notebooks are well suited for plotting various charts. But it's just as useful to embed various data manipulations in the code and display the data in the document as a table.

I first encountered a computational notebook in the late 1980's with Mathematica. I remember wishing I'd had access to such a tool during my university degree, but didn't use a computational notebook again until recent years, with the rise of their use in data science circles. The notebook software I hear most about is Jupyter Notebook, which is popular in the Python community, but as I do my data munging with R I tend to use R Markdown, usually within R Studio. I also use a rather more niche notebook, org-mode, which is part of Emacs.

The code embedded in Mathematica is its own programming language, designed for expressing mathematics. Although Jupyter began in the Python world, it supports a wide range of programming languages, as does R Markdown. Mathematica is a commercial tool, but Jupyter and R Markdown are open source. Jupyter stores its files in JSON, R Markdown uses markdown files with some special markup for the code blocks. Using a text format for the documents allows them to be stored in regular version control tools, and using a markup language makes diffing easier. Using a markup language allows the possibility of editing the documents in other editors, but they need to have a suitable environment for executing the code blocks.

Computational notebooks are useful when exploring a problem, such as trying various forms of analysis on a dataset. The document acts as a record of what's been tried and all the observations the researcher makes as they try things. By keeping the code and results together the writer can see exactly what they did and what results that generated. This coupling of code and results is a form of IllustrativeProgramming, making the environment appealing to lay programmers. One thing to be wary of, however, is if any external environmental factors change the result - such as the contents of a database. If the dataset isn't too large it can be exported and kept in the version control system, but often its size is prohibitive.

Notebooks are also useful for preparing reports, usually by generating a document in PDF, HTML, or other formats. If I want to report to an author on the traffic for their article, I take the last such report, change the subject URL, rerun all the code, and tweak any prose commentary I think is appropriate. If I were sufficiently motivated I could auto-generate such reports every few months. I like that such reports can easily include the code used to generate the results, so readers can accurately understand the logic behind the figures they see.

Notebooks shouldn't be used, however, as a component of a production system. The notebook structure - with its casual mix of IO, calculation, and UI - is there to encourage interactivity, but works against the modularity needed for code that is used as part of a broader code base. It's best to think of notebooks as a way of exploring logic, once you've found a path, that logic should be replicated into a library designed for production use.


KeystoneInterface

29 April 2020

Software development teams find life can be much easier if they integrate their work as often as they can. They also find it valuable to release frequently into production. But teams don't want to expose half-developed features to their users. A useful technique to deal with this tension is to build all the back-end code, integrate, but don't build the user-interface. The feature can be integrated and tested, but the UI is held back until the end until, like a keystone, it's added to complete the feature, revealing it to the users.

A simple example of this technique might be to give a customer the option of a rush order. Such an order needs to be priced, depending on where the customer lives and what delivery companies operate there. The nature of the goods involved affects the picking approach used in the warehouse. Certain customers may qualify to have rush orders available to them, which may also depend on the delivery location, the time of year, and the kind of goods ordered.

All in all that's a fair bit of business logic, particularly since it will involve gnarly integration with various warehousing, catalog, and customer service systems. Doing this could take several weeks, while other features, need to be released every few days. But as far as the user is concerned, a rush order is just a check-box on the order form.

To build this using the check-box as the keystone, the team does development work on the underlying business logic and interfaces to internal systems over the course of several production releases. The user is unaware of all this latent code. Only with the last step does the keystone check-box need to be made visible, which can be done in a relatively short time. This way all latent code can be integrated and be part of the system going into production, reducing the problems that come with a long-lived feature branch.

The latent code does need to be tested to the same degree of confidence that it would be if it were active. This can be done providing the architecture of the system is setup so that most testing isn't done through the user interface. Unit Tests and other lower layers of the Test Pyramid should be easy to run this way. Even Broad Stack Tests can be run providing there is a mechanism to make them Subcutaneous Tests. In some cases there will a significant amount of behavior within the UI itself, but this can also be tested if the design allows the visible UI to be a Humble Object.

Not all applications are built in such a way that they can be extensively tested in a subcutaneous manner - but the effort required to do this is worthwhile even without the capability to use a keystone. Tests running through the UI are always more trouble to setup, even with the best tools to automate the process. Moving more tests to subcutaneous and lower level tests, especially unit tests, can dramatically speed up Deployment Pipelines and enable Continuous Delivery.

Of course, most UIs will be more than a check-box, although often they aren't that much more work to keystone. In a web app, a complex feature will often be an independent web page, that can be built and tested in full, and the keystone is merely a link. A desktop may have several screens where the keystone is the menu-item to make them visible.

That said, there are cases when the UI can't be packaged into a simple keystone. When that's the case then it's time to use Feature Toggles. Even in this case, however, thinking of a keystone can be useful by ensuring that the feature toggle only applies to the UI. This avoids scattering lots of toggle points through the back end code, reduces the complexity of applying the toggle, allows the use of simple toggle mechanisms, and makes it easier to remove when the time comes.

There is a general danger with developing a UI last, in that the back-end code may be designed in a way that doesn't work with the UI once it's built, or the UI isn't given the attention it needs until late, leading to a lack of iteration and a poor user experience. For those reasons a keystone approach works best within an overall approach that encourages building a product through thin vertical slices that lead to releasing small but fully working features rapidly.

I've used the example of a user-interface here, but of course the same approach can be used with any other interface, such as an API. By building the consumer's interface last, and keeping it simple, we can build and integrate even large features in small chunks.

Dark Launching is a variation where the new feature is called once its built, but no results are shown to the user. This is done to measure the impact on the back-end systems, which is useful for some changes. Once all is good, we can add the keystone.

Acknowledgements

I first came across the metaphor of a keystone for this technique in the second edition of Kent Beck's Extreme Programming Explained. Pete Hodgson, Brandon Duff, and Stefan Smith reminded me that I'd forgotten this.

Dave Farley, Paul Hammant, and Pete Hodgson commented on drafts on this post.


OutcomeOverOutput

11 February 2020

Imagine a team writing software for a shopping website. If we look at the team's output, we might consider how many new features they produced in the last quarter, or a cross-functional measure such as a reduction in page load time. An outcome measure, however, would consider measure increased sales revenue, or reduced number of support calls for the product. Focusing on outcomes, rather than output, favors building features that do more to improve the effectiveness of the software's users and customers.

As with any professional activity, those of us involved in software development want to learn what makes us more effective. This is true of an individual developer trying to improve her own performance, for managers looking to improve teams within an organization, or a maven like me trying to raise the game of the entire industry. One of the things that makes this difficult is that there's no clear way to measure the productivity of a software team. And this measurement question gets further complicated by whether we base effectiveness on output or outcome.

I've always been of the opinion that outcome is what we should concentrate on. If a team delivers lots of functionality - whether we measure it in lines of code, function points, or stories - that functionality doesn't matter if it doesn't help the user improve their activity. Lots of unused features are wasted effort, indeed worse than that they bloat the code base making it harder to add new features in the future. Such a software development team needs to care about the usefulness of the new functionality, they improve as they deliver less features, but of greater utility.

One argument I've heard against using outcome-based observations is that it's harder to come up with repeatable measures for outcomes than it is for output. I find this point difficult to fathom. Measuring pure output for software is famously difficult. Lines of code are a useless measure even if they weren't so easily gamed. There's poor replicability with Function Point or Story Points - different people will give the same things different point scores. Compared to this, we are very good at measuring financial outcomes. Of course, many outcome observations are more tricky to make - consider customer satisfaction - but I don't see any of them as more difficult than software functionality.

Just calling something an “outcome”, of course, doesn't make something the right thing to focus on, and there is certainly a skill to picking the right outcomes to observe. One handy notion is that of Seiden, who says that an outcome should be a change in behavior of a user, employee, or customer that drives a good thing for the organization. He makes a distinction between “outcomes”, which are behavioral changes that are easier to observe, and “impacts” which are broader effects upon the organization. In developing EDGE, Highsmith, Luu, and Robinson advise that outcomes about customer value (reliability of a dishwasher) should be given more weight than outcomes about business value (warranty repair costs).

A consequential concern about using outcome observations is that it's harder to apportion them to a software development team. Consider a customer team that uses software to help them track the quality of goods in their supply chain. If we assess them by how many rejects there are by the final consumer, how much of that is due to the software, how much due the quality control procedures developed by quality analysts, and how much due to a separate initiative to improve the quality of raw materials? This difficulty of apportionment is a huge hurdle if we want to compare different software teams, perhaps in order to judge whether using Clojure has helped teams be more effective. Similarly there is the case that the developers work well and deliver excellent and valuable software to track quality, but the quality control procedures are no good. Consequently rejects don't go down and the initiative is seen as a failure, despite the developers doing a great job on their part.

But the problems of apportionment shouldn't be taken as a reason to observe the wrong thing. The common phrase says "you get what you measure", in this case it's more like "you get what you try to measure". If you focus appraisal of success on output, then everyone is thinking about how to increase the output. So even if it's tricky to determine how a team's work affects outcome, the fact that people are instead thinking about outcomes and how to improve them is worth more than any effort to compare teams' proficiency in producing the wrong things.

Further Reading

Seiden provides a nice framework for thinking of outcomes, one that's informed by experiences with non-profits who have a similarly tricky job of evaluating the impact of their work.

My colleagues developed EDGE as an operating model for transforming businesses to work in the digital world. Focusing on outcomes is a core part of their philosophy.

Focusing on outcomes naturally leads to favoring Outcome Oriented teams.

Acknowledgements

My fellow pioneers in the early days of Extreme Programming were very aware of the faults of assessing software development in terms of output. I remember Ron Jeffries and I arguing at an early agile conference workshop that any measures of a team's effectiveness should focus on outcome rather than output - although we did not use those words yet. That thinking is also reflected in my post Cannot Measure Productivity.

I recall starting to hear my colleagues at ThoughtWorks talking about a distinction between outcome and output appearing in the 2000s, leading Daniel Terhorst-North to suggest that outcome over features should be a fifth agile value. This favor to outcomes is a regular theme in ThoughtWorks-birthed books such as Lean Enterprise, EDGE, and the Digital Transformation Game Plan.

Alexander Steinhart, Alexandra Mogul, Andy Birds, Dale Peakall, Dean Eyre, Gabriel Sixel, Jeff Mangan, Job Rwebembera, Kief Morris, Linus Karsai, Mariela Barzallo, Peter Gillard-Moss, Steven Wilhelm, Vanessa Towers, Vikrant Kardam, and Xiao Ran discussed drafts of this post on our internal mailing list. Peter Gillard-Moss led me to the Seiden book and other work from the non-profit world.


ExploratoryTesting

18 November 2019

Exploratory testing is a style of testing that emphasizes a rapid cycle of learning, test design, and test execution. Rather than trying to verify that the software conforms to a pre-written test script, exploratory testing explores the characteristics of the software, raising discoveries that will then be classified as reasonable behavior or failures.

The exploratory testing mindset is a contrast to that of scripted testing. In scripted testing, test designers create a script of tests, where each manipulation of the software is written down, together with the expected behavior of the software. These scripts are executed separately, usually many times, and usually by different actors than those who wrote them. If any test demonstrates behavior that doesn't match the expected behavior designed by the test, then we consider this a failure.

For a long time scripted tests were usually executed by testers, and you'd see lots of relatively junior folks in cubicles clicking through screens following the script and checking the result. In large part due to the influence of communities like Extreme Programming, there's been a shift to automating scripted testing. This allows the tests to be executed faster, and eliminates the human error involved in evaluating the expected behavior. I've long been a firm advocate of automated testing like this, and have seen great success with its use drastically reducing bugs.

But even the most determined automated testers realize that there are fundamental limitations with the technique, which are limitations of any form of scripted testing. Scripted testing can only verify what is in the script, catching only conditions that are known about. Such tests can be a fine net that catches any bugs that try to get through it, but how do we know that the net covers all it ought to?

Exploratory testing seeks to test the boundaries of the net, finding new behaviors that aren't in any of the scripts. Often it will find new failures that can be added to the scripts, sometimes it exposes behaviors that are benign, even welcome, but not thought of before.

Exploratory testing is a much more fluid and informal process than scripted testing, but it still requires discipline to be done well. A good way to do this is to carry out exploratory testing in time-boxed sessions. These sessions focus on a particular aspect of the software. A charter that identifies the target of the session and what information you hope to find is a fine mechanism to provide this focus.

Elisabeth Hendrickson is one of the most articulate exponents of exploratory testing, and her book is the first choice to dig for more information on how to do this well.

Such a charter can act as focus, but shouldn't attempt to define details of what will happen in the session. Exploratory testing involves trying things, learning more about what the software does, applying that learning to generate questions and hypotheses, and generating new tests in the moment to gather more information. Often this will spur questions outside the bounds of the charter, that can be explored in later sessions.

Exploratory testing requires skilled and curious testers, who are comfortable with learning about the software and coming up with new test designs during a session. They also need to be observant, on the lookout for any behavior that might seem odd, and worth further investigation. Often, however, they don't have to be full-time testers. Some teams like to have the whole team carry out exploratory testing, perhaps in pairs or in a single mob.

Exploratory testing should be a regular activity occurring throughout the software development process. Sadly it's hard to find any guidelines on how much should be done within a project. I'd suggest starting with a one hour session every couple of weeks and see what kinds of information the sessions unearth. Some teams like to arrange half-an-hour or so of exploratory testing whenever they complete a story.

If you find bugs are getting through to production, that's a sign that there are gaps in the testing regimen. It's worth looking at any bug that escapes to production and thinking about what measures could be taken to either prevent the bug from getting there, or detecting it rapidly when in production. This analysis will help you decide whether you need more exploratory testing. Bear in mind that it will take time to build up the skill to do exploratory testing well, if you haven't done much exploratory testing before.

I would consider it a red flag if a team isn't doing exploratory testing at all - even if their automated testing was excellent. Even the best automated testing is inherently scripted testing - and that alone is not good enough.

Acknowledgements

Almost all I know about Exploratory Testing comes from Elisabeth Hendrickson's fine book, which is also where I pinched the net metaphor from.

Aida Manna, Alex Fraser, Bharath Kumar Hemachandran, Chris Ford, Claire Sudbery, Daniel Mondria, David Corrales, David Cullen, David Salazar Villegas, Lina Zubyte, and Philip Peter discussed drafts of this article on our internal mailing list.


WaterfallProcess

13 November 2019

In the software world, “waterfall” is commonly used to describe a style of software process, one that contrasts with the ideas of iterative, or agile styles. Like many well-known terms in software it's meaning is ill-defined and origins are obscure - but I find its essential theme is breaking down a large effort into phases based on activity.

It's not clear how the word “waterfall” became so prevalent, but most people base its origin on a paper by Winston Royce, in particular this figure:

Although this paper seems to be universally acknowledged as the source of the notion of waterfall (based on the shape of the downward cascade of tasks), the term “waterfall” never appears in the paper. It's not clear how the name appeared later.

Royce’s paper describes his observations on the software development process of the time (late 60s) and how the usual implementation steps could be improved. [1] But “waterfall” has gone much further, to be used as a general description of a style of software development. For people like me, who speak at software conferences, it almost always only appears in a derogatory manner - I can’t recall hearing any conference speaker saying anything good about waterfall for many years. However when talking to practitioners in enterprises I do hear of it spoken as a viable, even preferred, development style. Certainly less so now than in the 90s, but more frequently than one might assume by listening to process mavens.

But what exactly is “waterfall”? That’s not an easy question to answer as, like so many things in software, there is no clear definition. In my judgment, there is one common characteristic that dominates any definition folks use for waterfall, and that’s the idea of decomposing effort into phases based on activity.

Let me unpack that phrase. Let’s say I have some software to build, and I think it’s going to take about a year to build it. Few people are going to happily say “go away for a year and tell me when its done”. Instead, most people will want to break down that year into smaller chunks, so they can monitor progress and have confidence that things are on track. The question then is how do we perform this break down?

The waterfall style, as suggested by the Royce sketch, does it by the activity we are doing. So our 1 year project might be broken down into 2 months of analysis, followed by 4 months design, 3 months of coding, and 3 months of testing. The contrast here is to an iterative style, where we would take some high level requirements (build a library management system), and divide them into subsets (search catalog, reserve a book, check-out and return, assess fines). We'd then take one of these subsets and spend a couple of months to build working software to implement that functionality, delivering either into a staging environment or preferably into a live production setting. Having done that with one subset, we'd continue with further subsets.

In this thinking waterfall means “do one activity at a time for all the features” while iterative means “do all activities for one feature at a time”.

If the origin of the word “waterfall” is murky, so is the notion of how this phase-based breakdown originated. My guess is that it’s natural to break down a large task into different activities, especially if you look to activities such as building construction as an inspiration. Each activity requires different skills, so getting all the analysts to complete analysis before you bring in all the coders makes intuitive sense. It seems logical that a misunderstanding of requirements is cheaper to fix before people begin coding - especially considering the state of computers in the late 60s. Finally the same activity-based breakdown can be used as a standard for many projects, while a feature-based breakdown is harder to teach. [2]

Although it isn’t hard to find people explain why this waterfall thinking isn’t a good idea for software development, I should summarize my primary objections to the waterfall style here. The waterfall style usually has testing and integration as two of the final phases in the cycle, but these are the most difficult to predict elements in a development project. Problems at these stages lead to rework of many steps of earlier phases, and to significant project delays. It's too easy to declare all but the late phases as "done", with much work missing, and thus it's hard to tell if the project is going well. There is no opportunity for early releases before all features are done. All this introduces a great deal of risk to the development effort.

Furthermore, a waterfall approach forces us into a predictive style of planning, it assumes that once you are done with a phase, such as requirements analysis, the resulting deliverable is a stable platform for later phases to base their work on. [3] In practice the vast majority of software projects find they need to change their requirements significantly within a few months, due to everyone learning more about the domain, the characteristics of the software environment, and changes in the business environment. Indeed we've found that delivering a subset of features does more than anything to help clarify what needs to be done next, so an iterative approach allows us to shift to an adaptive planning approach where we update our plans as we learn what the real software needs are. [4]

These are the major reasons why I've glibly said that "you should use iterative development only in projects that you want to succeed".

Waterfalls and iterations may nest inside each other. A six year project might consist of two 3 year projects, where each of the two projects are structured in a waterfall style, but the second project adds additional features. You can think of this as a two-iteration project at the top level with each iteration as a waterfall. Due to the large size and small number of iterations, I'd regard that as primarily a waterfall project. In contrast you might see a project with 16 iterations of one month each, where each iteration is planned in a waterfall style. That I'd see as primarily iterative. While in theory there's potential for a middle ground projects that are hard to classify, in practice it's usually easy to tell that one style predominates.

It is possible for a mix of waterfall and iterative where early phases (requirements analysis, high level design) are done in a waterfall style while later phases (detailed design, code, test) are done in an iterative manner. This reduces the risks inherent in late testing and integration phases, but does not enable adaptive planning.

Waterfall is often cast as the alternative to agile software development, but I don't see that as strictly true. Certainly agile processes require an iterative approach and cannot work in a waterfall style. But it is easy to follow an iterative approach (i.e. non-waterfall) but not be agile. [5] I might do this by taking 100 features and dividing them up into ten iterations over the next year, and then expecting that each iteration should complete on time with its planned set of features. If I do this, my initial plan is a predictive plan, if all goes well I should expect the work to closely follow the plan. But adaptive planning is an essential element of agile thinking. I expect features to move between iterations, new features to appear, and many features to be discarded as no longer valuable enough.g

My rule of thumb is that anyone who says “we were successful because we were on-time and on-budget” is thinking in terms of predictive planning, even if they are following an iterative process, and thus is not thinking with an agile mindset. In the agile world, success is all about business value - regardless of what was written in a plan months ago. Plans are made, but updated regularly. They guide decisions on what to do next, but are not used as a success measure.

Notes

1: There have been quite a few people seeking to interpret the Royce paper. Some argue that his paper opposes waterfall, pointing out that the paper discusses flaws in the kind of process suggested by the figure 2 that I've quoted here. Certainly he does discuss flaws, but he also says the illustrated approach is "fundamentally sound". Certainly this activity-based decomposition of projects became the accepted model in the decades that followed.

2: This leads to another common characteristic that goes with the term “waterfall” - rigid processes that tell everyone in detail what they should do. Certainly the software process folks in the 90s were keen on coming up with prescriptive methods, but such prescriptive thinking also affected many who advocated iterative techniques. Indeed although agile methods explicitly disavow this kind of Taylorist thinking, I often hear of faux-agile initiatives following this route.

3: The notion that a phase should be finished before the next one is started is a convenient fiction. Even the most eager waterfall proponent would agree that some rework on prior stages is necessary in practice, although I think most would say that if executed perfectly, each activity wouldn't need rework. Royce's paper explicitly discussed how iteration was expected between adjacent steps (eg Analysis and Program Design in his figure). However Royce argued that longer backtracks (eg between Program Design and Testing) were a serious problem.

4: This does raise the question of whether there are contexts where the waterfall style is actually better than the iterative one. In theory, waterfall might well work better in situations where there was a deep understanding of the requirements, and the technologies being used - and neither of those things would significantly change during the life of the product. I say "in theory" because I've not come across such a circumstance, so I can't judge if waterfall would be appropriate in practice. And even then I'd be reluctant to follow the waterfall style for the later phases (code-test-integrate) as I've found so much value in interleaving testing with coding while doing continuous integration..

5: In the 90s it was generally accepted in the object-oriented world that waterfall was a bad idea and should be replaced with an iterative style. However I don't think there was the degree of embracing changing requirements that appeared with the agile community.

Acknowledgements

My thanks to Ben Noble, Clare Sudbury, David Johnston, Karl Brown, Kyle Hodgson, Pramod Sadalage, Prasanna Pendse, Rebecca Parsons, Sriram Narayan, Sriram Narayanan, Tiago Griffo, Unmesh Joshi, and Vidhyalakshmi Narayanaswamy who discussed drafts of this post on our internal mailing list.