delivery · agile adoption · team organization · collaboration


Agile software development has broken down some of the silos between requirements analysis, testing and development. Deployment, operations and maintenance are other activities which have suffered a similar separation from the rest of the software development process. The DevOps movement is aimed at removing these silos and encouraging collaboration between development and operations.

DevOps has become possible largely due to a combination of new operations tools and established agile engineering practices [1], but these are not enough to realize the benefits of DevOps. Even with the best tools, DevOps is just another buzzword if you don't have the right culture.

The primary characteristic of DevOps culture is increased collaboration between the roles of development and operations. There are some important cultural shifts, within teams and at an organizational level, that support this collaboration.

An attitude of shared responsibility is an aspect of DevOps culture that encourages closer collaboration. It’s easy for a development team to become disinterested in the operation and maintenance of a system if it is handed over to another team to look after. If a development team shares the responsibility of looking after a system over the course of its lifetime, they are able to share the operations staff’s pain and so identify ways to simplify deployment and maintenance (e.g. by automating deployments and improving logging). They may also gain additional ObservedRequirements from monitoring the system in production. When operations staff share responsibility of a system’s business goals, they are able to work more closely with developers to better understand the operational needs of a system and help meet these. In practice, collaboration often begins with an increased awareness from developers of operational concerns (such as deployment and monitoring) and the adoption of new automation tools and practices by operations staff.

Some organizational shifts are required to support a culture of shared responsibilities. There should be no silos between development and operations. Handover periods and documentation are a poor substitute for working together on a solution from the start. It is helpful to adjust resourcing structures to allow operations staff to get involved with teams early. Having the developers and operations staff co-located will help them to work together. Handovers and sign-offs discourage people from sharing responsibility and contributes to a culture of blame. Instead, developers and operations staff should both be responsible for the successes and failures of a system. DevOps culture blurs the line between the roles of developer and operations staff and may eventually eliminate the distinction. One common anti-pattern when introducing DevOps to an organization is to assign someone the role of 'DevOps' or to call a team a 'DevOps team'. Doing so perpetuates the kinds of silos that DevOps aims to break down and prevents DevOps culture and practices from spreading and being adopted by the wider organization.

Another valuable organizational shift is to support autonomous teams. In order to collaborate effectively, developers and operations staff need to be able to make decisions and apply changes without convoluted decision making processes. This involves trusting teams, changing the way risk is managed and creating an environment that is free of a fear of failure. For example, a team that has to produce a list of changes for sign-off in order to deploy to a testing environment is likely to be delayed frequently. Instead of requiring such a manual check, it is possible to rely on version control, which is fully auditable. Changes in version control can even be linked to tickets in the team's project management tool. Without the manual sign-off, the team can automate their deployments and speed up their testing cycle.

One effect of a shift towards DevOps culture is that it becomes easier to put new code in production. This necessitates some further cultural changes. In order to ensure that changes in production are sound, the team needs to value building quality into the development process. This includes cross-functional concerns such as performance and security. The techniques of ContinuousDelivery, including SelfTestingCode, form a basis which allows regular, low-risk deployments.

It is also important for the team to value feedback, in order to continuously improve the way in which developers and operations staff work together as well as the system itself. Production monitoring is a helpful feedback loop for diagnosing issues and spotting potential improvements.

Automation is a cornerstone of the DevOps movement and facilitates collaboration. Automating tasks such as testing, configuration and deployment frees people up to focus on other valuable activities and reduces the chance of human error. A helpful side effect of automation is that automated scripts and tests serve as useful, always up-to-date documentation of the system. Automating server configuration, for example, removes the guesswork associated with a SnowflakeServer and means that developers and operations staff are equally able to know and change how a server is configured.


1: Operations tools include virtualization, cloud computing and automated configuration management. These are often supported by engineering practices such as Continuous Integration, evolutionary design and clean code.

if you found this article useful, please share it. I appreciate the feedback and encouragement


evolutionary design · microservices


As I hear stories about teams using a microservices architecture, I've noticed a common pattern.

  1. Almost all the successful microservice stories have started with a monolith that got too big and was broken up
  2. Almost all the cases where I've heard of a system that was built as a microservice system from scratch, it has ended up in serious trouble.

This pattern has led many of my colleagues to argue that you shouldn't start a new project with microservices, even if you're sure your application will be big enough to make it worthwhile. .

Microservices are a useful architecture, but even their advocates say that using them incurs a significant MicroservicePremium, which means they are only useful with more complex systems. This premium, essentially the cost of managing a suite of services, will slow down a team, favoring a monolith for simpler applications. This leads to a powerful argument for a monolith-first strategy, where you should build a new application as a monolith initially, even if you think it's likely that it will benefit from a microservices architecture later on.

The first reason for this is classic Yagni. When you begin a new application, how sure are you that it will be useful to your users? It may be hard to scale a poorly designed but successful software system, but that's still a better place to be than its inverse. As we're now recognizing, often the best way to find out if a software idea is useful is to build a simplistic version of it and see how well it works out. During this first phase you need to prioritize speed (and thus cycle time for feedback), so the premium of microservices is a drag you should do without.

The second issue with starting with microservices is that they only work well if you come up with good, stable boundaries between the services - which is essentially the task of drawing up the right set of BoundedContexts. Any refactoring of functionality between services is much harder than it is in a monolith. But even experienced architects working in familiar domains have great difficulty getting boundaries right at the beginning. By building a monolith first, you can figure out what the right boundaries are, before a microservices design brushes a layer of treacle over them. It also gives you time to develop the MicroservicePrerequisites you need for finer-grained services.

I've heard different ways to execute a monolith-first strategy. The logical way is to design a monolith carefully, paying attention to modularity within the software, both at the API boundaries and how the data is stored. Do this well, and it's a relatively simple matter to make the shift to microservices. However I'd feel much more comfortable with this approach if I'd heard a decent number of stories where it worked out that way. [1]

A more common approach is to start with a monolith and gradually peel off microservices at the edges. Such an approach can leave a substantial monolith at the heart of the microservices architecture, but with most new development occurring in the microservices while the monolith is relatively quiescent.

Another common approach is to just replace the monolith entirely. Few people look at this as an approach to be proud of, yet there are advantages to building a monolith as a SacrificialArchitecture. Don't be afraid of building a monolith that you will discard, particularly if a monolith can get you to market quickly.

Another route I've run into is to start with just a couple of coarse-grained services, larger than those you expect to end up with. Use these coarse-grained services to get used to working with multiple services, while enjoying the fact that such coarse granularity reduces the amount of inter-service refactoring you have to do. Then as boundaries stabilize, break down into finer-grained services. [2]

While the bulk of my contacts lean toward the monolith-first approach, it is by no means unanimous. The counter argument says that starting with microservices allows you to get used to the rhythm of developing in a microservice environment. It takes a lot, perhaps too much, discipline to build a monolith in a sufficiently modular way that it can be broken down into microservices easily. By starting with microservices you get everyone used to developing in separate small teams from the beginning, and having teams separated by service boundaries makes it much easier to scale up the development effort when you need to. This is especially viable for system replacements where you have a better chance of coming up with stable-enough boundaries early. Although the evidence is sparse, I feel that you shouldn't start with microservices unless you have reasonable experience of building a microservices system in the team.

I don't feel I have enough anecdotes yet to get a firm handle on how to decide whether to use a monolith-first strategy. These are early days in microservices, and there are relatively few anecdotes to learn from. So anybody's advice on these topics must be seen as tentative, however confidently they argue.

Further Reading

Sam Newman describes a case study of a team considering using microservices on a greenfield project.


1: You cannot assume that you can take an arbitrary system and break it into microservices. Most systems acquire too many dependencies between their modules, and thus can't be sensibly broken apart. I've heard of plenty of cases where an attempt to decompose a monolith has quickly ended up in a mess. I've also heard of a few cases where a gradual route to microservices has been successful - but these cases required a relatively good modular design to start with.

2: I suppose that strictly you should call this a "duolith", but I think the approach follows the essence of monolith-first strategy: start with coarse-granularity to gain knowledge and split later.


I stole much of this thinking from my coleagues: James Lewis, Sam Newman, Thiyagu Palanisamy, and Evan Bottcher. Stefan Tilkov's comments on an earlier draft played a pivotal role in clarifying my thoughts. Chad Currie created the lovely glyphy dragons. Steven Lowe, Patrick Kua, Jean Robert D'amore, Chelsea Komlo, Ashok Subramanian, Dan Siwiec, Prasanna Pendse, Kief Morris, Chris Ford, and Florian Sellmayr discussed drafts on our internal mailing list.
if you found this article useful, please share it. I appreciate the feedback and encouragement


process theory · project planning · evolutionary design · clean code


Yagni originally is an acronym that stands for "You Aren't Gonna Need It". It is a mantra from ExtremeProgramming that's often used generally in agile software teams. It's a statement that some capability we presume our software needs in the future should not be built now because "you aren't gonna need it".

Yagni is a way to refer to the XP practice of Simple Design (from the first edition of The White Book, the second edition refers to the related notion of "incremental design"). [1] Like many elements of XP, it's a sharp contrast to elements of the widely held principles of software engineering in the late 90s. At that time there was a big push for careful up-front planning of software development.

Let's imagine I'm working with a startup in Minas Tirith selling insurance for the shipping business. Their software system is broken into two main components: one for pricing, and one for sales. The dependencies are such that they can't usefully build sales software until the relevant pricing software is completed.

At the moment, the team is working on updating the pricing component to add support for risks from storms. They know that in six months time, they will need to also support pricing for piracy risks. Since they are currently working on the pricing engine they consider building the presumptive feature [2] for piracy pricing now, since that way the pricing service will be complete before they start working on the sales software.

Yagni argues against this, it says that since you won't need piracy pricing for six months you shouldn't build it until it's necessary. So if you think it will take two months to build this software, then you shouldn't start for another four months (neglecting any buffer time for schedule risk and updating the sales component).

The first argument for yagni is that while we may now think we need this presumptive feature, it's likely that we will be wrong. After all the context of agile methods is an acceptance that we welcome changing requirements. A plan-driven requirements guru might counter argue that this is because we didn't do a good-enough job of our requirements analysis, we should have put more time and effort into it. I counter that by pointing out how difficult and costly it is to figure out your needs in advance, but even if you can, you can still be blind-sided when the Gondor Navy wipes out the pirates, thus undermining the entire business model.

In this case, there's an obvious cost of the presumptive feature - the cost of build: all the effort spent on analyzing, programming, and testing this now useless feature.

But let's consider that we were completely correct with our understanding of our needs, and the Gondor Navy didn't wipe out the pirates. Even in this happy case, building the presumptive feature incurs two serious costs. The first cost is the cost of delayed value. By expending our effort on the piracy pricing software we didn't build some other feature. If we'd instead put our energy into building the sales software for weather risks, we could have put a full storm risks feature into production and be generating revenue two months earlier. This cost of delay due to the presumptive feature is two months revenue from storm insurance.

The common reason why people build presumptive features is because they think it will be cheaper to build it now rather than build it later. But that cost comparison has to be made at least against the cost of delay, preferably factoring in the probability that you're building an unnecessary feature, for which your odds are at least ⅔. [3]

Often people don't think through the comparative cost of building now to building later. One approach I use when mentoring developers in this situation is to ask them to imagine any refactoring they would have to do later to introduce the capability when it's needed. Often that thought experiment is enough to convince them that it won't be significantly more expensive to add it later. Another result from such an imagining is to add something that's easy to do now, adds minimal complexity, yet significantly reduces the later cost. Using lookup tables for error messages rather than inline literals are an example that are simple yet make later translations easier to support.

The cost of delay is one cost that a successful presumptive feature imposes, but another is the cost of carry. The code for the presumptive feature adds some complexity to the software, this complexity makes it harder to modify and debug that software, thus increasing the cost of other features. The extra complexity from having the piracy-pricing feature in the software might add a couple of weeks to how long it takes to build the storm insurance sales component. That two weeks hits two ways: the additional cost to build the feature, plus the additional cost of delay since it look longer to put it into production. We'll incur a cost of carry on every feature built between now and the time the piracy insurance software starts being useful. Should we never need the piracy-pricing software, we'll incur a cost of carry on every feature built until we remove the piracy-pricing feature (assuming we do), together with the cost of removing it.

So far I've divided presumptive features in two categories: successful and unsuccessful. Naturally there's really a spectrum there, and with one point on that spectrum that's worth highlighting: the right feature built wrong. Development teams are always learning, both about their users and about their code base. They learn about the tools they're using and these tools go through regular upgrades. They also learn about how their code works together. All this means that you often realize that a feature coded six months ago wasn't done the way you now realize it should be done. In that case you have accumulated TechnicalDebt and have to consider the cost of repair for that feature or the on-going costs of working around its difficulties.

So we end up with three classes of presumptive features, and four kinds of costs that occur when you neglect yagni for them.

My insurance example talks about relatively user-visible functionality, but the same argument applies for abstractions to support future flexibility. When building the storm risk calculator, you may consider putting in abstractions and parameterizations now to support piracy and other risks later. Yagni says not to do this, because you may not need the other pricing functions, or if you do your current ideas of what abstractions you'll need will not match what you learn when you do actually need them. This doesn't mean to forego all abstractions, but it does mean any abstraction that makes it harder to understand the code for current requirements is presumed guilty.

Yagni is at its most visible with larger features, but you see it more frequently with small things. Recently I wrote some code that allows me to highlight part of a line of code. For this, I allow the highlighted code to be specified using a regular expression. One problem I see with this is that since the whole regular expression is highlighted, I'm unable to deal with the case where I need the regex to match a larger section than what I'd like to highlight. I expect I can solve that by using a group within the regex and letting my code only highlight the group if a group is present. But I haven't needed to use a regex that matches more than what I'm highlighting yet, so I haven't extended my highlighting code to handle this case - and won't until I actually need it. For similar reasons I don't add fields or methods until I'm actually ready to use them.

Small yagni decisions like this fly under the radar of project planning. As a developer it's easy to spend an hour adding an abstraction that we're sure will soon be needed. Yet all the arguments above still apply, and a lot of small yagni decisions add up to significant reductions in complexity to a code base, while speeding up delivery of features that are needed more urgently.

Now we understand why yagni is important we can dig into a common confusion about yagni. Yagni only applies to capabilities built into the software to support a presumptive feature, it does not apply to effort to make the software easier to modify. Yagni is only a viable strategy if the code is easy to change, so expending effort on refactoring isn't a violation of yagni because refactoring makes the code more malleable. Similar reasoning applies for practices like SelfTestingCode and ContinuousDelivery. These are enabling practices for evolutionary design, without them yagni turns from a beneficial practice into a curse. But if you do have a malleable code base, then yagni reinforces that flexibility. Yagni has the curious property that it is both enabled by and enables evolutionary design.

Yagni is not a justification for neglecting the health of your code base. Yagni requires (and enables) malleable code.

I also argue that yagni only applies when you introduce extra complexity now that you won't take advantage of until later. If you do something for a future need that doesn't actually increase the complexity of the software, then there's no reason to invoke yagni.

Having said all this, there are times when applying yagni does cause a problem, and you are faced with an expensive change when an earlier change would have been much cheaper. The tricky thing here is that these cases are hard to spot in advance, and much easier to remember than the cases where yagni saved effort [4]. My sense is that yagni-failures are relatively rare and their costs are easily outweighed by when yagni succeeds.

Further Reading

My essay Is Design Dead talks in more detail about the role of design and architecture in agile projects, and thus role yagni plays as an enabling practice.

This principle was first discussed and fleshed out on Ward's Wiki.


1: The origin of the phrase is an early conversation between Kent Beck and Chet Hendrickson on the C3 project. Chet came up to Kent with a series of capabilities that the system would soon need, to each one Kent replied "you aren't going to need it". Chet's a fast learner, and quickly became renowned for his ability to spot opportunities to apply yagni. Although "yagni" began life as an acronym, I feel it's now entered our lexicon as a regular word, and thus forego the capital letters.

2: In this post I use "presumptive feature" to refer to any code that supports a feature that isn't yet being made available for use.

3: The ⅔ number is suggested by Kohavi et al, who analyzed the value of features built and deployed on products at microsoft and found that, even with careful up-front analysis, only ⅓ of them improved the metrics they were designed to improve.

4: This is a consequence of availability bias


Rachel Laycock talked through this post with me and played a critical role in its final organization. Chet Hendrickson and Steven Lowe reminded me to discuss small-scale yagni decisions. Rebecca Parsons, Alvaro Cavalcanti, Mark Taylor, Aman King, Rouan Wilsenach, Peter Gillard-Moss, Kief Morris, Ian Cartwright, James Lewis, Kornelis Sietsma, and Brian Mason participated in an insightful discussion about drafts of this article on our internal mailing list.
if you found this article useful, please share it. I appreciate the feedback and encouragement




The microservices architectural style has been the hot topic over the last year. At the recent O'Reilly software architecture conference, it seemed like every session talked about microservices. Enough to get everyone's over-hyped-bullshit detector up and flashing. One of the consequences of this is that we've seen teams be too eager to embrace microservices, [1] not realizing that microservices introduce complexity on their own account. This adds a premium to a project's cost and risk - one that often gets projects into serious trouble.

While this hype around microservices is annoying, I do think it's a useful bit of terminology for a style of architecture which has been around for a while, but needed a name to make it easier to talk about. The important thing here is not how annoyed you feel about the hype, but the architectural question it raises: is a microservice architecture a good choice for the system you're working on?

"It depends" must start my answer, but then I must shift the focus to what factors it depends on. The fulcrum of whether or not to use microservices is the complexity of the system you're contemplating. The microservices approach is all about handling a complex system, but in order to do so the approach introduces its own set of complexities. When you use microservices you have to work on automated deployment, monitoring, dealing with failure, eventual consistency, and other factors that a distributed system introduces. There are well-known ways to cope with all this, but it's extra effort, and nobody I know in software development seems to have acres of free time.

So my primary guideline would be don't even consider microservices unless you have a system that's too complex to manage as a monolith. The majority of software systems should be built as a single monolithic application. Do pay attention to good modularity within that monolith, but don't try to separate it into separate services.

The complexity that drives us to microservices can come from many sources including dealing with large teams [2], multi-tenancy, supporting many user interaction models, allowing different business functions to evolve independently, and scaling. But the biggest factor is that of sheer size - people finding they have a monolith that's too big to modify and deploy.

At this point I feel a certain frustration. Many of the problems ascribed to monoliths aren't essential to that style. I've heard people say that you need to use microservices because it's impossible to do ContinuousDelivery with monoliths - yet there are plenty of organizations that succeed with a cookie-cutter deployment approach: Facebook and Etsy are two well-known examples.

I've also heard arguments that say that as a system increases in size, you have to use microservices in order to have parts that are easy to modify and replace. Yet there's no reason why you can't make a single monolith with well defined module boundaries. At least there's no reason in theory, in practice it seems too easy for module boundaries to be breached and monoliths to get tangled as well as large.

We should also remember that there's a substantial variation in service-size between different microservice systems. I've seen microservice systems vary from a team of 60 with 20 services to a team of 4 with 200 services. It's not clear to what degree service size affects the premium.

As size and other complexity boosters kick into a project I've seen many teams find that microservices are a better place to be. But unless you're faced with that complexity, remember that the microservices approach brings a high premium, one that can slow down your development considerably. So if you can keep your system simple enough to avoid the need for microservices: do.


1: It's a common enough problem that our recent radar called it out as Microservice Envy.

2: Conway's Law says that the structure of a system follows the organization of the people that built it. Some examples of microservice usage had organizations deliberately split themselves into small, loosely coupled groups in order to push the software into a similar modular structure - a notion that's called the Inverse Conway Maneuver.


I stole much of this thinking from my colleagues: James Lewis, Sam Newman, Thiyagu Palanisamy, and Evan Bottcher. Stefan Tilkov's comments on an earlier draft were instrumental in sharpening this post. Rob Miles, David Nelson, Brian Mason, and Scott Robinson discussed drafts of this article on our internal mailing list.
if you found this article useful, please share it. I appreciate the feedback and encouragement


extreme programming · clean code · refactoring


Kent Beck came up with his four rules of simple design while he was developing ExtremeProgramming in the late 1990's. I express them like this. [1]

The rules are in priority order, so "passes the tests" takes priority over "reveals intention"

Kent Beck developed Extreme Programming, Test Driven Development, and can always be relied on for good Victorian facial hair for his local ballet.

The most important of the rules is "passes the tests". XP was revolutionary in how it raised testing to a first-class activity in software development, so it's natural that testing should play a prominent role in these rules. The point is that whatever else you do with the software, the primary aim is that it works as intended and tests are there to ensure that happens.

"Reveals intention" is Kent's way of saying the code should be easy to understand. Communication is a core value of Extreme Programing, and many programmers like to stress that programs are there to be read by people. Kent's form of expressing this rule implies that the key to enabling understanding is to express your intention in the code, so that your readers can understand what your purpose was when writing it.

The "no duplication" is perhaps the most powerfully subtle of these rules. It's a notion expressed elsewhere as DRY or SPOT [2], Kent expressed it as saying everything should be said "Once and only Once." Many programmers have observed that the exercise of eliminating duplication is a powerful way to drive out good designs. [3]

The last rule tells us that anything that doesn't serve the three prior rules should be removed. At the time these rules were formulated there was a lot of design advice around adding elements to an architecture in order to increase flexibility for future requirements. Ironically the extra complexity of all of these elements usually made the system harder to modify and thus less flexible in practice.

People often find there is some tension between "no duplication" and "reveals intention", leading to arguments about which order those rules should appear. I've always seen their order as unimportant, since they feed off each other in refining the code. Such things as adding duplication to increase clarity is often papering over a problem, when it would be better to solve it. [4]

What I like about these rules is that they are very simple to remember, yet following them improves code in any language or programming paradigm that I've worked with. They are an example of Kent's skill in finding principles that are generally applicable and yet concrete enough to shape my actions.

At the time there was a lot of “design is subjective”, “design is a matter of taste” bullshit going around. I disagreed. There are better and worse designs. These criteria aren’t perfect, but they serve to sort out some of the obvious crap and (importantly) you can evaluate them right now. The real criteria for quality of design, “minimizes cost (including the cost of delay) and maximizes benefit over the lifetime of the software,” can only be evaluated post hoc, and even then any evaluation will be subject to a large bag full of cognitive biases. The four rules are generally predictive.

-- Kent Beck

Further Reading

There are many expressions of these rules out there, here are a few that I think are worth exploring:


Kent reviewed this post and sent me some very helpful feedback, much of which I appropriated into the text.


1: Authoritative Formulation

There are many expressions of the four rules out there, Kent stated them in lots of media, and plenty of other people have liked them and phrased them their own way. So you'll see plenty of descriptions of the rules, but each author has their own twist - as do I.

If you want an authoritative formulation from the man himself, probably your best bet is from the first edition of The White Book (p 57) in the section that outlines the XP practice of Simple Design.

  • Runs all the tests
  • Has no duplicated logic. Be wary of hidden duplication like parallel class hierarchies
  • States every intention important to the programmer
  • Has the fewest possible classes and methods

(Just to be confusing, there's another formulation on page 109 that omits "runs all the tests" and splits "fewest classes" and "fewest methods" over the last two rules. I recall this was an earlier formulation that Kent improved on while writing the White Book.)

2: DRY stands for Don't Repeat Yourself, and comes from The Pragmatic Programmer. SPOT stands for Single Point Of Truth.

3: This principle was the basis of my first design column for IEEE Software.

4: When reviewing this post, Kent said "In the rare case they are in conflict (in tests are the only examples I can recall), empathy wins over some strictly technical metric." I like his point about empathy - it reminds us that when writing code we should always be thinking of the reader.

if you found this article useful, please share it. I appreciate the feedback and encouragement


database · big data


Data Lake is a term that's appeared in this decade to describe an important component of the data analytics pipeline in the world of Big Data. The idea is to have a single store for all of the raw data that anyone in an organization might need to analyze. Commonly people use Hadoop to work on the data in the lake, but the concept is broader than just Hadoop.

When I hear about a single point to pull together all the data an organization wants to analyze, I immediately think of the notion of the data warehouse (and data mart [1]). But there is a vital distinction between the data lake and the data warehouse. The data lake stores raw data, in whatever form the data source provides. There is no assumptions about the schema of the data, each data source can use whatever schema it likes. It's up to the consumers of that data to make sense of that data for their own purposes.

This is an important step, many data warehouse initiatives didn't get very far because of schema problems. Data warehouses tend to go with the notion of a single schema for all analytics needs, but I've taken the view that a single unified data model is impractical for anything but the smallest organizations. To model even a slightly complex domain you need multiple BoundedContexts, each with its own data model. In analytics terms, you need each analytics user to use a model that makes sense for the analysis they are doing. By shifting to storing raw data only, this firmly puts the responsibility on the data analyst.

Another source of problems for data warehouse initiatives is ensuring data quality. Trying to get an authoritative single source for data requires lots of analysis of how the data is acquired and used by different systems. System A may be good for some data, and system B for another. You run into rules where system A is better for more recent orders but system B is better for orders of a month or more ago, unless returns are involved. On top of this, data quality is often a subjective issue, different analysis has different tolerances for data quality issues, or even a different notion of what is good quality.

This leads to a common criticism of the data lake - that it's just a dumping ground for data of widely varying quality, better named a data swamp. The criticism is both valid and irrelevant. The hot title of the New Analytics is "Data Scientist". Although it's a much-abused title, many of these folks do have a solid background in science. And any serious scientist knows all about data quality problems. Consider what you might think is the simple matter of analyzing temperature readings over time. You have to take into account that some weather stations are relocated in ways that may subtly affect the readings, anomalies due to problems in equipment, missing periods when the sensors aren't working. Many of the sophisticated statistical techniques out there are created to sort out data quality problems. Scientists are always skeptical about data quality and are used to dealing with questionable data. So for them the lake is important because they get to work with raw data and can be deliberate about applying techniques to make sense of it, rather than some opaque data cleansing mechanism that probably does more harm that good.

Data warehouses usually would not just cleanse but also aggregate the data into a form that made it easier to analyze. But scientists tend to object to this too, because aggregation implies throwing away data. The data lake should contain all the data because you don't know what people will find valuable, either today or in a couple of years time.

One of my colleagues illustrated this thinking with a recent example: "We were trying to compare our automated predictive models versus manual forecasts made by the company's contract managers. To do this we decided to train our models on year old data and compare our predictions to the ones made by managers at that time. Since we now know the correct results, this should be a fair test of accuracy. When we started to do this, it appeared that the manager's predictions were horrible and that even our simple models, made in just two weeks, were crushing them. We suspected that this out-performance was too good to be true. After a lot of testing and digging we discovered that the time stamps associated with those manager predictions were incorrect. They were being modified by some end-of-month processing report. So in short, these values in the data warehouse were useless; we feared that we would have no way of performing this comparison. After more digging we found that these reports had been stored and so we could extract the real forecasts made at that time. (We're crushing them again but it's taken many months to get there)."

The complexity of this raw data means that there is room for something that curates the data into a more manageable structure (as well as reducing the considerable volume of data.) The data lake shouldn't be accessed directly very much. Because the data is raw, you need a lot of skill to make any sense of it. You have relatively few people who work in the data lake, as they uncover generally useful views of data in the lake, they can create a number of data marts each of which has a specific model for a single bounded context. A larger number of downstream users can then treat these lakeshore marts as an authoritative source for that context.

So far I've described the data lake as singular point for integrating data across an enterprise, but I should mention that isn't how it was originally intended. The term was coined by James Dixon in 2010, when he did that he intended a data lake to be used for a single data source, multiple data sources would instead form a "water garden". Despite its original formulation the prevalent usage now is to treat a data lake as combining many sources. [2]

You should use a data lake for analytic purposes, not for collaboration between operational systems. When operational systems collaborate they should do this through services designed for the purpose, such as RESTful HTTP calls, or asynchronous messaging. The lake is too complex to trawl for operational communication. It may be that analysis of the lake can lead to new operational communication routes, but these should be built directly rather than through the lake.

It is important that all data put in the lake should have a clear provenance in place and time. Every data item should have a clear trace to what system it came from and when the data was produced. The data lake thus contains a historical record. This might come from feeding Domain Events into the lake, a natural fit with Event Sourced systems. But it could also come from systems doing a regular dump of current state into the lake - an approach that's valuable when the source system doesn't have any temporal capabilities but you want a temporal analysis of its data. A consequence of this is that data put into the lake is immutable, an observation once stated cannot be removed (although it may be refuted later), you should also expect ContradictoryObservations.

The data lake is schemaless, it's up to the source systems to decide what schema to use and for consumers to work out how to deal with the resulting chaos. Furthermore the source systems are free to change their inflow data schemas at will, and again the consumers have to cope. Obviously we prefer such changes to be as minimally disruptive as possible, but scientists prefer messy data to losing data.

Data lakes are going to be very large, and much of the storage is oriented around the notion of a large schemaless structure - which is why Hadoop and HDFS are usually the technologies people use for data lakes. One of the vital tasks of the lakeshore marts is to reduce the amount of data you need to deal with, so that big data analytics doesn't have to deal with large amounts of data.

The Data Lake's appetite for a deluge of raw data raises awkward questions about privacy and security. The principle of Datensparsamkeit is very much in tension with the data scientists' desire to capture all data now. A data lake makes a tempting target for crackers, who might love to siphon choice bits into the public oceans. Restricting direct lake access to a small data science group may reduce this threat, but doesn't avoid the question of how that group is kept accountable for the privacy of the data they sail on.


1: The usual distinction is that a data mart is for a single department in an organization, while a data warehouse integrates across all departments. Opinions differ on whether a data warehouse should be the union of all data marts or whether a data mart is a logical subset (view) of data in the data warehouse.

2: In a later blog post, Dixon emphasizes the lake versus water garden distinction, but (in the comments) says that it is a minor change. For me the key point is that the lake stores a large body of data in its natural state, the number of feeder streams isn't a big deal.


My thanks to Anand Krishnaswamy, Danilo Sato, David Johnston, Derek Hammer, Duncan Cragg, Jonny Leroy, Ken Collier, Shripad Agashe, and Steven Lowe for discussing drafts of this post on our internal mailing lists
if you found this article useful, please share it. I appreciate the feedback and encouragement




I've often been involved in discussions about deliberately increasing the diversity of a group of people. The most common case in software is increasing the proportion of women. Two examples are in hiring and conference speaker rosters where we discuss trying to get the proportion of women to some level that's higher than usual. A common argument against pushing for greater diversity is that it will lower standards, raising the spectre of a diverse but mediocre group.

To understand why this is an illusionary concern, I like to consider a little thought experiment. Imagine a giant bucket that contains a hundred thousand marbles. You know that 10% of these marbles have a special sparkle that you can see when you carefully examine them. You also know that 80% of these marbles are blue and 20% pink, and that sparkles exist evenly across both colors [1]. If you were asked to pick out ten sparkly marbles, you know you could confidently go through some and pick them out. So now imagine you're told to pick out ten marbles such that five were blue and five were pink.

I don't think you would react by saying “that's impossible”. After all there are two thousand pink sparkly marbles in there, getting five of them is not beyond the wit of even a man. Similarly in software, there may be less women in the software business, but there are still enough good women to fit the roles a company or a conference needs.

The point of the marbles analogy, however, is to focus on the real consequence of the demand for 50:50 split. Yes it's possible to find the appropriate marbles, but the downside is that it takes longer. [2]

That notion applies to finding the right people too. Getting a better than base proportion of women isn't impossible, but it does require more work, often much more work. This extra effort reinforces the rarity, if people have difficulty finding good people as it is, it needs determined effort to spend the extra time to get a higher proportion of the minority group — even if you are only trying to raise the proportion of women up to 30%, rather than a full 50%.

In recent years we've made increasing our diversity a high priority at ThoughtWorks. This has led to a lot of effort trying to go to where we are more likely to run into the talented women we are seeking: women's colleges, women-in-IT groups and conferences. We encourage our women to speak at conferences, which helps let other women know we value a diverse workforce.

When interviewing, we make a point of ensuring there are women involved. This gives women candidates someone to relate to, and someone to ask questions which are often difficult to ask men. It's also vital to have women interview men, since we've found that women often spot problematic behaviors that men miss as we just don't have the experiences of subtle discriminations. Getting a diverse group of people inside the company isn't just a matter of recruiting, it also means paying a lot of attention to the environment we have, to try to ensure we don't have the same AlienatingAtmosphere that much of the industry exhibits. [3]

One argument I've heard against this approach is that if everyone did this, then we would run out of pink, sparkly marbles. We'll know this is something to be worried about when women are paid significantly more than men for the same work.

One anecdote that stuck in my memory was from a large, traditional company who wanted to improve the number of women in senior management positions. They didn't impose a quota on appointing women to those positions, but they did impose a quota for women on the list of candidates. (Something like: "there must be at least three credible women candidates for each post".) This candidate quota forced the company to actively seek out women candidates. The interesting point was that just doing this, with no mandate to actually appoint these women, correlated with an increased proportion of women in those positions.

For conference planning it's a similar strategy: just putting out a call for papers and saying you'd like a diverse speaker lineup isn't enough. Neither are such things as blind review of proposals (and I'm not sure that's a good idea anyway). The important thing is to seek out women and encourage them to submit ideas. Organizing conferences is hard enough work as it is, so I can sympathize with those that don't want to add to the workload, but those that do can get there. FlowCon is a good example of a conference that made this an explicit goal and did far better than the industry average (and in case you were wondering, there was no difference between men's and women's evaluation scores).

So now that we recognize that getting greater diversity is a matter of application and effort, we can ask ourselves whether the benefit is worth the cost. In a broad professional sense, I've argued that it is, because our DiversityImbalance is reducing our ability to bring the talent we need into our profession, and reducing the influence our profession needs to have on society. In addition I believe there is a moral argument to push back against long-standing wrongs faced by HistoricallyDiscriminatedAgainst groups.

Conferences have an important role to play in correcting this imbalance. The roster of speakers is, at least subconsciously, a statement of what the profession should look like. If it's all white guys like me, then that adds to the AlienatingAtmosphere that pushes women out of the profession. Therefore I believe that conferences need to strive to get an increased proportion of historically-discriminated-against speakers. We, as a profession, need to push them to do this. It also means that women have an extra burden to become visible and act as part of that better direction for us. [4]

For companies, the choice is more personal. For me, ThoughtWorks's efforts to improve its diversity are a major factor in why I've been an employee here for over a decade. I don't think it's a coincidence that ThoughtWorks is also a company that has a greater open-mindedness, and a lack of political maneuvering, than most of the companies I've consulted with over the years. I consider those attributes to be a considerable competitive advantage in attracting talented people, and providing an environment where we can collaborate effectively to do our work.

But I'm not holding ThoughtWorks up as an example of perfection. We've made a lot of progress over the decade I've been here, but we still have a long way to go. In particular we are very short of senior technical women. We've introduced a number of programs around networks, and leadership development, to help grow women to fill those gaps. But these things take time - all you have to do is look at our Technical Advisory Board to see that we are a long way from the ratio we seek.

Despite my knowledge of how far we still have to climb, I can glimpse the summit ahead. At a recent AwayDay in Atlanta I was delighted to see how many younger technical women we've managed to bring into the company. While struggling to keep my head above water as the sole male during a late night game of Dominion, I enjoyed a great feeling of hope for our future.


1: That is 10% of blue marbles are sparkly as are 10% of pink.

2: Actually, if I dig around for a while in that bucket, I find that some marbles are neither blue nor pink, but some engaging mixture of the two.

3: This is especially tricky for a company like us, where so much of our work is done in client environments, where we aren't able to exert as much of an influence as we'd like. Some of our offices have put together special training to educate both sexes on how to deal with sexist situations with clients. As a man, I feel it's important for me to know how I can be supportive, it's not something I do well, but it is something I want to learn to improve.

4: Many people find the pressure of public speaking intimidating (I've come to hate it, even with all my practice). Feeling that you're representing your entire gender or race only makes it worse.


Camila Tartari, Carol Cintra, Dani Schufeldt, Derek Hammer, Isabella Degen, Korny Sietsma, Lindy Stephens, Mridula Jayaraman, Nikki Appleby, Rebecca Parsons, Sarah Taraporewalla, Stefanie Tinder, and Suzi Edwards-Alexander commented on drafts of this article.
if you found this article useful, please share it. I appreciate the feedback and encouragement