Bottleneck #05: Resilience and Observability

Service disruptions; production incidents reduce reputation and revenue

22 August 2023

Punit Lad is a Technical Lead and Infrastructure Engineer at Thoughtworks. He has led teams and advised clients at organizations of spanning industries, scale, and maturity. As a thought partner, he has collaborated with clients to shape vision and strategy, and as a developer at heart, he enjoys realizing their strategic goals by implementing solutions and building products and platforms. Most recently, Punit has developed a keen interest in the Developer Experience and applying product thinking to enhance the effectiveness and impact of development teams.

Carl Nygard

Carl Nygard is a Technical Principal at Thoughtworks. Carl has over 20 years of experience leading teams from start-ups to large enterprises building solutions for GIS/remote sensing, supply chain, real-time controls, online education, retail, and government. He works with organizations to develop technology strategy to achieve business outcomes through optimized software delivery practices.

This article is part of the series: Bottlenecks of Scaleups

How did you get into the bottleneck?
Signs you are approaching a scaling bottleneck
How do you get out of the bottleneck?
Summary

Sidebars

Availability is the most important feature

-- Mike Fisher, former CTO of Etsy

“I get knocked down, but I get up again…”

-- Tubthumping, Chumbawumba

Every organization pays attention to resilience. The big question is when.

Startups tend to only address resilience when their systems are already down, often taking a very reactive approach. For a scaleup, excessive system downtime represents a significant bottleneck to the organization, both from the effort expended on restoring function and also from the impact of customer dissatisfaction.

To move past this, resilience needs to be built into the business objectives, which will influence the architecture, design, product management, and even governance of business systems. In this article, we’ll explore the Resilience and Observability Bottleneck: how you can recognize it coming, how you might realize it has already arrived, and what you can do to survive the bottleneck.

How did you get into the bottleneck?

One of the first goals of a startup is getting an initial product out to market. Getting it in front of as many users as possible and receiving feedback from them is typically the highest priority. If customers use your product and see the unique value it delivers, your startup will carve out market share and have a dependable revenue stream. However, getting there often comes at a cost to the resilience of your product.

A startup may decide to skip automating recovery processes, because at a small scale, the organization believes it can provide resilience through the developers that know the system well. Incidents are handled in a reactive nature, and resolutions come by hand. Possible solutions might be spinning up another instance to handle increased load, or restarting a service when it’s failing. Your first customers might even be aware of your lack of true resilience as they experience system outages.

At one of our scaleup engagements, to get the system out to production quickly, the client deprioritized health check mechanisms in the cluster. The developers managed the startup process successfully for the few times when it was necessary. For an important demo, it was decided to spin up a new cluster so that there would be no externalities impacting the system performance. Unfortunately, actively managing the status of all the services running in the cluster was overlooked. The demo started before the system was fully operational and an important component of the system failed in front of prospective customers.

Fundamentally, your organization has made an explicit trade-off prioritizing user-facing functionality over automating resilience, gambling that the organization can recover from downtime through manual intervention. The trade-off is likely acceptable as a startup while it’s at a manageable scale. However, as you experience high growth rates and transform from a startup to a scaleup, the lack of resilience proves to be a scaling bottleneck, manifesting as an increasing occurrence of service interruptions translating into more work on the Ops side of the DevOps team’s responsibilities, reducing the productivity of teams. The impact seems to appear suddenly, because the effect tends to be non-linear relative to the growth of the customer base. What was recently manageable is suddenly extremely impactful. Eventually, the scale of the system creates manual work beyond the capacity of your team, which bubbles up to affect the customer experiences. The combination of reduced productivity and customer dissatisfaction leads to a bottleneck that is hard to survive.

The question then is, how do I know if my product is about to hit a scaling bottleneck? And further, if I know about those signs, how can I avoid or keep pace with my scale? That is what we’ll look to answer as we describe common challenges we’ve experienced with our clients and the solutions we have seen to be most effective.

Signs you are approaching a scaling bottleneck

It's always difficult to operate in an environment in which the scale of the business is changing rapidly. Investing in handling high traffic volumes too early is a waste of resources. Investing too late means your customers are already feeling the effects of the scaling bottleneck.

To shift your operating model from reactive to proactive, you have to be able to predict future behavior with a confidence level sufficient to support important business decisions. Making data driven decisions is always the goal. The key is to find the leading indicators which will guide you to prepare for, and hopefully avoid the bottleneck, rather than react to a bottleneck that has already occurred. Based on our experience, we have found a set of indicators related to the common preconditions as you approach this bottleneck.

Resilience is not a first class consideration

This may be the least obvious sign, but is arguably the most important. Resilience is thought of as purely a technical problem and not a feature of the product. It’s deprioritized for new features and enhancements. In some cases, it’s not even a concern to be prioritized.

Here’s a quick test. Listen in on the different discussions that occur within your teams, and note the context in which resilience is discussed. You may find that it isn’t included as part of a standup, but it does make its way into a developer meeting. When the development team isn’t responsible for operations, resilience is effectively siloed away. In those cases, pay close attention to how resilience is discussed.

Evidence of inadequate focus on resilience is often indirect. At one client, we’ve seen it come in the form of technical debt cards that not only aren’t prioritized, but become a constant growing list. At another client, the operations team had their backlog filled purely with customer incidents, the majority of which dealt with the system either not being up or being unable to process requests. When resilience concerns are not part of a team’s backlog and roadmap, you’ll have evidence that it is not core to the product.

Solving resilience by hand (reactive manual resilience)

How your organization resolve service outages can be a key indicator of whether your product can scaleup effectively or not. The characteristics we describe here are fundamentally caused by a lack of automation, resulting in excessive manual effort. Are service outages resolved via restarts by developers? Under high load, is there coordination required to scale compute instances?

In general, we find these approaches don’t follow sustainable operational practices and are brittle solutions for the next system outage. They include bandaid solutions which alleviate a symptom, but never truly solve it in a way that allows for future resilience.

Ownership of systems are not well defined

When your organization is moving quickly, developing new services and capabilities, quite often key pieces of the service ecosystem, or even the infrastructure, can become “orphaned” – without clear responsibility for operations. As a result, production issues may remain unnoticed until customers react. When they do occur, it takes longer to troubleshoot which causes delays in resolving outages. Resolution is delayed while ping ponging issues between teams in an effort to find the responsible party, wasting everyone’s time as the issue bounces from team to team.

This problem is not unique to microservice environments. At one engagement, we witnessed similar situations with a monolith architecture lacking clear ownership for parts of the system. In this case, clarity of ownership issues stemmed from a lack of clear system boundaries in a “ball of mud” monolith.

Ignoring the reality of distributed systems

Part of developing effective systems is being able to define and use abstractions that enable us to simplify a complex system to the point that it actually fits in the developer’s head. This allows developers to make decisions about the future changes necessary to deliver new value and functionality to the business. However, as in all things, one can go too far, not realizing that these simplifications are actually assumptions hiding critical constraints which impact the system. Riffing off the fallacies of distributed computing:

The network is not reliable.
Your system is affected by the speed of light. Latency is never zero.
Bandwidth is finite.
The network is not inherently secure.
Topology always changes, by design.
The network and your systems are heterogeneous. Different systems behave differently under load.
Your virtual machine will disappear when you least expect it, at exactly the wrong time.
Because people have access to a keyboard and mouse, mistakes will happen.
Your customers can (and will) take their next action in < 500ms.

Very often, testing environments provide perfect world conditions, which avoids violating these assumptions. Systems which don’t account for (and test for) these real-world properties are designed for a world in which nothing bad ever happens. As a result, your system will exhibit unanticipated and seemingly non-deterministic behavior as the system starts to violate the hidden assumptions. This translates into poor performance for customers, and incredibly difficult troubleshooting processes.

Not planning for potential traffic

Estimating future traffic volume is difficult, and we find that we are wrong more often than we are right. Over-estimating traffic means the organization is wasting effort designing for a reality that doesn’t exist. Under-estimating traffic could be even more catastrophic. Unexpected high traffic loads could happen for a variety of reasons, and a social media marketing campaign which unexpectedly goes viral is a good example. Suddenly your system can’t manage the incoming traffic, components start to fall over, and everything grinds to a halt.

As a startup, you’re always looking to attract new customers and gain additional market share. How and when that manifests can be incredibly difficult to predict. At the scale of the internet, anything could happen, and you should assume that it will.

Alerted via customer notifications

When customers are invested in your product and believe the issue is resolvable, they might try to contact your support staff for help. That may be through email, calling in, or opening a support ticket. Service failures cause spikes in call volume or email traffic. Your sales people may even be relaying these messages because (potential) customers are telling them as well. And if service outages affect strategic customers, your CEO might tell you directly (this may be okay early on, but it’s certainly not a state you want to be in long term).

Customer communications will not always be clear and straightforward, but rather will be based on a customer's unique experience. If customer success staff do not realize that these are indications of resilience problems, they will proceed with business as usual and your engineering staff will not receive the feedback. When they aren’t identified and managed correctly, notifications may then turn non-verbal. For example, you may suddenly find the rate at which customers are canceling subscriptions increases.

When working with a small customer base, knowing about a problem through your customers is “mostly” manageable, as they are fairly forgiving (they are on this journey with you after all). However, as your customer base grows, notifications will begin to pile up towards an unmanageable state.

Figure 1: Communication patterns as seen in an organization where customer notifications are not managed well.

How do you get out of the bottleneck?

Once you have an outage, you want to recover as quickly as possible and understand in detail why it happened, so you can improve your system and ensure it never happens again.

Tackling the resilience of your products and services while in the bottleneck can be difficult. Tactical solutions often mean you end up stuck in fire after fire. However if it’s managed strategically, even while in the bottleneck, not only can you relieve the pressure on your teams, but you can learn from past recovery efforts to help manage through the hypergrowth stage and beyond.

The following five sections are effectively strategies your organization can implement. We believe they flow in order and should be taken as a whole. However, depending on your organization's maturity, you may decide to leverage a subset of strategies. Within each, we lay out several solutions that work towards it's respective strategy.

Ensure you have implemented basic resilience techniques

There are some basic techniques, ranging from architecture to organization, that can improve your resiliency. They keep your product in the right place, enabling your organization to scale effectively.

Use multiple zones within a region

For highly critical services (and their data), configure and enable them to run across multiple zones. This should give a bump to your system availability, and increase your resiliency in the case of disruption (within a zone).

Specify appropriate computing instance types and specifications

Business critical services should have computing capacity appropriately assigned to them. If services are required to run 24/7, your infrastructure should reflect those requirements.

Match investment to critical service tiers

Many organizations manage investment by identifying critical service tiers, with the understanding that not all business systems share the same importance in terms of delivering customer experience and supporting revenue. Identifying service tiers and associated resilience outcomes informed by service level agreements (SLAs), paired with architecture and design patterns that support the outcomes, provides helpful guardrails and governance for your product development teams.

Clearly define owners across your entire system

Each service that exists within your system should have well-defined owners. This information can be used to help direct issues to the right place, and to people who can effectively resolve them. Implementing a developer portal which provides a software services catalog with clearly defined team ownership helps with internal communication patterns.

Automate manual resilience processes (within a timebox)

Certain resilience problems that have been solved by hand can be automated: actions like restarting a service, adding new instances or restoring database backups. Many actions are easily automated or simply require a configuration change within your cloud service provider. While in the bottleneck, implementing these capabilities can give the team the relief it needs, providing much needed breathing room and time to solve the root cause(s).

Make sure to keep these implementations at their simplest and timeboxed (couple of days at max). Bear in mind these started out as bandaids, and automating them is just another (albeit better) type of bandaid. Integrate these into your monitoring solution, allowing you to remain aware of how frequently your system is automatically recovering and how long it takes. At the same time, these metrics allow you to prioritize moving away from reliance on these bandaid solutions and make your whole system more robust.

Improve mean time to restore with observability and monitoring

To work your way out of a bottleneck, you need to understand your current state so you can make effective decisions about where to invest. If you want to be 5 nines, but have no sense of how many nines are actually currently provided, then it’s hard to even know what path you should be taking.

To know where you are, you need to invest in observability. Observability allows you to be more proactive in timing investment in resilience before it becomes unmanageable.

Centralize your logs to be viewable through a single interface

Aggregate logs from core services and systems to be available through a central interface. This will keep them accessible to multiple eyes easily and reduce troubleshooting efforts (potentially improving mean time to recovery).

Define a clear structured format for log messages

Anyone who’s had to parse through aggregated log messages can tell you that when multiple services follow differing log structures it’s an incredible mess to find anything. Every service just ends up speaking its own language, and only the original authors understand the logs. Ideally, once those logs are aggregated, anyone from developers to support teams should be able to understand the logs, no matter their origin.

Structure the log messages using an organization-wide standardized format. Most logging tools support a JSON format as a standard, which enables the log message structure to contain metadata like timestamp, severity, service and/or correlation-id. And with log management services (through an observability platform), one can filter and search across these properties to help debug bottleneck issues. To help make search more efficient, prefer fewer log messages with more fields containing pertinent information over many messages with a small number of fields. The actual messages themselves may still be unique to a specific service, but the attributes associated with the log message are helpful to everyone.

Treat your log messages as a key piece of information that is visible to more than just the developers that wrote them. Your support team can become more effective when debugging initial customer queries, because they can understand the structure they are viewing. If every service can speak the same language, the barrier to provide support and debugging assistance is removed.

Add observability that’s close to your customer experience

What gets measured gets managed.

-- Peter Drucker

Though infrastructure metrics and service message logs are useful, they are fairly low level and don’t provide any context of the actual customer experience. On the other hand, customer notifications are a direct indication of an issue, but they are usually anecdotal and don’t provide much in terms of pattern (unless you put in the work to find one).

Should I also implement tracing?

Tracing can help you dig into the “why” of certain metrics. When implemented effectively, tracing can provide a lot more detail than logging, and can even replace it entirely in some cases. It can be an effective additional observability avenue for teams to use. However, it can be fairly expensive to implement and is more valuable in a microservice implementation. Every service will need to inject tracing into their flows, and can only relay information about a particular request or flow, providing mostly depth, not breadth. For an immediate impact at a lower cost, place a higher investment into centralized logging, metrics and monitoring before leveraging tracing.

Monitoring core business metrics enables teams to observe a customer’s experience. Typically defined through the product’s requirements and features, they provide high level context around many customer experiences. These are metrics like completed transactions, start and stop rate of a video, API usage or response time metrics. Implicit metrics are also useful in measuring a customer’s experiences, like frontend load time or search response time. It's crucial to match what is being observed directly to how a customer is experiencing your product. Also important to note, metrics aligned to the customer experience become even more important in a B2B environment, where you might not have the volume of data points necessary to be aware of customer issues when only measuring individual components of a system.

At one client, services started to publish domain events that were related to the product experience: events like added to cart, failed to add to cart, transaction completed, payment approved, etc. These events could then be picked up by an observability platform (like Splunk, ELK or Datadog) and displayed on a dashboard, categorized and analyzed even further. Errors could be captured and categorized, allowing better problem solving on errors related to unexpected customer experience.

Figure 2: Example of what a dashboard focusing on the user experience could look like

Data gathered through core business metrics can help you understand not only what might be failing, but where your system thresholds are and how it manages when it’s outside of that. This gives further insight into how you might get through the bottleneck.

Provide product status insight to customers using status indicators

It can be difficult to manage incoming customer inquiries of different issues they are facing, with support services quickly finding they are fighting fire after fire. Managing issue volume can be crucial to a startup's success, but within the bottleneck, you need to look for systemic ways of reducing that traffic. The ability to divert call traffic away from support will give some breathing room and a better chance to solve the right problem.

Service status indicators can provide customers the information they are seeking without having to reach out to support. This could come in the form of public dashboards, email messages, or even tweets. These can leverage backend service health and readiness checks, or a combination of metrics to determine service availability, degradation, and outages. During times of incidents, status indicators can provide a way of updating many customers at once about your product’s status.

Building trust with your customers is just as important as creating a reliable and resilient service. Providing methods for customers to understand the services' status and expected resolution timeframe helps build confidence through transparency, while also giving the support staff the space to problem-solve.

Figure 3: Communication patterns within an organization that proactively manages how customers are notified.

Shift to explicit resilience business requirements

As a startup, new features are often considered more valuable than technical debt, including any work related to resilience. And as stated before, this certainly made sense initially. New features and enhancements help keep customers and bring in new ones. The work to provide new capabilities should, in theory, lead to an increase in revenue.

This doesn’t necessarily hold true as your organization grows and discovers new challenges to increasing revenue. Failures of resilience are one source of such challenges. To move beyond this, there needs to be a shift in how you value the resilience of your product.

Understand the costs of service failure

For a startup, the consequences of not hitting a revenue target this 'quarter' might be different than for a scaleup or a mature product. But as often happens, the initial “new features are more valuable than technical debt” decision becomes a permanent fixture in the organizational culture – whether the actual revenue impact is provable or not; or even calculated. An aspect of the maturity needed when moving from startup to scaleup is in the data-driven element of decision-making. Is the organization tracking the value of every new feature shipped? And is the organization analyzing the operational investments as contributing to new revenue rather than just a cost-center? And are the costs of an outage or recurring outages known both in terms of wasted internal labor hours as well as lost revenue? As a startup, in most of these regards, you've got nothing to lose. But this is not true as you grow.

Therefore, it’s important to start analyzing the costs of service failures as part of your overall product management and revenue recognition value stream. Understanding your revenue “velocity” will provide an easy way to quantify the direct cost-per-minute of downtime. Tracking the costs to the team for everyone involved in an outage incident, from customer support calls to developers to management to public relations/marketing and even to sales, can be an eye-opening experience. Add on the opportunity costs of dealing with an outage rather than expanding customer outreach or delivering new features and the true scope and impact of failures in resilience become apparent.

Manage resilience as a feature

Start treating resilience as more than just a technical expectation. It’s a core feature that customers will come to expect. And because they expect it, it should become a first class consideration among other features. Part of this evolution is about shifting where the responsibility lies. Instead of it being purely a responsibility for tech, it’s one for product and the business. Multiple layers within the organization will need to consider resilience a priority. This demonstrates that resilience gets the same amount of attention that any other feature would get.

Close collaboration between the product and technology is vital to make sure you're able to set the correct expectations across story definition, implementation and communication to other parts of the organization. Resilience, though a core feature, is still invisible to the customer (unlike new features like additions to a UI or API). These two groups need to collaborate to ensure resilience is prioritized appropriately and implemented effectively.

The objective here is shifting resilience from being a reactionary concern to a proactive one. And if your teams are able to be proactive, you can also react more appropriately when something significant is happening to your business.

Requirements should reflect realistic expectations

Knowing realistic expectations for resilience relative to requirements and customer expectations is key to keeping your engineering efforts cost effective. Different levels of resilience, as measured by uptime and availability, have vastly different costs. The cost difference between “three nines” and “four nines” of availability (99.9% vs 99.99%) may be a factor of 10x.

It’s important to understand your customer requirements for each business capability. Do you and your customers expect a 24x7x365 experience? Where are your customers based? Are they local to a specific region or are they global? Are they primarily consuming your service via mobile devices, or are your customers integrated via your public API? For example, it is an ineffective use of capital to provide 99.999% uptime on a service delivered via mobile devices which only enjoy 99.9% uptime due to cell phone reliability limits.

These are important questions to ask when thinking about resilience, because you don’t want to pay for the implementation of a level of resiliency that has no perceived customer value. They also help to set and manage expectations for the product being built, the team building and maintaining it, the folks in your organization selling it and the customers using it.

Feel out your problems first and avoid overengineering

If you’re solving resiliency problems by hand, your first instinct might be to just automate it. Why not, right? Though it can help, it's most effective when the implementation is time-boxed to a very short period (a couple of days at max). Spending more time will likely lead to overengineering in an area that was actually just a symptom. A large amount of time, energy and money will be invested into something that is just another bandaid and most likely is not sustainable, or even worse, causes its own set of second-order challenges.

Instead of going straight to a tactical solution, this is an opportunity to really feel out your problem: Where do the fault lines exist, what is your observability trying to tell you, and what design choices correlate to these failures. You may be able to discover those fault lines through stress, chaos or exploratory testing. Use this opportunity to your advantage to discover other system stress points and determine where you can get the largest value for your investment.

As your business grows and scales, it’s critical to re-evaluate past decisions. What made sense during the startup phase may not get you through the hypergrowth stages.

Leverage multiple techniques when gathering requirements

Gathering requirements for technically oriented features can be difficult. Product managers or business analysts who are not versed in the nomenclature of resilience can find it hard to understand. This often translates into vague requirements like “Make x service more resilient” or “100% uptime is our goal”. The requirements you define are as important as the resulting implementations. There are many techniques that can help us gather those requirements.

Try running a pre-mortem before writing requirements. In this lightweight activity, individuals in different roles give their perspectives about what they think could fail, or what is failing. A pre-mortem provides valuable insights into how folks perceive potential causes of failure, and the related costs. The ensuing discussion helps prioritize things that need to be made resilient, before any failure occurs. At a minimum, you can create new test scenarios to further validate system resilience.

Another option is to write requirements alongside tech leads and architecture SMEs. The responsibility to create an effective resilient system is now shared amongst leaders on the team, and each can speak to different aspects of the design.

These two techniques show that requirements gathering for resilience features isn’t a single responsibility. It should be shared across different roles within a team. Throughout every technique you try, keep in mind who should be involved and the perspectives they bring.

Evolve your architecture and infrastructure to meet resiliency needs

For a startup, the design of the architecture is dictated by the speed at which you can get to market. That often means the design that worked at first can become a bottleneck in your transition to scaleup. Your product’s resilience will ultimately come down to the technology choices you make. It may mean examining your overall design and architecture of the system and evolving it to meet the product resilience needs. Much of what we spoke to earlier can help give you data points and slack within the bottleneck. Within that space, you can evolve the architecture and incorporate patterns that enable a truly resilient product.

Broadly look at your architecture and determine appropriate trade-offs

Either implicitly or explicitly, when the initial architecture was created, trade-offs were made. During the experimentation and gaining traction phases of a startup, there is a high degree of focus on getting something to market quickly, keeping development costs low, and being able to easily modify or pivot product direction. The trade-off is sacrificing the benefits of resilience that would come from your ideal architecture.

Take an API backed by Functions as a Service (FaaS). This approach is a great way to create something with little to no management of the infrastructure it runs on, potentially ticking all three boxes of our focus area. On the other hand, it's limited based on the infrastructure it’s allowed to run on, timing constraints of the service and the potential communication complexity between many different functions. Though not unachievable, the constraints of the architecture may make it difficult or complex to achieve the resilience your product needs.

As the product and organization grows and matures, its constraints also evolve. It’s important to acknowledge that early design decisions may no longer be appropriate to the current operating environment, and consequently new architectures and technologies need to be introduced. If not addressed, the trade-offs made early on will only amplify the bottleneck within the hypergrowth phase.

Use modern cloud services to your advantage

Services offered by cloud providers can bring two-fold advantages. The main impact comes from recognizing that what was once specialized knowledge has become commodity services offered by your cloud service provider. Taking advantage of these services allows your engineers to move “up the stack”, closer to the customer and product.

For instance, leveraging managed container services instead of managing the servers yourself can offload most of that work outside your organization. Some services offer out-of-the-box solutions for managing your resilience, like automated rollouts or horizontal scaling.

Several cloud service providers offer access to many different services. Understanding the limitations and advantages a tool has can help you better understand how resilience can be implemented. This is true across the board and knowing what modern tools provide can help you understand if you truly are leveraging them in the most effective ways.

Enhance resilience with effective error recovery strategies

Data gathered from monitors can show where high failure rates are coming from, be it third-party integrations, backed-up queues, backoffs or others. This data can drive decisions on what are appropriate recovery strategies to implement.

Use caching where appropriate

When retrieving information, caching strategies can help in two ways. Primarily, they can be used to reduce the load on the service by providing cached results for the same queries. Caching can also be used as the fallback response when a backend service fails to return successfully.

The trade-off is potentially serving stale data to customers, so ensure that your use case is not sensitive to stale data. For example, you wouldn’t want to use cached results for real-time stock price queries.

Use default responses where appropriate

As an alternative to caching, which provides the last known response for a query, it is possible to provide a static default value when the backend service fails to return successfully. For example, providing retail pricing as the fallback response for a pricing discount service will do no harm if it is better to risk losing a sale rather than risk losing money on a transaction.

Use retry strategies for mutation requests

Where a client is calling a service to effect a change in the data, the use case may require a successful request before proceeding. In this case, retrying the call may be appropriate in order to minimize how often error management processes need to be employed.

There are some important trade-offs to consider. Retries without delays risk causing a storm of requests which bring the whole system down under the load. Using an exponential backoff delay mitigates the risk of traffic load, but instead ties up connection sockets waiting for a long-running request, which causes a different set of failures.

Use idempotency to simplify error recovery

Clients implementing any type of retry strategy will potentially generate multiple identical requests. Ensure the service can handle multiple identical mutation requests, and can also handle resuming a multi-step workflow from the point of failure.

Design business appropriate failure modes

In a system, failure is a given and your goal is to protect the end user experience as much as possible. Specifically in cases that are supported by downstream services, you may be able to anticipate failures (through observability) and provide an alternative flow. Your underlying services that leverage these integrations can be designed with business appropriate failure modes.

Consider an ecommerce system supported by a microservice architecture. Should downstream services supporting the ordering function become overwhelmed, it would be more appropriate to temporarily disable the order button and present a limited error message to a customer. While this provides clear feedback to the user, Product Managers concerned with sales conversions might instead allow for orders to be captured and alert the customer to a delay in order confirmation.

Failure modes should be embedded into upstream systems, so as to ensure business continuity and customer satisfaction. Depending on your architecture, this might involve your CDN or API gateway returning cached responses if requests are overloading your subsystems. Or as described above, your system might provide for an alternative path to eventual consistency for specific failure modes. This is a far more effective and customer focused approach than the presentation of a generic error page that conveys ‘something has gone wrong’.

Resolve single points of failure

A single service can easily go from managing a single responsibility of the product to multiple. For a startup, appending to an existing service is often the simplest approach, as the infrastructure and deployment path is already solved. However, services can easily bloat and become a monolith, creating a point of failure that can bring down many or all parts of the product. In cases like this, you'll need to understand ways to split up the architecture, while also keeping the product as a whole functional.

At a fintech client, during a hyper-growth period, load on their monolithic system would spike wildly. Due to the monolithic nature, all of the functions were brought down simultaneously, resulting in lost revenue and unhappy customers. The long-term solution was to start splitting the monolith into several separate services that could be scaled horizontally. In addition, they introduced event queues, so transactions were never lost.

Implementing a microservice approach is not a simple and straightforward task, and does take time and effort. Start by defining a domain that requires a resiliency boost, and extract it's capabilities piece by piece. Roll out the new service, adjust infrastructure configuration as needed (increase provisioned capacity, implement auto scaling, etc) and monitor it. Ensure that the user journey hasn’t been affected, and resilience as a whole has improved. Once stability is achieved, continue to iterate over each capability in the domain. As noted in the client example, this is also an opportunity to introduce architectural elements that help increase the general resilience of your system. Event queues, circuit breakers, bulkheads and anti-corruption layers are all useful architectural components that increase the overall reliability of the system.

Continually optimize your resilience

It's one thing to get through the bottleneck, it's another to stay out of it. As you grow, your system resiliency will be continually tested. New features result in new pathways for increased system load. Architectural changes introduces unknown system stability. Your organization will need to stay ahead of what will eventually come. As it matures and grows, so should your investment into resilience.

Regularly chaos test to validate system resilience

Chaos engineering is the bedrock of truly resilient products. The core value is the ability to generate failure in ways that you might never think of. And while that chaos is creating failures, running through user scenarios at the same time helps to understand the user experience. This can provide confidence that your system can withstand unexpected chaos. At the same time, it identifies which user experiences are impacted by system failures, giving context on what to improve next.

Though you may feel more comfortable testing against a dev or QA environment, the value of chaos testing comes from production or production-like environments. The goal is to understand how resilient the system is in the face of chaos. Early environments are (usually) not provisioned with the same configurations found in production, thus will not provide the confidence needed. Running a test like this in production can be daunting, so make sure you have confidence in your ability to restore service. This means the entire system can be spun back up and data can be restored if needed, all through automation.

Start with small understandable scenarios that can give useful data. As you gain experience and confidence, consider using your load/performance tests to simulate users while you execute your chaos testing. Ensure teams and stakeholders are aware that an experiment is about to be run, so they are prepared to monitor (in case things go wrong). Frameworks like Litmus or Gremlin can provide structure to chaos engineering. As confidence and maturity in your resilience grows, you can start to run experiments where teams are not alerted beforehand.

Recruit specialists with knowledge of resilience at scale

Hiring generalists when building and delivering an initial product makes sense. Time and money are incredibly valuable, so having generalists provides the flexibility to ensure you can get out to market quickly and not eat away at the initial investment. However, the teams have taken on more than they can handle and as your product scales, what was once good enough is no longer the case. A slightly unstable system that made it to market will continue to get more unstable as you scale, because the skills required to manage it have overtaken the skills of the existing team. In the same vein as technical debt, this can be a slippery slope and if not addressed, the problem will continue to compound.

To sustain the resilience of your product, you’ll need to recruit for that expertise to focus on that capability. Experts bring in a fresh view on the system in place, along with their ability to identify gaps and areas for improvement. Their past experiences can have a two-fold effect on the team, providing much needed guidance in areas that sorely need it, and a further investment in the growth of your employees.

Always maintain or improve your reliability

In 2021, the State of Devops report expanded the fifth key metric from availability to reliability. Under operational performance, it asserts a product's ability to retain its promises. Resilience ties directly into this, as it’s a key business capability that can ensure your reliability. With many organizations pushing more frequently to production, there needs to be assurances that reliability remains the same or gets better.

With your observability and monitoring in place, ensure what it tells you matches what your service level objectives (SLOs) state. With every deployment to production, the monitors should not deviate from what your SLAs guarantee. Certain deployment structures, like blue/green or canary (to some extent), can help to validate the changes before being released to a wide audience. Running tests effectively in production can increase confidence that your agreements haven’t swayed and resilience has remained the same or better.

Resilience and observability as your organization grows

Phase 1

Experimenting

Prototype solutions, with hyper focus on getting a product to market quickly

Phase 2

Getting Traction

Resilience and observability are manually implemented via developer intervention

Prioritization for solving resilience mainly comes from technical debt

Dashboards reflect low level services statistics like CPU and RAM

Majority of support issues come in via calls or text messages from customers

Phase 3

(Hyper) Growth

Resilience is a core feature delivered to customers, prioritized in the same vein as features

Observability is able to reflect the overall customer experience, reflected through dashboards and monitoring

Re-architect or recreate problematic services, improving the resilience in the process

Phase 4

Optimizing

Platforms evolve from internal facing services, productizing observability and compute environments

Run periodic chaos engineering exercises, with little to no notice

Augment teams with engineers that are versed in resilience at scale

Summary

As a scaleup, what determines your ability to effectively navigate the hyper(growth) phase is in part tied to the resilience of your product. The high growth rate starts to put pressure on a system that was developed during the startup phase, and failure to address the resilience of that system often results in a bottleneck.

To minimize risk, resilience needs to be treated as a first-class citizen. The details may vary according to your context, but at a high level the following considerations can be effective:

Resilience is a key feature of your product. It is no longer just a technical detail, but a key component that your customers will come to expect, shifting the company towards a proactive approach.
Build customer status indicators to help divert some support requests, allowing breathing room for your team to solve the important problems.
The customer experience should be reflected within your observability stack. Monitor core business metrics that reflect experiences your customers have.
Understand what your dashboards and monitors are telling you, to get a sense of what are the most critical areas to solve.
Evolve your architecture to meet your resiliency goals as you identify specific challenges. Initial designs may work at small scale but become increasingly limiting as you transition to a scaleup.
When architecting failure modes, find ways to fail that are friendly to the consumer, helping to ensure continuity and customer satisfaction.
Define realistic resilience expectations for your product, and understand the limitations with which it’s being served. Use this knowledge to provide your customers with effective SLAs and reasonable SLOs.
Optimize your resilience when you’re through the bottleneck. Make chaos engineering part of a regular practice or recruiting specialists.

Successfully incorporating these practices results in a future organization where resilience is built into business objectives, across all dimensions of people, process, and technology.

Acknowledgements

Much of this article improved through the thoughts, ideas and feedback from many of our colleagues. Thank you to Martin Fowler, Rickey Zachary, Vanessa Towers, Brandon Byars, Nic Cheneweth, Matthew Foster, Dale Peakall, Melissa Newman, Premanand Chandrasekaran, Karthik Krishnan, Kamil Markow Stefania Stefansdottir and Matt Newman.

Special thanks to Tim Cochran for this guidance throughout our entire time in writing this article.

Special acknowledgement to Ranbir Chawla for his contributions to early drafts.

Significant Revisions

22 August 2023: Published (in full)

This page is part of:

Bottlenecks of Scaleups

by Tim Cochran, Carl Nygard, Kennedy Collins, Keyur Govande, Premanand Chandrasekaran, Punit Lad, Rick Kick, Roni Smith, Sofia Tania, and Stefania Stefansdottir

Series Main Page

Bottlenecks

Accumulation of tech debt; experiments and shortcuts are core components
Constrained by talent and struggling to attract top technologists
Friction Between Product and Engineering; Lack of trust and collaboration slowing down product growth
Surging cloud and managed services costs outpacing customer growth
Service disruptions; production incidents reduce reputation and revenue
Onboarding; Too long for new hires to reach full effectiveness

Bottleneck #05: Resilience and Observability

Contents

Sidebars

How did you get into the bottleneck?

Signs you are approaching a scaling bottleneck

Resilience is not a first class consideration

Solving resilience by hand (reactive manual resilience)

Ownership of systems are not well defined

Ignoring the reality of distributed systems

Not planning for potential traffic

Alerted via customer notifications

How do you get out of the bottleneck?

Ensure you have implemented basic resilience techniques

Use multiple zones within a region

Specify appropriate computing instance types and specifications

Match investment to critical service tiers

Clearly define owners across your entire system

Automate manual resilience processes (within a timebox)

Improve mean time to restore with observability and monitoring

Centralize your logs to be viewable through a single interface

Define a clear structured format for log messages

Add observability that’s close to your customer experience

Provide product status insight to customers using status indicators

Shift to explicit resilience business requirements

Understand the costs of service failure

Manage resilience as a feature

Requirements should reflect realistic expectations

Feel out your problems first and avoid overengineering

Leverage multiple techniques when gathering requirements

Evolve your architecture and infrastructure to meet resiliency needs

Broadly look at your architecture and determine appropriate trade-offs

Enhance resilience with effective error recovery strategies

Use caching where appropriate

Use default responses where appropriate

Use retry strategies for mutation requests

Use idempotency to simplify error recovery

Design business appropriate failure modes

Resolve single points of failure

Continually optimize your resilience

Regularly chaos test to validate system resilience

Recruit specialists with knowledge of resilience at scale

Always maintain or improve your reliability

Resilience and observability as your organization grows

Summary

Acknowledgements