Using the cloud to scale Etsy

22 November 2022


Photo of Tim Cochran

Tim Cochran is a Technical Director for the US East Market at Thoughtworks. Tim has over 19 years of experience leading work across startups and large enterprises in various domains such as retail, financial services, and government. He advises organizations on technology strategy and making the right technology investments to enable digital transformation goals. He is a vocal advocate for the developer experience and passionate about using data-driven approaches to improve it.

Photo of Keyur Govande

Keyur is the Chief Architect and VP of Infrastructure Engineering at Etsy. He has led multiple large architectural changes during his tenure, most recently the move to Google Cloud. Prior to this role, he was a key member of the Systems Engineering team helping scale the site and keeping PHP, MySQL, Memcached, Redis, and the Linux kernel running smoothly.


Etsy, an online marketplace for unique, handmade, and vintage items, has seen high growth over the last five years. Then the pandemic dramatically changed shoppers’ habits, leading to more consumers shopping online. As a result, the Etsy marketplace grew from 45.7 million buyers at the end of 2019 to 90.1 million buyers (97%) at the end of 2021 and from 2.5 to 5.3 million (112%) sellers in the same period.

The growth massively increased demand on the technical platform, scaling traffic almost 3X overnight. And Etsy had signifcantly more customers for whom it needed to continue delivering great experiences. To keep up with that demand, they had to scale up infrastructure, product delivery, and talent drastically. While the growth challenged teams, the business was never bottlenecked. Etsy’s teams were able to deliver new and improved functionality, and the marketplace continued to provide a excellent customer experience. This article and the next form the story of Etsy’s scaling strategy.

Etsy's foundational scaling work had started long before the pandemic. In 2017, Mike Fisher joined as CTO. Josh Silverman had recently joined as Etsy’s CEO, and was establishing institutional discipline to usher in a period of growth. Mike has a background in scaling high-growth companies, and along with Martin Abbott wrote several books on the topic, including The Art of Scalability and Scalability Rules.

Etsy relied on physical hardware in two data centers, presenting several scaling challenges. With their expected growth, it was apparent that the costs would ramp up quickly. It affected product teams’ agility as they had to plan far in advance for capacity. In addition, the data centers were based in one state, which represented an availability risk. It was clear they needed to move onto the cloud quickly. After an assessment, Mike and his team chose the Google Cloud Platform (GCP) as the cloud partner and started to plan a program to move their many systems onto the cloud.

While the cloud migration was happening, Etsy was growing its business and its team. Mike identified the product delivery process as being another potential scaling bottleneck. The autonomy afforded to product teams had caused an issue: each team was delivering in different ways. Joining a team meant learning a new set of practices, which was problematic as Etsy was hiring many new people. In addition, they had noticed several product initiatives that did not pay off as expected. These indicators led leadership to re-evaluate the effectiveness of their product planning and delivery processes.

Strategic Principles

Mike Fisher (CTO) and Keyur Govande (Chief Architect) created the initial cloud migration strategy with these principles:

Minimum viable product - A typical anti-pattern Etsy wanted to avoid was rebuilding too much and prolonging the migration. Instead, they used the lean concept of an MVP to validate as quickly and cheaply as possible that Etsy’s systems would work in the cloud, and removed the dependency on the data center.

Local decision making - Each team can make its own decisions for what it owns, with oversight from a program team. Etsy’s platform was split into a number of capabilities, such as compute, observability and ML infra, along with domain-oriented application stacks such as search, bid engine, and notifications. Each team did proof of concepts to develop a migration plan. The main marketplace application is a famously large monolith, so it required creating a cross-team initiative to focus on it.

No changes to the developer experience - Etsy views a high-quality developer experience as core to productivity and employee happiness. It was important that the cloud-based systems continued to provide capabilities that developers relied upon, such as fast feedback and sophisticated observability.

There also was a deadline associated with existing contracts for the data center that they were very keen to hit.

Using a partner

To accelerate their cloud migration, Etsy wanted to bring on outside expertise to help in the adoption of new tooling and technology, such as Terraform, Kubernetes, and Prometheus. Unlike a lot of Thoughtworks’ typical clients, Etsy didn’t have a burning platform driving their fundamental need for the engagement. They are a digital native company and had been using a thoroughly modern approach to software development. Even without a single problem to focus on though, Etsy knew there was room for improvement. So the engagement approach was to embed across the platform organization. Thoughtworks infrastructure engineers and technical product managers joined search infrastructure, continuous deployment services, compute, observability and machine learning infrastructure teams.

An incremental federated approach

The initial “lift & shift” to the cloud for the marketplace monolith was the most difficult. The team wanted to keep the monolith intact with minimal changes. However, it used a LAMP stack and so would be difficult to re-platform. They did a number of dry runs testing performance and capacity. Though the first cut-over was unsuccessful, they were able to quickly roll back. In typical Etsy style, the failure was celebrated and used as a learning opportunity. It was eventually completed in 9 months, less time than the full year originally planned. After the initial migration, the monolith was then tweaked and tuned to situate better in the cloud, adding features ​​like autoscaling and auto-fixing bad nodes.

Meanwhile, other stacks were also being migrated. While each team created its own journey, the teams were not completely on their own. Etsy used a cross-team architecture advisory group to share broader context, and to help pattern match across the company. For example, the search stack moved onto GKE as part of the cloud, which took longer than the lift and shift operation for the monolith. Another example is the data lake migration. Etsy had an on-prem Vertica cluster, which they moved to Big Query, changing everything about it in the process.

Not surprising to Etsy, after the cloud migration the optimization for the cloud didn’t stop. Each team continued to look for opportunities to utilize the cloud to its full extent. With the help of the architecture advisory group, they looked at things such as: how to reduce the amount of custom code by moving to industry-standard tools, how to improve cost efficiency and how to improve feedback loops.

Figure 1: Federated cloud migration

As an example, let’s look at the journey of two teams, observability and ML infra:

The challenges of observing everything

Etsy is famous for measuring everything, “If it moves, we track it.” Operational metrics - traces, metrics and logs - are used by the full company to create value. Product managers and data analysts leverage the data for planning and proving the predicted value of an idea. Product teams use it to support the uptime and performance of their individual areas of responsibility.

With Etsy’s commitment to hyper-observability, the amount of data being analyzed isn’t small. Observability is self-service; each team gets to decide what it wants to measure. They use 80M metric series, covering the site and supporting infrastructure. This will create 20 TB of logs a day.

When Etsy originally developed this strategy there weren’t a lot of tools and services on the market that could handle their demanding requirements. In many cases, they ended up having to build their own tools. An example is StatsD, a stats aggregation tool, now open-sourced and used throughout the industry. Over time the DevOps movement had exploded, and the industry had caught up. A lot of innovative observability tools such as Prometheus appeared. With the cloud migration, Etsy could assess the market and leverage third-party tools to reduce operational cost.

The observability stack was the last to move over due to its complex nature. It required a rebuild, rather than a lift and shift. They had relied on large servers, whereas to efficiently use the cloud it should use many smaller servers and easily scale horizontally. They moved large parts of the stack onto managed services and third party SaaS products. An example of this was introducing Lightstep, which they could use to outsource the tracing processing. It was still necessary to do some amount of processing in-house to handle the unique scenarios that Etsy relies on.

Migration to the cloud-enabled a better ML platform

A big source of innovation at Etsy is the way they utilize their Machine learning.

Etsy leverages machine learning (ML) to create personalized experiences for our millions of buyers around the world with state-of-the-art search, ads, and recommendations. The ML Platform team at Etsy supports our machine learning experiments by developing and maintaining the technical infrastructure that Etsy’s ML practitioners rely on to prototype, train, and deploy ML models at scale.

-- Kyle Gallatin and Rob Miles

The move to the cloud enabled Etsy to build a new ML platform based on managed services that both reduces operational costs and improves the time from idea generation to production deployment.

Because their resources were in the cloud, they could now rely on cloud capabilities. They used Dataflow for ETL and Vertex AI for training their models. As they saw success with these tools, they made sure to design the platform so that it was extensible to other tools. To make it widely accessible they adopted industry-standard tools such as TensorFlow and Kubernetes. Etsy’s productivity in developing and testing ML leapfrogged their prior performance. As Rob and Kyle put it, “We’re estimating a ~50% reduction in the time it takes to go from idea to live ML experiment.”

This performance growth wasn’t without its challenges however. As the scale of data grew, so too did the importance of high-performing code. With low-performing code, the customer experience could be impacted, and so the team had to produce a system which was highly optimized. “Seemingly small inefficiencies such as non-vectorized code can result in a massive performance degradation, and in some cases we’ve seen that optimizing a single tensor flow transform function can reduce the model runtime from 200ms to 4ms.” In numeric terms, that’s an improvement of two orders of magnitude, but in business terms, this is a change in performance easily perceived by the customer.

We're releasing this article in installments. The last installment will include how Etsy handled the stresses of the pandemic, and its work on measuring cost and carbon consumption.

To find out when we publish the next installment subscribe to the site's RSS feed, or Martin's twitter stream, Mastodon feed,


Acknowledgements

Thanks to Martin Fowler, Christopher Hastings, and Melissa Newman for their writing help, and to Dale Peakskill, Kyle Gallatin, Emily Sommer and Rob Miles, for sharing the stories of their scaling work.

Special thanks to Mike Fisher for being so open about Etsy’s scaling journey.

Significant Revisions

22 November 2022: Published sections on observability and ML platform

17 November 2022: Published first installment