Scaling productivity on microservices at Lyft (Part 1)

Garrett Heel
Lyft Engineering
Published in
9 min readNov 10, 2021

--

Late in 2018, Lyft engineering completed decomposing our original PHP monolith into a collection of Python and Go microservices. A few years down the road, microservices had been largely successful in allowing teams to operate and ship services independently of one another. Separation of concerns that microservices brought about enabled us to experiment and deliver features faster–deploying hundreds of times each day–and provided us with the flexibility to use different programming languages where they work best, have stricter or looser requirements based on service criticality, and much more. However, as the number of engineers, services, and tests all increased, our development tooling struggled to keep up with an explosion of microservices, eroding much of the productivity gains we had strived for.

This four-part series will walk through the development environments that served Lyft’s engineering team as it grew from 100 engineers and a handful of services to 1000+ engineers and hundreds of services. We’ll discuss the scaling challenges that caused us to pivot away from most of those environments, as well as a testing approach based predominantly on heavy integration tests (often approaching end-to-end), in favor of a local-first approach centered on testing components in isolation.

History of development and test environments

Our first major investment in a comprehensive development environment began in 2015, around the time we reached 100 engineers. Almost all development still occurred on a PHP monolith, with a few microservices beginning to emerge for distinct use cases such as driver onboarding.

Anticipating growth in the number of engineers and services we’d need to serve, moving to containers made a lot of sense. Our plan was to build a container-orchestration environment based on Docker–still in its infancy at the time–to first serve developers for testing, then expand to production where we’d benefit from multi-tenant workloads being cheaper and faster to scale.

Local development with Devbox

Devbox–Lyft’s development environment in a box–was shipped in early 2016 and was quickly adopted by most engineers. Devbox worked by managing a local virtual machine on behalf of the user–removing the need for them to install or update packages, configure runit to start services, add shared folders, and so on. Once the VM was running, it took only a single command and a few minutes to pull the latest version’s image, create/seed databases, start an envoy proxy sidecar, and everything else necessary to begin sending requests.

This was a wonderful upgrade from before, where we’d manually provision an EC2 instance for every developer and service combination, making it tedious to set up and keep up-to-date. For the first time we had a consistent, repeatable, and easy way to develop across multiple services.

Remote development with Onebox

The need for longer-lived environments that could be shared with other engineers or functions (like design) quickly became apparent, and Onebox was born. Onebox was essentially Devbox on an EC2 instance, with a number of benefits that drew users away from Devbox. We hosted these on r3.4xlarge instances with 16 vCPUs and 122GiB of memory which were much more capable than the MacBook Pros carried around by engineers. Onebox could run more services and download container images much faster (on account of being in AWS), not to mention the avoidance of VirtualBox causing laptop fans to imitate jet engines.

We had two different flavors of development environments, each capable of running multiple services

Integration tests

In addition to unit tests, Onebox’s cloud infrastructure lent itself well to running integration tests on CI. A service would simply define the group of dependencies it requires in a manifest.yaml file and a temporary Onebox would spin up those services and execute tests on every pull request. Many services, particularly compositional ones closer to mobile clients, built up large suites of integration tests in response to outages. Postmortems often ended with an action item to add new integration tests. With such a flexible and powerful testing capability available, unit tests gradually took a back seat.

Example of a service defining integration tests to run in CI

Staging environment

Lyft’s staging environment was nearly identical to production–short of a smaller footprint and no production data–and all services were deployed there on the way to production. Although not a developer environment, staging is worth discussing due to the increasingly important role it played in end-to-end testing.

Shortly after Devbox and Onebox were released in early 2017, we were also solving a different kind of growth problem: load testing. Events resulting in a surge in rideshare traffic, such as New Years Eve and Halloween, would expose bottlenecks in our system which often led to outages. To get ahead of these problems, we built a framework for simulating rides at scale. The framework targeted our production environment, coordinating tens of thousands of simulated users with different configurations (e.g., a driver in Los Angeles that would frequently cancel) and treating Lyft as a black-box.

As a byproduct of testing the simulation framework itself in staging, we realized that the traffic generated was also valuable for general end-to-end testing. Constantly exercising public endpoints in staging provided great signal for deployments. For example, if a deploy broke the endpoint that drops off a passenger, the deploy’s author would see error logs and alarms almost immediately. Simulations also continuously generated up-to-date data for users, rides, payments, and so on, which removed much of the setup time for manual testing that would be necessary in development. With load testing efforts leading staging to become more realistic and useful than ever before, it became common for teams to deploy PR branches there as a consistent place to get feedback with realistic data.

Breaking point

Fast-forward to 2020–four years after introducing Devbox and Onebox as containerized development environments–and the “Lyft-in-a-box” style of environments was struggling to keep up despite our best efforts. Engineers using these environments increased tenfold and there were now hundreds of microservices powering a much more complex business. While development on services with smaller dependencies was still fairly efficient, most development occurred on services that had built up enormous dependency trees–making it painfully slow to start an environment or run tests on CI.

While these environments and testing capabilities were powerful and convenient, they reached a point of causing more harm than good. We built a system optimized for testing a handful of services and hadn’t re-evaluated our strategy as the number of services–accelerated by the decomposition of our PHP monolith–grew from 5 to 50, 50 to 100, and beyond. Not only did spinning up this number of services for development require an enormous amount of effort to maintain and scale, but it impaired the productivity of developers at large by forcing them to constantly think in terms of the entire system rather than one component at a time.

Let’s examine some of the aspects of this problem in more detail:

Scalability

Onebox environments became impractical to scale beyond a certain point due to the sheer number of resources involved and divergence from production-like environments. For example, it wasn’t feasible to run the same observability tools across hundreds of environments. When something went wrong, it was difficult to pinpoint the exact cause (which of the 70 services running might be misbehaving?) and people tended to hit the “reset” button a few times before giving up and testing on staging.

Staging, on the other hand, was both easy to scale and a much more faithful representation of production. It provided the same logging, tracing and metrics capabilities to aid debugging. The major drawbacks of deploying to the shared staging environment to test were that (1) experimental changes may break others using the environment, (2) only one change per service can effectively be tested at a time, and (3) it takes longer (minutes) to build & deploy than it would to sync code & hot-reload (seconds).

Maintenance

Due to the scaling challenges mentioned, maintaining and optimizing these environments took so much time that the technology fell behind. Production and staging environments had moved to Kubernetes for container orchestration, at the same time switching to slimmer single-process container images. Development utilized heavier multi-process images bundling sidecars and other infra components (metrics, logs, etc.), making images slower to build and download.

Fires arose weekly from changes that impacted development environments in ways that did not affect staging or production. With most developers running most services, issues with one service had a large blast radius. This was exacerbated by the fact that some teams had moved all of their end-to-end testing to staging, leaving their services to languish in development.

Ownership

Ownership of issues in developer environments was unclear. Who should be responsible for fixing a given service causing a problem? The person who spun up the Onebox, the owner of the service, or the Developer Infra team? In practice this often fell to the Developer Infra team, which was ill-equipped to diagnose and resolve issues that were application-specific (e.g., a configuration value changing, causing an application to crash at startup).

Bloated tests

Unwieldy integration test suites had become a significant drain on productivity. Hour-long test suites were commonplace, powered by complex sharding infrastructure with automated retries in an attempt to pave over an unstable environment. The two main factors driving this were bloat in dependencies and the tests themselves. Transitive dependencies would gradually increase without service owners noticing, eating away at test times in consistent 30 second chunks. Test suites themselves also steadily grew because although we’d jump to add tests when something went wrong, they’d rarely be deleted based on an assumption that existing tests served a purpose.

So why would we incur the tax of waiting hours to merge a PR? Because the tests would catch bugs before they hit production, of course! But under closer examination in practice, this theory didn’t hold up well. Analyzing the integration tests for some of our most actively developed services uncovered that 80% or more of tests were either unnecessary (e.g., outdated or duplicates of existing unit tests), or could be rewritten to run without external dependencies in a fraction of the time. When tests did fail, the majority were false-positives–consuming hours of time debugging–and the rest would usually be caught before causing production impact via the staging or canary environments.

Integration tests grew unwieldy as we continued to split out new microservices

Changing course

After beginning to migrate our development environments to Kubernetes about a year ago, a change in engineering resourcing was the catalyst for us to zoom out and re-examine our larger direction. Maintaining the infrastructure to support these on-demand environments had simply become too expensive and would only worsen with time. Solving for our situation would require a more fundamental change to the way we develop and test microservices. It was time to replace Devbox, Onebox, and integration tests on CI with alternatives that were sustainable for a system composed of hundreds of microservices.

Looking closely at how developers were using the existing environments, we identified three key workflows (denoted in purple in the diagram below) that were critical to maintain and would require investments:

  1. Local development: For any one given service, it should be easy and super fast to run unit tests or start a web server and send requests.
  2. Manual end-to-end testing: Testing how a given change performs in the larger system is a crucial workflow that many engineers rely on. We’d look to extend staging to make it easier and safer for developers to test in an isolated manner.
  3. Automated end-to-end testing: Despite an over-reliance on this kind of testing, we couldn’t continue to ship changes hundreds of times each day without the confidence that automated E2E tests provide. We would keep a small subset of the valuable ones as acceptance tests–tests which run during deployment to production.

The following posts in this series will dive deeper into each of these three areas to discuss the problem areas, how we tackled them, and what we learned. Check out the next post on local development, which will share more about the tools we use to inspect, mock and mutate network requests when developing locally.

If you’re interested in working on developer productivity problems like these then take a look at our careers page.

Special thanks to following people that helped to create this post: Brady Law, Glen Robertson, Ryan Park, Daniel Metz, Scott Wilson, Jake Kaufman, Susan Chan, Michael Rebello

--

--