Lyft Engineering - Medium

FacetController: How we made infrastructure changes at Lyft simple

Miguel Molina — Mon, 24 Feb 2025 15:48:48 GMT

Written by Miguel Molina and Arvind Subramanian

FacetController

If you are curious about Lyft’s automatic deployment process on a higher level, please read our blog post on Continuous Deployment.

In this post, we will go a little deeper into the deployment stack and how we leverage Kubernetes Custom Resource Definitions (CRDs) to create an abstraction on top of Kubernetes native resources, known at Lyft as facets. Additionally, we will discuss the new controller we developed to manage these facets and to streamline infrastructure rollouts across the company.

What are facets?

When deploying code, each Lyft microservice is composed of smaller deployable Kubernetes components called facets. There are several facet types representing different deployable Kubernetes objects, and they are defined in a generic manifest.yaml file within each project’s repository.

The following are some of the facet types we have at Lyft:

Service facets

These facets receive and send traffic, typically web servers containing APIs for a microservice. In this example the service facet will have different autoscaling min and max sizes per environment, and the HPA will scale up when CPU utilization reaches 70%.

- name: webservice
  container_command: go run main.go 
  type: service
  autoscaling:
     criteria:
         cpu_target: 70
     environment:
         staging:
             min_size: 5
             max_size: 20
         production:
             min_size: 5
             max_size: 200

This metadata ensures that the Kubernetes resources for a ReplicaSet, Service, Deployment, Configmap and an HPA are created.

Worker facets

These facets only send traffic and typically do some offline processing of work, like taking items from a queue and performing some action.

- name: offlineworker
  container_command: somestartupcommand.sh
  type: worker
  autoscaling:
     min_size: 1
     max_size: 1

This metadata ensures that the Kubernetes resources for a ReplicaSet, Service, Deployment, Configmap and an HPA are created.

Cron facets

These facets run workloads on a schedule. For example, once a week on Sunday.

- name: mycron
  container_command: somestartupcommand.sh
  type: cron
  schedule: 0 0 * * SUN

This metadata ensures that the Kubernetes resources for a CronJob are created.

Job facets

These facets run workflows once at deployment time and then gets terminated and deleted.

- name: s3uploadjob
  container_command: upload_data_to_s3.py
  type: job

This metadata ensures that the Kubernetes resources for a Job are created.

Batch facets

These facets contain code for a workflow that can be invoked by the user whenever an action is needed. For example, running a DB migration.

- name: dbmigrationbatch
  container_command: migration.py
  type: batch

This metadata ensures that the Kubernetes resources for a Job are created.

Deploying Facets

Lyft developers can reference and target facets in their microservice’s deploy pipeline with a controlled rollout. This allows a deploy step to target an environment or a specific percentage of that environment as well as target individual facets of the service using the target_facets field. For more details on pipeline structures, refer to the Continuous Deployment blog.

For example, a deploy pipeline might look like this:

deploy:
 - name: staging
   automatic: true
   environment: staging
   target_facets: [webservice, offlineworker, dbmigrationbatch]
 - name: canary
   environment: production
   bake_time_minutes: 10
   automatic: true
   target_facets: [webservicecanary]
 - name: one-third-of-production
   environment: production
   bake_time_minutes: 10
   automatic: true
   target_facets: [webservice]
 - name: production
   environment: production
   automatic: true
   target_facets: [webservice, offlineworker, dbmigrationbatch, s3uploadjob, mycron]

Problems

Early on during Lyft’s migration to Kubernetes (2019–2020), the infrastructure was rapidly evolving how Kubernetes deployments and manifests were configured (the templates defining deployments changed a lot!). At the time, any updates to these templates could only be propagated when a new deployment was triggered.

At deployment time, the system reads the project manifest, translates it into relevant Kubernetes objects (Deployments, ConfigMaps, HPAs, RoleBindings, ServiceAccounts, etc.), and applies these objects to the relevant Kubernetes clusters. We used to template Kubernetes files at deploy time, where user defined configuration was used in combination with some static logic, similar to helm-style deployments.

With over one thousand microservices, each containing a number of facets, this meant thousands of deployments / redeployments were needed to update facet objects with any template change. Hence, any major changes to environment variables, scaling, or other configurations were difficult to track rollout of and fully converge across all services, and also equally difficult to roll back in an emergency.

Each time we needed to add a field to a facet type or have any infrastructure wide migration, this required heavy manual tracking, like using a spreadsheet to know what still needed deployment, and required lots of coordination with every service team. This process would typically take many weeks or months to make any change to a template, like simply adding or renaming a field.

These problems all highlighted that we lacked a high level Custom Resource Definition (CRD) for deployable objects and a way to manage them. So we introduced FacetController.

Solution: FacetController

FacetController manages the lifecycle of facets. Instead of applying all the Kubernetes objects mentioned above, the deploy process will now create or update a singular facet (ex. ServiceFacet, WorkerFacet) resource, configured as a Custom Resource Definition, on a Kubernetes cluster. The facet resource closely resembles the metadata that is exposed to our developers in deployment manifests. When facet specs are updated or during a regular deployment on the cluster, FacetController will pick up this change, create/update the associated child resources (ex. Deployments, ConfigMaps), and delete resources that are no longer required. This allows changes on how these child resources are defined to be quickly and easily propagated to all services at Lyft.

Now all that is required when changing the templates is a deployment of FacetController instead of individually deploying each service at Lyft. Facecontroller effectively saved every infrastructure team from spending multiple quarters on migrations that now only take a few weeks to fully rollout and test safely.

Design of FacetController

Infrastructure Management is way easier

The biggest benefit of FacetController is that it has given us a way to drive sweeping changes to user services safely and ensures the changes happen in an automated fashion. Some examples:

Changes to Underlying Infrastructure

Autoscaling Changes (Kubernetes/autoscaler to Karpenter)

FacetController enabled our migration to use Karpenter instead of Cluster AutoScaler to manage how our nodes get packed with pods and balanced over time. It allowed us to slowly and safely select projects for deployment to Karpenter-managed nodes by using labels added through FacetController.

Kubernetes Upgrades

As Lyft’s infrastructure has evolved, some Kubernetes clusters are pending deprecation and are running older versions of Kubernetes while other clusters are running newer versions. Even though older clusters may rely on deprecated APIs, FacetController allows for managing different cluster versions by generating the appropriate resources based on each cluster’s specific API version.

Changes to Developer Experience

CPU limits removal

Removal of CPU limits allowed Lyft services to eliminate CPU throttling and let our most critical services burst when needed. The benefits of this has been extensively talked about by others, so here are some articles that explore this topic in more detail: Making Sense of Kubernetes CPU Requests And Limits | by JettyCloud | Medium, Remove your CPU Limits | by Shon Lev-Ran | Directeam, and The container throttling problem | Dan Luu.

Stay tuned for a future blog post on how removing CPU limits unblocked many cost savings initiatives.

Scaling on service container CPU

At Lyft we run many sidecar containers (envoyproxy, stats, logging, etc.) on each pod. CPU for the pod can sometimes be deceiving as the sidecars can skew the average CPU utilization of the pod but the application might be running hotter. This made us realize the importance of also scaling on application container CPU, and we now use the max of the application container CPU and the pod’s overall CPU to have more accurate scaling.

FacetController’s Net Benefits

Proper abstraction for facets and their templates

With FacetController, we now have one unified codebase to manage the lifecycle of facets instead of disparate systems that require individual updates. This consolidation means we now we have one resource to interact with for tools that modify facets (ex. our internal developer platform, command line tools) instead of multiple resources that could diverge between the old tools.

Automatic Garbage Collection (GC) of resources

Before, when deprecating a facet for a service, we would have to manually delete all the objects from that facet, such as the ConfigMap, K8s service, K8s Deployment, etc. Now with FacetController, because each facet has their standard interface/template and management, all of these are automatically GC’d when a facet is removed from a project’s manifest.

No need for en-masse redeploys of services for an infrastructure-level change

This process used to require coordination with service owners and having to re-deploy thousands of services, which would take multiple months for a change to the facet spec. Now most infrastructure-level changes can take minutes to take effect but can still be done in a controlled manner with rollout flags when percentage based rollout is required. This has saved infrastructure engineers many months of work.

Safe rollout of infrastructure-level changes

Despite having changes applied outside of the service’s deployment pipeline, we kept safety as a top priority in the design of how to deploy FacetController. Changes get rolled out on a per-cluster basis and can even be done to select services within a cluster given that we run multiple Kubernetes clusters at Lyft for availability.

Another safeguard we implemented is that concurrent updates to facets are limited to reduce the impact of problematic changes and being able to throttle updates.

Future work

We have fully adopted the controller pattern in different areas of our Kubernetes platform, creating others that use FacetController as an example to design controllers that manage and automate other parts of our infrastructure.

Some services at Lyft require additional resources and configurations outside of the provided templates, often for reasons such as using open source configuration. We refer to these as Direct Facets because they directly apply template files to Kubernetes. These exempt services do not use FacetController and therefore do not get the benefits mentioned above. However, we are actively working on adding generic support for these services so that they can leverage the platform.

….

Special thanks to all the people that contributed to the blog post and FacetController over the last few years: Mike Cutalo, Tuong La, Daniel Metz, Frank Porco, Tom Wanielista, and Yann Ramin.

Lyft is hiring! If you’re passionate about Kubernetes and building new controllers or using FacetController, visit Lyft Careers to see our openings.

FacetController: How we made infrastructure changes at Lyft simple was originally published in Lyft Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Using Marketplace Marginal Values to Address Interference Bias

Shima Nassiri — Mon, 27 Jan 2025 18:49:50 GMT

Written by Shima Nassiri and Ido Bright

Network Effect

At Lyft, we run various randomized experiments to tackle different measurement needs. User-split experiments account for 90% of the randomized studies due to the higher power and fit for most use cases. However, they are prone to interference or network bias. In a multi-sided marketplace, there is no such thing as a perfect balance of supply and demand and one side of the market is congested: if we have oversupply, we can run rider-split experiments without interference concerns. If we are undersupplied, however, interference in a rider-split experiment can severely bias the results. Same goes for under or over-demand situations and driver-split experiments. For example, in a supply constrained situation, not enough drivers are available to address the demand. As illustrated in Figure 1, in such an environment, if the treatment in an A/B experiment incentivises higher convergence of riders to complete their intended rides, there will be fewer resources available for the riders in the control group. Hence, the outcomes of the control group will be negatively impacted by the treatment through the congested resource (i.e., the drivers) and the impact of treatment can be overestimated — this is known as interference bias or network effect. This situation violates the Stable Unit Treatment Value Assumption (SUTVA) which indicates that the control group should not be affected by the treatment to keep the results unbiased.

Figure 1: Network effect in an undersupply situation

It’s important to recognize that interference doesn’t always lead to an overestimation of the treatment effect. For instance, in social networks, treating units in the treatment group can positively influence the outcomes for control units who are friends with those treated, boosting control outcomes and thus reducing the perceived treatment effect. Similarly, in a retail setting, with complementary products, treating units in the treatment group might positively impact complementary products often purchased together, inflating control outcomes and underestimating the treatment effect. Conversely, for substitutable products, the opposite occurs, where the treatment effect may be overestimated.

Possible Solutions to Interference

Much of the literature on interference focuses on modifying classical experimental designs to mitigate its effects. Cluster randomization is a popular method for addressing interference. For instance, at Amazon, cluster randomization is explored to tackle interference issues among substitutable products. In Section 4 of Cooperider and Nassiri (2023), the authors also address the challenge of low power resulting from such clustering and discuss how power can be improved through better cluster balancing.

Other alternative designs like time-split or region-split experiments can also be used to address interference. In a time-split experiment all units are exposed to a single treatment at any given time or time-location combination, which helps prevent the interference effect. (This type of experiment is also known as switchback). However, this approach can affect the user experience for user-facing changes. For example, if we frequently toggle a UI feature that provides the driver with more rider information, it might disrupt the user experience. Additionally, time-split experiments are inherently suited for scenarios where the focus is on the overall marketplace impact. They are designed to capture short-term marketplace behavior, as users experience different treatments throughout the experiment. However, it’s not possible to include a holdout group in a time-split experiment, making them unsuitable for assessing long-term impacts. Therefore, time-split experiments are suitable only for a limited range of use cases. Experimenters might opt to run a combination of a time-split experiment followed by a user-split experiment to leverage the strengths of both approaches. This strategy allows them to accurately gauge marketplace-level effects without interference concerns through the time-split, while also assessing user-level, long-term impacts via the user-split. However, this approach is costly to implement and can delay decision making by several weeks.

On the other hand, region-split or geo experiments apply a treatment across an entire region or region-time bucket, effectively eliminating interference bias since significant interference across different regions is unlikely. Additionally, they don’t impact user experience. However, region-split experiments often suffer from low statistical power due to smaller effective sample sizes, which limits their large-scale adoption.

Another method to obtain unbiased treatment effect estimates despite interference is by modeling interference. Interference can be a challenge in two types of marketplaces: choice-based (e.g., Airbnb and Amazon) and match-based (e.g., Lyft and Doordash). In choice-based marketplaces, customers select from multiple options, making it more complex to model the interference. In contrast, match-based marketplaces assign customers to a single option, which simplifies the modeling of interference. At Lyft, we use a Marketplace Marginal Values (MMV) approach for modeling interference. You can find the theoretical details of this approach in Bright et al. (2024). Essentially, MMV represents the change in the gain (which can be whatever you are optimizing for, e.g., more profit or rides) as a result of changing the resource (additional supply/demand) by one unit. This concept is commonly known as the shadow price in the operations research literature.

Why MMV?

In the paper, the authors present technical proofs demonstrating how marginal values can help significantly reduce the estimator bias of the treatment effect. Essentially, the primary source of interference bias as previously mentioned, is the competition for limited resources. Marginal values effectively capture this resource contention. Consider the following situations:

Figure 2: marginal vs. face value of a rider

As illustrated in Figure 2, when supply is abundant, the marginal value of having rider R1 matches its face value, which is $6. However, in a low supply scenario where resources are limited, the resource is allocated to rider R2. Consequently, both the marginal and face values for R1 become zero. For rider R2, the face value is $10, but its marginal value is only the additional $4 gained by having rider R2. This demonstrates how the marginal value inherently accounts for resource contention. By aggregating the marginal values across both the treatment and control groups and calculating the difference, one can derive an unbiased estimator of the average treatment effect.

How to compute MMVs?

As previously mentioned, shadow prices in the dispatch optimization problem can be used to obtain the MMVs. The primal dispatch problem can be described as follows:

Where xᵢⱼ is a variable that takes the value of 1 if the driver j got matched to that rider i, and 0 otherwise. πᵢⱼ represents the score (e.g., profit) of matching driver j to rider i. The first constraint ensures that a driver is matched with at most one ride per a matching cycle (more on this later), and the second constraint indicates that a rider can have at most one driver. Solving this optimization gives the optimal matching of drivers to riders. We can relax the last constraint into xᵢⱼ ≥ 0, and obtain a linear relaxation of the above problem for which we can compute the dual as:

The dual variable μⱼ is associated to the driver constraint (first primal constraint), and λᵢ is associated with the rider constraint. This means that for each driver j, there is an associated dual variable μⱼ (same is true for riders). More on duality can be found here. To find the MMV values, we aim at generating a matching cycle dispatch graph, solve it, and then efficiently compute the incremental values via the duals. Consider the objective function, denoted as Π(d,s), where d and s represent the demand and supply respectively. Assume that the treatment effect results in increasing the demand by e. Then the global effect of such treatment can be presented as follows:

Now to estimate Δ, we can do a rider-split 50/50 A/B test where each group serves half the demand. Consider the demand in each group being presented by dₑ. We then have

The global average treatment effect can then be estimated as:

where λ* is the optimal rider dual or shadow price. Here, the analysis provides the first order Taylor approximation results — for more details see Proposition 5 of Bright et al. (2024). We can observe that the difference in the objective function outcomes across treatment and control groups can be presented by the shadow price. In the paper, the authors further did a simulation and showed this shadow price estimator will hamper the overestimation of the default estimates from standard A/B tests while lowering the noise level (refer to Figure 8 in Bright et al. (2024)).

Matching Cycle

Next, we need to decide how often to solve these optimization problems, essentially determining the length of the matching cycle. If the matching cycle is too short, contention can occur between cycles. For instance, a driver who isn’t matched in Cycle 1 might be available in the next cycle, or a rider choosing the wait-and-save option might wait several cycles before being matched. At Lyft, we use a 1-hour mega cycle to solve the dispatch optimization problem for all eligible riders and drivers within that period. This cycle length helps significantly reduce concerns about contention between cycles.

Secondary Metrics

Finally, if we want to assess the MMV-corrected impact of a treatment on metrics beyond those defined by the dispatch objective function, we can compute the edge or ride cost for each completed ride (e.g., νᵢⱼ). Considering a linear relaxation of the primal problem and applying complementary slackness, we have:

Assuming non-degeneracy, we can then solve the above system of equations to find the optimal dual values and use them to estimate the average treatment effect same as before.

Productionalizing MMV in an Experimentation Platform

To implement MMV in an experimentation platform, we solve the matching optimization problem for passenger and driver duals on an hourly basis, as previously described, and store these values in a table. This data is then used to calculate MMV-corrected values for drivers and riders. These metrics are included in experiment reports alongside other metrics, with standard computations like CUPED applied to them. Below is an example of how an MMV-corrected metric might appear in a driver randomized experiment. Here, riders are the congested resource contributing to the interference bias, and over-estimation of the results as discussed earlier. The MMV correction would hamper the effect by accounting for the contention over the limited resource (in this case pool of riders).

It’s important to note that there are limitations to the use cases for MMV. For instance, MMV cannot be applied in situations where the target population for randomization is not drivers or passengers. An example of this would be mapping experiments where the route is associated with the ride itself, rather than being specific to drivers or passengers, making MMV-corrected metrics unsuitable.

MMVs in Practice

Other experimental designs like time-split, region-split, or a combination of time- and user-splits often fall short in addressing the majority of experimentation needs. They tend to lack sufficient power, are costly to implement, and can take several weeks to execute. In contrast, the MMV approach can adjust the effect sizes in user-split experiments where interference is a concern. This is particularly important in cases with significant network effects, as the change in effect magnitude can be substantial, potentially altering launch decisions under MMV correction.

At Lyft, we’ve had instances where both time- and user-split experiments were conducted for the same initiative to capture both market-level and long-term effects. We compared the MMV-corrected user-split outcomes with the time-split outcomes in three historical cases where a time-split counterpart was available. After applying MMV correction, we observed greater alignment with the time-split results.

Additionally, a comprehensive backtest across various user-split experiments was conducted, comparing MMV-corrected completed rides with traditional metrics. In 10% of these comparisons, the launch decision could change when using MMV-corrected values. These cases were evenly split between false positives (launching based on traditional values when MMV-corrected values didn’t meet launch criteria) and false negatives.

Moreover, when MMV results show a lower magnitude compared to naive user-split results, particularly in resource-constrained experiments, we anticipate an average 45% reduction in outcome magnitude based on this numerical study. This decrease occurs because the contribution of each ride is divided between the rider and the driver when calculating marginal values, thus correcting for the overestimation of effects due to interference bias.

Acknowledgements

We would like to thank Anahita Hassanzadeh and Thu Le for helpful discussions and suggestions.

Lyft is hiring! If you’re passionate about experimentation and measurement, visit Lyft Careers to see our openings.

Using Marketplace Marginal Values to Address Interference Bias was originally published in Lyft Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Cartography joins the CNCF

Alex Chantavy — Wed, 18 Dec 2024 17:20:44 GMT

Written by Alex Chantavy

Today we’re thrilled to announce that Lyft has donated Cartography to the Cloud Native Computing Foundation (CNCF). Since Lyft open sourced it in 2019, it’s been rewarding to see the project grow from an experimental tool to a solution that’s been battle-tested in production by multiple companies. In this post, we’ll reflect on our learnings from Cartography’s open source journey and discuss where it’s going next.

Origins and growth

Cartography was first created to find attack paths that a malicious actor could take to compromise data in cloud environments. We first used it to understand complicated IAM permissions so that we could think like an attacker and identify the shortest path to administrator privileges.

We soon realized that this graph capability was equally valuable for defenders. It also allowed us to quickly answer questions like, “Which of my services are internet-facing and running a vulnerable version of a software library?” or “Which directors at the company own the most security risk?”

In 2020, we chose to use Cartography as the backbone for our vulnerability management program because it helped us contextualize risks across our infrastructure in a way that no other tools did. As I talked about at BSidesSF, this was not an easy or smooth journey but it forced us to quickly mature the tool and improve our correctness, stability, and performance.

Lessons learned from open source

Through all this, we built a community and got to meet many of you. Matt Klein’s advice on managing an open source project (1, 2) was extremely helpful, and I’ll summarize some of my own recommendations for those considering open sourcing a project:

Open source is a big commitment. It’s like building a global engineering team. Define clear goals, set expectations for support, and remember that open source is usually a net negative for companies unless there’s large adoption.
Communicate and make decisions openly. Let the community have buy-in and understand your project’s direction. Start a public chat channel and hold regular community video calls.
Appoint external maintainers. To avoid burnout, find regular contributors who align with your vision. Allow them to review and merge PRs (we actually made one of our key hires for Lyft’s security team this way!).
Documentation is everything. Let others self serve and unblock themselves to avoid friction when trying out and learning your project.
Set a clear standard: passing tests = mergeable PR. Encourage contributions by setting clear expectations. Include checklist templates, use linters, and implement robust automated tests. Do what you can to avoid PR authors becoming frustrated and leaving the project behind.
Understand your project’s niche. Focus on its comparative advantages and avoid trying to be everything for everyone.

The impact of open sourcing Cartography

Cartography has grown to over 300 Slack members, 90 committers to the main source branch, and over a dozen companies adopting it (that we know of). It’s been incredibly rewarding to see this growth, and it was cool to see how community members built alerting around it or tried to experiment with other backend tech. Having a good open source presence helped Lyft source candidates, and as mentioned above, we were even able to make a key hire from the community. Cartography’s open source status enabled our vuln management and auditing programs to run smoother internally at Lyft as the community often encountered and fixed bugs before we did. We also benefited from dozens of community-contributed modules, many of which were from former Lyft employees and it was nice being able to collaborate with them even after they had changed companies.

The path to the CNCF

Over the past nearly 6 years working on Cartography, I believe more and more that having a self-maintaining map of your infra is a superpower and I can’t imagine working anywhere without it. I’d like to see a world where having a “Cartography-like” graph representation of infra assets becomes something of an open standard, especially since modern companies must maintain visibility over an ever-growing plethora of providers and tools.

However, one of the realities of running an open source project is that contributors (understandably) come and go with time. A successful project needs a steady stream of people discovering it, making contributions, and becoming maintainers. It became clear that growing the project wasn’t going to be possible over the long run if its steward was just one company.

In August 2023, we applied to donate Cartography to the CNCF. We pursued the CNCF in particular because Cartography was built to solve security problems that are uniquely complicated in cloud-native environments.

By donating the project we hope to:

Demonstrate Cartography’s commitment to being fully open source and supported over the long term.
Improve its reach by showing it has achieved a high level of maturity and can be trusted in production.
Receive logistics help in hosting the project. The foundation provides resources such as web hosting, video conferencing, Slack, GitHub continuous integration services, and others.

After a long, thorough review, Cartography was finally accepted by the foundation in August 2024 — big thanks to the Technical Oversight Committee and CNCF staff for shepherding the project through the vote and onboarding it!

The future

Now that Cartography is a CNCF project, what does this mean? The only practical differences are that our Slack channel is now hosted by CNCF instead of Lyft, and our GitHub URL is now slightly different: https://github.com/cartography-cncf/cartography. Cartography will still be developed and led by those who are interested, i.e. its community members. If this project seems useful to you, please try it out and say hi in the #cartography channel on the CNCF Slack — we’d love to hear your feedback. If you think someone else would find Cartography useful, please also share it with them. There are lots of new technical directions I’d like to explore in the future but we can only do this if the community continues to grow and be supported — future blog post to come!

Working on open source has been a career highlight for me, and I like to think that we’ve done at least a little to help the information security industry think in graphs and not lists.

Thank you

Special thanks to the leadership of Lyft’s security team who have been instrumental in their support of Cartography in this multi-year open source journey: Sacha Faust (Cartography’s original creator), Chaim Sanders, Nico Waisman, Matthew Webber, Ben Stewart, Martin Conte Mac Donnell, Samantha Davison, and Jason Vogrinec.

Thanks to Andrew Johnson, Taya Steere, and Evan Davis for taking Cartography from 0 to 1.

Thanks to those who helped take the project to the next level in production through vuln management and infra scenarios: Eryx Paredes, Zoe Longo, Jason Foote, Sergio Franco, Khanh Le Do, Aneesh Agrawal, Leif Raptis-Firth, Hans Wernetti, Fernando Zarate, Kunaal Sikka, Gaston Kleiman, and Lynx Lean.

Thanks to the maintainers and friends of Cartography: Ramon Petgrave, Chandan Chowdhury, Jeremy Chapeau, Marco Lancini, Ryan Lane, Kedar Ghule, Purusottam Mupunu, Daniel D’Agostino, Ashley Lowde, and Daniel Brauer.

Additional thanks to Matt Klein for mentorship in managing an open source project.

Finally, thank you to everyone who has tried out Cartography, raised an issue, shared code in a pull request, provided feedback, or otherwise interacted with the community or project in any way. There have been so many people involved in this journey — thank you for your contributions.

If you think in graphs and not lists, you should apply to work on Lyft’s security team. We’re a small team that absolutely punches above our weight in solving big engineering problems. Visit Lyft Careers to see our openings.

Cartography joins the CNCF was originally published in Lyft Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Integrating Extensions into Large-Scale iOS apps

Max Husar — Wed, 11 Dec 2024 19:46:26 GMT

Written by Artur Stepaniuk and Max Husar

Today, when you open Apple Maps and choose a destination, you are able to see a list of available Lyft offers, seamlessly routing you to the Lyft app to book your next ride. To create this fluid and user-friendly experience across the iOS ecosystem, however, engineers must tackle a range of technical challenges, from managing dependencies in a highly modular application to optimizing performance while maintaining a high quality user experience.

Disclaimer: For the purpose of this deep dive, we assume that you have a general understanding of what a build system is, including concepts like modules, static/dynamic frameworks, dependency graphs and build settings customization via flags.

Demo of Lyft’s integration to Apple Maps

From an implementation perspective, this involves creating a separate application extension. Detailed descriptions of API for booking rides can be found in the relevant Apple documentation, which includes multiple code examples.

Let’s explore the architectural nuances, challenges faced during development, and solutions implemented to overcome these obstacles:

The Dependency Jungle and how to manage constraints in a highly modular application, complying with RAM and binary size limitations;
Development process caveats & SiriKit integration tips to maintain a consistent development user experience.

Most of the faced complications in this specific use case can be generalized to integrations with different parts of the iOS system or other applications.

Dependency Jungle

Lyft applications are built using the Bazel build system (check out our Lyft Mobile podcast with Keith Smiley). Our codebase is highly modular, with each business feature consisting of several separate blocks/modules. This modularity forces the use of static linking for dependencies to avoid long app start times, among other benefits.

However, the downside of static linking is that each linked dependency is copied to the extension, which is a separate target. With many small modules, this leads to numerous connections between them, potentially causing issues and inevitably increasing the dependency graph complexity.

The advantage of a highly modular codebase is that it simplifies adding corresponding modules and avoids code duplication or significant refactoring. However, after the initial setup, the dependency tree of the extension looks as follows:

Initial dependency tree of the extension module. Powered by Gephi, a visualization tool.

There are dozens of modules linked to the extension module and hundreds of dependencies between them, making the extension’s dependency tree immensely complex.

While a large dependency graph isn’t inherently problematic, it does contribute significantly to the Extension’s memory footprint.

As mentioned above, with static linking each dependency is copied. It implies that every module from the image above will be added to the application package twice, increasing its binary size.

At the same time, loading these modules during extension’s work increases its overall runtime memory consumption.

Tooling tea break #1

At Lyft, we utilize Bazel to analyze module dependency graphs together with open-source Graphviz visualization software. It enables us to determine which application modules depend on others, such as checking if module A depends on module B.

Additionally, Gephi visualization software is used for demonstration purposes & as a more enhanced graph analysis tool.

An example query might look like this and can be executed against any module or the root of the app:

bazel query 'kind(swift_library, deps(MODULE_PATH:MODULE_NAME))'

The output is a list of MODULE_NAME dependencies. Different parameters allow you to create simple files with dependencies lists or build a graph representation of ones. This tool is essential for our instrumentation, particularly in addressing the extension memory footprint issue.

The next section describes the limitations related to the extension’s available memory and our way of handling them.

Memory footprint

App extensions are designed to extend existing applications’ functionality, so their resource needs are limited to avoid impacting the overall user experience within the main application or the iOS system.

For an extension’s runtime memory, there is no fixed limit on the RAM extension, as it depends on the iOS version, device model, OS environment, and other factors. Our explorations indicate that this limit can vary roughly between 20 to 50 MB.

Regarding binary size, it’s generally understood that smaller is better. A larger binary size can lead to longer download and install times, potentially reducing the number of installs. The worst-case scenario is hitting the 200 MB download size limit, which triggers an additional confirmation dialog during app download when using cellular data.

In our case, the initially created extension’s RAM footprint is around 21 MB which is considered safe within the explored boundaries. However, the initial binary size increase of 45 MB is a critical issue, as the extension itself would take almost ¼ of the 200MB size limit.

To address this, the following steps can be taken:

Analyze Business Logic Blocks: Break down the three main business logic blocks — authorization, available offers, and additional offers’ data. Identify the components that contribute most to the binary and runtime memory footprint.
Eliminate Major Contributors: Investigate and implement strategies to reduce or eliminate these major contributors.

To facilitate this analysis, a graph visualization tool is used. For example:

bazel query --output=graph [other omitted parameters] module=ExploredModule

The command above generates a detailed dependency graph file, which can be then visualized using Graphviz, Gephi or other software:

ExploredModule’s graph tree with most significant dependencies highlighted

The next step is to analyze the graph to identify any suspicious dependencies that might include unnecessary resources beyond the source code.

In-depth modules analysis

To measure the binary size impact in detail, each module can be added as the only dependency to the Apple Maps extension and analyzed using the `binary-size-diff` tool (explained in section below).

By repeating this process for each necessary business logic feature and its dependencies, we can identify the main contributors to the binary size:

UI module: ~15MB
Networking layer: ~7MB
Interface Description Language (IDL)* imports: ~6MB
Maps/ML stack: ~8MB

*Note: IDL modules in the context of the Lyft iOS applications are code-generated modules with DTO models and simple API clients to communicate with backend based on the contracts predefined using protocol buffers. The usage of this concept is explained in detail in our other article.

To understand how to remove these elements or limit their impact, we can use Bazel again to show the transitive dependencies (the path) between two modules. For example:

bazel query 'allpaths(INITIAL_MODULE_PATH:INITIAL_MODULE_NAME, TARGET_MODULE_PATH:TARGET_MODULE_NAME)' --output=graph | grep -v '  node \[shape=box\];' > relations.dot

As we can see on the graph shown below, the result is a much more readable graph compared to other methods.

Graph showcasing the dependencies that create Initial-Target modules connection

It allows us to identify what modules are linking our Initial module to the Target one and understand how to remove it from our dependency graph to unlink the unnecessary source code.

In our case there are 6 dependencies of the Initial module that need to be addressed to remove the biggest binary size contributor: the Target (CoreUI)* dependency.

*Note: CoreUI is our main module containing all UI components and related resources. While it’s widely used in the main app, it is not needed for the extension and only adds a significant increase in binary size.

The next step for this Initial module would be to either remove all 6 dependencies that lead to the Target module or to eliminate their connection to the Target module.

Separately, while some adjustments can be made to create a lighter version of the networking layer and to strip some of the IDL imports, the main issue is a singular module which fetches available ride offers (part of its dependencies is shown on the graph above) and its broad dependency list.

Two approaches are considered:

1. Extracting a core submodule to be used in both the main app’s Offers service and directly in the Apple Maps extension.

2. Creating a small, separate module containing only the functionality required for the extension.

In this case we are choosing the second approach as it allows us to have full control over added dependencies, thereby minimizing any unnecessary imports. The downside is the need to keep both the original and the new services in sync if any relevant parts in the extension are changed.

Tooling tea break #2

One of the essential tools for managing the dependency tree is the binary-size-diff script. This CI bash script allows you to compare the binary size differences between the base branch and the created Pull Request. Essentially, it compares the .ipa file sizes in both compressed and uncompressed states, enabling you to see changes in the app’s install and download sizes.

The workflow looks like:

Create a draft Pull Request (PR).
Modify the extension’s BUILD file* to include only the dependencies you want to measure.
Commit and push the changes.
Invoke the CI with the command `/test diff-app-sizes` to run the bash script described above.
Iterate this process for each individual dependency you want to measure.

*Note: BUILD file is the main configuration file that tells Bazel what software outputs to build, what their dependencies are, and how to build them.

Optimization results

The culmination of our efforts is resulting in a significantly streamlined dependency tree for the extension, leading to a substantial reduction in its binary size — from 45MB to 15MB.

Below is a visualization of the extension’s resulting dependency graph. Although it may still appear chaotic, most of the remaining modules are IDL imports. These imports are highly atomic and are all transitively linked to a networking base layer, creating some “dependency noise.”

Extension module dependency tree after performed optimizations

Development process caveats & SiriKit integration tips

While the optimization of the dependency tree marks a significant milestone in enhancing the extension’s efficiency, the journey towards a successful release doesn’t end here. Attention must now shift to the finer details of the development process. Successfully releasing the Apple Maps extension requires addressing the following small but crucial details.

Region Availability

Ensure your extension is available in the regions where your platform provides rides. This is done by creating a GeoJSON file listing the supported regions and uploading it as your app’s Routing App Coverage File (documentation). The extension card will only be displayed to users trying to book a ride within the specified regions.

Caveats with GeoJSON:

- While you can debug the GeoJSON file itself to check its correctness (developer docs), there is no direct way to test its integration with the Maps extension. Therefore, testing must be done in production after release.

- The GeoJSON file adds maintenance overhead. Whenever your service area changes, the configuration must be manually updated and uploaded.

Third-Party dependencies and the APPLICATION_EXTENSION_API_ONLY

According to the documentation, the build setting flag APPLICATION_EXTENSION_API_ONLY “when enabled, causes the compiler and linker to disallow use of APIs that are not available to app extensions and to disallow linking to frameworks that have not been built with this setting enabled.”

Implications:

If any of your application’s modules or any of its dependencies (direct or indirect) are built with this flag set to TRUE, they cannot be linked to the extension.
While you can control this flag in your own modules, dealing with third-party dependencies may require stripping them out (in cases when it is impossible to rebuild them from the source code). This issue is closely related to efforts to reduce the extension’s memory footprint (discussed in the previous section). Fortunately in our case no critical functionality depended on these third-party frameworks.

This highlights the risks associated with relying on third-party dependencies — adding one can lead to unexpected limitations in the future.

For more on this topic, check out our article about the risks and evaluation of adding third-party dependencies.

Encountered Developer experience issues

In the process of developing extension for Apple Maps, you may face several challenges that can impact workflow and efficiency.

The first one arises during the installation of your extension on a device or simulator for the first time, SiriKit may not immediately recognize it. This results in the Apple Maps application not displaying the added extension. You may need to wait several minutes before issuing any relevant commands to the Apple Maps (or simply try to reinstall & rerun the extension).

Similarly, when updating your extension’s Info.plist file or making any code updates to the extension’s business logic, it may take several minutes for SiriKit to recognize the changes. This is especially important during the develop-run-debug cycle as some unexpected behaviors might get unnoticed.

Conclusions

Integrating Lyft’s ride booking functionality with Apple Maps is a rewarding effort — this journey has underscored the critical importance of precise dependency management and efficient memory usage, even with the substantial computational resources available on modern mobile devices.

Key Takeaways:

Dependency Management: Effective management of a large dependency tree is essential to minimize the memory footprint and binary size of the extension. A highly modular codebase is key for success.
Tooling: Leveraging tools such as dependency graph visualization and binary size comparison can significantly aid in identifying and resolving issues related to memory and binary size.
Third-Party Dependencies: Relying on third-party dependencies can introduce unexpected limitations, highlighting the need for careful consideration and potential alternatives.

As we continue to refine our integration with Apple Maps, the lessons learned from this experience will guide us in overcoming new challenges and can be extrapolated to other initiatives related to utilizing various Apple’s App Extensions.

And one more thing: Lyft is hiring! If you’re passionate about developing complex systems using state-of-the-art technologies or building the infrastructure that powers them, consider joining our team.

Integrating Extensions into Large-Scale iOS apps was originally published in Lyft Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Protocol Buffer Design: Principles and Practices for Collaborative Development

Roman Kotenko — Mon, 19 Aug 2024 16:40:51 GMT

At Lyft Media, we’re obsessed with building flexible and highly reliable native ad products. Since our technical stack encompasses mobile clients on both iOS and Android, as well as multiple backend services, it is crucial to ensure robust and efficient communication between all involved entities. For this task we are leveraging Protocol Buffers, and we would like to share the best practices that are helping us achieve this goal. This article focuses on our experience addressing the challenges that come with collaborating on shared protocols in teams where people with different levels of familiarity and historical context, or even people outside the team, get to contribute. The problem of development process quality is prioritized over raw efficiency optimizations.

Note: This article focuses on the proto3 specification of the Protocol Buffers (protobuf) language. Code snippets illustrating the handling of generated protobuf messages are provided in Python.

Why Protocol Buffers?

In comparison to text-based data serialization formats like JSON, protobufs offer both higher serialization efficiency & performance, and better backwards compatibility.
Our team works with Python, Swift, and Kotlin codebases extensively, and all of these languages have extensive protobuf support.
Protobufs are extensible with rich validation capabilities that help us reach our reliability goals while avoiding writing boilerplate code across platforms.
Protobufs boast rich internal tooling at Lyft, and widespread use in both mobile-to-server and server-to-server domains. For additional context on the decision to adopt protobufs for mobile development at Lyft and the process of achieving this change, refer to this 2020 article by Michael Rebello in the Lyft Engineering Blog.

Key Principles

Protocol design is different from typical coding & implementation work in a couple of crucial aspects. To illustrate the rather abstract arguments presented here, let’s imagine we’re designing a message conveying a certain event, containing typical fields like event kind, id, timestamp, etc. With no prior protocol design experience, one might be tempted to approach it like writing code and use the familiar concept of enum to distinguish between different kinds of events:

message Event {
    enum Kind {
        EVENT_KIND_A = 0;
        EVENT_KIND_B = 1;
    }

    uint64 id = 1;
    uint64 timestamp = 2;
    Kind kind = 3;
}

Indeed, this setup will work well — until new events are added that are required to carry additional data. Let’s say, now a new rich event type appears, and we add a new enum value and the new associated field like so:

message Event {
    enum Kind {
        EVENT_KIND_A = 0;
        EVENT_KIND_B = 1;
        EVENT_KIND_C = 2;
    }

    uint64 id = 1;
    uint64 timestamp = 2;
    Kind kind = 3;
    uint32 payload_size = 4;  // Specific to EVENT_C.
}

The problem with this setup is that it is implicit about the correctness of various combinations of its fields being set — if kind equals EVENT_C, should payload_size’s presence be enforced? What about if kind is EVENT_A? Of course it can be implemented semantically in the logic handling these values, but with each new implicit relationship, like this one, the code becomes more convoluted and can quickly grow unmaintainable. Avoiding such a pitfall brings us to our first principle: clarity. It’s nice to work with a protocol that’s structured in a way where it explains itself; both as perceived immediately and as proven by iterating on it long term.

Clarity relates not only to semantic relationships between fields like we just illustrated, it also applies to individual fields in their own context. Take for example the payload_size field that was just added: what is the unit specifying this value? One may assume bytes, but one shouldn’t have to — and this makes the difference between a good protocol and a lacking one. What about, besides the time unit, assuming the timezone of timestamp? It proves much more practical to name these fields appropriately with these considerations in mind, e.g. payload_size_bytes and timestamp_ms_utc.

One other common point of ambiguity is fields being required versus being optional. The proto3 standard has deprecated marking fields as required for backward compatibility reasons, effectively allowing any field to be left unset and interpreting such cases as carrying the default value for its given type. Some practices will be covered later in this article to help make your protocols more explicit and more appreciated by engineers who will be using it.

To zoom out again, a more correct way to go about structuring this message is by using oneof, the protobuf version of the familiar concept of a union. A protobuf oneof unites a set of fields to ensure that no more than one of them is set at the same time, and also to improve data transfer efficiency by saving on the payload size (since unset oneof fields do not get serialized):

message Event {
    uint64 id = 1;
    uint64 timestamp_ms_utc = 2;
    oneof data_kind {
        EventDataA data_a = 3;
        EventDataB data_b = 4;
        EventDataC data_c = 5;
    }
}

message EventDataA {}
message EventDataB {}
message EventDataC {
    uint32 payload_size_bytes = 1;
}

Now the EventData messages serve both to distinguish between event kinds, and to contain the respective fields specific to a certain kind but not others. Even if some of them are going to share a common subset of fields, the duplication of declaring them individually is worth it compared to the ambiguity in the other approach. Unlike with enums, the oneof approach is self-documenting, and it doesn’t leave much room for error interpreting how the message should be formed.

Note, however, that the structure of our message had to change in a major way to allow this upgrade, which is another key thing to keep in mind — let’s call this the principle of extensibility. This doesn’t only apply to avoiding convoluted field semantics. Let’s say at some point the system is migrated from using integer IDs to UUIDs. It becomes a problem since the id field is already locked in as uint64, and we are forced to deprecate and declare a new one; whereas having it as string from the beginning would allow a smooth transition to virtually any ID scheme. While it’s impossible to predict and stay safe from all potential breaking changes, there’s a few common pitfalls in protobuf, which often revolve around changing the type of a field and rearranging oneof groupings.

To recap, the key principles of protocol design that we just outlined are:

Clarity: A well-designed protocol should define its messages in a way where it’s not only explicit about which fields must be set. This prevents missetting any of the messages during implementation. In other words, good protocols leave no ambiguity for its implementers.
Extensibility: It is crucial that protocol structure is built with future vision and potential roadmap in mind. This way, some foreseeable additions and breaking changes can be accounted for in advance.

These ideas are quite applicable to classic software development. However, protocol design features greater constraints in comparison and therefore puts greater emphasis on the above principles.

Best Practices

Besides the broad principles, let’s go over some practices that help avoid typical pitfalls in protocol design.

Unknown enum values

It’s always a good idea to declare the 0-th element of an enum as “unknown” to ensure backward compatibility. When an enum without one is added to a message definition, earlier implementations that came before this addition will produce messages for which the fields of its type will be interpreted by newer implementations as 0. To use the Event.Kind example from earlier:

enum Kind {
    EVENT_KIND_A = 0;
    EVENT_KIND_B = 1;
}

The above definition should become:

enum Kind {
    EVENT_KIND_UNKNOWN = 0;
    EVENT_KIND_A = 1;
    EVENT_KIND_B = 2;
}

This way, coming back to the clarity principle, it’s unambiguous and implementation-agnostic when the enum value is set.

Well-known types

Once your team has used protobufs for some time, you might notice that some field types commonly pop up across your protocol surface. There are some that are commonplace enough that the Protocol Buffers development team made them part of the language itself, such as Duration and Timestamp among other, more specific ones. Indeed, going back to the event message example, the uint64 timestamp field can — and should — be replaced with a google.protobuf.Timestamp, fitting right in line with our clarity principle. Some might not be available out of the box, and it’ll be at your team’s discretion to add and standardize your usage of them, for example a reusable LatLng type for geospatial coordinates.

The full list of default well-known protobuf types is available here in the official documentation.

Explicit optional fields

A bit of historical context: in the proto2 protobuf standard, both required and optional fields could be marked with a namesake label. The required label was enforced strictly by the compiler which proved hugely problematic in the long run, because it was nearly impossible to safely change a required field to be optional. In the proto3 standard, the required label was dropped entirely and starting with protobuf 3.15 (2021), the optional label was added. The distinction shifted to fields being explicitly optional vs. implicitly required (having no explicit label). The value of marking optional fields is in the ability to check them for presence in a serialized message. Let’s say that, in the event message example above, we need to distinguish between the payload size value being 0, and being absent. With its current state:

message EventDataC {
    uint32 payload_size_bytes = 1;
}

Querying from a formed message like this:

if event_pb.WhichOneof('data_kind') == 'data_c':
    # ...
    if not event_pb.data_c.payload_size_bytes:  # Error-prone!
        handle_payload_size_absent()

Is error-prone, and can be quite misleading — since primitive types get initialized to a default value, it’s impossible to tell whether a field was absent or equal to default value. However, with an optional label like so:

message EventDataC {
    optional uint32 payload_size_bytes = 1;
}

The .HasField method can then be used on the EventDataC instance:

if event_pb.WhichOneof('data_kind') == 'data_c':
    # ...
    if not event_pb.data_c.HasField('payload_size_bytes'):
        handle_payload_size_absent()

Note: Since protobufs were adopted at Lyft prior to the introduction of optionals to the language specification, our convention for optional primitive types is to use wrappers from the google.protobuf package.

Validation rules

When considering principles of protocol design, we are big fans of explicitly stating a field’s constraints within its message’s broader context. For this we’re taking advantage of the protoc-gen-validate plugin (PGV).

Note: Since recently, PGV has reached a stable state and has been succeeded by protovalidate. While the general idea remains the same, consider using the modernized solution when getting started with validation.

A list of useful validation rules for common types is provided in the bulleted section below. Please note that some level of familiarity with protobufs is assumed.

oneof validation: By default, and somewhat counterintuitively, none of the fields declared under a oneof have to be set. A neat validation rule exists to enforce one of the fields to be present in a formed message: option (validate.required) = true; and needs to be declared alongside with the oneof members.
Generic message validation: (validate.rules).message = { … }
· required with a boolean value is self-explanatory and extremely useful.
enum validation: (validate.rules).enum = { … }
· Prior to proto3, enums were treated as “closed” — meaning that fields of their type could only store the defined values. This produced undefined behaviors, and “open” enum behavior was introduced with proto3, making it valid for fields to be set to values other than the ones listed in the enum definition. defined_only is useful for enforcing that an enum field is effectively “closed” and will only carry expected values.
· in allows you to specify the collection of acceptable values for the given field.
· not_in is also extremely handy. The obvious example is to set it to [0] — enforcing cases when the unknown value is not acceptable.
string validation: (validate.rules).string = { … }
· min_len with value 1 is great for enforcing a non-empty value to be set for the field.
· Well-known string formats are handily available for validation, including email, ip (and ipv4 and ipv6), uri, uuid, among other ones.
· pattern allows you to define a bespoke regex to fit your validation needs.
repeated validation: (validate.rules).repeated = { … }
· min_len with value 1 is great for enforcing that the collection is not empty.
· items allows individual values to be validated against their given type, e.g. items: {enum: {not_in: [0]}}
· unique set to true is useful for validating set-like collections.
map validation: (validate.rules).map = { … }
· min_pairs and keys & values work exactly like min_len and items for repeated fields, respectively.
· no_sparse is good for validating that, for maps with non-primitive value type, values must be set.

Pro tip: Validation also works on the wrapper types with the same rules as for their respective wrapped types, e.g. a google.protobuf.StringValue field can be validated with (validate.rules).string = { … }.

An exhaustive definition of all validation rules (declared in protobuf syntax themselves!) is available in the validate.proto source.

Note: It is important to understand that the generated validation methods still need to be called manually — if a message is formed in violation of the stated rules, nothing will fail until its validator is invoked! A snippet for validating a formed protobuf message will look like this:

import protoc_gen_validate.validator
# It's handy to explicitly distinguish between protobuf entities
# and natively defined models, by e.g. appending PB as applicable
# or importing the whole package.
from your_protobuf_namespace_path.event_pb2 import Event as EventPB

event_pb = EventPB(...)

try:
    protoc_gen_validate.validator.validate(event_pb)  
except protoc_gen_validate.validator.ValidationFailed as ex:
    raise ValueError(f'Protobuf validation error: {ex}')

Cross-entity constants

In some cases, various code points across different services or even domains (e.g. client app and server) may need to refer to the same constants. Protobuf definitions can lend great help in aligning these constants across all entities. Although it’s not an explicit feature of the language, this effect can be achieved using custom options:

import "google/protobuf/descriptor.proto";

extend google.protobuf.EnumValueOptions {
    // Use a distant number to avoid accidental collisions.
    // For a small project, picking an arbitrary large prime number
    // should be safe enough.
    // For larger projects, tooling can be built to manage field numbers
    // with safety guarantees.
    string const_value = 11117;
}

enum EventTag {
    // The unknown value might not be necessary depending on whether
    // you intend to pass values of this type in actual proto messages,
    // or just reference their const values statically.
    EVENT_TAG_UNKNOWN = 0 [(const_value) = ""];
    EVENT_TAG_1 = 1 [(const_value) = "#tag1"];
    EVENT_TAG_2 = 2 [(const_value) = "#tag2"];
}

Then the values can be accessed through the enum descriptor:

from your_protobuf_namespace_path import event_pb2

tag_name = event_pb2.EventTag.Name(event_pb2.EVENT_TAG_1)
tag_descriptor = event_pb2.EventTag.DESCRIPTOR.values_by_name[tag_name]
tag_options = tag_descriptor.GetOptions()
tag_value = tag_options.Extensions[event_pb2.const_value]

Or, compacted:

tag_value = event_pb2.EventTag.DESCRIPTOR \
    .values_by_name[event_pb2.EventTag.Name(event_pb2.EVENT_TAG_1)] \
    .GetOptions() \
    .Extensions[event_pb2.const_value]

Note: It’s recommended to exercise caution when using this technique. It is most suitable for cases where the constant values are never expected to change, or where you have complete control over deployment of entities that will be consuming the protocol.

Language-dependent behaviors

The “Getting started” section in the official documentation is a good entry point to language-specific protobuf work, covering the basic setup as well as more nuanced details like exact type mapping, ways of parsing messages, properties of the entities generated from the protocol definition, etc. This is important because certain behaviors differ across languages (from namespace structuring and naming to implementation details, i.e. when a key in a map field has no value, it being serialized with the default value in some languages and omitted in others), so knowing your target language stack you can always find the right steps to ensure correct behavior.

Conclusion

In this article, we’ve explored the intricacies of working with Protocol Buffers from a collaboration standpoint. In the end, our protocol might end up looking like this:

syntax = "proto3";

import "google/protobuf/descriptor.proto";
import "google/protobuf/timestamp.proto";
import "validate/validate.proto";

extend google.protobuf.EnumValueOptions {
    string const_value = 11117;
}

enum EventTag {
    EVENT_TAG_UNKNOWN = 0 [(const_value) = ""];
    EVENT_TAG_1 = 1 [(const_value) = "#tag1"];
    EVENT_TAG_2 = 2 [(const_value) = "#tag2"];
}

message Event {
    string id = 1 [(validate.rules).string = {min_len: 1}];
    google.protobuf.Timestamp timestamp_utc = 2 [(validate.rules).timestamp = {required: true}];
    oneof data_kind {
        option (validate.required) = true;
        EventDataA data_a = 3;
        EventDataB data_b = 4;
        EventDataC data_c = 5;
    }
}

message EventDataA {}
message EventDataB {}
message EventDataC {
    optional uint32 payload_size_bytes = 1;
}

To recap the key takeaways:

Clarity and Extensibility: We’ve emphasized the importance of designing protocols that are self-explanatory and flexible enough to accommodate future changes. This approach minimizes ambiguity for implementers and reduces the likelihood of breaking changes.
Best Practices: We’ve covered several useful practices, including:
· Using unknown default enum values
· Leveraging standard well-known types
· Setting optional fields intentionally and explicitly
· Implementing validation rules
· Declaring cross-entity constants when appropriate

There are many other useful practices that aren’t mentioned in this article, that may or may not apply to your team depending on the given use case and language stack. For an extensive list, refer to Proto Best Practices and API Best Practices from the official documentation.

And one more thing: Lyft is hiring! If you’re passionate about developing complex systems using state-of-the-art technologies or building the infrastructure that powers them, consider joining our team.

Protocol Buffer Design: Principles and Practices for Collaborative Development was originally published in Lyft Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Building Lyft’s Next Emblem — Glow

Avneet Oberoi — Mon, 29 Jul 2024 22:43:03 GMT

Building Lyft’s Next Emblem — Glow

By: Avneet Oberoi, Michael Vernier, Phoenix Li, Masroor Ahmed

Introduction

Long time riders might remember the original fuzzy, pink Carstache emblem that made Lyft universally recognizable. Over the years, the emblem dropped the fuzz for pink lights in the Glowstache and later evolved with more colors as the beloved Amp, which has been in active use for over seven years! Recently, Lyft has introduced its brighter, bolder next generation emblem — Glow. Glow provides a daytime visible, auto-dimmable display showing rider customizable colors and new animations to help them find their ride faster. Glow also has enhanced GPS and IMU sensors for improved driver location accuracy.

Figure 1: In app rider experience for Glow ride

Prior to Glow, Lyft has IoT development experience with not only the Amp but also with Bikes and Scooters, Halo ad displays, and even Autonomous research vehicles. Each of these were built with bespoke solutions for their use case, which were hard to retrofit for new device types. Observing similar functionality across these siloed systems motivated us to collaborate with these teams, where feasible, to build new IoT middleware services which could provide a unified framework for managing a variety of devices.

High Level Overview

Figure 2: Simplified high level overview

Just like for the Amp, we continue to use Bluetooth Low Energy (BLE) as the communication mechanism between the Glow device and the driver’s smartphone, instead of opting for a cellular chip in the device. We leverage the driver’s phone as an “IoT gateway” which serves as a communication link between the Glow device and the Lyft backend.

The Lyft backend consists of services which act as the brain of the whole operation, determining everything from whether the driver is eligible to use a Glow, to controlling every aspect of the driver’s Glow.

The Glow device itself is controllable via a well defined request-response based command framework that was developed in-house based on our requirements.

This post will detail a few simplified foundational components that make up the IoT system for the Glow including:

Provisioning and Authentication
Control and Communication
State Management including the firmware update process

Lastly, we’ll briefly discuss what’s next for the Glow program.

Provisioning and Authentication

For any IoT device, provisioning refers to the process of creating a unique identifier for each device and registering the device in a central Device Registry, so that the state of each one in the fleet can be tracked and managed.

Common provisioning processes typically involve:

Flashing a bootstrap URL onto the device during manufacturing and then letting the device self-identify by connecting to the URL when activated.
Having the end user manually register their device through an app or website when they start using the device.

After a device has been provisioned into the Device Registry, future communication with the device requires authentication to verify the device is genuine before it engages with the IoT system. Authentication mechanisms such as verification of on-device certificates, symmetric or asymmetric key cryptography, or more sophisticated hardware solutions can be employed depending on the sensitivity of the data being transferred to and from the device.

For the Amp, we simply relied on treating the MAC address of the Amp device as its unique identifier, which would be read by the driver’s phone and relayed to the backend. Phone manufacturers though can choose to anonymize MAC addresses, providing an obfuscated and random hash of the MAC address every time it is read. Additionally, we had no concept of a dedicated Device Registry then and were simply associating each device’s MAC address with a driver record. Since each device could report multiple identities, we faced obvious challenges with device state management and accurate user-to-device association and tracking.

In order to circumvent the problems faced with the Amp, we designed the following on-the-fly provisioning process for the Glow:

Each Glow device is assigned a unique serial number during the manufacturing process. This number is stored within the Glow’s internal flash storage and on a barcode which is affixed to the device packaging.
As part of shipping a device to a driver, the above barcode is scanned and mapped to a specific Lyft driver in the Device Registry service. Once this is complete, the Glow is shipped out to its new owner.
When the driver receives their Glow device and tries connecting it to their phone, the Lyft Driver app will leverage a pre-shared key and the serial number of the Glow to authenticate the device before any communication can begin.

The diagram below summarizes the different states that a Glow device transitions through, as described above.

Figure 3: Glow device state transitions

The above flow not only helps provide detailed device metadata and accountability for devices issued to drivers, but is also meant as a security measure. Unlike the Amp, a Glow device is made to only illuminate when it has successfully authenticated with the Lyft backend services and when the driver is actively driving for Lyft. This prevents malicious parties from impersonating Lyft drivers with a Glow device obtained from someone or somewhere other than Lyft.

Device Control and Communication

Device Command Framework

Each IoT device needs to be controllable by means of pre-determined operating instructions. These can be as trivial as turning the device on or off, or more involved, like playing a certain animation file on the LED display, such as for the Glow. A device is made remotely controllable through its firmware i.e. the on-device software responsible for deciphering and executing instructions to make the device operate as required.

The instruction set aka the “command set” built for the Glow is simple, yet powerful. It supports instructions to individually control various functionalities such as animation file playback, adjusting device brightness, uploading file to device storage and even monitoring on-device sensors. These types of fine grained device commands are also made highly configurable which enables robust, predictable, and easily extensible behavior from the Glow device.

As an example, the instruction for file uploads to the device unlocks the ability to add more animation options for riders to choose from in the future. The alternative would have been to bundle the additional files as part of a new firmware update, which would be much slower to build and roll out. Additionally, firmware resourcing is often more limited, delaying launch of even simple functionality like the above.

Device Control Flow

The Lyft backend is responsible for controlling the device using the command set described above. We built a backend service, “Device Controller”, which listens for different types of relevant event triggers to determine appropriate instructions for the driver’s Glow device. For example, when the service receives a trigger indicating that the driver is in close proximity to the rider’s pickup location, it issues an instruction for the Glow to play the rider selected pickup animation.

The mobile client acts as the communication gateway, helping with essential flows such as command relay and BLE protocol translation as well as buffering for data transfers when needed.

Each server generated command is delivered to the mobile client by means of a server streaming component which exists in Lyft’s infrastructure. The client simply needs to subscribe for the commands once a Glow is successfully connected to it, in order to receive real-time data from the server.

The entire command set has been divided into two categories, based on how the command is to be handled by the client:

Passthrough commands: These are simple commands which do not require any pre-processing or state management by the mobile client and can be forwarded directly to the Glow device via BLE. Examples include updating the device’s brightness or updating the animation being played on the device.
Complex commands: These typically require the client to pre-process payload data, download files from the Lyft backend, buffer data, or maintain some sort of state. Transferring files to the Glow is an action that requires complex commands. A single command is sent from the backend to the client. The client is responsible for first downloading the new files to internal storage, splitting the file into smaller chunks, and sending each chunk to the Glow in a separate command.

Some of the more nuanced details about command ordering/prioritization, data encoding, retry mechanism and encryption mechanism between the client and Glow have been omitted in order to avoid complicating the overall flow we’re trying to depict here.

All commands include a unique identifier to enable tracking of the control flow at any given time. The Glow will send a response for every command message it receives. This response will include whether the action succeeded or failed, an error message describing the failure, or any data that might have been requested by the command. All responses are transmitted to the Lyft backend so that they can be used to diagnose or debug issues.

The end to end flow has been summarized in the image below.

Figure 4: End to end device control flow

Device Data Streams

While the above section details communication directed to the Glow, we talk briefly here about data originating from the device itself.

The Glow streams data to the client so that neither the server nor the client has to continuously request new data from it. There are 4 types of messages that the Glow generates:

Sensor readings: Data generated from on device sensors which includes internal device temperature, display brightness, inertial measurements from gyroscopes and accelerometers, position and speed of the vehicle.
Diagnostic information: The Glow provides information regarding its performance and any errors it encounters during its normal operation. This information is necessary for software maintenance and to keep the Glow device running in optimal condition.
Warning messages: These are triggered by the Glow when it is operating beyond the bounds of its normal operation. For example, a temperature sensor on the Glow warns the client that the Glow device is running abnormally hot. In such cases, the device will self trigger a sleep command, which will render it inoperable for a duration of time before it can be restarted again.
Heartbeat messages: This periodic stream of data reflects the current device state such as the installed firmware version, device assets, etc. Such state is explicitly managed as is detailed in the section below.

State Management

An IoT device’s state is a set of properties that represent its current configuration. This state should be updatable for device management and readable by the backend for monitoring and debugging issues with the device. A device can either report its state asynchronously by pushing periodic updates to the backend or synchronously by responding to a request for it.

A “Device Shadow” service maintains a mirror of each device’s current state using the information received from the device. The service also manages the desired state of the device, which can be updated by systems integrated with the service. The Device Shadow service will detect the difference between the current and desired state state and issue appropriate commands required to transition the device to the desired state. This is done in an eventually consistent manner, thereby guaranteeing synchronizing a device to its desired state irrespective of its current connectivity status.

The state information for the Glow includes the current firmware version, a list of files stored on the device, the display brightness and configuration parameters for various device functions. The Glow broadcasts its state information asynchronously to the client to forward to the backend. A simplified flow depicting this process is shown below.

Figure 4: Device Shadow determining commands

Lastly, we discuss how device firmware is updated via the Device Shadow service.

Firmware Update Process

During the lifetime of an IoT device, it’s almost guaranteed that the firmware will need to be upgraded to add new functionality, address bugs, or patch security vulnerabilities. Some devices can be manually updated by means of a human operator, but most need to support remote over-the-air (OTA) updates. Without a human operator, the firmware update strategy needs to be robust to internal and external problems such as firmware not initializing properly or an update that was only partially installed due to a power interruption. If the system is not designed properly, a device can become inoperable, often referred to as being “bricked.”

A new “OTA Manager” service was created to manage the firmware update lifecycle of devices, including Glow as well as other IoT devices developed by teams within Lyft. The OTA Manager service is wired to an internal developer UI which allows configuring the desired firmware version, the S3 location of the firmware file, the rollout percentage and a set of attribute filters which allow fine grained control of which devices the firmware update should be applied to.

The Device Shadow service interacts with OTA Manager to check for any differences between the installed firmware image on the device, as read from the most recent heartbeat message, and compares it with the desired firmware version. If a difference is detected, a new complex command for file upload will be issued by the server and managed by the client as described earlier in the post, to send the new firmware file to the Glow device.

As the Glow receives the new firmware image, the contents are saved to a different memory location than the currently running version. Once the full image is received, the Device Shadow service commands the device to reboot.

On power on, the Glow runs a small piece of software called a Boot Manager to detect a new pending firmware update. When one is detected, the following happens:

The previous firmware image is copied from internal to external storage and vice versa for the new firmware image.
The new firmware image is booted and several checks are performed before confirming that the new firmware image is functioning properly. For example, the device needs to connect to the mobile client via Bluetooth and successfully authenticate with Lyft backend services.
If any fault occurs, the device will reboot and revert back to the previously working firmware image. The Glow will also attempt to transmit the reason for failure in its next heartbeat message.
If anything goes wrong during this revert process, a recovery firmware image will be used, which was flashed onto the device as part of the manufacturing process and cannot be changed.

All of these mechanisms combine to help mitigate the risk of bricking a Glow device.

What’s Next ?

The Glow is actively rolling out in markets across the US, with over 30,000 devices already live!

The systems built above are standing strong, but we’re keeping an eye out for any room for improvements in our systems and are already sharing the knowledge gained with other teams at Lyft.

Lyft is hiring! If you’re passionate about building software, visit Lyft Careers to see our openings.

Building Lyft’s Next Emblem — Glow was originally published in Lyft Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

FAQ: Common Questions from Candidates During Lyft Data Science Interviews

Kelly Haberl — Tue, 25 Jun 2024 20:29:24 GMT

Written by Nada Sarsour, Maria Rice, and Kelly Haberl

Interested in applying to Lyft Data Science or currently in the interview process? This article helps answer questions commonly asked by Data Science candidates looking to learn more about the Lyft application process, our Data Science teams, and overall life at Lyft!

The Application & Interview Process

Which Data Science role should you apply for?

Lyft Data Science is split between Decisions and Algorithms Scientists, which can be defined by the typical output of their work (although there is overlap between the 2 roles). Decision Scientists tend to focus more on helping humans make decisions by utilizing a deep understanding of the business to develop decision frameworks that drive alignment on the most impactful solutions. Algorithm Scientists tend to focus more on helping machines make decisions by developing models that power internal and external production systems. Both types of scientists leverage a suite of data science skills, from A/B experimentation to machine learning and operations research.

When you are deciding what Data Science role to apply for at Lyft, we recommend reviewing this article by Simran and Thibault to learn more about different focus areas and responsibilities. Your recruiter can also help determine what role makes the most sense with your experience after you submit your application.

How is the interview process structured?

The Data Science interview process at Lyft is similar for both Decisions and Algorithms Scientists. Even though this previous article is focused on the process for Lyft Science Interns, it provides a great overview of the interview process that’s also relevant to full time hires (note: the interview process differs slightly if you are interviewing for a staff or manager role).

Overall, there are 3 stages to interviewing:

Recruiter Screen (30 minutes): A recruiter reviews your resume and will reach out if they determine you are a good match for an opening. It is important for your resume to align with the role you are interested in and to include experience that matches the requirements of the role. The recruiter call is a chance for you to talk through your experience and learn more about the different tracks/interview processes.

Technical Phone Screen (60 minutes): This interview step is the same for all Data Science tracks. It focuses on statistics, probability, experimentation, and business acumen, and does not include live coding or modeling. After the phone screen and based on interviewer feedback, there is an opportunity for candidates to be redirected to another track (i.e. Decisions or Algorithms) if you would perform better in another space.

Virtual Onsite Interviews: The final round consists of 4 or 5 virtual interviews where candidates speak with a Data Scientist or Data Science Manager. These rounds can be completed in 1 day or split across 2 days. These interviews are broken down into the following areas:

Business Case Interview (45 minutes): work through a technical business problem that’s an example of the problems you would solve in this DS role at Lyft; this interview requires an evaluation of a business case and Lyft metrics but won’t include live coding.
Experience Interview (45 minutes): in this behavior interview, candidates could be asked to give examples of previous projects, describe how you reacted in specific work-related situations, and discuss your background and interests
Coding Interview (45 minutes): in this technical interview, candidates complete a live coding challenge in the language of their choice; the interview various depending on the Data Science path:
- Algorithms: a technical problem related to programming fundamentals used in everyday practical algorithm development, machine learning implementation, or data processing; this interview is meant to test the candidate’s technical coding abilities and communication skills, through discussing potential approaches and pitfalls, and implementing a working end-to-end solution to a practical problem (no esoteric pointer techniques or brain-teasers!
- Decisions: a series of analytical coding questions that require manipulation of an example data set containing rideshare data, to assess coding and communication skills
Technical Interview (45 minutes): work through a case study, which varies based on the Data Science path:
- Algorithms: a technical discussion on how you would approach and solve a problem end-to-end that you would encounter as a data scientist at Lyft, using tools from your track expertise related to either statistics, optimization, or machine learning
- Decisions: a diagnosis of a product problem end-to-end, including sections on experimentation, probability, and product intuition

How long does the interview process take from start to finish?

The time it takes to complete our interview process depends on your availability and how quickly you are willing and able to schedule interviews. On average, it takes about 3–4 weeks to complete the entire science interview process. If a candidate does not pass the interviews, they are eligible to reapply after 6 months.

Are any interviews conducted in person in a Lyft office?

All science interviews are being conducted virtually so you will not be required to come into an office for any interviews at this time!

Lyft’s Data Science Teams

What type of work will I do?

At Lyft, Data Scientists are embedded into a specific business function or focus area. This allows our scientists to become experts in an area of the Lyft business and drive long-term impact. The team working on one business function will consist of not only Data Scientists, but cross-functional teammates from Product Management, Software Engineering, Design, Marketing, Operations, etc.

There are Scientists working on every aspect of the Lyft business — here are some examples of focus areas our scientists are aligned to (more information on focus areas can be found in this article!):

How do I get matched to a team?

Your interview process will be focused on interviewing for a specific team, which will be outlined in the job description that you are applying for. If an interviewer sees you may be a better fit on another team, they will let the recruiter know so you can discuss which is the best team match. Given this set up, it is preferable that the candidate apply to only one type of role, Decisions or Algorithms, during the initial application to not delay the interview process (selecting a type of role can also be discussed with a recruiter during the initial recruiter screen).

Are specific teams based in one location? Is traveling common?

Teams are not based in one specific office location, but rather one team can be located across a few of our offices. Our corporate offices are in San Francisco, New York, Seattle, Montreal, and Toronto. You will report to the nearest location per your residence for our hybrid work schedule.

Life at Lyft

Tell me more about Lyft’s in-office policy!

Lyft’s workplace strategy categorizes most of our roles as hybrid roles. Team members in hybrid roles will work from their nearest Lyft office and report to the office 3 times a week. Additionally, hybrid roles have the flexibility to work from anywhere for up to 4 weeks per year.

In the office, Lyft provides free breakfast, lunch, and snack options daily. At the Lyft headquarters in San Francisco, in addition to the salad, soup, and pizza options, there is a rotational buffet lunch each day including menu items like pastas, empanadas, tostadas, and poke!

How would you describe Lyft’s culture?

Lyft’s culture revolves around a sense of collaboration and belonging. Team members at Lyft proactively support one another, whether it’s discussing ideas on Slack, reviewing code, or whiteboarding solutions together. We frequently celebrate each other, whether it be nominating others for ‘Employee of the Month’ or discussing our hobbies & passions! Teams typically have events or offsites at least once a quarter and cross-team socials occur in or around the office even more often than that.

It’s also a culture of fun! The Diversity & Inclusion group at Lyft has a Culture workstream dedicated to bringing people together. We host various events such as social meet ups (from a gingerbread house competition to a Texas Hold ’em poker tournament with prizes to an in-office Boba social — pictured below!) and lean-in circles where we can support and learn from each other on what it’s like day-to-day at Lyft.

How would you describe work-life balance at Lyft?

Lyft promotes work-life balance in many different ways. In addition to generous personal time off and company-wide holidays, Lyft sets recharge time around the November/December holidays for all employees to be offline. Scientists are encouraged to take this time to do whatever helps them recharge, whether that be spending time with family, traveling, or just relaxing!

Another favorite benefit of Lyft employees is the sabbatical! After hitting your 5 year mark at Lyft, we encourage team members to take a multi-week sabbatical to enjoy some extended time off after their hard work.

Many employees also participate in extracurricular activities together after work. One great example is the Lyft soccer team, which is a community of Lyft team members who enjoy playing soccer on weekday nights. The team participates in a recurring quarterly corporate league in San Francisco with weekly games, featuring other Bay Area tech companies like LinkedIn, Cruise, and Plaid.

Similarly, the Lyft Chess Team competes in an online corporate league against 40+ companies in sectors ranging from tech, finance, to consulting. The team also likes to play in the office whenever they get a chance!

What is diversity at Lyft like?

Diversity is a large focus for Lyft Science with many ongoing efforts to maintain and improve the diversity and inclusion among our scientists. A key example of these efforts is DIG, which is Lyft Science’s ‘Diversity & Inclusion Group’. Founded in 2018, this internal council of over 50 scientists focuses on cross-organizational efforts to ensure diversity and inclusivity throughout a scientist’s entire lifecycle at Lyft: from attracting diverse talent, to ensuring our work reaches a broad audience, to building a welcoming community and culture within. DIG volunteers work on a variety of projects such as leading internal discussions on diversity topics (e.g. being a parent at Lyft), creating blog posts to share recent Science work or increase transparency on our interview process, and planning/hosting meet-ups with external DS/ML organizations. Read more information on DIG’s efforts & mission here!

Additionally, there are several specific focus groups within Lyft that foster communities for support and discussion. For example, members of the Women in Science group have periodic meet ups and discussion groups on topics such as imposter syndrome and how to approach promotion / salary discussions.

What does growth look like at Lyft? How would my job change in 1 year? 5 years?

At Lyft, there are ample resources to support career development. There are multiple levels of Data Scientists at Lyft, ranging from junior to senior roles. Each level has clearly laid-out performance expectations in a rubric format, which are shared with employees from their first day at Lyft to make promotions fair and predictable. Managers conduct frequent check-ins with employees on their performance vs these rubrics and can provide feedback and advice for future projects to develop new skills or strengthen current weaknesses. Formal reviews are conducted bi-annually, which allow scientists to reflect on their own work, compare their performance to the expectations rubric, and receive written feedback from their manager and peers.

Lyft provides the option to advance to higher Data Science roles either as a manager or an individual contributor in order to support the varying career goals across our scientists. The individual contributor track means the scientist will not take on any people-management responsibilities and will continue to focus on technical work as they progress to higher Data Science levels, while they take on a larger technical / product scope.

To help Data Scientists maximize their career growth especially early on in their career, Lyft offers a mentorship program many employees have elected to participate in. Mentorship at Lyft is a formalized program where employees are matched to mentors, typically more senior scientists, who can help provide informal guidance on a specific issue such as how to prioritize workloads or a more general topic like working towards a promotion. Additional offerings at Lyft for career development include participating in internal technical education courses (recent courses focused on optimization, causal inference, and reinforcement learning), cutting-edge research seminars by academic guests, or attending external conferences to learn about industry trends — read more about these initiatives in this blog post!

If you are interested in joining our incredible team of Data Scientists and improving people’s lives with the world’s best transportation, check out our careers page!

FAQ: Common Questions from Candidates During Lyft Data Science Interviews was originally published in Lyft Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

ETA (Estimated Time of Arrival) Reliability at Lyft

Rachita Naik — Thu, 20 Jun 2024 15:41:12 GMT

The ETA Conundrum: Speed vs Accuracy

Imagine this: You’ve got a crucial early morning flight to catch — you open the Lyft app, and see an estimated pickup time on the screen. But here’s the million-dollar question: Will the driver you are paired with, arrive at the estimated time? One of Lyft’s most simple, yet profound goals is to ensure we provide riders with the most accurate ETAs we can.

Before you even hit ‘request’ and summon a ride, there are complex algorithms that sift through historical and real-time data, leveraging machine learning alongside traffic and weather insights to predict the ETA (or pickup time) to display on the rider’s screen based on the destination they input. This got us thinking — how do we determine ETA estimates as accurately as possible, before a rider requests a ride?

Enter: Reliability

What does ‘reliable’ really mean in this context?

In the realm of ridesharing, ‘reliability’ takes on different dimensions depending on the ride’s phase — be it prior to a rider requesting a ride, after the ride has been requested, or after the details of the driver who has accepted the rider’s offer, emerge. For instance, once a driver’s information is visible, reliability translates to the accuracy of the predicted driver arrival time (displayed on the screen) compared to the driver’s factual arrival time.

So, what encompasses reliability before a ride is even requested?

Simply put — Given an ETA before a ride request, reliability is the likelihood that a driver will arrive within a reasonable timeframe around that ETA, should the ride be booked.

Rider Screen pre-request

Understanding and estimating this aspect of reliability is crucial for setting accurate ETAs, as it has a direct impact on the likelihood of riders canceling their bookings. Illustrated below is a simulated graph depicting the relationship between reliability and rider cancellation rates — demonstrating a higher reliability % results in lower % of cancellations.

Reliability vs Cancels graph

Our objective, for each ride option presented to our users, is to showcase an accurate ETA with a strong likelihood of a swift, reliable pickup.

Unpacking ETA Uncertainties

Before we set out to tackle this problem, it is important to understand the reasons for unreliability — i.e, why estimated times of arrival (ETAs) might differ from the actual arrival times.

Unpredictability of Driver Availability: At the heart of the ETA challenge is the inherent uncertainty around driver availability at the time a rider requests a ride. Our system endeavors to predict the closest and most suitable drivers who may choose to accept a ride request. Yet, there are still many variables at play:
– Driver Preference: Drivers have the autonomy to reject or cancel a ride based on personal preferences, impacting the ETA estimate.
– Driver Contention: The occurrence of several requests simultaneously vying for the same driver complicates the process of matching each ride request with a driver.
– Changes in Driver Status: Drivers may choose to log off unexpectedly which could alter ETAs.
Organic ETA uncertainty/mapping volatility: Beyond the unpredictability of driver availability, there are other factors at play that can skew ETA accuracy:
– Traffic Conditions: Traffic can unpredictably affect travel times.
– Navigation Challenges: Unexpected detours like missed turns/ road closures can add time to the journey.
– GPS Volatility: GPS inaccuracies can affect the exact location of a driver or rider, impacting ETA predictions prior to request.
Marketplace Dynamics: Another layer of complexity is the supply and demand dynamics within specific neighborhoods. There are instances where requests are made from areas with a lower density of available drivers. Additionally, marketplace conditions are in constant flux, with the balance of supply and demand shifting within minutes, further impacting ETA reliability.

Harnessing Machine Learning (ML) for Reliability Prediction

The rest of the article delves into some technical aspects of ML and familiarity with fundamental concepts is recommended!

To predict the effect of selecting a certain ETA on ride reliability, we could potentially use Causal Inference methods leveraging historical data to predict the causal effect of different ETA settings on reliability, given actual arrival times, rider cancellations and other relevant metrics.

However, in order to automatically detect complex interactions between multiple variables (like driver behavior, ETA patterns, demand and supply conditions) without explicit specifications, we decided to harness ML. This approach enhances our ability to accurately predict ride fulfillment reliability by analyzing rich datasets, while also ensuring scalability and efficiency in our processes.

We started with the objective of developing a classification model, capable of predicting the reliability probability of ETA estimates. The goal? To arm downstream services with reliability scores for all possible ETA brackets, enabling the selection of the most accurate ETAs for our riders.

Fig 1. Example classification model prototype

But how are these reliability estimates used for ETA selection?

Using product requirements, data insights and UX research, we set a stringent reliability Service Level Agreement (SLA) for every possible ETA estimate that we can present to our users (possible ETA brackets are pre-determined per ride type). This ensures we hit reliability targets by only selecting ETAs that meet the desired SLAs. Fig 1 illustrates a model prototype — sample inputs include possible ETAs (ranging from 1 to 10 minutes) and ride-level and marketplace features. The outputs are ETAs enhanced with model-based reliability estimates. Finally, the ETA with reliability greater than SLA is selected. In this narrative, we will focus on our approach towards reliability estimation.

The Model

At the core of our solution lies a tree-based classification model. While deep learning models have their advantages and are increasingly used in ridesharing for tasks like demand forecasting, route optimization, and image recognition (e.g. identifying road conditions), they are often an overkill for everyday lightweight business classification tasks.

Gradient boosting tree-based models have been a historic choice at Lyft for these purposes due to their clear interpretability, efficiency with smaller datasets, and robustness to outliers and missing values. These models excel in handling structured tabular data common in ridesharing, capturing complex, non-linear relationships and feature interactions without extensive feature preprocessing/ scaling. They require less computational resources and are straightforward to implement and maintain, facilitating rapid deployment in production environments.

Features and Training

Along with the ETA estimate we want to predict reliability for, we need features that would help us capture as much of the marketplace uncertainty as possible at prediction time itself -

Nearby Available Drivers: We identify a list of the closest drivers to a ride request and use their characteristics, such as estimated driving time, distance, and driver status (online, offline, or completing a trip) as model features. This data helps the model gauge the likelihood of each driver matching with the ride, should the ride be requested in the future.
Harnessing Historical Insights: Our model integrates historical data at the regional and granular geohash level to offer a broader perspective on performance trends. Recent driver ETA estimates and match times, and number of completed and canceled rides establish historical benchmarks that help adjust predictions based on recent performance.
Marketplace Features: Capturing the Pulse of Demand and Supply: Realtime neighborhood-level demand and supply indicators, such as the number of app opens, unassigned rides and driver pool counts offer a granular view of the market conditions.
To further refine model predictions, we incorporate features such as pickup/ dropoff location, temporal elements, and categorical data like which region the ride was requested.

Innovative Training Approach: Our training label is generated by comparing the actual request-to-driver arrival time against the ETA to produce a binary label for reliability. A unique aspect of our approach is the decision to train the model on all possible ETA estimates for each ride, rather than just the factual ETA estimates shown to riders (prior to request), i.e, each ride in the training data is duplicated n times — n = number of possible ETA estimates (eg — 1, 2, 3, … 10 minutes). This strategy helps us -

Avoid negative feedback loops during training — a model trained on only factuals could progressively degrade over time.
Ensures equal representation of all possible ETA estimates which could be seen during inference.
Allows the model to learn variances in driver ETA estimation (driver ETAs which are used as model features are generated upstream by another service and may not always be accurate).

Evaluating Model Performance: We use Area Under the Curve (AUC) metric to evaluate model performance since it evaluates performance across all thresholds and not just a single one (which is useful since we utilize raw probabilities for our use case).

AUC Curve for the Reliability Model

We also look at performance per ETA bracket — the model bias is generally small but increases for larger ETAs (owing to smaller % of rides with say ETA > 15 minutes).

Beyond Prediction: Ensuring Sustained Performance

Ensuring that a model meets our use case upon training and deployment is crucial, but maintaining its performance over time in the dynamic rideshare environment presents a unique challenge. Consider how a model trained before significant societal shifts, such as the pandemic, would struggle as commuting patterns evolve dramatically. Similarly, updates to the Lyft app itself can impact the functionality of its components, including predictive models. This necessitates a robust system for continuous monitoring of features and performance to identify and address any degradation promptly.

It’s often hard to pinpoint the root cause of the model degradation, but most times, a simple retrain on fresh data can often mitigate performance declines. Thankfully, Lyft’s advanced ML Platform called LyftLearn composed of model serving, training, CI/CD, feature serving, and model monitoring functionality equips us with necessary tools to establish drift detection alarms and automated retraining pipelines seamlessly.

What’s Next?

In addition to focusing on our primary regions, we are broadening our analysis to include unique markets (e.g. complex marketplaces like airports) to incorporate more nuanced signals into the model. We are also integrating more real-time signals to better capture dynamic marketplace conditions. As we continue to refine our predictive models and strategies, our goal remains clear: to enhance the reliability of our service and uphold our commitment to providing riders with accurate and trustworthy information.

Acknowledgements

A huge thank you to all the members of the Offer Generation and Offer Selection teams at Lyft for making this happen!

If you are excited to be part of the Lyft team, explore opportunities on our careers page. Join us on our journey to improve people’s lives with the world’s best transportation!

ETA (Estimated Time of Arrival) Reliability at Lyft was originally published in Lyft Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Keeping OSM fresh, accurate, and navigation-worthy at Lyft

Rostom Zain El-Din — Wed, 05 Jun 2024 18:42:08 GMT

Written by Brian Spencer, Rostom Zain El-Din, Yuliya Shustava, Anastasiya Prakopava, and Kiryl Yakimavets

Lyft’s mission to improve people’s lives with the world’s best transportation requires investing in the world’s best map. Nothing could be accomplished without first having a reliable groundwork of highways, buildings, and natural features through which to guide drivers and riders. We rely heavily on accurate and up-to-date map data — this is why Lyft Mapping is built on OpenStreetMap (OSM).

OSM is a global map database used by millions of people around the world, for tracking agricultural land use, disaster recovery, refugee response, academic research, and much more. After 19 years of growth, OSM is now commonly used by many companies to power applications like logistics platforms, social media, and gaming. OSM is now the biggest crowdsourced repository of human geospatial knowledge — and Lyft is proud to be a contributor since 2020.

Why Lyft Chooses OSM

When we first evaluated OSM, we found that it aligned with Lyft’s goals and needs thanks to three key findings:

High-quality road network

OSM already had a robust core road network across our major markets in North America. This meant that for basic needs like turn-by-turn navigation, the road geometry and connectivity were sufficient to get us started.

Large and active community

With more than 10 million registered users, OSM effectively harnesses the power of crowdsourced mapping. This community-driven approach ensures several benefits:

Fresh Maps — With peers and partners working on the same map, it stays fresh and accurate
Incremental Maintenance — Partially completed map layers mean that only incremental updates are needed, making it easier to maintain
Engaged Community — The OSM community is actively involved in defining tags and schemas, ensuring that the map evolves to meet the needs of its users
Collaborative Efforts — Many companies are working on similar map layers for their applications (e.g., gated communities), which means we can leverage shared knowledge and resources

Thriving OSM ecosystem

The open source nature of OSM combined with its large community make for a thriving and innovative ecosystem, contributing to two distinct advantages:

Tools and Resources — A plethora of tools and resources are available for map editing, analysis, and geospatial processing
Rapid Response — The ability to improve the map ourselves allows us to respond quickly to customer feedback and make necessary updates

Initial Challenges & Growth

Of course, working on a project with ten million of your closest mapping friends isn’t all peaches and cream. Lyft faced some hurdles during the initial phase of contributions to OSM:

Policy Interpretation — We were tasked with studying all possible OSM sources and guidelines related to our projects. However, not all elements of our projects are specifically addressed by guidelines or OSM best practices, requiring us to make interpretations based on local mapping peculiarities. The Lyft Team continues to mature its mapping skills through practice and communication with the community.
The complexity of changing OSM mapping rules — Lyft diligently adapts its projects to adhere to OSM guidelines. In some cases, however, we develop best practice updates that we believe are useful for the entire OSM community. But the process of clarifying or changing OSM guidelines requires active participation from all interested parties. It can be very time-consuming and does not always yield the expected results.
Lyft-Owned Evidence — OSM mappers initially questioned changesets where we used our own data as evidence. Now, all Lyft OSM profiles include source descriptions, and we provide necessary data upon request. This has reduced confusion and increased trust in our edits.

Over time, we learned how to best leverage OSM and its community for success in Lyft Mapping and quickly expanded our efforts. We began in 2020 with five Lyft mappers focused on identifying missing roads using our internal automated pipeline and adding or editing ‘lanes’ and ‘turn:lanes’ tags. These two projects laid the foundation for our maps, providing us with initial experience in detecting and resolving map issues, which subsequently increased the number of edits we made.

After over four years of working with OSM we now have 38 Lyft mappers contributing over 3K edits per week across six mapping features (turn restrictions, barriers & access, highway signs, road lanes, missing roads, and complex investigations — more details on the Lyft OSM wiki page).

Making a Difference

Lyft is proud to be contributing to an ecosystem used by so many. Editing the map is inspirational for us, especially when our expertise and unique data help resolve complex situations and contribute to critical map updates.

Unique sources for edits lead to a better, fresher map

In addition to all available open sources, Lyft is collecting ground truth and driver telemetry data to update maps with real-time road construction and environment changes.

The left screenshot is the freshest available Nearmap background (March 2024) of the intersection without a roundabout. However, Lyft driver telemetry data reveals that the traffic is organized with a roundabout, allowing us to apply the needed map update.

Lyft drivers provide an additional source of needed map updates as they report issues that occur during rides. Some reports lead to huge improvements at major intersections.

In the example above, the driver initially reported two missed roundabouts, but upon further investigation, we discovered that the area around I-80 and County Road F44 required extensive editing.

Ultimately, Lyft’s driver feedback and image collection programs enable rapid map updates for road changes, ensuring accurate navigation, improved customer satisfaction, and optimized routes for reduced fuel consumption and environmental impact.

OSM Community Engagement & Contributions

Lyft aims to be an engaged and collaborative member of the OSM community. We have taken on several initiatives and practices to stay connected with other mappers and contribute to the continued improvement of OSM mapping practices.

Community communication

Our team monitors several communication channels to maintain two-way discussions with OSM community members, the main one being public changeset discussions. Since 2020, we have engaged in over 265 changeset discussions. Most were related to clarifying the evidence that we used for specific map edits, but some discussions contained constructive feedback that helped us improve our curation approach.

Still other discussions contain positive feedback that highlight our efforts and motivate us to maintain high mapping quality levels. Engaging with the OSM community and receiving feedback improves our mapping expertise and fosters learning.

We also maintain communication through email (dct-osm(at)lyft.com), OSM forums and Slack channels, which are used mainly for clarifying information to launch new projects or updates. For example, we initiated two forum discussions (first, second) to resolve a disagreement about adding destination tags on motorways in the USA and found an appropriate way to add destinations in Canada. The discussions were aimed at finding a consensus on the best approach to tagging destinations, ensuring that the data would be useful and accurate for all users.

Community contributions

The Mapping Curation Team at Lyft created and maintains OSM training on GitHub. These guides help improve OSM editing accuracy for our internal mapping teams and are accessible to any new mappers outside of Lyft. OSM community members appreciate and share these guides on social media, promoting best mapping practices.

Impact

In the span of four years of mapping, the Lyft team has published over a half a million changesets with 5.5 million edits. Our contributions have included, but are not limited to:

135k changesets for adding new ways and correcting the geometry of existing ones, checking way directions, validating tags that define road drivability
142k lanes editing changesets to help drivers better navigate through interchanges and multi-lane intersections
67k edits based on destination signs to make guidance on motorways as useful as possible
37k turn restrictions adjustments including newly-added restrictions to ensure drivers are offered the best possible route without any illegal turns

All this impact is thanks to the OSM ecosystem, its incredible community, and a dedicated team of mappers from the Lyft Curation Team.

As we continue to grow our mapping efforts at Lyft, we will focus on improving road network tagging and geometry in OSM with a focus on adding new roads, turn restrictions, lane and destination data, road barriers and maintaining construction updates using both public and private data. We stay committed to open communication with the OSM community members. Together we will build the world’s best map to support the world’s best transportation.

Interested in learning more? Lyft’s own Jason Laska will be speaking in detail on this topic at OSM State of the Map US 2024. We’d love to see you there!

Keeping OSM fresh, accurate, and navigation-worthy at Lyft was originally published in Lyft Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Technical Learning at Lyft: Build a Strong Data Science Team

Shumpei Goke — Wed, 24 Apr 2024 15:11:29 GMT

Written by Shumpei Goke and Jinshu Niu

Why Technical Learning?

At Lyft, data scientists tackle challenging technical problems every day. To support and empower our data scientists, Lyft’s Technical Learning Council (TLC) provides diverse and high-quality continuous learning opportunities to hone their technical skills. TLC’s mission is “to equip Data Science team members with the technical knowledge and skills that are applicable to their work and helpful to their career advancement.” Investing in technical learning not only aids data scientists in solving complex problems but also contributes to their professional growth, benefiting both Lyft’s business and the individuals involved. We want to foster a culture of continuous learning by providing resources and forums that are easily accessible to everyone. See our previous blog post for more on the motivations behind TLC.

TLC has four main workstreams: Technical Training, CS4DS (“Computer Science for Data Scientists”), Rideshare Seminar, and Science Brown Bag. Let’s take a closer look at each of these workstreams. These are open to new and existing Lyft data scientists, who are encouraged to choose from the opportunities depending on their current skill levels and schedules.

Technical Training

Technical Learning offers a rich array of lecture series on data science methodologies and applications, taught by our fellow data scientists and typically run for 6–10 weeks. Data science is a multidisciplinary field that combines knowledge in statistics, computer science, machine learning, causal inference and many more. Business acumen and domain knowledge on specific product areas is critical as well. It is rare for a data scientist to be skilled in everything, especially in the first couple of years of their career. The Technical Learning workstream offers Lyft scientists the opportunity to build or brush up their core data science skills across multiple areas.

Over the years, we have successfully launched lecture series on topics such as experimentation, observational causal inference, structural causal modeling, and reinforcement learning. These lectures typically start with the foundational theory, followed by applications within Lyft. Some recent examples can be found in our tech blog: reinforcement learning and structural causal modeling. This style of lectures enables attendees to understand the topic deeply, as the theory and applications complement each other and strike a fine balance. In 2024, the Technical Training workstream is offering a course on Large Language Models with a focus on both the theoretical fundamentals and their day-to-day applications to Lyft’s business.

An oddball in the past offerings by Tech Learning is the series titled “How to Build a Rideshare Company.” This series delves into essential aspects of Lyft’s business, such as pricing, incentive campaigns, assignment of drivers and mapping, and explains how data science is applied to balance the two-sided marketplace and deliver reliable service to both our riders and drivers. The lecture series attracted huge attention among scientists as a great opportunity to develop their holistic view of how Lyft operates its business and where data science plays a role in optimizing our marketplace and driving growth.

Snapshot Photo from a Technical Training Session

CS4DS

Computer science and software engineering are fundamental data science skills. This is especially true at Lyft, where data scientists work very closely with software engineers and read (and write!) production-level code. It is, however, not uncommon for talented data scientists to start their careers without a formal education in computer science.

The CS4DS course addresses this gap. The goal is to teach data scientists the foundational knowledge in computer science, in order to elevate their programming abilities and enhance collaboration with engineers. The course covers both the theoretical foundations such as big-O notation and Object Oriented Programming (OOP) and practical software engineering skills like Git, containers, and unit testing. It is an intensive self-served course with lectures, homework assignments and mentorship with regular check-ins. The students get feedback on their coding styles, which is very hard to get if they were studying on their own, but is essential to upskill their coding practices. Office hours are provided for students to get hands-on assistance from volunteers.

The course has attracted great enthusiasm from the participants, and we typically see higher completion rates than other technical training. It has contributed greatly to uplift the software engineering skill of data scientists. Past participants have found that the course has helped them read and understand production-level code faster and write code that respects the principles of OOP and is easier to maintain. We even have a data analyst who transitioned to a ML software engineer after completing the course!

Landing page of the internal website for the CS4DS course

Rideshare Seminar

Our biweekly Rideshare Seminar invites external guests from academia and industry to deliver talks on their research. It is a great opportunity for data scientists to catch up with the latest research trends that are relevant to Lyft’s business.

The topics are broad and inspiring, ranging from the analysis of the future world with autonomous vehicles (Freund, Lobel, and Zhao, 2022) to the analysis of the gig economy (Lian, Martin, and Ryzin, 2022) to experiment analysis that mitigates bias from marketplace interference (Bright, Delarue, and Lobel, 2023). These seminars provide great opportunities for knowledge sharing, networking, and have strongly inspired and promoted collaboration across teams to build products with advanced technologies in the industry.

We have successfully run about 30 Rideshare seminars in the past 2 years with speakers from 16 different universities around the world, attracting over 400 total seminar attendance.

Image of Rideshare Seminar Sessions

Science Brown Bag

Science Brown Bag provides a forum for Lyft data scientists to showcase their project achievements and promote knowledge sharing. By having open and informal discussions on their work, data scientists can promote their tools and applications for wider adoption and gather valuable feedback from their peers. Additionally, the Brown Bag also serves as a forum to foster collaboration amongst groups with similar ideas.

The topics at Science Brown Bag vary, and include important advances in our technology and infrastructure, such as a new metric on measuring the supply-demand imbalance, graph-based embeddings, recommendation systems, and our machine learning platform and data platform.

Snapshot Photo from a Science Brown Bag Session

Final Words

The Technical Learning Council is proud to offer Lyft’s data scientists with a wide spectrum of learning opportunities. At Lyft, data scientists play a pivotal role in driving innovation, and TLC is committed to building a robust data science team, facilitating innovation, and supporting the continuous growth and success of our scientists.

If you are excited to be part of Lyft’s data science team, explore opportunities on our careers page. Join us and let’s work together to improve people’s lives with the world’s best transportation!

Special thanks to the current and previous workstream leads for the Tech Learning Council, including Amber Wang, Baichuan Mo, Frances Huang, Hao Yi Ong, Li Wang, Miriam Leon, Nick Ung, Paul Havard Duclos, Ramon Iglesias, Vicky Liu, and Zhe Hu!

Technical Learning at Lyft: Build a Strong Data Science Team was originally published in Lyft Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.