<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Lyft Engineering - Medium]]></title>
        <description><![CDATA[Stories from Lyft Engineering. - Medium]]></description>
        <link>https://eng.lyft.com?source=rss----25cd379abb8---4</link>
        <image>
            <url>https://cdn-images-1.medium.com/proxy/1*TGH72Nnw24QL3iV9IOm4VA.png</url>
            <title>Lyft Engineering - Medium</title>
            <link>https://eng.lyft.com?source=rss----25cd379abb8---4</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Sun, 17 May 2026 03:15:16 GMT</lastBuildDate>
        <atom:link href="https://eng.lyft.com/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[How We Built a Smarter Pickup Experience for Gated Communities]]></title>
            <link>https://eng.lyft.com/how-we-built-a-smarter-pickup-experience-for-gated-communities-47416e9df029?source=rss----25cd379abb8---4</link>
            <guid isPermaLink="false">https://medium.com/p/47416e9df029</guid>
            <category><![CDATA[mapping]]></category>
            <category><![CDATA[lyft]]></category>
            <category><![CDATA[rideshare]]></category>
            <dc:creator><![CDATA[winnieyan]]></dc:creator>
            <pubDate>Thu, 23 Apr 2026 19:16:44 GMT</pubDate>
            <atom:updated>2026-04-23T19:17:14.179Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*cxVOjt0GnB5tRXbNktweOA.jpeg" /></figure><p>If you live in a gated community, you’ve been there: You request a ride from your apartment complex, expect your driver to come to you as usual, and then — your driver’s car icon just stops right at the front gate. You watch helplessly as the ETA ticks up. A chat message comes in: <em>“Hey, how do I get in?”</em> You scramble to remember the gate code. They try it. It doesn’t work. You end up meeting them awkwardly on the sidewalk outside while your coffee gets cold — <strong>a pickup journey frustrating for both you and your driver.</strong></p><figure><img alt="An example gated community in real life, Photo by Bingqian Li on Pexels" src="https://cdn-images-1.medium.com/max/1024/1*Pp8iEDacju32V13VKgsOHg.jpeg" /><figcaption>An example gated community in real life, Photo by Bingqian Li on Pexels</figcaption></figure><p>It turns out you’re not alone: Gated community pickups can make up 25–30% of Lyft rides in selected markets. For a long time, our app offered no special guidance in these situations. Riders would drop their pin inside the gates (fair enough — that’s where they <em>are</em>), while drivers would pull up to a locked entrance with no way in, leaving both parties to sort things out over chat. The result was predictable: <strong>more cancellations, longer waits, and a lot of unnecessary stress for our customers</strong>.</p><p>The Lyft Mapping team decided it was time to fix this properly — not with a band-aid, but a new end-to-end experience. Here’s how we did it.</p><h3>What Was Actually Going Wrong?</h3><p>We looked through gated ride examples, zoomed into our metrics data, and found two root causes behind most of the friction.</p><p>The first was an <strong>inflexible selection of pickup spots</strong>. Our app would suggest pickup spots near a rider’s location — which, for riders inside a gated community, often means <em>inside the gate</em>. But our data told a different story: many riders actually preferred meeting their driver right outside the gate, knowing their driver couldn’t access the property. The app wasn’t giving them that option clearly.</p><p>The second was a <strong>communication black hole</strong>. Even riders who <em>knew</em> how to get their driver through the gate had no good way to pass along access instructions in advance. Instead, they’d wait until the driver was already idling at the entrance before firing off a text with the gate code, setting off a frantic last-minute back-and-forth.</p><p>Two problems, two different moments in the ride flow. This signaled that we needed to fix things both <em>before</em> and <em>after</em> a ride was requested.</p><h3>Only on Lyft Maps</h3><p><strong>Lyft’s Mapping team sits in a unique position to solve this problem</strong>, with workstreams across map data, pickup spot recommendations, routing, and the rider and driver app experience.</p><p>This one needed all of the pieces to complete the map (pun intended).</p><h3>Piece 1: Drawing Gated Communities on the Map</h3><p>Before anything else, we needed a map that actually <em>knew</em> about gated communities — where they are, where their entrances are, and how many gates they have.</p><p><strong>Our Map Data team built a gate area shape generation algorithm</strong> to do exactly this. Gated communities come in all shapes and sizes: some are small apartment complexes with one entrance, and some are large developments with multiple gates and even their own internal road networks. The algorithm had to handle all of it reliably and provided the foundation accurate enough to build our new pickup experience for riders within these gated communities.</p><p><em>How does the app even know I’m inside a gated community?</em> When you open the app, we cross-reference your GPS location against the gate area shapes we generated. If you’re within a known gated community boundary, the app quietly switches into “gates mode” and adjusts what you see as your pickup spot selection process. No extra steps on your end.</p><p><em>Is my community in the system?</em> We’ve built our gate coverage from sources like OpenStreetMap and driver feedback, and we’re continuously monitoring for gaps and adding new communities during every map update — so coverage grows over time without you having to do a thing. That said, no map is perfect, and local knowledge is invaluable. If you don’t see the new experience or encounter issues at your gate, reaching out to Lyft support helps us prioritize the communities that need attention most.</p><h3>Piece 2: Giving Riders a Smarter Selection of Choices</h3><p>Now, when you open the Lyft app from inside a gated community, you’ll see pickup spots at your complex entrances (labeled “I’ll walk outside the gate”), right alongside the usual spots near your current location (labeled “Pick me up inside gate”). You can decide freely whether you prefer to wait locally (maybe it’s raining outside) or want to meet your driver outside the gate (get some steps in, right?).</p><figure><img alt="The new pickup spot selection process for riders in gated community" src="https://cdn-images-1.medium.com/max/800/1*ijs537jdaX1t6EJSAVybtA.gif" /><figcaption>The new pickup spot selection process for riders in gated community</figcaption></figure><p><strong>The team looked at historical ride patterns to surface spots that real riders have actually used near gates</strong> — so we’re not just pointing you toward the gate in theory, we’re pointing you to where riders in your community <em>actually</em> go.</p><figure><img alt="Pickup location heatmap of an example gated community" src="https://cdn-images-1.medium.com/max/467/1*bu616vMnvIaTPlze3Vs8pg.png" /><figcaption>Pickup location heatmap of an example gated community</figcaption></figure><p>We also designed the new UI to feel familiar, borrowing patterns from our existing Venues pickup flow so our riders wouldn’t have to learn something from scratch.</p><p>Before shipping, we ran a controlled experiment to make sure we weren’t accidentally making things worse — specifically, that the new flow wouldn’t add enough friction to make riders bail on requesting a ride altogether. Good news: no meaningful drop-off. Even better news: we saw actual pickups happening at the spots that riders chose initially, meaning that they no longer have to walk around to find their drivers.</p><h3>Piece 3: Routing Drivers to the Right Gate</h3><p>Recommending gate-aware pickup spots to a rider is only half the job. The driver also needs to be <em>routed there</em> correctly.</p><p>Normally, a pickup route is simple — get the driver from point A (their location) to point B (the rider’s pickup spot). <strong>For gated communities, our Routing team introduced a new “detour”: the gate itself becomes an intermediate stop</strong>, an arbitrary, “invisible” stop that the driver passes through on the way to the rider. We use our map data to direct drivers to the <em>right</em> gate, accounting for road direction and the most logical approach path.</p><p>This seemingly small routing tweak creates something valuable: a precise moment in the driver’s journey where we can surface gate instructions at exactly the right time.</p><h3>Pieces 4: Actually Getting the Driver Through the Gate</h3><p>Now that we get drivers <em>to</em> the gate, we still need to get them <em>through</em> the gate to meet those who asked to be picked up inside, and we really can’t let our customers resort to back-and-forth communications.</p><p><strong>Our Map Experience team brought this to life</strong>: After a rider is matched with their driver, they’re prompted to share access information. Rather than presenting a blank text field — which puts all the pressure on the rider to know exactly what to say, right after they’ve just requested a ride — <strong>our Design team drew inspiration from the intercom panels riders use every day</strong>: a familiar numpad for gate codes, and a short list of plain-language options for everything else. The goal was to make instruction sharing effortless.</p><p>We leveraged historical data from gate instructions that riders had previously shared to identify the most common access scenarios. <strong>Our Content team worked to make each option feel conversational and specific enough to be useful</strong>, without being so wordy that riders would skip past them. The free-text fallback is still there for the edge cases, but in practice, most riders find what they need in the list.</p><figure><img alt="The new gate instructions sharing feature" src="https://cdn-images-1.medium.com/max/236/1*1l-RXL-PfkkYaPsoNeNctw.png" /><figcaption>The new gate instructions sharing feature</figcaption></figure><figure><img alt="The new gate instructions sharing feature — Sharing gate codes" src="https://cdn-images-1.medium.com/max/236/1*tjUIMB7QaP0er7qP91XwlA.png" /><figcaption>Sharing gate codes</figcaption></figure><figure><img alt="The new gate instructions sharing feature — Sharing special gate instructions" src="https://cdn-images-1.medium.com/max/236/1*ZSxPFZ7mLxoOT-vbgmCD5w.png" /><figcaption>Sharing special gate instructions</figcaption></figure><p>You might be thinking: <em>I usually text my driver the gate code — what’s actually different here?</em> <strong>The difference is timing and reliability.</strong> When you text, you’re reacting: the driver might be busy driving, or is already at the gate, already stopped, already waiting.</p><p>We also paid special attention to how we should show the shared instructions to drivers. They are likely navigating with eyes on the road — so a paragraph of gate instructions on their screen at the wrong moment is unhelpful and more importantly, unsafe. <strong>We designed the gate instructions to surface as a small, but obvious banner in the navigation screen at precisely the moment the driver approaches the gates</strong>, timed to the routing waypoint we added in Piece 3. The information is brief and scannable — just enough to get through the gate. No extra steps or unnecessary calls or texts.</p><figure><img alt="Gate instructions shared to drivers during navigation" src="https://cdn-images-1.medium.com/max/233/1*DqImu0uPvhvpAA4yjRxgAg.png" /><figcaption>Gate instructions shared to drivers during navigation</figcaption></figure><p>We also know that sharing a gate code feels more sensitive than sharing a pickup pin — it’s access to where you live, <strong>and your privacy is important to us. </strong>So we built in privacy controls that let you view or delete your instructions at any time after sharing. Gate codes are never stored between trips, and only your matched driver ever sees them on their navigation screen on their way to pick you up. To further ensure your gate codes stay protected, we implemented safeguards to prevent screenshots.</p><figure><img alt="Privacy protection for riders to remove shared gate details" src="https://cdn-images-1.medium.com/max/223/1*WAyZHz_ED1a-kXJEeeYZTw.png" /><figcaption>Privacy protection for riders to remove shared gate details</figcaption></figure><h3>Does It Actually Work?</h3><p>We asked riders directly through our optional feedback survey. <strong>About 95% responded positively</strong> — and their comments said it better than any metric could:</p><p><em>“Driver successfully picked me up OUTSIDE the gate, as requested.”<br></em> <em>“Picked me up where I asked.”</em></p><p>We also see data reflecting our goals: riders and drivers are not giving up on a ride just because there’s a gate in the way. Drivers are not looping around a community entrance trying to figure out how to get in. Riders are not wandering the sidewalk looking for a car that’s stuck on the other side of a barrier. Phones are not fired off with chats and calls because our riders and drivers can’t find each other. <strong>Those are the moments we were trying to eliminate — and the numbers reflect it.</strong></p><p>We observe lower cancellation rates from both riders and drivers, when riders share gate instructions.</p><figure><img alt="Rider cancellation rate (after driver match) for gated community rides" src="https://cdn-images-1.medium.com/max/1024/1*5doZWavnvAujBNb0ht8H1A.png" /><figcaption>Rider cancellation rate (after driver match) for gated community rides</figcaption></figure><figure><img alt="Driver cancellation rate (after driver match) for gated community rides" src="https://cdn-images-1.medium.com/max/1024/1*_Z4XJy-qi_YtrvZfIZu6aA.png" /><figcaption>Driver cancellation rate (after driver match) for gated community rides</figcaption></figure><p>In addition, riders are walking less to their final pickup locations from their requested pins, waiting less, and don’t need to reach out to their drivers as much. We also see drivers needing to change course in the final stretch less. Each of these represents a ride that went smoothly instead of sideways.</p><figure><img alt="Graph on distance between pin and actual pickup location for gated community rides" src="https://cdn-images-1.medium.com/max/1024/1*Us4IyuqjdztCFvoply0sDw.png" /><figcaption>Distance between pin and actual pickup location for gated community rides</figcaption></figure><figure><img alt="% of gated community rides with wait longer than 5 minutes" src="https://cdn-images-1.medium.com/max/1024/1*XMH8nvbqX9c7Rtwm_rJYaw.png" /><figcaption>% of gated community rides with wait longer than 5 minutes</figcaption></figure><p>Now, that same ride from your gated community looks a little different: your driver’s icon moves smoothly past the gate with the instruction you share with a simple tap, and you get into your ride stress-free, with your coffee warm. <strong>The results gave us confidence that the approach works — and got us thinking about where else we could take it.</strong></p><h3>More Gates, and The Bigger Picture: Bringing the Real World to the Map</h3><p>For gated communities specifically, we’re just getting started. We want to <strong>provide riders with the option to save non-sensitive gate instructions</strong> <strong>for future trips, </strong>so riders who regularly get picked up from the same community don’t have to re-enter the same information every time. Your gate intercom instruction isn’t going to change — we shouldn’t make you tell us every time. We will also work on <strong>smarter gate selection</strong> for communities with multiple entrances, choosing the best gate based on where both you and your driver are at that moment, not just the nearest one on the map.</p><p>But the more we worked on this project, the more we realized gates are just one example of a much bigger problem. <strong>The real world is full of physical constraints that affect where a driver can go and where a rider can safely and conveniently be picked up — and we see a big opportunity of leveraging our Mapping stack to improve customers’ pickup experiences across many of these situations.</strong></p><p>Consider road closures. A street gets blocked for construction, a parade, or a marathon, and suddenly the route our app plotted doesn’t work anymore. The driver gets rerouted at the last second, the pickup pin is now on the wrong side of a barrier, and rider and driver end up circling each other trying to figure out where to actually meet.</p><p>Or consider unsafe road segments — a pickup spot that looks perfectly fine on a map but sits on a high-speed tunnel exit way with an accident-prone history. The rider waits on the curb, and the driver pulls over in a spot, attempting to avoid moving traffic. Neither of them had any idea there was a safer option just half a block away.</p><p>What the gated community project gave us is a repeatable playbook for all of these situations: <strong>encode the real-world constraint into the map, surface that context in pickup spot recommendations, thread it through routing, and deliver the right information at the right moment and place in the app</strong>. Whether the obstacle is a gate, a closed road, or an unsafe stretch of curb, the same approach applies.</p><p>Gated communities were a specific problem. But the approach behind solving them isn’t specific at all. <strong>Every time the real world gets in the way of a safe and convenient pickup, there’s an opportunity to make the map smarter — and the ride a little less stressful for everyone involved.</strong></p><p><em>Lyft is hiring! If you’re passionate about maps, visit </em><a href="https://www.lyft.com/careers"><em>Lyft Careers</em></a><em> to see our openings.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=47416e9df029" width="1" height="1" alt=""><hr><p><a href="https://eng.lyft.com/how-we-built-a-smarter-pickup-experience-for-gated-communities-47416e9df029">How We Built a Smarter Pickup Experience for Gated Communities</a> was originally published in <a href="https://eng.lyft.com">Lyft Engineering</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Predicting Rider Conversion in Sparse Data Environments with Bayesian Trees]]></title>
            <link>https://eng.lyft.com/predicting-rider-conversion-in-sparse-data-environments-with-bayesian-trees-07227ff92789?source=rss----25cd379abb8---4</link>
            <guid isPermaLink="false">https://medium.com/p/07227ff92789</guid>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[ridesharing]]></category>
            <category><![CDATA[transportation]]></category>
            <category><![CDATA[statistics]]></category>
            <dc:creator><![CDATA[Zammit Alban]]></dc:creator>
            <pubDate>Mon, 30 Mar 2026 14:43:41 GMT</pubDate>
            <atom:updated>2026-03-30T14:43:39.644Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*OMZl9ZxHl02QWvOMcaUaCQ.jpeg" /></figure><p>At Lyft, understanding how riders go through our user experience is fundamental to operating a healthy marketplace. Specifically, it is important to have a robust model determining if a rider will actually request a ride after entering a destination and viewing a price and ETA. Accurately predicting this decision, that we call <em>conversion</em>, informs countless decisions across our platform. Whether it is to better balance supply and demand, improve user experiences, optimize recommendations and advertisement, understand long-term engagement, decide how to distribute coupons… rider conversion prediction is a central challenge for the Lyft business.</p><p>However, predicting human behavior at scale is incredibly complex, and the exact same person might well open the app just to check current availability or actually to request a ride after viewing our prices. The contexts under which riders make their conversion decisions are extremely diverse and almost unique to each session. A user’s intent changes based on where they are and where they want to go, what time it is, their previous interactions with the platform, current supply-demand market conditions, to cite a few.</p><p>When we try to model this using standard machine learning approaches, we run into a significant challenge: <strong>data sparsity</strong>.</p><h3>The Challenge of High Cardinality and Sparsity</h3><p>To accurately predict conversion, we need to slice our data very thinly across many categorical features. Imagine trying to predict the conversion probability for a business traveler leaving the suburbs of Detroit at 4:00 AM on a Tuesday to catch their flight at the airport 30 minutes away. While Lyft has vast amounts of data overall, the amount of data available for that specific intersection of contexts often reveals to be very tiny. Maybe we only have ten examples in history.</p><p>If we use standard techniques like <a href="https://en.wikipedia.org/wiki/Gradient_boosting">Gradient Boosted Trees</a> (e.g., LGBM, XGBoost), we encounter severe overfitting. A standard model looking at 10 examples in the training data where 8 converted might confidently predict an 80% conversion probability. But that’s likely statistical noise. The next 10 examples might show only 2 conversions.</p><p>One could argue that much more complex models, such as Deep Neural Networks or even Large Language Models in a few-shot fashion, could be able to handle this sparsity if properly tuned and calibrated; but this comes at the cost of greatly increased inference time. Indeed, predicting session conversion is very often needed live, while the rider interacts with our systems, and we want to be able to confidently predict their conversion as soon as we know their destination. And of course, no one wants to see a loading wheel rotating for hours, so when you think about the user experience, we need a model that can handle an extremely high-cardinality set of different situations, provide robust and accurate predictions even when data is scarce, and serve those predictions with ultra-low latency.</p><h3>Overcome Sparsity with Bayesian Trees</h3><p>To solve this, our team developed a modeling framework to predict sessions’ conversions in real-time. At its core, it is a highly optimized, hierarchical lookup structure designed to handle sparse categorical data by leveraging Bayesian theory. It incorporates priors (knowledge gained from broader contexts) to smooth predictions in specific, data-rare contexts.</p><h4>The Hierarchical Tree Structure</h4><p>We organize the training data into a tree structure based on a hierarchy of partition keys. These keys define the context. While the exact keys we use depend on the specific application, imagine a hierarchy like this:</p><ul><li>Root: All the sessions</li><li>Level 1 Split (e.g., Spatial Context): City Region</li><li>Level 2 Split (e.g., Temporal Context): Time of Day / Day of Week</li><li>Level 3 Split (e.g., Congestion Context): Current local supply-demand balance</li><li>…</li></ul><p>As we move down the tree, the context becomes more and more specific, and the data available at each node becomes sparser and sparser.</p><figure><img alt="Figure 1 — Hierarchical decomposition of rider sessions. The tree structure illustrates how the model splits global data into progressively granular segments, from City Region down to specific Temporal and Congestion contexts, to address data sparsity. (Image generated with Gemini’s Nano Banana model)" src="https://cdn-images-1.medium.com/max/1024/1*wx3nbQJF2UJDE3roAyTAKA.png" /><figcaption><em>Figure 1 — Hierarchical decomposition of rider sessions. The tree structure illustrates how the model splits global data into progressively granular segments, from City Region down to specific Temporal and Congestion contexts, to address data sparsity. (Image generated with Gemini’s Nano Banana model)</em></figcaption></figure><h4>The Secret: Bayesian Smoothing with Gaussian Priors</h4><p>How do we deal with a leaf node that only has five data points? We definitely cannot trust only the 5 data points and train an independent model, but also we cannot have a global model that trains on the whole data, as it would miss the specifics of this leaf node that guide rider conversion decision-making.</p><p>Our conversion model uses the statistical properties of the parent node to inform the prediction at the child node. This is achieved through <strong>Bayesian smoothing</strong>, specifically utilizing gaussian priors on model parameters. It allows data-sparse segments (e.g. a leaf node in the tree) to default to the robust average of their parent group.</p><p>Here is the intuition:</p><ol><li><strong>Model architecture: </strong>Each node in the tree is a copy of the same parametric model. It can be a logistic regression, a support vector machine, or even a shallow neural network. To make a prediction y from an input x, you rely on a parametric model f that depends on trainable parameters Θ such that y = f(x, Θ). Each node has its own version of Θ that we want to be as specific as possible to the segment the node represents.</li><li><strong>Training</strong>: The tree is trained top-down, starting from the root of the tree, level by level, until the leaves, with less and less data as we dive deeper and deeper. For each individual node, its parent node has been trained on more data. Hence Θ_parent provides a strong belief about what Θ_child should look like when training the child node. As such, when fitting the child’s model on the child’s specific data, we add in the training a L2 penalization for diverging too far from the parent’s, that is ||Θ_parent — Θ_child||². This is known as a Gaussian Bayesian prior, centered at the parent’s mean.</li></ol><figure><img alt="Equation 1 — Example of the training loss used to train a child node’s parameters Θ_child using its parent parameter Θ_parent via a L2 regularization of strength λ" src="https://cdn-images-1.medium.com/max/1024/1*lpezD3yNpON6G6FIUKb4AQ.png" /><figcaption><em>Equation 1 — Example of the training loss used to train a child node’s parameters Θ_child using its parent parameter Θ_parent via a L2 regularization of strength </em>λ</figcaption></figure><p>The choice of the regularization strength λ is critical and should depend on the data size available between the parent’s node and its children, such that the model automatically balances parent’s trust with specification to child’s data. If a leaf node has very little data, its prediction will be heavily pulled towards the parent’s stable mean. As the leaf node acquires more data, the model gradually trusts the local data more, moving the prediction away from the parent’s prior and towards the local raw average.</p><figure><img alt="Figure 2 — How Bayesian smoothing prevents overfitting. If a parent model shows conversions peaking at 5-mile trips, but a sparse child node indicates a 2–3-mile peak, relying solely on the child’s data causes irrational overfitting. Bayesian smoothing creates a compromise by blending the local signal with the parent’s prior, ensuring robust predictions even when local data is scarce." src="https://cdn-images-1.medium.com/max/990/1*DrMVDdCFSpCXH6ZnXwGz2A.png" /><figcaption><em>Figure 2 — Bayesian Smoothing in Action. Imagine we are at a specific node in the tree where the parent model (black line) tells us that, for the parent’s segment of sessions, rider conversion peaks around 5-mile trips. Then, for a sparser segment one level below in the tree, i.e. for one of the parent’s child nodes, we observe in the child’s specific training data that shorter 2–3-mile trips actually convert more. Standard Overfitting (purple dashed line): If we relied solely on this child’s data, the child model would overfit, irrationally predicting near-zero conversion for 5-mile trips. Bayesian Smoothing (pink line): The smoothed model finds a compromise. It respects the local signal by shifting the peak a bit to the right, but also relies on the parent’s prior to maintain a good probability mass around 5 miles, ensuring robust predictions even where local data is missing.</em></figcaption></figure><h3>Ensuring Behavioral Consistency</h3><p>Beyond handling sparsity, we often have domain knowledge about how conversion should behave relative to certain continuous variables. For instance, consider the rider’s historical conversion rate: intuitively, as this increases, the predicted conversion probability for the current session should also increase. All else being equal, we should logically predict the conversion of a rider with a 90% historical rate higher than the one for another rider with a 40% historical rate. The model should enforce a monotonic relationship between historical conversion and current session’s conversion.</p><p>However, standard ML models can sometimes learn erratic shapes that violate this intuitive logic due to noise in the training data, ending up with non monotonic relationships between some input and the output that shouldn’t be. With Bayesian trees, we can keep very simple parametric models at each node because the use case is already very specific (that is the whole point of Bayesian trees, right?). Using simpler models<em>— </em>e.g. a logistic regression cvr = σ(Θ . cvr_hist) <em>—</em> can give much more control on the model monotonicity (e.g. enforce Θ&gt;0) and ensure appropriate behavior of the model, and as such better explainability. This guarantees that the model’s outputs are not only statistically robust but also directionally intuitive, reliable and interpretable.</p><figure><img alt="Figure 3 — Bayesian priors to enforce logical constraints, like monotonicity. For instance, a rider with a higher historical conversion rate should logically have a higher current conversion probability. While standard models overfit to noisy data, creating jagged, illogical curves; the Bayesian model enforces a monotonic constraint. This ignores noise and produces a smooth, logically consistent prediction that aligns with business logic." src="https://cdn-images-1.medium.com/max/989/1*r5apu5JDsMmvtttHoUAMkg.png" /><figcaption><em>Figure 3 — Enforcing Logical Constraints with Bayesian Priors. Imagine we are analyzing a specific feature used to predict a session’s conversion, such as the “Rider’s Historical Conversion Rate.” Business Logic dictates that this relationship should be monotonic: a rider who has converted frequently in the past should generally be more likely to convert now, not less. Standard Overfitting (purple dashed line): A standard model blindly chases the sparse noisy data (blue dots), creating a jagged curve that irrationally becomes non-monotonic with the Rider’s Historical Conversion Rate. Monotonically-constrained Model (pink line): By enforcing a monotonic constraint on the relationship between input and output, the Bayesian model produces a smooth, logically consistent prediction. It ignores the noise that suggests a dip, ensuring that a higher historical rate always translates to a higher predicted probability, aligning with our domain knowledge.</em></figcaption></figure><h3>Conclusion</h3><p>Bayesian Conversion models represent a significant step forward in our ability to model rider conversion decision-making in highly dynamic, sparse environments. By combining a highly structured hierarchical approach with the robust statistical grounding of Bayesian smoothing, we can generate accurate, stable predictions where traditional models fail. This architecture allows us to be hyper-local and specific when the data supports it, while gracefully falling back to broader, stable trends when faced with the unknown. It’s a key piece of infrastructure that helps Lyft make smarter decisions in real-time.</p><p><em>Lyft is hiring! If you’re passionate about developing state of the art machine learning/optimization models or building the infrastructure that powers them, read more about them on our blog and </em><a href="https://www.lyft.com/careers"><em>join our team</em></a><em>.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=07227ff92789" width="1" height="1" alt=""><hr><p><a href="https://eng.lyft.com/predicting-rider-conversion-in-sparse-data-environments-with-bayesian-trees-07227ff92789">Predicting Rider Conversion in Sparse Data Environments with Bayesian Trees</a> was originally published in <a href="https://eng.lyft.com">Lyft Engineering</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Beyond A/B Testing: Using Surrogacy and Region-Splits to Measure Long-Term Effects in Marketplaces]]></title>
            <link>https://eng.lyft.com/beyond-a-b-testing-using-surrogacy-and-region-splits-to-measure-long-term-effects-in-marketplaces-9cb06d628f2d?source=rss----25cd379abb8---4</link>
            <guid isPermaLink="false">https://medium.com/p/9cb06d628f2d</guid>
            <dc:creator><![CDATA[Iraklikhorguani]]></dc:creator>
            <pubDate>Wed, 25 Mar 2026 13:56:39 GMT</pubDate>
            <atom:updated>2026-03-25T13:56:38.643Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*iVC-FQekAWZXMelyqruN5g.png" /><figcaption>Image generated with Gemini 3 Pro (Google), 2026.</figcaption></figure><p><em>Written by </em><a href="https://www.linkedin.com/in/amber-yuehui-wang-b283b0aa/"><em>Amber Wang</em></a> <em>and </em><a href="https://www.linkedin.com/in/yoonji-kim-48731b239/"><em>Y</em>oonji Kim</a> <em>at Lyft.</em></p><h3>Background</h3><p>Whenever you use the Lyft app, there is a complex balancing act happening behind the scenes. Various levers are used to keep the marketplace running smoothly; Base prices and coupons for riders affect demand, while driver pay and bonuses impact the level of available supply. Since every change to prices and payments impacts Lyft’s costs and revenue, they lead to key optimization problems, such as:</p><ul><li><em>How should we allocate budget between driver incentives and rider incentives?</em></li><li><em>How do we invest resources to achieve x% rides growth, and how much does it cost in terms of short term profit?</em></li></ul><p>These are the questions the Foundational Models team at Lyft tries to answer in a systematic way. A key ingredient is understanding the effects of different types of investments — for instance, <em>what will happen if we increase the total budget for driver incentives by x%? What will happen if we increase the rider price of all rides by y%?</em> It’s worth noting that the long term effects of such decisions tend to dominate the short term effects: we may earn more short term profit from a ride if we charge riders more and pay drivers less, but lose riders and drivers in the long run.</p><p>Estimating the long term effects of resource allocation decisions is challenging in a multi-sided marketplace such as Lyft. Because these decisions tend to be consequential, their effects go beyond first order effects on directly affected users. For example, if we increase driver incentive spending by x% in week 1, drivers will drive more in week 1 (short term effect), and may return to drive a bit more in the following weeks (direct long term effects). But this is not the full picture: in week 1, when there is a positive increase in driver hours as the result of more incentives, riders will enjoy better experiences (e.g. less surge pricing, shorter wait times) and may want to return to Lyft in the future. However, more driver hours in the market also mean the average driver is less busy, and that idleness may discourage them from driving in the future. An analog holds for any decision that boosts short term demand — riders are more likely to encounter bad experiences such as higher surge prices and longer wait times, whereas drivers may become busier and earn a bit more. At Lyft, we refer to such indirect effects as the “market-mediated effects.” Compared to the direct long term effects, these market-mediated long term effects are much harder to estimate.</p><p>Below is an illustration of the causal relationships we consider for the policy change “increase driver incentive spend.”</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*PREa8hWRI4ABqpCCxg8aig.png" /></figure><h3>Summary of our solution</h3><p>This article introduces the methodology framework we use to estimate the “market-mediated long term effects,” the more challenging component of the long-term effects. In short, we use a two-step approach that can be thought of as a surrogacy approach in its broad sense. Under the assumption that the “market-mediated” outcomes are fully mediated through negative user experiences, (1) we first estimate how our policy changes affect the distribution of core negative user experiences, then (2) estimate how negative user experiences affect users’ future behavior.</p><p>Both steps are done with observational causal inference, allowing fast and inexpensive updating. Verification is another major challenge. There is no single form of experiment that can provide a perfect verification; therefore, we have to combine multiple imperfect signals. Specifically, the first step methodology can be verified using a switch-back experiment, and the second step of estimation can in principle be verified using user-split experiments.</p><p>As the last step, we combine the direct long term effects (estimated separately) and the market-mediate long term effects and use region-split experiments to verify the overall long term effects. In general, region-split experiments in general suffer from poor pre-intervention fit and low power; we developed a forward selection algorithm to optimize experiment design by picking the treated and control regions.</p><p>Below, we describe our methodology in detail.</p><h3>Step 1: from decisions to negative user experiences</h3><p>We first start with a question: <em>when we move a policy decision today (e.g., raise rider price water-level or change driver incentive budgets), how does the short-term user experience shift?</em></p><p><strong>What we model</strong>: <em>How does a policy change affect short-term user experiences?</em> We focus on a set of negative user experiences that matter most downstream — such as long waits, high surge, and cancellations for riders, or lower hourly earnings, idleness and incentive earnings for drivers — and maintain the critical assumption that these are the only channels through which today’s decisions affect the future (by affecting today’s market). The goal of this step is to estimate the effect of our policy decisions on the distribution of these negative user experiences.</p><p>The challenge is that negative user experiences are highly cyclical and seasonal, varying with time‑of‑week, holidays, weather, and shifting supply/demand. A naive regression would blend these rhythms with the true effect of our decisions. Therefore, our approach:</p><ul><li>Residualizes predictable time-of-week patterns so we compare like-with-like, which mimics a “random shock” by capturing the deviation from this period’s normal status.</li><li>Controls for remaining market information such as supply and demand so the coefficient of our policy decision reflects its incremental impact on negative user experience.</li></ul><p><strong>How we estimate </strong>(observational, residualized regressions)<strong>:</strong> Conceptually, for any negative user experience metric (e.g., wait time), we fit a residualized model on deviations from the market’s own baseline:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*dDlbSBg2EK8v-cbJ6meK0w.png" /></figure><p>Below is a visualization of a simulated example where we have one contextual variable. With residualized terms, we can capture how policy change would marginally change negative experience (green solid line).</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Rmmh0X8przGlbV0um4wZjA.png" /></figure><p>The coefficient beta policy tells us how much negative user experience moves when a policy decision (like rider price water‑level or incentive spend) deviates from its usual level, after accounting for typical time‑of‑week rhythms and market conditions. Because we measure everything as “deviation from normal,” the effects read like elasticities around everyday operating conditions. For example, a higher‑than‑usual price suppresses demand and hence is associated with a predictable decrease in negative rider experience (e.g. high surge, long wait time), holding everything else constant. The output is a calibrated response function that turns a policy change into a forecasted shift in the distribution of negative user experience, not just its average, with uncertainty to reflect real‑world variability.</p><p><strong>How we validate: </strong>To validate this mapping, we use switch‑back experiments that alternate policy settings across comparable time slots and compare the model’s predicted changes in negative user experience to the experimental lifts we observe. The experiment either verifies our model or informs its iteration (e.g., changing controls or introducing additional ones).</p><h3>Step 2: from negative user experiences to future outcomes</h3><p>The next question is: <em>given a change in a certain negative user experience today, how do future outcomes move (e.g., future rides, retention, driver hours)?</em> In the broader surrogacy framing, Step 1 captures short‑term shifts in experience, and Step 2 translates those shifts into long‑term behavior under the assumption that long‑term market-mediated effects are completely mediated by short‑term negative user experiences.</p><p><strong>What we model</strong>: The long term effects of negative user experiences. We treat these negative experiences as exposures that vary naturally across users, times, and places, and estimate their impact on future outcomes — such as future rides — while controlling for confounders (time, location, and rider history).</p><p><strong>How we estimate</strong> (observational, double‑robust): We use Augmented Inverse Probability Weighting (AIPW, <a href="https://academic.oup.com/ectj/article/21/1/C1/5056401">Chernozhukov et.al, 2021)</a>, a doubly robust causal estimator combining (i) a propensity model for exposure (the likelihood of facing a given level of negative user experience, given context) and (ii) outcome models for future metrics, conditional on confounders. This yields average treatment effects for negative user experience. We summarize the mapping via a “surrogacy index” that quantifies how much short-term negative experiences will affect long‑term outcomes; this is the scaling we use to move from short‑term exposure to negative experiences to long‑term impact.</p><p><strong>How we validate</strong>: We run user‑split experiments that perturb negative experiences and compare the model’s predicted changes in future outcomes to the experimental lifts, checking calibration (predicted vs. observed) for validation.</p><p><strong>How it works together with Step 1</strong>: Step 1 converts a policy change into a shift in the distribution of negative user experiences; Step 2 converts that exposure shift into forecasted changes in future outcomes (e.g., future rides). Together, via the surrogacy index, they provide a causal link from today’s decision to long‑term business impact.</p><h3>Step 3: verify overall LTE by region-split experiments</h3><p>With the market‑mediated effect from Steps 1–2, we combine the direct long term effect (estimated separately) using a transparent formula grounded in market mechanics. This yields a single policy‑level forecast for long‑run rides and financials that reflects both mediated and direct channels.</p><p>To validate end‑to‑end predictions, we run region‑split experiments. We developed a forward‑selection algorithm, inspired by the forward difference‑in‑differences (FDiD, <a href="https://pubsonline.informs.org/doi/10.1287/mksc.2022.0212">Li, 2024</a>) approach, to choose treated and control regions: starting from a single treated region, we iteratively add treated regions that best improve pre‑period fit and expected power. The region-split experiments allow us to observe the overall long-term effects of a policy intervention to an entire market.</p><p>Below is a simulated example (using simulated data) of a region-split experiment. We find a set of treated regions where we can find another set of control regions that mimic the average behavior of treated regions. Once we inject an intervention shock on the treated regions (e.g., increase of incentives), we track the discrepancy between the average rides of treated and control regions.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*xsMU4a9HfE_7aVrq39skSw.png" /></figure><h3>Conclusion</h3><p>We presented a framework to connect today’s resource decisions to long‑run marketplace impact. First, we translate policy changes into shifts in negative user experience using residualized modeling and verify those short‑term responses with switch‑back experiments. Second, we map negative user experience to future outcomes with doubly‑robust observational inference (AIPW) and a surrogacy index, then validate with user‑split experiments. Finally, we combine the market‑mediated and direct long-term effects and validate end‑to‑end predictions via region‑split experiments, using a forward‑selection design to choose treated and control regions.</p><p><strong>What this enables</strong>:</p><ul><li>Scenario planning and policy evaluation.</li><li>Budget allocation across levers (pricing, incentives) informed by the short-term profit v.s. long-term rides Pareto Frontier.</li><li>Continuous calibration as markets evolve, grounded in experimentally-verified observational causal inference.</li></ul><p>The result is a model‑based, experiment-verified causal engine: decisions move user experience, user experience moves future behavior, and the composition yields long‑term business impact. This enables fully-informed decisions on resource allocation.</p><p><em>Lyft is hiring! If you’re passionate about Data Science, visit </em><a href="https://www.lyft.com/careers"><em>Lyft Careers</em></a><em> to see our openings.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=9cb06d628f2d" width="1" height="1" alt=""><hr><p><a href="https://eng.lyft.com/beyond-a-b-testing-using-surrogacy-and-region-splits-to-measure-long-term-effects-in-marketplaces-9cb06d628f2d">Beyond A/B Testing: Using Surrogacy and Region-Splits to Measure Long-Term Effects in Marketplaces</a> was originally published in <a href="https://eng.lyft.com">Lyft Engineering</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Scaling Localization with AI at Lyft]]></title>
            <link>https://eng.lyft.com/scaling-localization-with-ai-at-lyft-b04dca99e6ee?source=rss----25cd379abb8---4</link>
            <guid isPermaLink="false">https://medium.com/p/b04dca99e6ee</guid>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[distributed-systems]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[aws]]></category>
            <dc:creator><![CDATA[Stefan Zier]]></dc:creator>
            <pubDate>Thu, 19 Feb 2026 17:28:41 GMT</pubDate>
            <atom:updated>2026-02-19T17:28:40.735Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*kdkhjQ3EWj3J8fHUe1xKIA.png" /></figure><p><em>Written by </em><a href="https://www.linkedin.com/in/stefan-zier/"><em>Stefan Zier</em></a></p><p>For years, Lyft’s localization infrastructure relied exclusively on human translation. While this model usually ensured excellent quality, it was bound by multi-day turnarounds and costs that scaled linearly with every new language. For the few languages Lyft initially supported (Spanish, Portuguese, and French), these limits were acceptable.</p><p>However, Lyft’s expansion goals quickly outpaced what traditional workflows could support. Lyft’s recent Québec launch required compliance with Bill 96 (legislation mandating French-first user experiences) which demanded faster turnaround than multi-day cycles allowed. Simultaneously, the Lyft Urban Solutions (“LUS”: Bikes &amp; Scooters) division sought to expand into European markets, requiring six new languages. The business need had changed as we now needed to move faster without sacrificing quality.</p><p>This post explores how we re-architected Lyft’s Translation Pipeline to leverage AI alongside linguist oversight and ultimately unlock new market launches. We will walk through context injection, decoupling content generation from evaluation, implementing guardrails, and treating prompts as version-controlled production code. The new pipeline reduces translation latency from days to minutes while maintaining the fidelity required for legal compliance and brand integrity.</p><p><strong>Note</strong>: We will walk through our <strong><em>batch translation pipeline</em></strong> — used for 99% of app and web content — which targets a 30-minute SLA for 95% of translations. We also support real-time translation (e.g., ride chat) which uses a different architecture.*</p><h3>How Translations Reach Hundreds of Services</h3><p>Before diving into the LLM pipeline, it helps to understand how translations flow through Lyft’s infrastructure. <a href="https://eng.lyft.com/translating-lyft-into-spanish-6f14cb37c7fd">This 2020 post</a> explains the internationalization architecture initially built to move beyond one language/currency/country. Since then, the platform has grown to serve 11 locales across 150+ services.</p><p>At its core, the pipeline does two things in parallel: it uploads source strings to <a href="https://www.smartling.com/">Smartling</a>, our Translation Management System (TMS), for human oversight, and submits them to LLM workers for rapid draft generation. This dual-path architecture lets us accelerate turnaround while keeping the TMS as the system of record for quality control. The AI-translated strings ship immediately to unblock launches and linguists review them asynchronously. Once approved, the linguist-reviewed version replaces the early release.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ueDmtP1p_qb9o8p9G0QMmA.png" /><figcaption>Figure 1: Batch Translation Pipeline Components</figcaption></figure><p>Lyft’s localization pipeline operates in three phases:</p><ol><li><strong>Drafting (Ingest):</strong> Requesters submit source strings along with context — where the text appears in the UI and the intended tone. The Drafter uses this context to generate translation candidates.</li><li><strong>Unblocking Launches (Early Release):</strong> The “Evaluator” runs various quality checks to select the optimal translation candidate. This version is distributed in minutes, unblocking product launches while linguist review takes place in parallel.</li><li><strong>Finalization (Review):</strong> Professional linguists review the drafts within the TMS. Approved translations replace the early-release versions and are established as the definitive system of record. Flagged translations are corrected before being distributed.</li></ol><h3>Why LLMs?</h3><p>We started with an evaluation of traditional Neural Machine Translation (NMT) providers. We found NMT was fast but translations often did not preserve Lyft-specific terminology or context. This pointed us toward LLMs, and recent research validated that direction.</p><p><strong>LLMs have reached human-level translation quality for resource-rich languages.</strong> A<a href="https://arxiv.org/html/2407.03658v1"> 2024 study</a> found GPT-4 produced error rates comparable to junior translators for major language pairs, though performance weakens for lower-resource languages. Industry adoption reflects this shift with over <a href="https://lokalise.com/library/data-reports/localization-trends-2025/">70% of translations now machine-assisted</a> and an overall growing trust in AI-provided translation.</p><p><strong>Context handling is the key differentiator.</strong> Traditional NMT services process each string in isolation with no awareness of where or how it will be used. In our system, requesters provide context about each string — where it appears, its purpose, any relevant UI constraints. LLMs can incorporate this metadata directly to often produce more accurate translations.</p><p><strong>LLMs can evaluate, not just translate.</strong> Multiple valid translations can exist for the same source string. Research established that LLMs are state-of-the-art evaluators of translation quality, with<a href="https://aclanthology.org/events/wmt-2025/"> WMT25 confirming</a> that large LLMs show strong system-level evaluation performance. This enabled the following architecture where one LLM translates, another evaluates and provides feedback, and the translator again refines based on the critique.</p><h3>Building the Iterative Pipeline</h3><p>A naive approach to machine translation is a single API call: send English text, receive translated text. This fails at scale for several reasons:</p><ol><li><strong>No nuance.</strong> Single-shot translations are often “correct enough” but rarely optimal for brand voice or regional idioms.</li><li><strong>No quality signal.</strong> Without evaluation, there’s no way to know if a translation is acceptable before shipping it to users.</li><li><strong>No recovery path.</strong> When a translation fails validation, the system has no mechanism to try again with corrective feedback.</li></ol><p>We needed a system that could generate options, critique them, and iterate, a workflow that mirrors how human translators actually work. This led us to an architecture where multiple specialized processes translate and evaluate through structured handoffs.</p><h4>The “Drafter” — Translation Generation</h4><p>The job of the Drafter is primarily creative as it aims to produce diverse, high-quality translation candidates. We configure it to generate three distinct candidates for every source string.</p><p><strong>Why three?</strong> We find a single translation often converges on the most likely phrasing, which may not be optimal for Lyft’s brand voice or the specific UI context. Multiple candidates increase the probability that at least one captures the right tone, handles edge cases correctly, and uses terminology naturally.</p><p><strong>Model selection:</strong> For the Drafter, we use a fast, non-reasoning model here as translation is primarily a generative task where standard models already perform very well. Additionally, a faster model comes with lower cost and allows us to iterate.</p><p><strong>Sample Prompt</strong></p><pre>DRAFTER_PROMPT = &quot;&quot;&quot;<br>You are a professional translator for Lyft.<br>Translate into {language} for {country}.<br>Give {num_translations} translations of the following text.<br><br>GLOSSARY: {glossary}<br>PLACEHOLDERS (preserve exactly): {placeholders}<br><br>Text: {source_text}<br>&quot;&quot;&quot;</pre><p><strong>Sample Input/Output</strong></p><p><strong>Note</strong>: The LLM interactions return structured data via <a href="https://docs.pydantic.dev/latest/">Pydantic</a> schemas rather than free-form text. This ensures type safety, reliable parsing, and clear contracts between Drafter and Evaluator.</p><pre># Input <br>source_text = &quot;Your {vehicle_type} is arriving in {eta_minutes} minutes&quot;<br>language = &quot;French&quot;<br>country = &quot;Canada&quot;<br><br># Output (parsed)<br>DrafterOutput(<br>    candidates=[<br>        TranslationCandidate(text=&quot;Votre {vehicle_type} arrive dans {eta_minutes} minutes&quot;),<br>        TranslationCandidate(text=&quot;Votre {vehicle_type} sera là dans {eta_minutes} minutes&quot;),<br>        TranslationCandidate(text=&quot;Votre {vehicle_type} arrivera d&#39;ici {eta_minutes} minutes&quot;),<br>    ]<br>)</pre><h4>The “Evaluator” — Translation Evaluation</h4><p>The Evaluator acts as a strict quality gate. It receives all candidates from the Drafter and scores each against a rubric, ultimately selecting the best one or rejecting them all.</p><p><strong>Model selection:</strong> We use a reasoning-focused model for evaluation. Unlike generation, evaluation requires analytical comparison: checking source versus target for semantic drift, verifying terminology compliance, catching subtle tone mismatches. The deliberate reasoning process helps surface errors that a faster model might miss.</p><p>The Evaluator grades each candidate on four dimensions:</p><ol><li><strong>Accuracy &amp; Clarity</strong>: Does the translation preserve the full meaning of the source? Is it unambiguous?</li><li><strong>Fluency &amp; Adaptation</strong>: Does it read naturally to a native speaker? Is it culturally appropriate for the target region?</li><li><strong>Brand Alignment</strong>: Does it use official Lyft terminology? Are proper nouns, airport codes, and brand names preserved in English?</li><li><strong>Technical Correctness</strong>: Is it free of spelling and grammar errors? Are all Lyft terms/phrases applied correctly?</li></ol><p>Each candidate receives a grade: pass or revise. If any candidate passes, we ask the Evaluator to select the best one. If all fail, the Evaluator provides a detailed critique explaining why each failed.</p><p><strong>Sample Output</strong></p><pre>EvaluatorOutput(<br>    evaluations=[<br>        CandidateEvaluation(candidate_index=0, grade=Grade.PASS, <br>                           explanation=&quot;Accurate, natural phrasing.&quot;),<br>        CandidateEvaluation(candidate_index=1, grade=Grade.PASS,<br>                           explanation=&quot;Natural and conversational.&quot;),<br>        CandidateEvaluation(candidate_index=2, grade=Grade.REVISE,<br>                           explanation=&quot;&#39;d&#39;ici&#39; implies uncertainty, inappropriate for ETA.&quot;),<br>    ],<br>    best_candidate_index=0,<br>)</pre><h4>Why separate Drafter and Evaluator?</h4><p>Now that we’ve seen both components, it’s worth explaining why we separate them. The critique-and-refine pattern has several benefits:</p><ol><li><strong>Easier Evaluation</strong>: Spotting errors is simpler than perfect generation, so the Evaluator doesn’t need to be a flawless translator.</li><li><strong>Context Preservation:</strong> The original translator retains the reasoning for its choices when refining based on feedback.</li><li><strong>Bias Avoidance:</strong> Separating roles prevents the self-approval bias of a single model translating and evaluating its own work.</li><li><strong>Flexibility/Cost:</strong> Different models can be used for each role (e.g., a fast drafting model and a more capable evaluator).</li></ol><h4>Retry, Reflection, and Self-Correction</h4><p>The feedback loop between Drafter and Evaluator is a continuous mechanism to ensure if all candidates fail evaluation, the system doesn’t give up. It learns from the failure and tries again, up to three times.</p><p>We find this iterative refinement yields the largest gains in the first 1–2 cycles, so the three-attempt limit balances quality improvement against latency and cost.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*375V9mKLMKDKDwT1iLOgXg.png" /><figcaption><em>Figure 2: The Translation Feedback Loop</em></figcaption></figure><p>When the Evaluator rejects all candidates, its critique is captured and injected into the Drafter’s next attempt. The prompt explicitly instructs the Drafter to address previous failures:</p><pre># This critique is prepended to the next Drafter prompt<br>critique_for_retry = &quot;&quot;&quot;<br>All candidates failed glossary compliance. Key issues:<br>- &quot;Ride&quot; must be translated as &quot;trajet&quot; per Lyft Quebec glossary<br>- Do not use &quot;course&quot; which is European French<br>&quot;&quot;&quot;</pre><p>The retry prompt explicitly instructs the Drafter to address previous failures: correct glossary usage, fix placeholders, adjust tone. We find this iterative refinement process results in a success rate of over 95% across most languages.</p><p>In cases of persistent disagreement between the Drafter and Evaluator, we limit retries to three attempts. If no acceptable translation emerges, product teams must wait for a linguist in our TMS to handle the string, or they can internally request an expedited translation for urgent launches.</p><h4>Context Injection</h4><p>LLMs don’t have access to Lyft’s terminology guidelines or know, for example, that {driver_name} is a variable in code that must be preserved. We inject this context through careful prompt engineering.</p><p>Terminology and reference data aim to reduce the chances of LLM hallucination, a common failure mode. For example, “Driver” might become “Conducteur” in French when Lyft’s official term is “Chauffeur.” To solve this, we maintain data sources of Lyft terms and phrases that the Drafter and Evaluator reference when completing their tasks:</p><ul><li><strong>Glossary terms</strong>: Individual words and even phrases with their official translations (e.g., “Driver” → “Chauffeur” in French)</li><li><strong>Do-not-translate lists</strong>: Brand names, product names, and proper nouns that should remain in English</li></ul><p>All of these assets are maintained in the TMS where linguists routinely review them and update the datasets. The Translation Pipeline then pulls them in periodically for use at translation time to augment prompts to the Drafter and Evaluator:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*DlC6BZdNL2XaPWyyOHQaDg.png" /><figcaption><em>Figure 3: Injecting Terminology and Reference Data</em></figcaption></figure><h4>Deterministic Guardrails</h4><p>LLMs are probabilistic as they may hallucinate variable placeholders, mangle URLs, or strip formatting. For production translation, we need deterministic validation. Without guardrails, we observed consistent hallucination patterns: placeholder translation was the most common failure mode (e.g., {driver_name} becoming {nom_du_conducteur}), followed by placeholder omission and format string corruption.</p><p>Our guardrail system operates in two phases: <strong>pre-translation extraction</strong> and <strong>post-translation validation</strong>. This layered approach ensures protected elements survive the LLM round-trip intact.</p><p><strong>Pre-Translation: Extract and Tokenize</strong></p><p>Before any content reaches the LLM, we extract elements that must be preserved exactly. This includes:</p><ul><li><strong>Variables and format strings</strong>: Curly-brace variables like {driver_name} and {eta}, printf-style placeholders (%s, %1$s, %@)</li><li><strong>URLs and identifiers</strong>: Links, email addresses, and region codes requiring exact preservation</li><li><strong>Structural elements</strong>: HTML tags (with balance checking), escape sequences (ex. \n, \t, \r)</li></ul><p>We use regex pattern matching to identify these elements and replace them with numbered tokens (__PH_0__, __PH_1__, etc.) that the LLM is less likely to accidentally mangle. The numbering also handles reordering as different languages have different grammatical structures, so {vehicle_type} might appear before {eta_minutes} in English but after it in German. The numbered tokens maintain the correct mapping regardless of where they end up in the translated string.</p><p><strong>Sample Transformation</strong></p><pre>Original: &quot;Hey {first_name}! Your Lyft arrives at {eta}.\nTrack: https://lyft.com/r/abc&quot;<br><br>Masked: &quot;Hey __PH_0__! Your Lyft arrives at __PH_1__.__PH_2__Track: __PH_3__&quot;<br><br>Mapping:<br>    __PH_0__ → {first_name}<br>    __PH_1__ → {eta}<br>    __PH_2__ → \n<br>    __PH_3__ → https://lyft.com/r/abc</pre><p>The mapping is injected into the prompt as human-readable context, giving the LLM explicit instructions about what to preserve.</p><p><strong>Post-Translation: Validate and Restore</strong></p><p>After the LLM returns a translation, we run deterministic validation before accepting it. Validation checks three conditions:</p><ol><li><strong>Presence</strong>: Every expected token appears exactly once</li><li><strong>No hallucination</strong>: No unexpected tokens were introduced</li><li><strong>Structure</strong>: Structured content, like HTML tags, are balanced (i.e. open tags have corresponding close tags)</li></ol><p>When validation fails, we inject the specific errors into a retry prompt rather than giving up. The LLM knows exactly what went wrong (e.g., <em>“Missing placeholder: </em><em>__PH_2__ (</em><em>\n)”</em>) and can correct it on the next attempt. This creates a deterministic feedback loop that catches mistakes prompt instructions alone would miss.</p><p>Once validation passes, restoration is a simple token-for-original swap:</p><pre>Translated: &quot;Salut __PH_0__! Votre Lyft arrive à __PH_1__.__PH_2__Suivre: __PH_3__&quot;<br><br>Restored: &quot;Salut {first_name}! Votre Lyft arrive à {eta}.\nSuivre: https://lyft.com/r/abc&quot;</pre><h4>Experimentation &amp; Configuration</h4><p>Initially in development, we treated prompts as configuration, tweaking ad hoc, testing manually, deploying with little review. We found, for example, a small wording change in the Evaluator’s prompt can cause a rise in false rejections. We had no way to identify the regression or roll back.</p><p>Now, prompts are version-controlled production code. Every prompt template lives in our Translation Pipeline alongside the code, subject to the same review process. Each prompt version includes a changelog documenting what changed and why. Prompt changes require testing to demonstrate the change improves (or at least doesn’t regress) translation quality on our evaluation suite.</p><p>To do this, we evaluate changes against ground truth translations in our TMS, flagging divergences from linguist-approved versions. These differences are reviewed before any prompt or model change receives production traffic.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*FBe_Ou3DhM8TvUBh-kU6Fw.png" /><figcaption><em>Figure 4: Prompt Rollouts</em></figcaption></figure><p><strong>Multi-Model Experimentation</strong></p><p>We don’t commit to a single model provider. The pipeline abstracts the LLM layer, but experimentation happens through Lyft’s configuration infrastructure. This gives us:</p><ul><li><strong>Traffic splitting</strong>: Roll out new models to 5%, 20%, 50% of requests</li><li><strong>Shadow mode</strong>: Run new models in parallel without affecting production</li><li><strong>Per-locale overrides</strong>: Different models/prompts for different markets</li></ul><p><strong>Instant rollback</strong>: Revert to previous configuration without a deploy</p><pre># Example config schema<br>drafter:<br>  model: &lt;fast-generation-model&gt;<br>  fallback: &lt;fallback-model&gt;<br>  prompt: &lt;prompt-id&gt;<br>  shadow:<br>    model: &lt;candidate-model&gt;<br>    prompt: &lt;prompt-id&gt;<br>    traffic_percent: 10<br><br>evaluator:<br>  model: &lt;reasoning-model&gt;<br>  fallback: &lt;fallback-model&gt;<br>  prompt: &lt;prompt-id&gt;<br>  reasoning_effort: medium<br><br>locale_overrides:<br>  &lt;locale_code&gt;:<br>    evaluator:<br>      reasoning_effort: high</pre><p>We run experiments comparing model combinations across major model providers. This is how we determined that faster, cheaper models (GPT’s mini models, Claude Haiku) perform comparably to frontier models for initial generation while reasoning-focused models significantly outperform on catching subtle errors.</p><p><strong>Locale-Specific Tuning</strong></p><p>Not all locales need the same prompt. We discovered this when expanding to English variants (en-GB and en-CA). These locales require only orthographic changes like spelling (“color” → “colour”), punctuation, and occasional vocabulary swaps (“trunk” → “boot”).</p><p>Without locale-specific guidance, the LLM interprets “translate to British English” as license for entire rewrites. A simple “Try again” button became “Have another go” — technically valid British English, but a jarring tone shift that didn’t match our UI voice.</p><p>As shown in the config above, we use <strong>locale overrides </strong>that constrain the transformation scope. For example:</p><pre>LOCALE_OVERRIDE_EN_GB = &quot;&quot;&quot;<br>You are adapting American English text for British English speakers.<br><br>IMPORTANT: This is an ORTHOGRAPHIC adaptation, not a full translation.<br><br>Only change:<br>- Spelling (color → colour, center → centre, organize → organise)<br>- Punctuation conventions where required<br><br>DO NOT change:<br>- Tone or voice<br>- Sentence structure  <br>- Casual vs. formal register<br>- Idioms or expressions (unless they are specifically American and would confuse UK readers)<br><br>The goal is that a British reader sees familiar spelling, not that the text &quot;sounds British.&quot;<br>&quot;&quot;&quot;</pre><p>This constraint dramatically reduced over-adaptation errors for English variants while maintaining consistent brand voice across locales. We now apply similar scoping constraints for other closely-related language pairs where full translation would be overkill.</p><h3>Conclusion</h3><p>Through careful oversight, we see 95% of translations need no significant changes after linguists review. The remaining 5% usually represent genuinely difficult cases (regional idioms, legal disclaimers, brand voice decisions) where human oversight adds real value.</p><p>Re-architecting Lyft’s translation pipeline taught us that LLMs excel at translation when you design for their limitations. The iterative Drafter/Evaluator pattern mirrors how human translators actually work: generate options, critique them, refine. Deterministic guardrails catch what prompts alone cannot enforce. And treating prompts as production code (versioned, tested, reviewed) prevents the silent regressions that plague ad hoc LLM deployments.</p><h3>Acknowledgements</h3><p>Thank you to the following team members for making this possible: <em>Janani Sundarrajan, Sebastiano Bea, Jiachen Jiang, Alex Hartwell, Yousra Saidani, Adriana Deneault, Yuantao Ji, Miles Krell, Alex Atencio.</em></p><p><em>Lyft is hiring! If you’re passionate about leveraging AI to enhance millions of user experiences, visit </em><a href="https://www.lyft.com/careers"><em>Lyft Careers</em></a><em> to see our openings.</em></p><p>*<em>The Lyft Terms of Service and other product-specific terms and conditions remain strictly human translated.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=b04dca99e6ee" width="1" height="1" alt=""><hr><p><a href="https://eng.lyft.com/scaling-localization-with-ai-at-lyft-b04dca99e6ee">Scaling Localization with AI at Lyft</a> was originally published in <a href="https://eng.lyft.com">Lyft Engineering</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Trusting the Untestable: Validation and Diagnostics for the Doubly Robust Models]]></title>
            <link>https://eng.lyft.com/trusting-the-untestable-validation-and-diagnostics-for-the-doubly-robust-models-00853df009df?source=rss----25cd379abb8---4</link>
            <guid isPermaLink="false">https://medium.com/p/00853df009df</guid>
            <category><![CDATA[validation]]></category>
            <category><![CDATA[aipw]]></category>
            <category><![CDATA[doubleml]]></category>
            <category><![CDATA[quasi-experiment]]></category>
            <category><![CDATA[rideshare]]></category>
            <dc:creator><![CDATA[Shima Nassiri]]></dc:creator>
            <pubDate>Thu, 12 Feb 2026 17:07:13 GMT</pubDate>
            <atom:updated>2026-02-17T21:35:07.027Z</atom:updated>
            <content:encoded><![CDATA[<p><em>written by </em><a href="https://www.linkedin.com/in/rosschu/"><em>Ross Chu</em></a><em> and </em><a href="https://www.linkedin.com/in/shima-nassiri-phd-7a030826/"><em>Shima Nassiri</em></a></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*1h8xn6vIPteLe8ssJuFQtg.png" /></figure><h3>The Causal Frontier: Measurement Beyond Randomization</h3><p>The gold standard for determining the causal impact of a policy or product change at a company like Lyft is the <strong>A/B test</strong> (randomized experiment). By randomly assigning users to a treatment or control group, A/B tests inherently eliminate bias, providing clean estimates of the Average Treatment Effect (ATE). However, many critical business questions and large-scale initiatives simply <strong>cannot be randomized</strong>. This forces scientists to move past traditional experimentation and leverage <strong>quasi-experimental</strong> methods.</p><p>We rely on non-randomized measurement in several key scenarios across Lyft:</p><ul><li><strong>Partnerships and Policies:</strong> Assessing the incremental impact of a partnership (e.g., linking two company accounts) is often a non-randomized assignment. Since these collaborations require coordinated operational work across both companies and are typically announced or promoted broadly, this makes controlled randomization impractical.</li><li><strong>Long-Term Effect (LTE):</strong> Measuring effects that unfold over a long period, like the LTE of high prices on future rides, is typically handled by observational studies.</li><li><strong>Post-Launch Evaluation:</strong> Continuous monitoring of a policy after it has been fully rolled out requires a method that doesn’t involve costly holdout groups or degradation tests.</li><li><strong>Biased Data:</strong> In cases where pre-existing experimental data is found to have an imbalance, a quasi-experimental approach can potentially leverage the biased data instead of requiring a costly rerun.</li></ul><h3>Introducing Doubly Robust Models: Causal Inference Without Randomness</h3><p>To address these non-randomized measurement needs, Lyft relies on various <strong>quasi-experiment estimators</strong>. In this blog we specifically focus on using the <strong>Augmented Inverse Propensity Weighting (AIPW) model</strong>. This model was first established at Lyft to measure the impact of a negative user experience on future topline metrics like rides and bookings; AIPW is a form of <strong>doubly robust</strong> estimation used to estimate the Average Treatment Effect (ATE) or Average Treatment Effect on Treated (ATT).</p><p>The <strong>doubly robust</strong> nature is what makes the AIPW model so powerful: the formula relies on fitting two separate models for a given set of confounders <em>X</em> and treatment <em>D </em>— the outcome model, <em>g(D, X)</em>, and the propensity score model, <em>e(X)</em> — and the overall estimator consistently estimates the true ATE if <em>at least one</em> of these two models is correctly specified.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/656/1*v0kci-jzz56hnoXDdypPQw.png" /></figure><h3>The Critical Need for Validation</h3><p>Unlike A/B tests, quasi-experiments lack the fundamental guarantee of unbiasedness provided by randomization. This makes <strong>validation, diagnostics, and trust-building</strong> absolutely critical. If we cannot randomize the treatment assignment, we must rely on a <strong>large set of control variables (confounders)</strong> to adjust for pre-exposure differences between the treatment and control groups. This adjustment process makes the analysis highly susceptible to bias if not correctly executed and rigorously checked. The core validation pillars are centered on managing confounders and monitoring model diagnostics, as we’ll explore in the next section.</p><p><strong>The Validation Engine: Confounders and Diagnostics</strong></p><p>Since quasi-experiments lack the inherent randomization of A/B tests, we must prove the validity of our causal estimates through rigorous inputs and explicit model diagnostics. We build trust in every single result through <strong>Confounder Management</strong> and a <strong>Diagnostic Scorecard.</strong></p><ol><li><strong>Rigorous Data and Confounder Inputs</strong></li></ol><p>The greatest threat to a quasi-experiment’s validity is <strong>selection bias</strong> — users with certain characters are more likely to be treated by the policy, making the control group not a proper counterfactual for the treatment group. The only way to correct for this is by meticulously identifying and controlling for these pre-existing differences, known as confounders. At Lyft, we have developed a quasi-experimentation platform that makes this process mandatory and customizable.</p><p><strong>a) The Confounder Set Requirement</strong></p><p>The analysis requires potentially <strong>hundreds of features/confounders</strong> to adequately reduce bias. The quasi-experimentation platform doesn’t allow users to skip this step; users <strong>are required</strong> to select or define one confounder set for AIPW analysis.</p><ul><li><strong>Pre-defined sets:</strong> Users can choose from established sets on the platform that are pre-defined as SQL queries to pull the relevant data.</li><li><strong>Customization is key:</strong> Crucially, the platform exposes the underlying SQL query, allowing users to <strong>customize, modify, add, or remove</strong> variables within a set to perfectly match their specific use case.</li><li><strong>Preventing leakage:</strong> The system automatically ensures confounder data is gathered from <em>before</em> the user’s first exposure date to the treatment, preventing <strong>leaky covariates</strong> that would incorrectly attribute the treatment’s effect to a non-causal variable.</li></ul><p><strong>b) The Balancing Act: Correcting for Downsampling Bias</strong></p><p>A core task in data preparation for AIPW is balancing the size of the treatment and control groups in the presence of imbalance data. When the one group is smaller than the other, the system <strong>randomly downsamples the larger group</strong> to achieve balance. However, this random downsampling introduces a new scientific challenge:</p><ul><li><strong>Non-Representative Samples:</strong> Even if the original sample satisfies model assumptions across treatment groups, taking a random subset of the larger group may make that subset non-representative of the true population distribution with respect to the confounders.</li></ul><p>To recover the true population-level ATE, we apply a correction in the AIPW estimates per <a href="https://arxiv.org/pdf/2403.01585">Ballinari (2024)</a>, which involves two related concepts:</p><p><strong>i. Propensity Score Correction:</strong> We must convert the sample-estimated propensity score, <em>p_s(X)</em>, back into the true population propensity score, <em>p(X)</em>, using a conversion formula that accounts for the downsampling ratio <em>L</em>. This ensures the model uses the real probability of being treated in the population, not just in the sample.</p><p><strong>ii. Outcome Reweighting:</strong> After the propensity score correction, the efficient scores must be subjected to a <strong>weighted average</strong> based on the sampling ratio. Specifically, for every observation in the downsampled control group, we must account for the fact that it represents <em>1/L</em> copies of the original population. This process involves uniformly rescaling the weights of each observation so they average to 1.</p><p>This <strong>reweighting of outcomes</strong> is a critical scientific refinement currently being implemented to debias the results and ensure the final ATE estimate accurately reflects the total impact on the original population.</p><p><strong>2. Model Diagnostics and Assumptions</strong></p><p>The output of every AIPW analysis on our quasi experimentation platform is a scorecard with two essential tabs: the <strong>Scores Tab</strong> and the <strong>Diagnostic Tab</strong>. The Diagnostic Tab is where we evaluate the model’s health to look for clues that its fundamental assumptions hold, providing visual proof of the estimation quality. Below are two of the tens of diagnostics we show users:</p><p><strong>i. Checking Common Support (Propensity Overlap)</strong></p><p>The AIPW model relies on the <strong>common support assumption</strong> (or <strong>overlapping assumption</strong>), meaning that for any given set of confounders, there must be a non-zero probability of being in <em>either</em> the treatment or control group (<em>0&lt;e(X)&lt;1</em>). If this assumption is violated, the inverse propensity score weights explode. This means that the outliers will receive an extreme weight and take over the overall effect.</p><p>The platform provides a histogram visualizing the <strong>Propensity Overlap</strong> between the control and treatment groups. Below are examples of sufficient and insufficient overlaps between treated and control groups</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ENQZ7MZXY93UEgCpK8wbXw.png" /></figure><p>If the overlap is poor, the causal estimates are unreliable. The analysis is further refined by applying a user-defined <strong>trim level</strong> to discard extreme propensity score values and satisfy the common support assumption.</p><p><strong>ii. Ensuring Covariate Balance</strong></p><p>The primary goal of using confounders is to achieve a state where the treatment and control groups are statistically similar on all observed characteristics.</p><p>A key diagnostic graph illustrates the <strong>Covariate Balance before and after adjustment</strong>.</p><ul><li><strong>Before Adjustment (Original):</strong> Shows the initial difference (bias) in each feature between the groups.</li><li><strong>After Adjustment (IPW/IPW Trimmed):</strong> Shows how well the model has reduced the difference. Successful adjustment moves the metrics close to the dotted zero line, <strong>confirming that the confounders are properly balanced</strong>.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*v1_j3J-2Hz9weli99jA79A.png" /></figure><p><strong>Measuring Bias in AIPW</strong></p><p>Since AIPW is a new addition to our causal inference platform, we needed to verify its performance. We decided to use randomized experiments as our “ground truth” to see how closely our observational AIPW estimates matched experimental results. We looked for an empirical setting where there was both experimental and observational data for the same intervention, time period, and regional markets. This allows us to compare experimental results directly against AIPW estimates derived from observational data, where we controlled for biases using a large set of confounders.</p><p><strong>Empirical Setting: Weekly Ride Challenges</strong></p><p>Weekly ride challenges are conditional bonuses given to drivers if they complete a certain number of rides within a week. Since these bonus payments are costly, Lyft uses an algorithm to target ride challenges efficiently given a fixed budget. Specifically, the algorithm allocates ride challenges to drivers who are expected to be most responsive. AIPW estimates the effect of ride challenges using observational data by comparing drivers who do versus do not receive the challenges (within the targeted group) while having similar propensity scores.</p><p>The experimental ground truth leverages an ongoing experiment to evaluate the effectiveness of ride challenges. In this holdout experiment, drivers are randomly assigned into a “control” group that will not receive a ride challenge. Remaining drivers in the “targeted” group may or may not receive a ride challenge depending on whether they are targeted by the algorithm (illustrated below). AIPW estimates the treatment effect on observational data by comparing drivers who do versus do not receive the challenge within the targeted group.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*G9qaFkLsyl_dW81XQxyz9w.png" /></figure><p>For the experimental ground truth, we cannot directly compare drivers in targeted versus control groups. This is because many drivers in the targeted group do not receive the ride challenge, so a direct comparison would yield unbiased estimates for “potentially” being exposed to ride challenges. This diluted effect would not be comparable with AIPW, which is comparing drivers with versus without the ride challenge.</p><p>To get around this issue, we used instrumental variables (IV) to estimate the Local Average Treatment Effect (LATE). Randomized assignment into targeted versus control groups is the “instrument” that influences whether the driver receives the ride challenge, but it does not perfectly determine it. LATE is the treatment effect of ride challenges for “compliers”: drivers who receive the challenge if assigned to the targeted group, and drivers in the control group who would have received the ride challenge if they had been assigned to the targeted group. Conceptually, compliers are “potential targets” who would be targeted by the algorithm for ride challenges, so whether they receive the challenge is solely determined by random assignment into targeted versus control groups. We compare LATE from experimental data with Average Treatment Effect for Treated units (ATET) from observational data, since they both correspond to the effect of ride challenges for drivers who are targeted by the algorithm.</p><p><strong>What We Found</strong></p><p>Through this careful “apples-to-apples” comparison, we found that AIPW estimates obtained from observational data understate ground truth magnitudes obtained from experimental data through IV estimates. In the first row of the table below, AIPW estimates that ride challenges increase driver hours by 11.1%, while the experimental estimate suggests a 13.3% effect. The AIPW estimate is 2.1 percentage points below the experimental estimate, or equivalently, 16.2% lower than the ground truth magnitude.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*0aB4438vl0ceZ9y0orGTow.png" /></figure><p>To understand the root cause, we explored whether this discrepancy is because the analysis sample for AIPW is different from the overall population. Since AIPW requires the same propensity score values to be observed in treated and control groups, the analysis sample trims out drivers whose propensity scores are too high or too low. Since this is a subset of the overall population, the discrepancy observed earlier may reflect differences in the populations they represent.</p><p>To test this hypothesis, we used the propensity score model to find a comparable subset in the experimental data with the same range of hypothetical propensity scores. The second row of the table shows that the discrepancy between observational AIPW and experimental IV is minimal when comparing within this subset. The estimated effect from AIPW is only -0.3 percentage points below the experimental ground truth (smaller by 2.57%). This large reduction in discrepancy shows that AIPW understates the ground truth primarily because the trimmed analysis sample is different from the overall population.</p><p>However, this doesn’t mean that AIPW will always understate the experimental ground truth. In fact, there is a <a href="https://pubsonline.informs.org/doi/10.1287/mksc.2022.1413">well-known study</a> at Facebook that actually found the opposite. Whether AIPW understates or overstates the true effect heavily depends on the empirical context. In our case, ride challenges are targeted to the most effective drivers, who tend to have higher propensity scores. It is difficult to find drivers who have similar propensity scores but did not receive the ride challenge, since most of them are targeted by the algorithm. As a result, the most effective drivers in the treated group end up being trimmed out of the analysis sample due to a lack of comparable drivers in the control group. Since AIPW estimates the treatment effect on an analysis sample that excludes the most effective drivers, it makes sense that AIPW understates the true effect on the overall population that includes such drivers.</p><p><strong>Improving the AIPW Platform Through our Learnings</strong></p><p>To summarize our findings, we initially found that the estimated effect from AIPW understates the experimental ground truth. A deep dive revealed two key drivers:</p><ol><li><strong>Hidden Confounders:</strong> Even with a rich feature set, hidden confounders can still bias estimates.</li><li><strong>Propensity Trimming:</strong> Trimming users with extreme propensity scores (to satisfy the common support assumption) shifted our analysis sample so it was no longer representative of the target population.</li></ol><p>This discrepancy disappeared once we subset the experimental data to match the analysis sample in AIPW. The “bias” was largely due to measuring the effect on a narrower segment of users who behaved differently than the general population.</p><p>Unobserved confounding and trimming on propensity scores can be more problematic in some applications than others. Since AIPW is not reliable when these issues are severe, we introduced two additional diagnostics to recognize when our estimates are vulnerable to these issues:</p><ul><li><strong>Marginal Sensitivity Model (</strong><a href="https://academic.oup.com/jrsssb/article/81/4/735/7048357"><strong>Zhao et al., 2018</strong></a><strong>):</strong> This metric quantifies the robustness of estimates to hidden confounders, which bias our propensity scores. It measures the required discrepancy between true and estimated propensities to change our directional conclusions. More specifically, the discrepancy factor <em>G</em> quantifies the extent to which true versus observed odds ratios diverge due to unobserved confounding (<em>1/G ≤ odds ratio ≤ G</em>). This is used to infer bounds on true propensity scores given observed propensities <em>e(X)</em> and factor . We use this to construct worst-case bounds on treatment effects, and the diagnostic metric is the smallest value so that the worst case bounds touch zero. Higher values for this sensitivity metric indicate that the direction of true impact does not change even with large discrepancies in propensity scores due to unobserved confounding.</li><li><strong>Covariate Comparisons for Trimmed vs. Untrimmed:</strong> We plot observed characteristics of trimmed users against untrimmed users. If trimmed users look significantly different from untrimmed users, then our analysis sample would lack external validity to the overall population. Each point below reports the normalized difference between trimmed versus untrimmed users for each observed covariate, separately for treated and control groups. In the example below, users with certain characteristics are more likely to be trimmed in the treated group than they are in the control group. If users with certain characteristics are more likely to be trimmed, then the estimated ATE from the analysis sample may overstate or understate the true ATE in the overall population.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*f9npe_a_THxSzZEqr2yrsw.png" /></figure><h3>Conclusion: Trustworthy Causal Impact at Scale</h3><p>Integrating AIPW into our platform has unlocked the ability to measure causal impact when A/B tests are infeasible. Since quasi-experimentation platform is a new capability at Lyft, we needed to rigorously validate the model to build stakeholders’ trust around the tool. Validation will continue to be essential as we onboard additional tools to the platform, which will help us deliver trustworthy insights for our most complex challenges at Lyft’s marketplace.</p><p><em>Lyft is hiring! If you’re passionate about experimentation and measurement, visit </em><a href="https://www.lyft.com/careers"><em>Lyft Careers</em></a><em> to see our openings.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=00853df009df" width="1" height="1" alt=""><hr><p><a href="https://eng.lyft.com/trusting-the-untestable-validation-and-diagnostics-for-the-doubly-robust-models-00853df009df">Trusting the Untestable: Validation and Diagnostics for the Doubly Robust Models</a> was originally published in <a href="https://eng.lyft.com">Lyft Engineering</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Lyft’s Feature Store: Architecture, Optimization, and Evolution]]></title>
            <link>https://eng.lyft.com/lyfts-feature-store-architecture-optimization-and-evolution-7835f8962b99?source=rss----25cd379abb8---4</link>
            <guid isPermaLink="false">https://medium.com/p/7835f8962b99</guid>
            <category><![CDATA[feature-engineering]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[features]]></category>
            <category><![CDATA[feature-store]]></category>
            <category><![CDATA[data-science]]></category>
            <dc:creator><![CDATA[Rohan Varshney]]></dc:creator>
            <pubDate>Tue, 06 Jan 2026 18:10:32 GMT</pubDate>
            <atom:updated>2026-01-06T18:10:25.923Z</atom:updated>
            <content:encoded><![CDATA[<p><em>Written by </em><a href="https://www.linkedin.com/in/rohanvarshney"><em>Rohan Varshney</em></a><em>, with support from </em><a href="https://www.linkedin.com/in/devon-mittow-9a54ab99/"><em>Devon Mittow</em></a><em> &amp; </em><a href="https://www.linkedin.com/in/janiceyrlee/"><em>Janice Lee</em></a><em>.</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*YGtkZilNCULvGZ_qjB1Kuw.jpeg" /></figure><p><em>This article expands upon a </em><a href="https://www.featurestoresummit.com/session/lyfts-feature-store-architecture-optimization-and-evolution"><em>presentation</em></a><em> from the Feature Store Summit 2025, which can be viewed in full </em><a href="https://www.youtube.com/watch?v=918ph1fIfvs"><em>here</em></a><em>. There is also another</em><a href="https://www.youtube.com/watch?v=tI974V6yMsQ"><em> video</em></a><em> available on the evolution of Lyft’s Feature Store from DE4AI 2024.</em></p><h3>Introduction and Core Purpose</h3><p>Lyft’s Feature Store stands as a core infrastructural pillar within its Data Platform organization, designed to optimize the management and deployment of Machine Learning (ML) features at massive scale. Its primary objective is to centralize feature engineering efforts, guaranteeing uniformity across diverse models and workflows that perform important data-driven decision making across the entire rideshare stack. By streamlining the entire lifecycle — from feature creation and storage to low-latency access and high-throughput processing — it facilitates effective offline and online model training and inference.</p><p>This post will provide a refreshed look (<a href="https://eng.lyft.com/ml-feature-serving-infrastructure-at-lyft-d30bf2d3c32a">since 5 years ago</a>) at the architectural evolution, practical applications, performance tuning, and significant improvements in developer experience we’ve performed over the past few years to improve efficiency, scalability, performance, and user accessibility. Ultimately, we aim to illustrate how the Feature Store empowers Lyft engineers to develop highly effective service components and ML models, a capability that is becoming vital for emerging AI and Large Language Model (LLM) applications.</p><h3>Defining Our Audience and Impact</h3><p>Before diving into the system’s architecture, it’s essential to understand the importance and breadth of our user base. Our product is mission-critical, serving diverse engineering functions across the company.</p><figure><img alt="5 use cases of Lyft’s Feature Store with corresponding descriptions on details &amp; impact." src="https://cdn-images-1.medium.com/max/1024/1*rNbn4VjDgPn0lHQscY_yGg.png" /><figcaption>Feature Service impact goes far beyond just these examples.</figcaption></figure><p>Understanding this subset of varied high-impact use cases (just five of 60+) is key to appreciating the necessary robustness and flexibility of the Feature Store design.</p><h3>The Feature Store Architecture: A Platform of Platforms</h3><p>Our system is best described as a <strong>platform of platforms</strong>. While the full architecture diagram is complex, we can break it down into three digestible components: Batch, Online, and Streaming features.</p><figure><img alt="Architecture diagram of Lyft’s Feature Store." src="https://cdn-images-1.medium.com/max/1024/1*Fxs9OFEwm_K5JVwSXC2HCw.png" /><figcaption>A “draw.io” architecture diagram of Lyft’s Feature Service.</figcaption></figure><h3>Batch Feature Ingestion and Serving</h3><p>Batch features are the most widely used family of features within our platform. These features are defined from existing Hive data tables and represent a set of standardized data points that are calculated and refreshed on a set cadence, typically on a daily basis.</p><p>The ingestion process begins when customers define features using a Spark SQL query and a simple JSON file representing the dedicated configuration metadata.</p><p>A Python cron service reads these configurations and automatically generates an Astronomer-hosted Airflow Directed Acyclic Graph (DAG). Crucially, these generated DAGs are production-ready out-of-the-box. They handle:</p><ol><li>Executing the Spark SQL query to compute the feature data</li><li>Storing the feature data to both the offline and online data paths</li><li>Running integrated data quality checks</li><li>Compatibility for feature discovery</li></ol><p>The executed DAG generates a dataframe and delivers the results to two distinct paths:</p><ul><li><strong>Offline Data Path:</strong> The feature data is stored in Hive tables for historical data analysis and machine learning model training.</li><li><strong>Online Data Path:</strong> The processed features are translated and sent to our low-latency online serving layer for real-time inference.</li></ul><h3>The Online Serving Layer</h3><p>Our online serving layer, referred to as dsfeatures (short for “data science features”), is central to our feature serving capability. It is an optimized wrapper over various AWS data stores, providing a reliable and ultra-low-latency retrieval mechanism for real-time serving.</p><p>The core structure of dsfeatures is:</p><ul><li><strong>Backing Store:</strong> DynamoDB is utilized as the primary, persistent source for features. It uses various metadata fields as the primary key with a GSI for GDPR deletion efficiency.</li><li><strong>Performance Cache:</strong> A ValKey write-through LRU cache is deployed on top of DynamoDB to facilitate ultra-low-latency retrievals by storing the most frequently-accessed (meta)data with a generous TTL.</li><li><strong>Embeddings:</strong> An OpenSearch integration is utilized specifically for serving embedding features, which require specialized indexing and retrieval capabilities.</li></ul><h3>Customer Interaction and Data Retrieval</h3><p>The dsfeatures service centralizes how both internal DAGs and external customers interact with the feature data. From a customer’s perspective, data retrieval and management are straightforward, facilitated by our dedicated Software Development Kits (SDKs): go-lyft-features (Golang) and lyft-dsp-features (Python).</p><p>Services utilize these SDKs to make API calls directly to the dsfeatures service. The most common retrieval methods are Get or BatchGet calls, which the service handles and returns the requested data in a developer-friendly format.</p><p>Crucially, the SDK libraries expose full CRUD (Create, Read, Update, Delete) operations. This capability allows system components, such as our internal Airflow DAGs, to read and write features, and even lets customers manage real-time features ad-hoc by directly invoking these API calls against our data stores.</p><h3>The Streaming Pipeline</h3><p>While batch features are essential, we also rely on streaming features to ensure data recency for low-latency applications and customer demands.</p><p>Our streaming pipeline follows a robust, multi-stage architecture to process features in real time.</p><ol><li><strong>Ingestion:</strong> Streaming applications, developed primarily using Apache Flink, read analytic events from Kafka topics (or sometimes Kinesis streams).</li><li><strong>Transformation:</strong> The Flink applications perform necessary initial transformations on the data. This includes manual metadata creation and proper value formatting.</li><li><strong>Ingest Service:</strong> The feature payloads from customer applications are sunk to spfeaturesingest— our “Streaming Platform feature ingest” Flink application. It handles the (de)serialization of the payloads and subsequent interaction with dsfeatures via WRITE API call(s) to ensure the features are processed in the right format, guaranteeing availability for online retrieval by other services.</li></ol><p>Regardless of the ingestion method (batch, streaming, or on-demand), the Feature Store maintains uniform metadata and strongly consistent reads. This is crucial for ensuring feature accuracy and availability across all consuming applications and services.</p><h3>Prioritizing User Experience and Feature Governance</h3><p>Understanding our architecture is only half the picture; the user experience is central to maximizing productivity. Our Feature Store primarily serves two frequent personas: <strong>Software Engineers</strong> (who drive service activity) and <strong>ML Modelers</strong> (who design features and models). Since developers can often embody both roles or work in mixed teams, we’ve designed our system to simplify interaction for everyone.</p><h3>Ease of Use and Quick Iteration</h3><p>We learned early on that our core personas are particularly proficient in SQL and place a high value on quick iteration. To facilitate this, our design centers on:</p><ul><li><strong>Performant SparkSQL</strong> as the preferred processing engine and language for batch feature queries.</li><li><strong>Simple JSON configuration files</strong> to define feature behavior/metadata.</li></ul><figure><img alt="Example JSON configuration for a simple feature group (collection)." src="https://cdn-images-1.medium.com/max/896/1*QxoBvSQeFLzo0ZiT6NiddA.png" /><figcaption>Example JSON configuration</figcaption></figure><figure><img alt="Example simple SQL query for a feature group (collection)." src="https://cdn-images-1.medium.com/max/926/1*kAmnDbAUlVz7mjaHDOIuGQ.png" /><figcaption>Example (simplified) SparkSQL query</figcaption></figure><p>This approach ensures that developers can focus on their primary responsibilities without technical intricacies getting in their way. The Feature Store presents this user-friendly interface and APIs that simplify interaction, minimizing the learning curve and facilitating rapid adoption. Engineers can readily register, update, and retrieve features using well-documented APIs and well-supported examples.</p><h3>Feature Governance and Metadata</h3><p>Our configuration files include essential <strong>metadata</strong> such as ownership details, urgency tiering, run-to-run carryover/rollup logic, and explicit feature naming &amp; data-typing. This metadata is crucial for more than just customer clarity; it is vital for our monitoring and observability systems, aiding in debugging and providing posterity of feature history (both metadata and values).</p><p>To support robust feature management, the Feature Store incorporates versioning and lineage tracking capabilities, encapsulated in our metadata:</p><ul><li><strong>Versioning</strong> allows developers to monitor changes to features over time, ensuring the use of correct versions for their models/services. If the SQL or expected feature behavior undergoes business logic changes, a version bump is expected.</li><li><strong>Lineage tracking</strong> offers crucial insights into the origin and transformation of features, enhancing both transparency and accountability across the platform.</li></ul><h3>Accelerating the Feature Engineering Workflow</h3><p>To complement our simple SQL/JSON foundation, we’ve integrated with <strong>Kyte</strong> to accelerate the development lifecycle. This homegrown solution is central to Airflow local development at Lyft — more about Kyte can be learned <a href="https://airflowsummit.org/sessions/2022/kyte-dag-development-experience-at-lyft/">here</a>.</p><p>We provide a custom Command Line Interface (CLI) within the Kyte environment that significantly improves the feature prototyping experience, allowing users to:</p><ul><li>Perform feature validation against their configurations.</li><li>Test SQL runs for immediate feedback and investigable results.</li><li>Execute DAG runs in a local environment.</li><li>Confidently backfill previous dates against their DAGs.</li></ul><h3>Feature Discoverability</h3><p>Once features are generating data, discoverability is the next crucial step. Our generated DAGs automatically tag feature metadata within <strong>Amundsen</strong>, Lyft’s central data discovery platform. This integration allows users to easily search for existing features, a critical step in preventing the duplication of efforts and reducing wasted engineering work.</p><figure><img alt="UI for Amundsen showing a subset of discoverable ML features." src="https://cdn-images-1.medium.com/max/1024/1*pXRNX3OiA_J8t6Xqf5xDQQ.png" /><figcaption>Example of Amundsen UI for ML features</figcaption></figure><p>By simplifying data discovery and feature engineering, we solidify the Feature Store’s crucial role in the ML Model Development lifecycle, ensuring a strong partnership with our Machine Learning Platform (MLP) team, which owns the remaining model-building steps.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*CN6rR6SFhBiT4wXHvayMJg.png" /><figcaption><em>Image originally from Konstantin Gizdarski’s article, “</em><a href="https://eng.lyft.com/building-real-time-machine-learning-foundations-at-lyft-6dd99b385a4e"><em>Building Real-time Machine Learning Foundations at Lyft</em></a><em>.”</em></figcaption></figure><h3>Platform Evolution and Prioritization</h3><p>While our current setup is robust, it has evolved significantly. We made deliberate, strong steps to prioritize <strong>efficiency</strong> and <strong>improved customer experience</strong> over the last few years.</p><h4>Streamlining for Core Success</h4><ul><li><strong>Orchestration Migration:</strong> We completed two migration hops, ultimately transitioning our core orchestration platform from in-house Flyte to fully-managed Astronomer (tradeoffs discussed <a href="https://eng.lyft.com/orchestrating-data-pipelines-at-lyft-comparing-flyte-and-airflow-72c40d143aad">here</a>). This offloads ETL platform stability issues to external engineers while allowing our Orchestration team to invest in higher priority internal initiatives.</li><li><strong>Niche Support:</strong> In the spirit of the <a href="https://en.wikipedia.org/wiki/Pareto_principle">Pareto principle</a>, we removed compatibility for alternative query engines like HiveQL and Redshift. This allowed us to reduce support for minor niche use cases and invest more deeply in our core use cases that drive the most value for Lyft.</li><li><strong>Staging Environment:</strong> We unlocked a reliable staging capability for our platform, boosting confidence in the release of urgent or sensitive features through easier prototyping and E2E testing in non-production environments.</li><li><strong>Standardizing Access:</strong> We developed the long-requested Golang SDK and an offline-data Python SDK. These normalize customer activity against our data, making it more monitorable, accessible, and understandable.</li><li><strong>Data Contracts:</strong> We are actively implementing the organization’s “Data Contracts” initiative, which enforces explicit expectations regarding feature freshness, ownership, and quality. This is crucial for maintaining trust as data generation scales rapidly.</li></ul><h4>Investing in New Capabilities</h4><ul><li><strong>Streaming Abstraction:</strong> A RealtimeMLPipeline<strong> </strong>interface was designed for Flink applications, aiming to make generating streaming features a low-friction process and meeting growing internal demand.</li><li><strong>Embedding Support:</strong> We integrated OpenSearch into the entire Feature Store stack to support embedding features, which has generated significant interest and new use cases.</li><li><strong>Staffing Investment:</strong> Significant new staffing is being dedicated to the Stream Compute team, reflecting the rising importance and complexity of this class of features. See <a href="https://www.lyft.com/careers"><em>Lyft Careers</em></a> for open roles!</li></ul><h3>Optimization: Transparency and Lean Retrieval</h3><p>With a platform of this size and importance, efficiency is in continuous battle against egregious bloat. Our recent optimization strategy focused on two core areas: transparency and reliability.</p><h4>Data Generation: Transparency</h4><p>For the generation pipeline, we focused on <strong>transparency</strong>. Improved monitoring against failed DAGs and tasks, coupled with strong ownership tracking, has made debugging faulty features significantly easier. This, in turn, has provided the confidence to actively deprecate unused or incorrectly used features, ensuring a healthier, more maintainable feature space for the platform team while saving resources for others (f.e. less wasted Spark compute &amp; Astronomer task scheduling resources).</p><h4>Data Retrieval: Reliability</h4><p>Our customers’ primary requests for retrieval improvements centered on two metrics: <strong>better latencies</strong> and <strong>higher success rates</strong>. Given that AWS datastores are our main source of unpredictable transient failures and high P999 tail latencies (latency is generally not part of their <a href="https://aws.amazon.com/dynamodb/sla/">SLAs</a>), our strategy was to focus on being as lean as possible to make this less likely &amp; disruptive:</p><ul><li><strong>Cache Modernization:</strong> We upgraded our cache technology and version, adopting ValKey as the latest solution over ElastiCache.</li><li><strong>Payload Optimization:</strong> We removed unnecessary fields from the retrieval code path and the cache payload to reduce data transfer size and processing time.</li><li><strong>Rightsizing:</strong> We right-sized our EKS pods in dsfeatures to minimize the aggregate number of necessary Redis connections, which historically resulted in networking issues.</li><li><strong>Policy Hardening:</strong> We improved retry and timeout policies both within customer services and our own SDKs to prevent premature network exits and degraded success rates.</li><li><strong>TTL Management:</strong> We increased the cache TTL (Time-To-Live) as much as possible for both feature values and the metadata used in retrieval decision-making, carefully balancing latency performance against storage cost.</li></ul><h3>Results: Unprecedented Growth and Performance</h3><p>These focused adjustments and improvements led to remarkable trends across the platform.</p><ul><li><strong>Latency Reduction:</strong> We cut the standard <strong>P95 latency</strong> experience during read operations by a full <strong>third</strong>. This had a tangible downstream effect, evidenced by increased customer SRs and a significant reduction in customer-support threads in our internal Slack channels, directly aiding on-call engineers.</li><li><strong>Batch Feature Growth:</strong> Our <strong>batch features</strong>, the largest family by volume, grew by over <strong>12% year-over-year</strong>. This growth occurred despite our active feature deprecation efforts, suggesting a highly positive experience and deepening partnership from the teams that utilize our platform.</li><li><strong>Caller Growth:</strong> The number of <em>distinct</em> production service callers increased by almost <strong>25%</strong> over the last year. Since each distinct caller represents a fundamentally unique use case — whether a separate service or a new facet within an existing one — it strongly validates the company’s increasing appetite for feature usage and the value it brings.</li><li><strong>Scale of Activity:</strong> Aggregate R/W activity on the platform increased by over a <strong>trillion</strong> in raw count, based on conservative extrapolation. This serves as a powerful reminder of the enormous and continuously growing scale at which our platform operates.</li></ul><p>Considering the remarkable progress of the past year, the potential for future growth and impact is truly immense.</p><h3>Conclusion</h3><p>Lyft’s Feature Store is a testament to robust data infrastructure, propelling ML excellence and operational efficiency within the company. Its architecture effectively addresses the complexities inherent in managing and deploying ML features at scale. This strategic approach ensures Lyft remains at the forefront of data-driven decision-making.</p><p>More than just a data management tool, the Feature Store at Lyft serves as a vital catalyst for innovation and a key enabler of Lyft’s overarching mission: <strong>to serve &amp; connect</strong>.</p><p>As the machine learning landscape continues its rapid evolution, the Feature Store will undoubtedly retain its critical role within Lyft’s data strategy, driving advancements and solidifying Lyft’s leadership in data-driven technology. We are truly excited to see what the future holds, and see how we can continue to serve &amp; connect our customers.</p><h3>Acknowledgements</h3><p>I would like to thank <a href="https://www.linkedin.com/in/devon-mittow-9a54ab99/">Devon Mittow</a>, <a href="https://www.linkedin.com/in/janiceyrlee/">Janice Lee</a>, <a href="https://www.linkedin.com/in/yigal-kassel/">Yigal Kassel</a> for all their direct contributions to the feature space. The charter’s achievements would not have been possible without them.</p><p>Further thanks to <a href="https://www.linkedin.com/in/maheep-myneni-750b51120/">Maheep Myneni</a>, <a href="https://www.linkedin.com/in/arda-kuyumcu-9712282a/">Arda Kuyumcu</a>, and <a href="https://www.linkedin.com/in/aniruddhadkar/">Aniruddh Adkar</a> for being incredible team members whose collaboration and support unlocked the confidence to embark on all these important projects &amp; developments.</p><p>Finally, thanks to <a href="https://www.linkedin.com/in/premsantosh/">Prem Santosh Udaya Shankar</a>, <a href="https://www.linkedin.com/in/menonrohit/">Rohit Menon</a>, <a href="https://www.linkedin.com/in/arbazmirza/">Arbaz Mirza</a>, <a href="https://www.linkedin.com/in/gizdarski/">Konstantin Gizdarski</a>, <a href="https://www.linkedin.com/in/yunhao-qing/">Yunhao Qing</a> and <a href="https://www.linkedin.com/in/balser/">Brian Balser</a> whose technical and management support, past and present, led to this charter’s overall success and growth.</p><h3>Further Reading</h3><ul><li>Learn about our RTML architecture: <a href="https://eng.lyft.com/building-real-time-machine-learning-foundations-at-lyft-6dd99b385a4e">Building Real-time Machine Learning Foundations at Lyft </a>by Konstantin Gizdarski and Martin Liu</li><li>Learn about a major RTML use case: <a href="https://eng.lyft.com/real-time-spatial-temporal-forecasting-lyft-fa90b3f3ec24">Real-Time Spatial Temporal Forecasting @ Lyft</a> by Rakesh Kumar and Josh Xi</li><li>Learn about our ML platform architecture: <a href="https://eng.lyft.com/lyftlearn-evolution-rethinking-ml-platform-architecture-547de6c950e1">LyftLearn Evolution: Rethinking ML Platform Architecture</a> by Yaroslav Yatsiuk</li></ul><p><em>Lyft is hiring! If you’re passionate about Infrastructure &amp; Data Platform, visit </em><a href="https://www.lyft.com/careers"><em>Lyft Careers</em></a><em> to see our openings.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=7835f8962b99" width="1" height="1" alt=""><hr><p><a href="https://eng.lyft.com/lyfts-feature-store-architecture-optimization-and-evolution-7835f8962b99">Lyft’s Feature Store: Architecture, Optimization, and Evolution</a> was originally published in <a href="https://eng.lyft.com">Lyft Engineering</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[From Python3.8 to Python3.10: Our Journey Through a Memory Leak]]></title>
            <link>https://eng.lyft.com/from-python3-8-to-python3-10-our-journey-through-a-memory-leak-1fd9b43cc01e?source=rss----25cd379abb8---4</link>
            <guid isPermaLink="false">https://medium.com/p/1fd9b43cc01e</guid>
            <category><![CDATA[memory-leak]]></category>
            <category><![CDATA[python]]></category>
            <dc:creator><![CDATA[Jay Patel]]></dc:creator>
            <pubDate>Mon, 15 Dec 2025 19:31:01 GMT</pubDate>
            <atom:updated>2025-12-15T19:30:48.214Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="Image generated with ChatGPT (OpenAI), 2025." src="https://cdn-images-1.medium.com/max/1024/1*QWBsfrlv7BNM8sZwFAsTfQ.png" /><figcaption><em>Image generated with ChatGPT (OpenAI), 2025.</em></figcaption></figure><h3>Intro</h3><p>When working with Python, memory management often feels like a solved problem. The garbage collector quietly does its job, and unlike C or C++, we rarely think about malloc or free. This doesn’t mean that there are no memory leaks in Python. Reference cycles, unreleased resources like connection pooling, global caches, etc can slowly inflate your process’s memory footprint. You might not notice it at first, until your worker starts OOM-ing, latency creeps up, or container restarts become mysteriously frequent.</p><p>In this post, we’ll share the story of a real-world memory leak we encountered during a Python upgrade — how we discovered it, the tools and techniques we used to investigate, and the lessons we learned.</p><h3>What happened after upgrading to Python 3.10?</h3><p>Back in the summer of 2024, we had an initiative at Lyft to upgrade all of our Python services from v3.8 to 3.10 as v3.8 was scheduled to be EoL by the end of 2024. You can find more details on how our awesome Backend Foundations team at Lyft does Python upgrade across hundreds of repos at scale <a href="https://eng.lyft.com/python-upgrade-playbook-1479145d52f4">here</a>. The upgrade involved two phases: the first phase was to upgrade all the dependencies to be Python 3.10 compatible, and the second phase was to upgrade the services to Python 3.10. The dependency upgrades went smoothly for all services and then the phase to upgrade all services to Python 3.10 rolled out. While all services were running Python 3.10 smoothly, there was one service for which the upgrade in the test environment caused a flurry of latency spikes, resulting in timeouts for downstream services.</p><figure><img alt="Graph: Increasing 5xx caused by timeouts after upgrading to Python 3.10" src="https://cdn-images-1.medium.com/max/1024/1*eZCQ8TRmMvT5_wSFiksPrA.png" /><figcaption><em>Increasing 5xx caused by timeouts after upgrading to Python 3.10</em></figcaption></figure><p>After profiling the APIs with increased latency with stats, we found that the source of latency were repository queries to the dynamo tables. Specifically, we had <a href="https://pynamodb.readthedocs.io/en/stable/">pynamodb</a> based repository queries which would spin up a bunch of greenlets to fetch data from multiple tables and combine the result which was showing increased timeouts. The individual queries themselves were fine; however, it was the thread join which took the longest time causing the worker to timeout (default = 30 seconds).</p><figure><img alt="Graph: Individual Dynamo queries taking &lt; 100 ms to finish" src="https://cdn-images-1.medium.com/max/1024/1*1Nh7ZK1j2fI0fWNNw9hUew.png" /><figcaption><em>Individual Dynamo queries taking &lt; 100 ms to finish</em></figcaption></figure><figure><img alt="Graph: Gevent thread join takes 30 secs" src="https://cdn-images-1.medium.com/max/1024/1*rOsCnNVg66HEXNy1HqGCIg.png" /><figcaption>Gevent thread join takes 30 secs</figcaption></figure><p>The other interesting thing we found was memory consumption slowly creeping up with time in all of the pods.</p><figure><img alt="Graph: Memory usage % of all pods" src="https://cdn-images-1.medium.com/max/1024/1*eTRt4KtT7amhI-lDska_rQ.png" /><figcaption>Memory usage % of all pods</figcaption></figure><p>At this point, we weren’t sure if there was something up with gevent/greenlet causing the memory leak or the memory leak causing the latency since decreased memory availability can cause increasing page fetches from the disk. We first checked if the <a href="https://www.gevent.org/monitoring.html#blocking">gevent monitoring thread</a> detected any event loop blocks, which could potentially cause these timeouts. We then pivoted to find out the root cause of the memory leak. Fortunately, Lyft has an internal library which can help profile memory which is based on <a href="https://docs.python.org/3/library/tracemalloc.html">tracemalloc</a>.</p><h3>Memory profiling tool</h3><p>The Lyft memory profiling tool is based on <a href="https://docs.python.org/3/library/tracemalloc.html">tracemalloc</a>. To capture the memory trace for a given gunicorn process, we registered the worker process to listen to USR2 signal during the application initialization phase.</p><pre># app/__init__.py<br><br>MemoryProfiler().register_handlers()<br><br># mem_profiler.py pseudo code<br><br>class MemoryProfiler: <br>    def __init__(self) -&gt; None:<br>        self._state_machine = self._profiling_state_machine()<br><br>     def register_handlers(self) -&gt; None:<br>        # Register gunicorn worker to listen to USR2 to dump traces<br>        signal.signal(signal.SIGUSR2, self.handle_signal)  <br><br>    def handle_signal(self, signum: signal.Signals, frame: FrameType) -&gt; None:<br>        next(self._state_machine)<br><br>    def _profiling_state_machine(self) -&gt; Generator[None, None, None]:<br>        while True:<br>            try:<br>                self.start_tracing() # tracemalloc.start()<br>                self.memory_dump() # Create snapshot1<br>                yield<br>                self.memory_dump() # Create snaphot2,compare with snapshot1, and dump the difference in a file <br>            finally:<br>                if tracemalloc.is_tracing():<br>                    tracemalloc.stop()</pre><h3>Let’s start the tracing!</h3><p>Ok, now that we had the memory profiler setup, we are ready for some tracing to find the source of the leak. To start the tracing, we send <strong>USR2</strong> signal to the gunicorn process in the K8s pod to start tracing and send the signal again after some time interval to capture the stack trace with highest memory usage.</p><pre>ps aux</pre><figure><img alt="Command line output: Initial process list before sending USR2 signal" src="https://cdn-images-1.medium.com/max/1024/1*LWqnkq8g2_BB5cXEBuyozg.png" /><figcaption><em>Initial process list before sending USR2 signal</em></figcaption></figure><p>Now, we will send a USR2 signal to worker with pid 12</p><pre>kill -USR2 12</pre><p>Upon checking the process list again….</p><pre>ps aux</pre><figure><img alt="Command line output: Tracing killing the gunicorn worker with PID=12" src="https://cdn-images-1.medium.com/max/1024/1*27fhy9fAMFix-1-BorV4pA.png" /><figcaption><em>Tracing killing the gunicorn worker with PID=12</em></figcaption></figure><p>… <strong>we observed that the gunicorn process we planned to trace got killed </strong>🙁</p><p>It took several hours of debugging and a journey back to one of my <a href="https://dartmouth.smartcatalogiq.com/en/2023/orc/departments-programs-undergraduate/computer-science/cosc-computer-science-undergraduate/cosc-58">favorite class</a> to find the root of the issue — <a href="https://docs.gunicorn.org/en/stable/settings.html#preload-app">preload</a>. To understand why preload caused the process to be killed, we first need to understand how gunicorn works.</p><h3>Gunicorn</h3><p><a href="https://gunicorn.org/">Gunicorn</a> works on the pre-fork model. There is a leader process which forks a bunch of workers. There are two ways to fork the workers:</p><p><strong>No Preload</strong></p><figure><img alt="Graph: Gunicorn forked workers with no preload" src="https://cdn-images-1.medium.com/max/1024/1*GIDNjLI8fwZoU7vTSsA6Og.png" /><figcaption><em>Gunicorn forked workers with no preload</em></figcaption></figure><p>When the leader process forks a worker, the worker has its own application code. This results in the worker process having a larger memory footprint than the leader.</p><pre>smem -a - sort=pid -k</pre><figure><img alt="Command line output: Service with no preload: Worker PSS mem = ~203MB" src="https://cdn-images-1.medium.com/max/1024/1*3CskCAq2yV0Lz2RKHAUhow.png" /><figcaption><em>Service with no preload: Worker PSS mem = ~203MB</em></figcaption></figure><p><strong>With Preload</strong></p><figure><img alt="Graph: Gunicorn forked workers with preload" src="https://cdn-images-1.medium.com/max/1024/1*xK8Eol5qNjHq5XRHV-FHqQ.png" /><figcaption><em>Gunicorn forked workers with preload</em></figcaption></figure><p>Preload is a memory optimization based on the concept of<a href="https://en.wikipedia.org/wiki/Copy-on-write"> copy-on-write</a>. Essentially, the workers share the imports and application code with the leader and only modified pages are written to the worker’s memory.</p><pre>smem -a - sort=pid -k</pre><figure><img alt="Command line output: Service with preload: Worker PSS mem reduced to ~41MB!!" src="https://cdn-images-1.medium.com/max/1024/1*howZgiwphM8tW7Yz5P5P2Q.png" /><figcaption><em>Service with preload: Worker PSS mem reduced to ~41MB!!</em></figcaption></figure><p><strong>So how does preload play a role with USR2 signal killing the process?</strong></p><p>If you remember, we registered the signal during the app initialization by calling <strong>register_handlers().</strong></p><pre># app/__init__.py<br><br>MemoryProfiler().register_handlers()  <br><br><br># mem_profiler.py<br><br>class MemoryProfiler:<br><br>    def register_handlers(self) -&gt; None:<br>        # Register gunicorn worker to listen to USR2 to dump traces<br>        signal.signal(signal.SIGUSR2, self.handle_signal)</pre><p>Since the app had preload=True, only the leader process was registering the USR2 signal to handle the tracing. The worker process did not register due to copy-on-write and that causes any <strong>kill -USR2</strong> to actually kill the process!</p><h3>Let’s start the tracing again (with no preload)!</h3><p>Now that we have figured out that preload caused the process to be killed, we turn off the preload option and start the tracing again.</p><pre>ps aux </pre><figure><img alt="Command line output: Initial process list before sending USR2 signal" src="https://cdn-images-1.medium.com/max/1024/1*WWqAGDxKxuOaC7xd9CpZ3A.png" /><figcaption><em>Initial process list before sending USR2 signal</em></figcaption></figure><pre>kill -USR2 12</pre><figure><img alt="Command line output: Successful USR2 signal not killing the gunicorn worker" src="https://cdn-images-1.medium.com/max/1024/1*LtPhlX5-R_IWxkIovsQR5w.png" /><figcaption><em>Successful USR2 signal not killing the gunicorn worker</em></figcaption></figure><p>The worker does not get killed!</p><p>We created a script which iterates through all the K8s pods and sends a USR2 signal to all the workers to start the tracing and resends the signal to stop the tracing after a certain time interval. The trace had a lot of false positives since it collects dumps which may not necessarily be the source of the leak, but have not been garbage collected yet.</p><h3>Root causing</h3><p>The most interesting (and common) memory dump trace after sifting through hundreds of them was the following:</p><figure><img alt="Stack trace dump from memory profiler" src="https://cdn-images-1.medium.com/max/1024/1*mFiA_yzHXa2mzw463nnO5Q.png" /><figcaption><em>Stack trace dump from memory profiler</em></figcaption></figure><p>If you remember the initial conclusion we had with the following graph, we knew that the increase in timeouts had something to do with pynamodb and gevent/greenlets since we saw thread joins taking a long time:</p><figure><img alt="Graph: Our initial observation of gevent thread join takes 30 secs" src="https://cdn-images-1.medium.com/max/1024/1*rOsCnNVg66HEXNy1HqGCIg.png" /><figcaption>Our initial observation of <em>gevent thread join takes 30 secs</em></figcaption></figure><p>The stack trace combined with the graph above, narrowed down the issue to pynamo/botocore. After digging online, we found the following <a href="https://github.com/urllib3/urllib3/issues/3061">issue with urllib3 v1.26.16</a>. Essentially, in a highly concurrent environment using <strong>gevent</strong>, connections were not being returned to the pool which caused the pool to sit at its max size and block further requests. This particular stack trace confirmed our suspicion:</p><figure><img alt="Stack trace showing botocore/urllib3/connectionpool" src="https://cdn-images-1.medium.com/max/1024/1*wcV7ev9Zs45RSLesKlQsfw.png" /><figcaption>Stack trace showing botocore/urllib3/connectionpool</figcaption></figure><p>The root cause of the issue was some incompatibility between <a href="https://docs.python.org/3/library/weakref.html"><strong>weakref.finalize</strong></a> and gevent’s monkey patching causing non-deterministic deadlock which made the issue hard to reproduce. The immediate fix was to downgrade the urllib3 version to <strong>1.26.15</strong>, after which the timeouts and the memory leak were gone!! <strong>The actual fix which ensures urllib3 connection pooling is cooperative was released in April 2025 and we have seen no issues upgrading both gevent to </strong><a href="https://github.com/gevent/gevent/issues/1769"><strong>v25.4.1</strong></a><strong> as well as urllib3 to 1.26.16+.</strong></p><p>It is unclear though why the Python version upgrade exposed the issue. <strong>In fact, urllib3 upgrade was not part of the dependency upgrade we had done to prepare for the Python 3.10 upgrade! </strong>We had actually been running Python 3.8 with urllib v1.26.16 for about a year without any problem. Ironically, we had upgraded to v1.26.16 specifically because it <a href="https://pypi.org/project/urllib3/1.26.16/">logged</a> the total connections whenever connection pools were full.</p><h3>Tip</h3><ol><li>If you run into memory leaks which are affecting the live production system, you can use gunicorn’s <a href="https://docs.gunicorn.org/en/stable/settings.html#max-requests"><strong>max-request</strong> </a>settings which recycles the worker processes after N requests. This ensures your process or container does not run into OOM. While this helps mitigate the issue, it is critical to continue investigating the source of the memory leak.</li><li>Gevent <a href="https://www.gevent.org/monitoring.html#memory-usage">monitoring thread</a> has an option to print trace for greenlets which exceed a certain memory threshold. While I have personally never tried this, it could help find objects which are holding large amounts of memory, but not necessarily the source of a leak.</li></ol><h3><strong>Closing Notes</strong></h3><p><strong>There is no silver bullet to debugging memory leaks</strong>; it is a hard issue to debug them. There a few things you can look for eg. unbounded global caches, unreleased resources tied to database/network pooling, recently upgraded libraries, etc. If you check the actual gevent/urllib3 <a href="https://github.com/urllib3/urllib3/issues/3061">issue</a>, none of them talked about memory leaks, only timeouts. We just happened to run into a memory leak and try to find the root cause of it 😀</p><p>Lyft is hiring! If you’re passionate about efficient database connection management, visit <a href="https://www.lyft.com/careers">Lyft Careers</a> to see our openings.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=1fd9b43cc01e" width="1" height="1" alt=""><hr><p><a href="https://eng.lyft.com/from-python3-8-to-python3-10-our-journey-through-a-memory-leak-1fd9b43cc01e">From Python3.8 to Python3.10: Our Journey Through a Memory Leak</a> was originally published in <a href="https://eng.lyft.com">Lyft Engineering</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[LyftLearn Evolution: Rethinking ML Platform Architecture]]></title>
            <link>https://eng.lyft.com/lyftlearn-evolution-rethinking-ml-platform-architecture-547de6c950e1?source=rss----25cd379abb8---4</link>
            <guid isPermaLink="false">https://medium.com/p/547de6c950e1</guid>
            <category><![CDATA[distributed-systems]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[aws]]></category>
            <category><![CDATA[data-science]]></category>
            <dc:creator><![CDATA[Yaroslav Yatsiuk]]></dc:creator>
            <pubDate>Tue, 18 Nov 2025 18:16:05 GMT</pubDate>
            <atom:updated>2025-11-18T18:16:04.544Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*-r1saxCbUiFJf47gTbTyXg.png" /></figure><p><em>Written by </em><a href="https://www.linkedin.com/in/yaroslav-yatsiuk-945931160/"><em>Yaroslav Yatsiuk</em></a></p><p>At Lyft, machine learning (ML) is the engine behind our most critical business functions — from dispatch and pricing optimization to fraud detection and support automation. Our ML infrastructure serves thousands of production models making hundreds of millions of real-time predictions per day, supported by thousands of daily training jobs that keep ML models fresh and accurate.</p><p>As our scale grew, we faced a classic engineering challenge: the very complexity that powered our platform was becoming a bottleneck to its future growth. We needed to answer a fundamental question: How could we evolve our platform to accelerate innovation for our users while simplifying its underlying architecture?</p><p>This post explores how we rethought LyftLearn’s architecture to solve this problem. We’ll walk through our transition from a fully Kubernetes-based system to a hybrid platform, combining the simplicity of managed compute on AWS SageMaker for offline workloads with the flexibility of Kubernetes for online model serving. Afterwards, we’ll share the key technical decisions and trade-offs that made this evolution possible.</p><h3>LyftLearn Overview</h3><p>LyftLearn is Lyft’s end-to-end machine learning platform, managing the complete ML lifecycle from model development to production serving. Built to support hundreds of data scientists and ML engineers, it handles the full spectrum of ML workloads at scale. The platform is composed of three integrated products:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*SEJzxSPCjsnhPd0q5U2Zvg.png" /><figcaption>Figure 1: LyftLearn Components</figcaption></figure><p><strong>LyftLearn Compute (Offline Stack)</strong> handles model development and training workloads. ML Modelers use JupyterLab environments to prototype models, then run training jobs, batch processing, and hyperparameter optimization at scale. These workloads are elastic and on-demand — they spin up when needed, process large datasets, and terminate when complete.</p><p><strong>LyftLearn Serving (Online Stack)</strong> powers production inference, serving millions of predictions per minute with millisecond latency. It provides online model serving with real-time ML capabilities, automated deployment and promotion workflows, and online validation to ensure model quality before production traffic.</p><p><strong>LyftLearn Observability</strong> monitors model health and detects degradation across the platform. It tracks performance drift, identifies anomalies, scores model health, and monitors model activity to ensure production models maintain quality as data and business conditions evolve.</p><p>While all three components work together to provide a unified ML platform, the offline and online stacks have fundamentally different operational characteristics. Offline workloads need elastic, cost-efficient compute that scales to zero between jobs. Online model serving requires always-on infrastructure with strict latency guarantees and tight operational control. These differences led us to adopt different infrastructure strategies for each — and it’s the evolution of our offline stack that transformed how we deliver LyftLearn Compute today.</p><h3>The Original Architecture</h3><p>The original offline stack (LyftLearn Compute) ran entirely on Kubernetes — every training job, batch prediction, hyperparameter optimization run, and JupyterLab notebook environment executed as a Kubernetes workload, orchestrated through a collection of custom-built services. We documented this architecture in detail in our 2021 blog post,<a href="https://eng.lyft.com/lyftlearn-ml-model-training-infrastructure-built-on-kubernetes-aef8218842bb"> LyftLearn: ML Model Training Infrastructure built on Kubernetes</a>.</p><p>The following diagram shows a high-level view of the LyftLearn Compute 1.0 architecture:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*a-_qAccCYkUbkOj4mfMj9w.png" /><figcaption>Figure 2: LyftLearn Compute 1.0 high-level architecture</figcaption></figure><p>To understand the operational complexity, let’s look at some of the key components and how they worked together:</p><p><strong>LyftLearn Service </strong>served as the backend API, receiving requests from three primary sources: the LyftLearn UI for ad-hoc jobs, Airflow DAGs for scheduled training and batch prediction pipelines, and CI/CD pipelines that registered models along with their Docker images during deployments. It managed model configurations, job metadata, and coordinated with downstream services.</p><p><strong>K8s Orchestration Service</strong> translated job requests into Kubernetes resources. When LyftLearn Service called it to create a training job, it would:</p><ul><li>insert the job record in LyftLearn DB (so watchers could track it)</li><li>construct the Kubernetes Job specification with containers, resource requests, environment variables, sidecars, references to docker images in AWS Elastic Container Registry (ECR), and other K8s resources</li><li>submit the job to the Kubernetes cluster</li></ul><p><strong>Background Watchers</strong> ran continuously to manage jobs lifecycle and infrastructure. We maintained multiple worker scripts handling different responsibilities:</p><ul><li>job status watcher (monitoring job state transitions and timing)</li><li>container status watcher (tracking individual container states)</li><li>ingress status watcher (managing notebook endpoint URLs)</li><li>job cleanup watcher (removing completed jobs from Kubernetes)</li><li>analytics event watcher (capturing usage events)</li><li>additional scripts for EFS cleanup, spending tracking, and stats publishing</li></ul><p>Creating any job meant assembling a complete set of Kubernetes resources:<strong><em> </em></strong>Pod specifications with init and sidecar containers for secrets and metrics, <em>ConfigMaps</em> for hyperparameters, Secrets for credentials, <em>PersistentVolumeClaims</em> for notebook storage, <em>Services</em> and <em>Ingresses</em> for network access, and role-based access control (RBAC) policies (<em>ServiceAccounts, Roles, RoleBindings</em>) for cluster permissions. In essence, we owned the entire operational lifecycle — from scheduling and retries to cleanup and low-level resource management.</p><h4>What Worked Well</h4><p>The Kubernetes-based architecture successfully powered production ML workloads for years and delivered some real technical advantages, including:</p><p><strong>Unified Infrastructure Stack</strong> <br>ML workloads ran on the same Kubernetes infrastructure as Lyft’s production services, using the same networking stack, observability tooling, security patterns, and operational processes. This meant the platform team leveraged existing infrastructure expertise and tooling rather than maintaining separate systems for ML workloads.</p><p><strong>Fast Job Startup<br></strong>Jobs could launch as fast as 30–45 seconds on existing K8s cluster infrastructure. Unlike on-demand compute provisioning which requires waiting for instances to start and initialize, jobs scheduled immediately onto available nodes with cached images, making the approach particularly effective for frequently running training jobs and batch processing workflows.</p><p><strong>Flexible Resource Specifications</strong> <br>Engineers could request any CPU/memory combination their workload needed. Memory-intensive preprocessing jobs could request 16 CPUs with 512GB RAM, while CPU-intensive training jobs used 64 CPUs with 128GB RAM. These ratios didn’t map cleanly to fixed AWS instance types, so this flexibility allowed precise resource allocation based on workload needs.</p><p>This architecture served hundreds of engineers running thousands of daily jobs that powered business-critical ML workflows. However, as Lyft’s scale grew, so did the operational complexity of managing such a system.</p><h4>Challenges of a Growing Platform</h4><p>We identified several key challenges that were consuming an increasing amount of our focus:</p><p><strong>The Feature Tax <br></strong>Every new capability we added to the platform, from distributed hyperparameter optimization using Katib/Vizier to distributed training with Kubeflow operators, required building, deploying, and maintaining a corresponding set of custom Kubernetes orchestration logic. While this approach gave us maximum control, it also meant that a significant portion of our development cycle was dedicated to building and managing the infrastructure for each new feature, rather than the feature itself.</p><p><strong>Managing State in a Distributed System<br></strong>To keep our platform’s database synchronized with the cluster state, we relied on background watcher scripts that continuously monitored Kubernetes events for job status changes, container updates, and ingress resource availability.</p><p>The eventually-consistent nature of Kubernetes created operational complexity. Training containers could succeed while Kubernetes marked jobs as failed due to sidecar issues. Event streams would timeout or arrive out of order. Container statuses could transition between states as different watchers processed conflicting events. We developed sophisticated synchronization checks and logic to handle these cases, but managing state consistency for thousands of daily jobs required considerable on-call attention and directly impacted our development velocity.</p><p><strong>Kubernetes Cluster Management<br></strong>A persistent challenge in managing a large-scale ML compute platform is optimizing resource utilization for heterogeneous workloads. ML jobs often have distinct phases with conflicting resource profiles: data processing tends to be memory-intensive, while model training is often CPU- or GPU-intensive. This created a complex optimization puzzle, making it challenging to maximize node utilization across the cluster.</p><p>As the platform grew, we also had to proactively manage resource contention during bursts of highly parallel workloads. Ensuring that the cluster autoscaler could provision capacity quickly enough to prevent job queuing for critical workflows required careful planning and continuous management.</p><p>The pattern was clear: as the platform scaled, so did the operational investment required to manage its low-level infrastructure. To continue innovating for our users, we needed to abstract away this underlying complexity and refocus our efforts on what mattered most: building new platform capabilities, optimizing ML workflows, and accelerating the entire ML development lifecycle<em>.</em></p><h3>The Journey to LyftLearn 2.0</h3><p>The growing operational complexity of our Kubernetes stack was limiting our development velocity. This reality pushed us to explore how we could simplify operations while delivering more powerful capabilities to our users. We began evaluating managed solutions to abstract this infrastructure complexity, which led us to a deep evaluation of <a href="https://aws.amazon.com/sagemaker/">AWS SageMaker</a>.</p><p>We evaluated SageMaker across both our <strong>online</strong> (LyftLearn Serving) and <strong>offline</strong> (LyftLearn Compute) stacks.</p><p><strong>For LyftLearn Serving</strong>, adopting SageMaker would have required a fundamental re-architecture of our core workflows. Our model deployment, promotion, and serving solutions were deeply integrated with Lyft’s internal tooling. Observability relied on our standard monitoring infrastructure, not on AWS CloudWatch. Client services communicated via<a href="https://www.envoyproxy.io/"> Envoy</a>, not via SageMaker’s specific invocation and authentication patterns.</p><p>Our analysis confirmed that the existing Kubernetes-based stack was exceptionally reliable and efficient, performing well within our required latency requirements. We determined the right path forward was to retain our existing, battle-tested model serving infrastructure.</p><p><strong>For LyftLearn Compute</strong>, the evaluation pointed in a different direction. This was where our greatest operational complexity lived: managing eventually-consistent job states, optimizing cluster capacity for heterogeneous workloads, and building custom Kubernetes orchestration for new ML capabilities.</p><p>SageMaker’s managed infrastructure would address these challenges directly. It offered out-of-the-box support for a variety of job types, which would allow us to stop building and maintaining low-level orchestration logic. Its native state management would eliminate the need for our custom watcher system, and its elastic compute model would handle capacity automatically, removing the need for complex cluster planning and autoscaling management.</p><p>While SageMaker’s per-instance costs were higher, the Total Cost of Ownership (TCO) was clearly lower. By eliminating idle compute, cluster administration overhead, and the constant infrastructure firefighting, the economics of a managed service made sense.</p><p>The evaluation led to a clear strategy: adopt SageMaker for LyftLearn Compute, where we had the greatest opportunity to reduce operational complexity, and retain Kubernetes for LyftLearn Serving, where our existing solution was already highly reliable and efficient.</p><p>The diagram below provides a high-level, conceptual view of how we wanted to transform the offline stack:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*1GXjfr0fXp4tcFBtapk_hQ.png" /><figcaption>Figure 3: LyftLearn Compute Evolution Plan</figcaption></figure><p><strong>On the left, the original architecture:</strong> LyftLearn Service sent job requests to a K8s Orchestration Service, which constructs Kubernetes Job specifications and submits them to the Kubernetes API. This orchestration service was complex — it managed pod configurations, resource allocations, volumes, and all the low-level details of Kubernetes jobs. Background watchers continuously polled the Kubernetes API for events — job completions, container status changes, resource updates — and wrote those updates back to the LyftLearn database. The compute layer ran on Lyft-managed Kubernetes clusters.</p><p><strong>On the right, the new architecture: </strong>Under the hood, this is a significantly simpler solution. LyftLearn Service interacts with a lean SageMaker Manager Service that only makes AWS SDK calls — it doesn’t manage any low-level infrastructure. We replaced the fleet of problematic background watchers with a single, reliable SQS consumer that processes status updates pushed from EventBridge. The heavy lifting of orchestration and state management is delegated to AWS. The goal was simplification without losing power.</p><p>It looks simple on a diagram, but making this transition without disrupting critical ML workflows and hundreds of users was a significant engineering challenge. The following sections detail some of the most difficult technical challenges we solved to make this migration possible.</p><h4>Migration: Solving the Hard Problems</h4><p>Our core principle for the migration was to replace the execution engine — Kubernetes to SageMaker — while keeping our ML workflows completely unchanged. The actual ML code — the Python scripts that train models, process data, and run inference — had to work identically on both platforms. No modifications to model training logic, no changes to data preprocessing, no updates to inference code.</p><p>Forcing hundreds of users across dozens of teams to rewrite their business-critical ML workflows was not an option. The cost of such a disruption in terms of lost productivity and engineering effort would have made the migration untenable, which meant the burden of compatibility was entirely on our platform. The requirement of zero code changes transformed the project into a complex systems engineering challenge for the ML Platform team, but it was a necessary one. The real task wasn’t just running a container on a different platform — it was ensuring environmental parity.</p><p>During the transition, we solved numerous challenges across the stack. Here are a few of the most complex ones we solved to make this possible.</p><p><strong>Replicating the Kubernetes Runtime Environment<br></strong>Our Kubernetes environment provided automatic credential injection via webhooks, metrics collection through sidecars, and configuration management via <em>ConfigMaps</em>. SageMaker offered none of these primitives. We built a compatibility layer into cross-platform base Docker images to replicate this behavior:</p><ul><li><strong>Credentials</strong>: In Kubernetes, credentials from our internal secret management solution, <a href="https://lyft.github.io/confidant/">Confidant</a> , were automatically injected at pod creation. SageMaker has no equivalent mechanism. We built a custom solution, as part of the container entrypoint script, that fetches credentials at job startup and exposes them exactly as Kubernetes did, ensuring user code worked identically on both platforms</li><li><strong>Environment Variables</strong>: SageMaker constrains the number of environment variables passed via its API. Similar to our credential solution, we moved most environment setup to runtime, fetching additional configuration at job startup.</li><li><strong>Metrics</strong>: Kubernetes workloads sent <a href="https://github.com/statsd/statsd">StatsD</a> metrics to sidecar containers. SageMaker has no sidecar support, so we reconfigured the runtime and networking to connect directly to our metrics aggregation gateway. The user-facing API remained unchanged.</li><li><strong>Hyperparameters</strong>: In Kubernetes, hyperparameters were stored in <em>ConfigMaps</em> and mounted as files. SageMaker’s API has much stricter size limits than K8s, making direct parameter passing impossible for our use cases. We developed a solution to upload hyperparameters to AWS S3 before each job and have SageMaker automatically download them to its standard input path. This overcame the API limitation while still using SageMaker’s native capabilities.</li></ul><p>These represent only a subset of the environmental differences we systematically solved across the migration.</p><p><strong>Building for the Hybrid Architecture<br></strong>We developed new SageMaker-compatible base images to replace our old LyftLearn images. The critical design requirement was that these images must work across our entire hybrid platform: in SageMaker (for training and batch processing) and in Kubernetes (for serving). This meant the same Docker image that trained a model would also serve it, guaranteeing consistency. These base images serve as a foundation that teams extend with their own dependencies.</p><p>We built SageMaker-compatible base images with different capabilities to match our workload diversity. Here are some of the most important ones:</p><ul><li><strong>LyftLearn image:</strong><em> </em>For traditional ML workloads</li><li><strong>LyftLearn Distributed image:</strong> Adds Spark ecosystem integration for distributed processing</li><li><strong>LyftLearn DL image:</strong> Adds GPU support and libraries for deep learning workloads</li></ul><p>The Spark-compatible images presented the biggest challenge. They needed to maintain full compatibility with our existing Spark infrastructure — custom wrappers, executor configurations, and JAR (Java Archive) dependencies. But they also had to run correctly in three distinct execution contexts: SageMaker Jobs, SageMaker Studio notebooks, and Model serving in K8s.</p><p>These images detect their execution environment at runtime and adapt. They automatically configure different environment variables, use different users and permissions, and set up Spark appropriately for each context, all while preserving an identical core runtime.</p><p><strong>Matching Kubernetes Job Launch Times<br></strong>In Kubernetes, notebooks, training, and processing jobs could start quickly because nodes were warm due to a significant percentage of cluster resources sitting idle. SageMaker provisions instances on-demand — no idle waste, but slower startup.</p><p>For JupyterLab notebooks, we adopted <em>SOCI</em> (Seekable Open Container Initiative) indexes. <em>SOCI</em> enables lazy loading: SageMaker fetches only the filesystem layers needed immediately rather than pulling entire multi-gigabyte images. This cut notebook startup times by 40–50%.</p><p>For training and batch processing jobs, <em>SOCI</em> wasn’t available. We optimized our Docker image sizes, which were sufficient for most of our workloads. However, this wasn’t enough for our most latency-sensitive workflows. Some models retrain every 15 minutes, making slower startup times unacceptable. For this subset of jobs, we adopted <a href="https://docs.aws.amazon.com/sagemaker/latest/dg/train-warm-pools-how-to-use.html">SageMaker’s warm pools</a>, which keep instances alive between runs.</p><p>These optimizations gave us Kubernetes-like startup times with fully serverless infrastructure.</p><p><strong>Cross-Cluster Networking for Spark<br></strong>Many of our ML Platform users rely heavily on the interactive Spark experience in JupyterLab notebooks. In Kubernetes, this was simple, as the driver and executors ran in the same cluster. The new architecture, however, required the Spark driver to run in a SageMaker Studio notebook while the executors remained on our EKS K8s cluster.</p><p>This hybrid model presented a major networking challenge, as shown in the diagram below. Spark client mode requires bidirectional communication:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*0TVs7AcCQMI_t7FrJFpKrQ.png" /><figcaption>Figure 4: Spark Networking Architecture in LyftLearn 2.0</figcaption></figure><ul><li>The driver (in SageMaker) must call the EKS API Server Endpoint to request executor pods.</li><li>The executor pods must be able to establish inbound connections directly back to the driver’s SageMaker Instance Elastic Network Interface (ENI).</li></ul><p>The default SageMaker Studio networking blocked these critical inbound connections, breaking Spark’s communication model. This issue was a fundamental blocker that could jeopardize the entire migration. Without a solution for interactive Spark, we could not move our users to SageMaker Studio. To resolve this, we partnered closely with the AWS team. As a result of this collaboration, they introduced networking changes to the Studio Domains in our account that enabled the required inbound traffic from our EKS cluster. Despite the cross-cluster setup, Spark performance remained the same, and the interactive experience for ML Platform users was identical to the original Kubernetes environment.</p><h3>LyftLearn 2.0: The Hybrid Architecture</h3><p>As a result of this architectural transformation, we arrived at the hybrid architecture we planned: SageMaker for LyftLearn Compute and Kubernetes for LyftLearn Serving.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*FDqyuwxPjGjobOPxWkfMCQ.png" /><figcaption><em>Figure 5: Complete LyftLearn 2.0 High-Level Architecture</em></figcaption></figure><p>As the diagram illustrates, the two systems are fully decoupled, each operating as a purpose-built stack:</p><p><strong>LyftLearn Serving</strong> runs on Kubernetes, powering a distributed architecture for real-time inference. Dozens of ML teams deploy their own model serving services — each containing their team’s models with custom prediction handlers and configurations — handling production predictions for specific use cases (pricing, fraud, dispatch, ETA, etc.). The Model Registry Service coordinates model deployments across these services. (We detailed this serving architecture in our 2023 blog post: <a href="https://eng.lyft.com/powering-millions-of-real-time-decisions-with-lyftlearn-serving-9bb1f73318dc">Powering Millions of Real-Time Decisions with LyftLearn Serving</a>.)</p><p><strong>LyftLearn Compute</strong> runs on SageMaker, where the SageMaker Manager Service orchestrates training, batch processing, Hyperparameter Optimization (HPO), and JupyterLab notebooks through AWS SDK calls. EventBridge and SQS provide event-driven state management, replacing our background watchers.</p><p>Integration happens through the Model Registry and S3. Training jobs in SageMaker generate model binaries and save them to S3. The Model Registry tracks these artifacts, and model serving services pull them for deployment. Docker images flow from CI/CD through ECR to both platforms. The LyftLearn database maintains job metadata and model configurations across both stacks.</p><p>Each LyftLearn product operates independently while maintaining seamless end-to-end ML workflows.</p><h3>Putting It All Together</h3><p>We rolled out changes repository by repository, running both infrastructures in parallel. Our approach was systematic: build a comprehensive compatibility layer that made SageMaker feel like Kubernetes to ML code, validate each workflow type thoroughly, then migrate teams incrementally. Each repository required minimal changes — typically updating configuration files and workflow APIs — while the actual ML code remained untouched.</p><p>For our users, the migration was nearly invisible. But behind the scenes, the operational improvements were substantial. We reduced ML training and batch processing compute costs by eliminating idle cluster resources and moving to on-demand provisioning. System reliability improved significantly, with infrastructure-related incidents becoming rare occurrences. Most importantly, this stability and the serverless nature of the new compute freed our team to focus on building platform capabilities rather than managing low-level infrastructure components.</p><h4>Key Lessons</h4><p><strong>Build versus buy is a pragmatic decision, not an ideology</strong> <br>We adopted SageMaker for training because managing custom batch compute infrastructure was consuming engineering capacity better spent on ML platform capabilities. We kept our serving infrastructure custom-built because it delivered the cost efficiency and control we needed. The decision wasn’t about preferring managed services or custom infrastructure — it was about choosing the right tool for each specific workload.</p><p><strong>Abstract complexity from users</strong>. <br>The migration succeeded because we absorbed all the complexity. Users didn’t rewrite ML code or learn SageMaker APIs — they continued their work while we handled secrets management, networking, metrics collection, and environmental parity. The platform’s job is to evolve infrastructure while preserving velocity and avoiding disruptions, not to distribute migration work across hundreds of teams.</p><p><strong>Invest in compatibility layers<br></strong> The cross-platform base images were the foundation of the migration’s success. They enabled gradual, repository-by-repository migration with easy rollbacks. Most importantly, they guaranteed that the same Docker image for model training in SageMaker would serve it in Kubernetes, eliminating train-serve inconsistencies. The upfront investment in cross-platform compatibility paid dividends throughout the migration.</p><blockquote>The best platform engineering isn’t about the technology stack you run — it’s about the complexity you hide and the velocity you unlock.</blockquote><h4>Acknowledgment</h4><p>This platform evolution was a massive team effort. Special thanks to <a href="https://ua.linkedin.com/in/vladyermakov"><strong>Vlad Yermakov</strong></a><strong>, </strong><a href="https://ua.linkedin.com/in/herman-khivrenko-ab488618b"><strong>Herman Khivrenko</strong></a><strong>, </strong><a href="https://ca.linkedin.com/in/nimanasiri"><strong>Nima Nasiri</strong></a><strong> and </strong><a href="https://www.linkedin.com/in/andyrosales"><strong>Andy Rosales-Elias</strong></a><strong> </strong>for their exceptional work making it a success.</p><p>We’re also grateful to <a href="https://www.linkedin.com/in/rajesh-bagwe-1995762/"><strong>Raj Bagwe</strong></a> and <a href="https://www.linkedin.com/in/vikramsawant/"><strong>Vikram Sawant</strong></a>, our partners from AWS, for their invaluable support on this initiative.</p><p><strong><em>Lyft is hiring!</em></strong><em> If you’re passionate about building AI/ML platforms and applications at scale, visit </em><a href="https://www.lyft.com/careers"><em>Lyft Careers</em></a><em> to see our openings.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=547de6c950e1" width="1" height="1" alt=""><hr><p><a href="https://eng.lyft.com/lyftlearn-evolution-rethinking-ml-platform-architecture-547de6c950e1">LyftLearn Evolution: Rethinking ML Platform Architecture</a> was originally published in <a href="https://eng.lyft.com">Lyft Engineering</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[My Starter Project on the Lyft Rider Data Science Team]]></title>
            <link>https://eng.lyft.com/my-starter-project-on-the-lyft-rider-data-science-team-86a60dddd935?source=rss----25cd379abb8---4</link>
            <guid isPermaLink="false">https://medium.com/p/86a60dddd935</guid>
            <category><![CDATA[causal-inference]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[rideshare]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[artifical-intellegence]]></category>
            <dc:creator><![CDATA[Jacob Nogas]]></dc:creator>
            <pubDate>Tue, 07 Oct 2025 14:41:38 GMT</pubDate>
            <atom:updated>2025-10-07T14:41:34.115Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/450/1*XmS2ckVGZJ6GDLfnTwp3UQ.png" /><figcaption>Credit to Brian Wu for creating the illustrations in this post.</figcaption></figure><p>I joined Lyft in January of 2024, as a Data Scientist — Decisions, on the Rider Science Core Experience team. My journey at Lyft began with a starter project, which focussed on using the Rider Experience Score (RES) tool to measure long-term effects of various rider experiences at Lyft.</p><p>In this blog post, I will discuss my experience at Lyft as a new hire, focusing on this starter project.</p><h3>What is RES?</h3><h4>Motivation</h4><p>At Lyft, we aim to deliver seamless and reliable experiences for our riders. To continuously improve the platform, it’s important to understand which rider experiences most impact our riders, and how those experiences influence their decision to continue using Lyft over time (rider retention).</p><p>For example, we can imagine a hypothetical scenario where a rider experiences a lower than normal ETA (the estimated request to pickup time). Experiencing lower ETA can potentially be an improved experience for riders, motivating us to make a product change which drives a decrease in ETA. To justify the introduction of this product change, we first want to quantify how low ETA impacts long-term rider retention. Using an A/B test to evaluate the long-term rider retention impact of low ETA may be problematic, since an A/B test typically runs for 2–6 weeks, which may not be a long enough time period to accurately measure long-term effects. In other cases, A/B tests may not be possible, such as when introducing a new feature which is rolled out to all users.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*RjN8moizEc9uQPwhu40Mug.png" /><figcaption>Figure 1: Illustration of total rides per customer vs time. Group 1 had a positive experience (such as low ETA), whereas Group 2 did not. Δ shows the difference in long-term total rides across Group 1 and 2. This image is adapted from the presentation Customers Obsessed Experimentation and Metrics by Ricky Chachra (December 14, 2023).</figcaption></figure><p>RES is a tool for estimating how various user experiences (low ETA, early driver arrival, etc.) impact long-term rides taken, which is the Δ in Figure 1, without requiring an A/B test.</p><h4>Challenges in Estimating the Effects of User Experiences</h4><p>In order to explain RES, I will first discuss the challenges in estimating how user experiences impact rider retention.</p><p>The essence of this problem is estimating the causal effect of exposing riders to a particular experience on rider retention. For simplicity, we define the rider retention effect as the impact on the number of rides taken in the future 28 days. The true causal effect of a rider experiencing low ETA would be obtained by observing the future 28 day rides for a rider session in which low ETA is encountered, and comparing that to the future 28 day rides for the exact same rider session, except low ETA is not encountered (counterfactual). But, it’s impossible to observe counterfactuals; in reality, for a given session, we only observe the session where the rider experiences low ETA or not, but not both (the Fundamental Problem of Causal Inference).</p><p>A simple potential solution to overcome the challenge of not being able to observe the counterfactual outcome might be to look at historical observational data for riders that experienced low ETA, and compare their average ride retention to riders who didn’t experience low ETA, which is the average treatment effect (ATE) estimated by difference-in-means estimator.</p><p>However, this naive approach can lead us astray. Let’s see what happens when we apply this method. Figure 2 shows hypothetical results (not real data) that illustrate the problem we might encounter.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/450/1*rshk2JsslTUutmNRp_tvOg.png" /><figcaption>Figure 2: Plotting low ETA vs. # of rides in the next 28 days for all regions. Left of the dashed line didn’t see low ETA, right did.</figcaption></figure><p>Based on the observations in Figure 2, we would conclude that low ETA actually has a negative effect on rider retention, which must be incorrect. To understand how we arrived at this incorrect conclusion, we segment our data into regions City A and City B, and also indicate on the x-axis whether or not the rider experienced low ETA (left of dashed line didn’t see low ETA, right did); results are shown in Figure 3.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/450/1*XmS2ckVGZJ6GDLfnTwp3UQ.png" /><figcaption>Figure 3: Plotting low ETA vs. # of rides in the next 28 days with regions highlighted.</figcaption></figure><p>We now see that within City B, and within City A, the effect of low ETA is showing a positive trend, as expected. We also see that City B is much less likely to be in the “treatment” group (right of dash, low ETA), and that City B has higher baseline rides in the next 28 days. We thus see that our groups for low and not low ETA are biased (selection bias), which results in an incorrect estimate of the causal effect of low ETA. The difference-in-means estimator works well for randomized experiments, but is biased for non-randomized cases; in the above example, region is correlated with both the treatment ‘low ETA’, as well as the outcome # of rides in the next 28 days (region is a confounder). This is an example of an issue that can arise when estimating causal effects from observational data.</p><p>The gold standard approach for estimating causal effects is randomized experimentation (A/B test). With randomization, we would end up with a roughly equal split of City A and City B across control and treatment groups, thus mitigating the bias discussed above. But, a limitation of an A/B test is that we typically can only run them for a fixed short period of time, which won’t allow measuring long-term effects on rider retention. We thus can benefit from methodology that can mitigate the bias in causal effects estimated from observational data.</p><h4>RES Methodology</h4><p>The RES tool employs causal inference methodology to mitigate the bias in causal effects measurements obtained with observational data.</p><p>There are various methods to control for confounding variables. Propensity score methods model the relationship between confounders X and treatments W (i.e. with an ML model), outcome methods model the relationship between confounders X and the outcome Y, and double ML methods model both the relationship between X and Y and W and X.</p><p>RES employs Augmented Inverse Propensity Score Weighting estimator (AIPW; more info on AIPW can be found <a href="https://www.law.berkeley.edu/files/AIPW(1).pdf">here</a>), which is an example of a double ML Method.</p><p>At a high level, AIPW consists of estimating potential outcome functions of treatment and control experiences (Direct Method), as well as adjusting estimation bias with propensity weighted residuals. AIPW has the desirable theoretical properties of Neyman Orthogonality (robust to ML estimation errors), and doubly robustness. We also apply a <a href="https://academic.oup.com/ectj/article-abstract/21/1/C1/5056401?redirectedFrom=fulltext">cross-fitting</a> procedure (data split) to get Double-ML properties (unbiased estimation).</p><p>More precisely, we compute the treatment effect as follows:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Mu1m_p4C_ofzN0ywAIVW5w.png" /></figure><p>In order to compute the treatment effect, we train three XGBoost or LightGBM models</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/490/1*bGEjtKvZLmL5YodYR3Oy6A.png" /></figure><p>For every observation i, using the above models (and cross-fitting), we predict the outcome <em>μ₁*(xᵢ)</em>, <em>μ₀*(xᵢ)</em>, and <em>e*(xᵢ)</em>. Finally, we plug <em>μ₁*(xᵢ)</em>, <em>μ₀*(xᵢ)</em>, and <em>e*(xᵢ)</em> into the equation for AIPW to get an estimate for the ATE for low ETA. Though the relationships between outcomes, treatment, and confounders in the above low ETA example are relatively simple, using machine learning models for <em>e*(xᵢ) </em>and <em>μ*(xᵢ) </em>allows us to model complex non-linear relationships which may arise in other settings.</p><p>In the example above, AIPW helps us avoid making a wrong conclusion in two ways:</p><ol><li>By modelling the relationship between region and 28-day rides (<em>μ*(xᵢ)</em>), we are able to isolate the effect of region on our outcome Y, and capture that City B riders are more likely to have more rides in 28 days than City A.</li><li>By modelling the relationship between region and low vs. normal ETA (<em>e*(xᵢ)</em>), we capture that City B riders are less likely to experience low ETA than City A riders. AIPW reweights observations inversely to their propensity scores, effectively upweighting points in the bottom-left and top-right of Figure 3. This reweighting reveals the underlying upward slope, allowing us to conclude that low ETA has a positive impact on 28-day rides.</li></ol><p>The AIPW estimator combines both 1. and 2., which gives favorable statistical properties, such as treatment effect estimation being insensitive to errors in models <em>e*(xᵢ)</em> and <em>μ*(xᵢ)</em>.</p><h3>What I did as a new hire</h3><p>The existing RES estimates needed an update, and there was also a need to add additional rider experiences to the RES pipeline. RES generally needed a refresh, and I was tasked with doing so.</p><p>I started by gaining more familiarity with AIPW. Reading through the internal causal inference lecture series at Lyft was extremely helpful. I also found Stefan Wager’s STATS 361 <a href="https://web.stanford.edu/~swager/stats361.pdf">course notes</a> very useful for learning about AIPW.</p><p>Next, I spent time learning about how to use <a href="https://eng.lyft.com/lyftlearn-ml-model-training-infrastructure-built-on-kubernetes-aef8218842bb">LyftLearn</a>, Lyft’s internal computing platform for Big Data and Machine Learning. I then became familiar with how to use the RES codebase. I analyzed the RES codebase to see if there were any opportunities to improve reliability and efficiency of the RES code. I identified some aspects of the RES code which could be improved, and then submitted a pull request with the corresponding changes. For example, there was a subtle issue which prevented model diagnostics from completing in certain cases, which I was able to fix.</p><p>With the unblocking of the RES pipeline, I sought to compute estimates of long-term effects of various Lyft user experiences. I began with identifying the most important experiences to add to the RES pipeline.</p><p>I reached out to other teams to see which experiences would be most important to their work. Examples of experiences that were identified are “Improved Match Time Prediction”, and “Improved ETA Reliability”.</p><p>After gathering a list of experiences, I prioritized and selected 23 based on discussions with my manager about their potential impact. This process provided great insight into the experiences that matter most to our riders, and where we could have the greatest positive impact on retention. I also refreshed estimates for the pre-existing reliability experiences (with newly added High Value Mode sub group analysis).</p><p>A major challenge I faced in computing estimates for my chosen experiences was selecting confounders. Internal RES documentation provides excellent guidance on selecting confounders, but significant trial and error was still required. For example, I computed RES estimates for Prime Time experience, where Prime Time refers to a multiplier on ride price during high demand periods, and I had included Neighborhood Supply as a confounder. The ROC AUC of our trained model was suspiciously high, which we realized was partly due to Neighborhood Supply being a leaky confounder for Prime Time. A leaky confounder is a covariate which contains information that is concurrent or subsequent to the experience; this is an issue, as it can cause some of the treatment effect to be attributed to the confounder, leading to biased estimates.</p><p>For each of the 23 experiences I worked on, I had to make sure to include key confounders, and also avoid including problematic confounders, which was very time consuming. But, this process provided a great opportunity to learn about various data sources that exist at Lyft, and to reflect on how key covariates may be related causally to Lyft rider experiences.</p><h3>Conclusion</h3><p>Though I faced many challenges in my starter project, I had excellent support from my colleagues at Lyft, and was able to successfully generate causal estimates for the 23 user experiences identified.</p><p>Navigating the various challenges I encountered was a very intellectually stimulating and fulfilling experience. The insights from RES play a crucial role in helping teams focus on the experiences that matter most to our riders; contributing to a workstream with such a high impact on improving rider experience has been very rewarding.</p><p>Getting started at Lyft has been an enriching journey. I’ve had the opportunity to apply real-world causal inference techniques and collaborate with amazing colleagues. I’m excited to continue contributing to impactful projects, and to see what the future holds.</p><p><em>Lyft is hiring! If you’re passionate about Data Science, visit </em><a href="https://www.lyft.com/careers"><em>Lyft Careers</em></a><em> to see our openings.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=86a60dddd935" width="1" height="1" alt=""><hr><p><a href="https://eng.lyft.com/my-starter-project-on-the-lyft-rider-data-science-team-86a60dddd935">My Starter Project on the Lyft Rider Data Science Team</a> was originally published in <a href="https://eng.lyft.com">Lyft Engineering</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Migrating Lyft’s Android Codebase to Kotlin]]></title>
            <link>https://eng.lyft.com/migrating-lyfts-android-codebase-to-kotlin-53b231dfecb5?source=rss----25cd379abb8---4</link>
            <guid isPermaLink="false">https://medium.com/p/53b231dfecb5</guid>
            <category><![CDATA[android]]></category>
            <category><![CDATA[kotlin]]></category>
            <category><![CDATA[programming]]></category>
            <category><![CDATA[lyft]]></category>
            <dc:creator><![CDATA[Oleksii Chyrkov]]></dc:creator>
            <pubDate>Tue, 09 Sep 2025 20:34:03 GMT</pubDate>
            <atom:updated>2025-09-09T13:48:32.011Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*9HXuWJOvxM-TqitBdgZ6oQ.jpeg" /></figure><p><strong>Introduction</strong></p><p>Lyft started adopting Kotlin into our Android codebase in 2018. Fast forward 7 years, and we are finally done! Lyft Rider, Driver and Lyft Urban Solutions apps are now fully Kotlin-based.</p><p>I joined Lyft in 2022, so this post will describe the efforts undertaken after that.</p><p>Our motivation included several points:</p><ul><li>Kotlin code is more concise, and oftentimes, way more concise than Java. In some cases, 10 lines of Java could be turned into a 1-liner in Kotlin.</li><li>We get compile speed benefits by using the newest K2 compiler.</li><li>All the new UIs in Lyft are written using Compose — the modern declarative UI framework approach which is the industry standard. All the existing UIs will eventually be migrated to Compose. That also means Kotlin was the only option, as Compose only supports Kotlin.</li><li>We started adopting Coroutines — the structured concurrency framework which greatly simplifies writing asynchronous code. Coroutines are part of the Kotlin standard library, so that was an extra argument to adopt Kotlin faster.</li><li>Our codebase is huge, so we often run automated migrations, which require adopting migration scripts for Java as well.</li><li>Working entirely in Kotlin is a big plus for engineers considering Lyft.</li></ul><p><strong>Pre-migration</strong></p><p>The first thing which needs to be done when undertaking a project this large is to know where we are standing. To achieve this, Lyft has an internal tool called Migration Tracker, which tracks all of the migrations in both Android and iOS codebases. Examples of migrations are:</p><ul><li>Migrating from RxJava to Coroutines</li><li>Switching from our old UI approach to the new declarative one</li><li>Eliminating uses of Java in favor of Kotlin</li></ul><p>A daily cron job runs the Migration Tracker and updates an internal website, presenting graphs which help us ensure we meet the migration deadlines.</p><p>As of Feb 24, 2025, the Kotlin migration was 85% ready. That means we still needed to migrate about 1,000 files scattered across 20+ teams and 150+ Bazel modules.</p><p><strong>Migration</strong></p><p>Fellow developer Oleksii Zaiats built a tool which greatly sped up the migration process: the Migration Script. It leveraged Android Studio IDE Scripting, which is a powerful but rarely used tool, fitting perfectly for this kind of task.</p><p>Put simply, the script flow was as follows:</p><ul><li>For a given team, it found all modules owned by the team.</li><li>For the given module, it ran the automatic migration mechanism for each Java file.</li><li>It automatically fixed some common imperfections of Android Studio’s built-in Java to Kotlin converter.</li><li>After all the files in the module were migrated to Kotlin, it created a git commit with all the changes, naming the git branch appropriately.</li><li>The owning team was notified of the changes and reviews were requested.</li></ul><p>This simple approach was not without its flaws, but it gave us a real productivity boost and allowed us to migrate a couple of modules per day.</p><p><strong>Challenges and Caveats</strong></p><p>The major pain point in the migration process was that the automatic migration tool was not perfect:</p><ul><li>In many cases, it uses nullable-types where non-nullable types are fine, resulting in code like <strong>Observable&lt;Optional&lt;List&lt;Ride?&gt;?&gt;?&gt;?</strong></li><li>It is unnecessarily verbose, using explicit types everywhere.</li><li>It is not smart enough to convert an explicit for loop into a Kotlin one-liner using <strong>map</strong> or <strong>filter</strong>.</li><li>It cannot automatically use <strong>lateinit var</strong> which is often needed when writing View-based UIs.</li></ul><p>In addition, the structure of the legacy code itself presented additional complications. We had some very old code implementing a hand-written <strong>INullable</strong> interface, which was similar to <strong>Optional</strong> but not quite. The semantics of <strong>INullable</strong> required us to do extensive code review each time we touched one of these classes.</p><p>Last but not least, once we encountered a class which absolutely had to be written in Java! It was implementing an interface with a signature like so:</p><pre>public void onTouchEvent(@NonNull Float x, @NonNull Float y)</pre><p>However, on some devices, the API contract was broken, so <strong>x </strong>and <strong>y</strong> could in fact be null! In Java this was totally fine, but in Kotlin this resulted in a crash. Thankfully, we have rewritten the entire screen using another approach, not using this interface anymore.</p><p><strong>Post-migration</strong></p><p>After we were finally done, we needed to ensure that no one accidentally adds a Java file to our codebase. To achieve this, we have added a Lint check, integrated in our CI system, which checks every pull request and explicitly prohibits Java code.</p><p><strong>Conclusion</strong></p><p>After concluding this years-long effort, Lyft developers can skip the hassle of Java-Kotlin interop and concentrate on solving the problem that actually matters — providing our users with the world’s best transportation!</p><p><em>Lyft is hiring! If you’re passionate about working on a Kotlin-only app, visit </em><a href="https://www.lyft.com/careers"><em>Lyft Careers</em></a><em> to see our openings.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=53b231dfecb5" width="1" height="1" alt=""><hr><p><a href="https://eng.lyft.com/migrating-lyfts-android-codebase-to-kotlin-53b231dfecb5">Migrating Lyft’s Android Codebase to Kotlin</a> was originally published in <a href="https://eng.lyft.com">Lyft Engineering</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
    </channel>
</rss>