Experimentation in a Ridesharing Marketplace

Published in

Lyft Engineering

11 min readDec 15, 2016

Part 3 of 3: Bias and Variance

This is the final installment in a three-part story on strategies for A/B testing in Lyft’s unique two-sided marketplace. In Part 1 we introduced interference bias, explained why experimentation is non-trivial when so-called network effects are present, and qualitatively explored bias-variance tradeoffs. Part 2 described the Lyft Data Science team’s simulation infrastructure, an indispensable tool in understanding market dynamics and sanity-testing new algorithms. If you have already read those posts — welcome back! If not, and you find yourself wondering What is the point of all this?, we suggest going back to give them a quick read.

We have two main goals with this post. The first is to use the simulation methodology from Part 2 to quantify the phenomena outlined in Part 1. Second, we will borrow arguments from causal inference to motivate a two-stage experiment design and some ideas for how to construct interference-free treatment effect estimators.

Quantifying the Bias-Variance Tradeoff

We simulated a 4-week experiment using historical demand and supply in one of Lyft’s medium-sized markets. Specifically, we applied the simple experimental manipulation described in Part 1: we subsidized Prime Time pricing (PT) for all passengers in the treatment group. Such a treatment is known to have strong network effects, as subsidizing PT for user A can mean that user B is less likely to have a driver nearby when she opens the Lyft app. This happens because during times of undersupply, users who see a cheaper price will tend to snap up the available drivers.

First we ran our simulator over the entire period both with and without the PT subsidy, in order to take measurements in the global treatment and global control configurations. This gives us a ground truth comparison for each metric. Next, we simulated experiments under the following designs: (a) alternating time intervals (one hour), (b) randomized coarse and fine spatial units (geohashes at granularities 5 and 6, respectively), (c) randomized user sessions. See Figure 1 for an illustration of the two different spatial randomizations in an arbitrarily chosen Lyft market.

**Figure 1.** Spatial randomization at the geohash-5 (top) and geohash-6 (bottom) levels, for a randomly chosen Lyft market (Denver, CO). In this realization of the designs, pink cells might be given the hypothetical treatment while remaining cells would get the control.

We focus on three key metrics for this simulation study. The first is availability, defined as the proportion of user app opens for which there is an available Lyft driver within some context-dependent radius.

The second metric is the average (across user sessions) estimated time of arrival (ETA) of the nearest available driver. ETA is one measure of the quality of Lyft’s service levels.

Third, we consider the number of completed Lyft rides, normalized by group size. Rides is our most important top-line business metric, but in a simulation setting is somewhat dependent on our models of passenger and driver behavior.

In all cases we measure percent changes in the metric in the treatment group relative to the control group. When we compare the global Prime Time subsidy simulation to the global control simulation, we measure a decrease in availability, an increase in ETAs, and an increase in total rides. These changes are in line with the example in Part 1, and reflect the fact that passengers are more likely to request a ride when PT is subsidized, potentially worsening undersupply situations.

Figure 2 illustrates the claim made in Part 1 that different randomization schemes inhabit a spectrum of bias-variance tradeoffs. The hourly scheme is almost unbiased for all metrics, but can have large variance, e.g. for the ETA metric. This is because hour-to-hour variability in Lyft supply and demand is substantial, having components which are non-cyclical at the week level — including weather, traffic, special events and Lyft promotions. Broadly speaking, bias increases and variance decreases as the experimental units become finer.

**Figure 2.** Bias, standard deviation and root-mean squared error (all on % change scale) for treatment effect estimators of three key metrics, based on four randomization schemes: alternating hourly time intervals; coarse (geohash-5) and fine (geohash-6) spatial units; random user-sessions. (Y-axis is obscured for confidentiality reasons.) Spatial units were re-randomized every hour. Bias was computed with reference to the ground truth treatment effect obtained by comparing the simulated global treatment to the simulated global control. Dashed lines show the absolute value of the ground truth treatment effect for each metric, to give a sense for the relative magnitude of errors. The base of all bars is at zero.

Interestingly, a different randomization scheme minimizes root-mean squared error (RMSE) for each metric. However, even having the smallest RMSE is not sufficient. For example, if we blindly applied a random-session design, we would wrongly conclude that subsidizing Prime Time has no effect on driver availability — with a tight confidence interval! The session design is most susceptible to network interference bias, since two users standing right next to each other, and thus sharing an identical pool of candidate drivers and Lyft Line matches, may well end up in different treatment groups.

The hourly experiment did well at estimating the change in rides resulting from a Prime Time subsidy. However, because the alternating hour design has only two possible random treatment allocations (corresponding to the assignment of the first hour), a generic and accurate variance estimator does not exist to our knowledge. This is a major flaw in the alternating interval methodology. Moreover, that design failed badly on the task of estimating the change in ETA, a metric that is more sensitive to hourly fluctuations. In a sense, alternating time interval experiments perform well if the effect size is very large or we just happen to get lucky with temporal fluctuations — something that’s difficult if not impossible to know in practice. Spatial designs represent a nice compromise. But in this case they may have led to a bad product decision, by systematically underestimating the negative impact of the PT subsidy on Lyft service levels.

Interference bias is clearly a problem when it comes to estimating the network-wide changes in our metrics of interest. And naive attempts to design our way out of this problem can indeed eliminate the bias, but bring in unwanted variance. The remainder of this post presents a useful way of decomposing and understanding experimental effects contaminated by interference. First, let’s take a step back and ask a pretty fundamental question: What treatment effect do we wish to estimate?

Direct and Indirect Causal Effects

There are three important types of treatment effects in the context of interference: direct, indirect, and total effects (see this excellent article by Hudgens and Halloran). In a nutshell, the total effect is what we care about. It’s a sum of

direct effects: how does a unit’s treatment assignment affect its outcome?
indirect effects: how do the treatment assignments of other units affect a unit’s outcome?

An experiment assignment mechanism is a rule for randomizing the population of units to either treatment or control. A direct causal effect is a function of one such rule, while an indirect effect is a function of a pair of rules. For simplicity, we assume simple random sampling and identify the assignment mechanism with the proportion of units that will be randomly selected for the treatment. (In general, assignment can be a much more complicated random function.) A unit here could mean different things: user, user session, geohash, etc. — more on that later.

To be more precise, we need some mathematical notation. Feel free to skip ahead if that’s not your thing. Let Y_i (Z; p) denote the response of unit i exposed to Z (= 0 for control or 1 for treatment) assuming that a proportion p of the N units are assigned to the treatment. Recall that these responses are potential outcomes, as discussed in Part 1 of this post. They are not all observable in the same experiment — a unit i only gets one condition assigned in reality. But they are nonetheless well-defined. We have:

Direct causal effect of the assignment mechanism p:

Indirect causal effects of the assignment mechanism p vs. mechanism q (there are two such effects, corresponding to units in the control or treatment):

Total causal effect of the assignment mechanism p vs. mechanism q:

The total effect can be decomposed as a sum of direct and indirect effects in two different ways:

Note that when the assignment mechanism is held fixed (p = q), the total effect is the same as the direct effect, because both indirect effects are zero. Although we work with absolute-scale causal effects for simplicity, relative-scale effects can be defined analogously.

Now let’s go back to our subsidized PT example. The direct and indirect effect functions may be interesting in their own right. But as discussed in Part 1, the treatment effect that’s most critical is the difference between our usual PT algorithm applied throughout the network, and subsidizing every user’s PT. In the above notation, this is just T(1, 0). But the decomposition of T(1, 0) into direct and indirect effects doesn’t quite make sense, because neither Y(1; 0) nor Y(0; 1) is well-defined! Nevertheless, we can make a continuity assumption and assert the existence of a limiting total treatment effect

This decomposition provides some pretty intriguing estimands — functions and scalars that we’d like to estimate. But how do we do it?

Two-Stage Randomization

One idea is to use a 2-stage experiment design motivated by the above formulas. In the first stage of randomization, we can divide the units into J independent, non-interfering groups. Say group j contains N_j units. Next we can draw a random sequence of group treatment proportions p_j for j = 1, …, J from some distribution on the unit interval. At the second stage, we can randomly assign p_j N_j units in group j to the treatment (and the rest to the control) for each j. In the Prime Time example, we might choose to use disjoint time intervals as groups, since they are approximately non-interfering. One could also make an argument for very coarse spatial groups. In defining within-group units we have a number of options, including fine spatial cells, users or user sessions.

We simulated a two-stage design in the same region and four-week period as above, using hourly groups, user sessions as units, and a uniform distribution over treatment proportions. This design allows us to ‘observe’ a large number of different treatment assignment mechanisms acting on the network. From that spectrum of assignments, we can build up some fairly simple effect estimates.

Direct and Indirect Estimators

The most obvious way to estimate our decomposed causal effects in this 2-stage setting is by first discretizing the p’s into some number of bins. Let z and y denote, respectively, the realized binary treatment assignment and observed response for a given unit in a given group.

We can estimate the direct treatment effect at treatment proportion p as

and the indirect treatment effects of p vs. q as

(It will often make sense to replace these sums with weighted sums, for example when the units are of different sizes.)

This gives us a vector of direct effect estimates and two matrices of indirect effect estimates. In practice these may be quite noisy, but we can smooth them. Smoothing should be done with care, as we wish to extrapolate the estimates to p, q in {0, 1}. In our simulation study we’ve obtained smooth estimates of the direct effect curve and the indirect effect surfaces using R’s loess package. See Figures 3 and 4 for visualizations of our estimates of causal effects on the rides metric. The raw effects were indeed quite noisy for these data, as can be seen in the plots.

**Figure 3.** Direct treatment effect of subsidized PT as a function of treatment proportion (bins of size 0.1) for the rides metric. Bin-level effect estimates shown as points (x-coord at bin midpoint); smooth function estimate (from loess) in pink; dashed horizontal line at zero. Y-axis is again obscured for confidentiality.

**Figure 4.** Indirect treatment effects for the control (I_0(p, q); top row) and treatment (I_1(p, q); bottom row) of subsidized PT for the rides metric (with bins of size 0.1; p is along the vertical axis and q along the horizontal axis). Raw estimates are shown in the leftmost column, fitted values from the loess fit in the middle column, and the full prediction surface on the unit square is shown in the rightmost column. The colormap is pink at high values and purple at low values. By definition, indirect effects are zero along p = q.

Let a tilde (~) denote a smoothed functional estimate. We can estimate the global treatment effect of interest by

Alternatively, we could directly smooth the 2-d total effects matrix (not shown here).

In our simulation study, this procedure led to essentially unbiased estimates for the rides and ETA metrics (about as unbiased as the alternating hour design). It did not perform as well for the availability metric, however. This may be partly explained by the smaller baseline effect size for that metric — availability is very close to 100% in this market. Given its binary nature, availability may also vary less smoothly as a function of p.

A Promising Direction

The idea outlined above is nice for three main reasons:

It has the potential to reduce or eliminate interference bias.
It allows us to estimate a continuum of direct and indirect effects, which is useful in problems where the global treatment effect is not the only thing of interest. For example, Lyft may want to provide coupons to a subset of passengers, and understand the effect on system-wide ETAs as a function of subset size.
Because 2-stage randomization allocates fine units (within coarse units), there are natural resampling-based variance estimators for these statistics.

Disaggregating the total treatment effect into direct and indirect components is a useful way to think about the network interference problem. However, it is not a silver bullet, and there are some caveats worth mentioning. The 2-stage estimator did not manage to decrease variance in the above example — in fact, the variance was higher than for the alternating hour design for all three metrics. This is probably not a surprise after looking at the raw estimates in Figures 3 and 4. And because fitting a smooth curve to noisy points is itself a noisy process — especially when extrapolating to the edges of the data.

Nevertheless, there are some obvious improvements that can help reduce variance, including

Sampling treatment proportions from a more U-shaped distribution
Imposing negative correlation between treatment proportions in adjacent groups
Stratification or pairing of groups and units within groups

The Lyft Data Science team is excitedly exploring these and other directions in our quest for the perfect experiment!

In this three-part journey we encountered a nonstandard data science challenge stemming from Lyft’s unique two-sided market dynamics. In order to shed light on it, we leveraged two very different technologies: scalable, high-fidelity simulation, and causal inference. In the process we illustrated two things that Lyft’s Data Science team members do on a regular basis: build cool machinery and think deeply about data!
If you’re interested in developing experimentation methodology, or in building the underlying algorithms that we pit head-to-head in these A/B tests, then join us — Lyft Data Science is hiring! Send me an email at chamandy@lyft.com.