Interactions in fraud experiments: A case study in multivariable testing

Published in

Lyft Engineering

9 min readOct 5, 2017

A while ago we observed something curious when we ran a set of simultaneous A/B tests around multiple antifraud features. These tests were to improve our passengers’ ride payment experience and our ability to collect fares to pay our drivers. The features centered around the temporary authorization hold we use to determine if a passenger has enough money for a Lyft ride.

In general, we apply an authorization hold (or auth) to ride requests that get flagged by our system of business rules and machine learning models. While the auth decreases our fraud exposure, it increases passenger churn due to side effects such as banks not promptly releasing the credit held on the payment method. In rare cases, auth releases can take up to a week, which understandably frustrates our users.

The goal of these tests was to minimize fraud loss without sacrificing good user experience. Suppose that prior to these tests Lyft would auth users at most once per week. In one of these tests, we auth all system-flagged users if they have not been auth’ed within the past day. This was a user-split A/B test that we’ll henceforth refer to as the 24HoursTrust test. (Note: The information in this paragraph is not entirely true — the actual product and experiment variables were different and more nuanced. Did you think we’d simply show our hand? 😎)

Intuitively, because of higher user friction from auths, we expect that there would be a relative increase in user churn in 24HoursTrust’s treatment group compared to its control group, where we auth users at most once every week. Surprisingly, we saw a statistically significant decrease in user churn rate in the test.

This got us scratching our heads for a while. What happened?

(Spoiler: After figuring out what the issue was, we revamped the experimental design and learned that what we intuited was right — user churn did increase with increased auth frequency.)

Overview

Introducing payment fraud at Lyft
Fine-tuning the auth experience
Designing bad auth experiments
Designing good auth experiments

The cause of the surprising result above can be somewhat obvious to the careful reader. But let’s take a step back and get some context around this problem. Ultimately, we’d like to understand the cause of this issue and how to better design experiments.

Introducing payment fraud at Lyft

Like many consumer-facing online services, Lyft faces the risk of fraudsters who use stolen credit cards to pay for rides. Often, these bad credit cards don’t have enough money and result in failed transactions for rides. Lyft protects our drivers from passengers who took rides but can’t pay. But this protection also means that we can incur significant losses if we don’t adequately defend ourselves.

As defense against fraudsters who don’t have money to pay for rides requested, Lyft may contact the user’s bank (e.g., credit card issuer) to put a temporary authorization hold and confirm the payment method. Unless the ride is completed, the authorization hold (or auth) never processes and may show as a pending transaction on the bank statement. The side effect of auth’ing the user is, unfortunately, the inconvenience of not being able to use the credit held out by the bank while the transaction is pending. Additionally, passengers can misinterpret auths as actual charges because of their banks’ confusing interface and become frustrated with what seem to be Lyft charges.

Fine-tuning the auth experience

There are many aspects of the auth that Lyft actively tweaks to fine-tune the user experience. One is simply how often we auth a user: If we had recently determined that a passenger has sufficient funds for rides, we shouldn’t need to challenge their credit-worthiness so soon, again.

The above set of tests challenge the assumption that one week is how long we can trust an auth’ed user after a successful check. On the one hand, not challenging our passengers as frequently can reduce overall user friction. On the other hand, clever fraudsters that notice this business logic can take a cheap ride followed by an expensive one that results in fraud loss. The trick is finding the sweet spot to operate at.

Designing bad auth experiments

So why did we observe the counterintuitive result of decreased user churn despite increased user friction?

To see why, we’ll have to consider the combined impact of the different tests we ran concurrently. As before, we have the 24HoursTrust test, which auths all flagged users if they have not been auth’ed within the past day. On top of that, we have the 1HourSubsetTrust user-split test, which auths a subset of the same users flagged in the 24HoursTrust test, except that it auths them if they have not been auth’ed within the past hour at ride request. (Not too important, but this subset considered in 1HourSubsetTest is just the “most suspicious users” within the 24HoursTrust test’s subjects.)

Notice the two key differences here:

1HourSubsetTrust’s one hour versus 24HoursTrust’s one day post-auth “trust period.”
1HourSubsetTrust considers a subset of system-flagged users but 24HoursTrust considers all system-flagged users.

Crucially, this setup means that the flagged users in the 1HourSubsetTrust test formed a sizable, strict subset of the flagged users in the 24HoursTrust test. In fact, when we checked, it was more than half of all system-flagged users.

The overlap meant that the average user experienced auths more frequently in the 24HoursTrust control group.

At this point, the key insight reveals itself: Many users in the control group of the 24HoursTrust test were simultaneously users in the treatment group of the 1HourSubsetTrust test. Because of this peculiarity, the overlapping 24HoursTrust control users can be auth’ed up to once every one hour. As a result, they ended up experiencing more instances of pending auth charges and greater user friction than those in the 24HoursTrust treatment group, where users cannot experience more than a single auth in a contiguous 24 hours period.

Counterintuitively, the 24HoursTrust control group user sees greater friction that the same test’s treatment group.

That there is a significant overlap between user assignments between two product factors that interacted escaped us in planning the experiments. We simply rarely encounter tests that strongly interact! And while this is no excuse, contemporary literature would also have us believe A/B tests rarely coincide in practice [1, 2].

Conducting a post-mortem

Normally, our oversight wouldn’t pose a big problem — we can simply consider all combinations of test treatment/control groups separately and evaluate how each performs relative to one another. In our case, however, we had carefully sized the experiment to guarantee a level of statistical power and confidence in our experiment. Sizing for statistical power and confidence translate to predetermining the experiment length and user bucket proportions to ensure some minimum probabilities of correctly rejecting and retaining the null hypothesis, respectively. That is, correctly rejecting or retaining the hypothesis that a treatment or a combination of treatments will not improve our metrics.

For the hypothetical pair of A/B tests above, we could have had enough samples for a somewhat lower but reasonable power and confidence. In our actual, full set of tests, however, there were a lot more interacting variants than the four described above. In fact, there were so many interacting features in our tests that the total number of treatment variant combinations was an order of magnitude higher than just the number of treatment variants. The sample sizes in each group were therefore far from sufficient to make any causal inferences. To overuse an over-quoted quote,

To consult the [data scientist] after an experiment is finished is often merely to ask [her] to conduct a post mortem examination. [She] can perhaps say what the experiment died of.
— paraphrasing R. A. Fisher, 1938.

Unfortunately, our flawed experiment design resulted in data that was useless.

Designing good auth experiments

A good set of experiments must capture the strong interactions between experimental factors and the influence of pairing various factors together. It is thus infeasible to simply run concurrent tests (as we have) or sequential tests. It’s clear from our semi-fictitious example above that the former didn’t work for us. In the latter case, there’s no guarantee that sequentially running tests and greedily choosing the best treatment variant will lead to the best combination of experimental factors.

In our revamped auth experiments, we finally settled on a sequential set of multivariable tests (MVTs). MVTs allow us to include more than a single factor and estimate interaction between them. In our MVTs, we tested different combinations of post-auth trust periods (as above), auth amounts, and so forth to improve the passenger’s experience.

And while we could simply run a giant MVT, we had to split them into a non-overlapping sequence of MVTs because of experiment sizing requirements. Recall that since there is only a fixed rate of samples coming in, the numbers of samples in each variant decrease due to the increased number of variants. This implies that we must run an MVT longer than if we just had a simple A/B test.

A more nuanced problem that arise in analyzing MVTs is that of multiple comparisons. In a traditional A/B test, we need only compare the treatment with the control variants for effect. In MVTs, however, we are not only comparing each treatment with the control, but also each treatment against one another to find the best combination of factors.

To see this, consider 10 treatment/control variants’ 95% confidence intervals (bounds that include the true population parameter of interest with 95% probability). If we wanted to pick the top treatment variant, there are 45 pairwise comparisons being made and the expected number of intervals that don’t contain the null result is 2.25. That is, we expect more than two of the relative performance metrics between variants to be wrong.

Here’s another way to think about it: If the intervals are statistically independent from each other, the probability that at least one interval does not contain the population parameter is 90.1%. Note that the intervals here refer to the 95% confidence intervals for the differences in the treatment/control variants’ impact. In other words, we’re more than 90% sure that at least one of the pairwise comparisons wouldn’t yield a correct result, which can lead to an incorrect decision about shipping the best treatment variant.

It’s thus crucial to factor in something like the Bonferroni correction to counteract this problem. Specifically, the Bonferroni correction compensates for an increased likelihood of making the wrong decision by testing each individual hypothesis at a significance level of alpha divided by the number of hypotheses, where alpha is the desired overall significance level. Here, each individual hypothesis’s 95% confidence intervals would need to have a 100–5/45 = 99.9% coverage, which means we need even more samples to shrink the confidence intervals enough to make reasonable decisions.

The sequence of MVTs we designed balanced the ability to test many factors simultaneously against maintaining a minimum sample size and hence statistical power and confidence in our results. Factors that we had strong reason to believe would interact were grouped together and tested in multiple tests. Specifically, we ran full factorial MVTs (all combinations of factor levels) because the interactions were important. In the end, we were able to reconcile our revamped experimental results with intuition about how the auth experience improves with different features.