Recovering from Crashes with Safe Mode

Michael Rebello
Lyft Engineering
Published in
7 min readOct 17, 2022

--

Feature flags are everywhere in modern software development: They’re a great tool for running A/B experiments, slowly rolling out changes to users, and even turning off problematic codepaths during incidents. When an engineer implements a new feature, it’s practically second-nature to gate it behind a feature flag.

While this practice is largely beneficial for the most part, incidents are occasionally caused when a feature flag enables a buggy codepath and causes a crash or an otherwise degraded user experience. A feature flag that causes a crash immediately upon app launch is particularly painful because even if the feature flag is disabled remotely after an engineer identifies the issue, once an app has the bad configuration it will continue to crash before it’s able to successfully fetch the corrected configuration.

We’ve experienced this issue a few times at Lyft over the years. When a crash on launch was introduced by turning on a feature flag or changing other remote configurations, we usually had to ship a hotfix to get users out of infinite crash loops since we had no way of pushing configuration updates to the app when it was crashing so early in its lifecycle. This inevitably resulted in disappointed users, fewer rides, and lost revenue.

To help mitigate these crash loops, we created Safe Mode.

Goals

We had 3 goals when we set out to create Safe Mode:

  1. Crash loop prevention: Identify crash loops that occur on launch (i.e., crashes that occur before the user sees the main screen and persist through multiple launch attempts) as a result of bad configuration changes, and prevent subsequent crashes from happening.
  2. Observability: Provide insights into when these situations occur via dashboards and alarms.
  3. Safe rollout: We needed to be confident that whatever we shipped to solve the first two goals didn’t itself result in crash loops.

Approach

Our strategy for tackling this problem was as follows:

  1. Track consumed feature flags. In order to determine which flags could have caused a crash on launch, we store a list of configurations that are accessed during the app’s lifecycle. In practice, this is done by updating the set of feature flags that our crash reporting library, Bugsnag, stores to be reported alongside crashes.
  2. Identify the crash on launch. To do this, we utilize Bugsnag’s support for tracking crashes on launch and configure the SDK to mark launches as “completed” when applicationDidFinishLaunching returns on iOS and once the main screen is displayed on Android. Safe Mode is the first thing to start in our app’s lifecycle after the Bugsnag SDK (essentially the second line of main), and it queries the SDK to see if any crashes occurred during the previous session before launch completed.
  3. Track the occurrence. We emit an analytics event safe_mode_engaged once a crash on launch has been identified by Safe Mode. This event powers our related dashboards and alarms.
  4. Engage Safe Mode. In order to de-risk our rollout of Safe Mode, we introduced a “shadow” setting within it. When shadow is enabled, Safe Mode only emits the analytics event mentioned above and does not take any further action. Without the shadow setting, it determines which feature flag configurations were consumed before the app crashed on the previous launch (by reading the feature flags previously stored in Bugsnag in step 2), and locks the configurations to their local default values for the remainder of the session. This effectively puts the potentially problematic features/codepaths into their default “safe” state and allows the user to use the app as they normally would (albeit with some functionality disabled).
  5. Alarm/page engineers. Dashboards and alarms are configured to page on-call engineers when Safe Mode begins to engage at a low volume. Engineers then inspect the configuration changelog and revert the offending update once it is identified.
  6. Refresh client-side configurations. Once the app successfully finishes launching, it refreshes its feature flags. If the problematic configuration has already been reverted, the correct values will be used on the following app launch. If the client does not receive the fixed configuration before the next launch, the app will likely crash and Safe Mode will engage again.

We considered having the app do a force-refresh of configurations instead of locally resetting values, but decided against it because: 1) Although this would have met the object of preventing a client hotfix, it would have prevented users from using the app until the configuration change was reverted, and 2) Making a network call would require additional dependencies to be set up — dependencies which could be (and have been) the cause of the previous crash.

Observability

Using the event described above, we constructed Grafana dashboards to surface instances of when a bad configuration change causes a spike in Safe Mode engagements:

Graph showing a spike in Safe Mode engagements after a bad configuration change.
Graph showing a spike in Safe Mode engagements after a bad configuration change.

PagerDuty alarms are also configured using raw event counts, set up to page on-call engineers when a spike in events occurs.

Rollout

Prior to shipping Safe Mode, we built a tool to test it on physical devices. Activating the test tool enables a feature flag locally which, in turn, causes a fatalError() on all following launches. We included it in a set of developer tools that’s part of the alpha (employee) version of the Lyft apps:

Screenshot demonstrating the Safe Mode simulation tool in the Lyft alpha app.
Screenshot demonstrating the Safe Mode simulation tool in the Lyft alpha app.

Once we were ready to roll out Safe Mode, we enabled it in stages. The primary reason for rolling out slowly was to ensure that Safe Mode did not engage more often than we expected it to. Doing so would mean being overly-aggressive about disabling features and subsequently degrading the user experience:

  • Enabled Safe Mode with its shadow setting active for alpha and beta users, which allowed us to validate that Safe Mode wouldn’t engage in unexpected circumstances.
  • Enabled Safe Mode without its shadow setting for alpha and beta.
  • Enabled Safe Mode with its shadow setting active for production, which gave us more exposure to potential instances of unexpected engagements.
  • Enabled Safe Mode for production users over the course of ~2 weeks.

False positives

When we shipped Safe Mode, we expected that it would never engage outside of an actual incident involving a bad configuration change. However, we noticed that there were actually a few crashes on launch that triggered Safe Mode and occurred at low volumes throughout each day:

Graph showing occasional Safe Mode engagements at low volumes rather than a baseline of 0.
Graph showing occasional Safe Mode engagements at low volumes rather than a baseline of 0.

Curious about why we were seeing this, we dug in to the Safe Mode events and joined them with our crash reports to identify which stacktraces were causing Safe Mode to engage unexpectedly (i.e., when there was a crash on launch caused by something other than a bad configuration change). The issue turned out to be an obscure thread-safety bug related to headers we set on outgoing network requests. Furthermore, the bug tended to manifest itself during app launch because that’s when the data being read was usually updated (thus increasing the likelihood for race conditions). Fixing the bug reduced the number of false-positives and allowed our graphs to settle down as expected.

Impact

Safe Mode was rolled out just a couple of months ago. Since then, it has engaged multiple times — each time, triggering a prompt investigation and fix after paging, reducing the set of affected users, allowing affected users to continue using the Lyft apps, avoiding hotfixes, and saving the business real money.

Future improvements

Today’s implementation of Safe Mode alleviates a big pain point when it comes to resolving incidents on our mobile clients, but there are still some impactful ideas we would like to explore in the future:

  • Handling app hangs / ANRs. Although Safe Mode is able to mitigate crashes that happen on launch, there is an opportunity to handle app hangs (ANRs on Android) that occur as a result of feature flag configuration changes but don’t necessarily crash the app. We are currently investing in building out this functionality.
  • Reducing collateral damage. Safe Mode currently disables all feature flags that were consumed before the crash during the previous app launch. This usually includes tens of flags that had no part in causing the issue, but they are disabled regardless — largely because linking a stacktrace back to an individual feature flag is an incredibly challenging task. For now, we encourage our teams to remove flags that are no longer needed. Ideally, in the future we’d be able to reduce the impact that Safe Mode has on unrelated flags.
  • Automatically disabling problematic flags. At the moment, reverting a problematic flag change requires intervention from an engineer because identifying the problematic flag is not trivial (as mentioned above). In the future, we could potentially report the keys of the flags being reset by Safe Mode and compare those to the server-side configuration changelog, though this would likely require additional heuristics.
  • Expanding beyond handling bad configurations. What other things cause crash loops on launch? How can we build infrastructure to recover from those as well?

Summary

Safe Mode has allowed us to identify client-side incidents more quickly, resolve them faster, avoid hotfixes, and enable users to continue using our apps while we debug issues behind-the-scenes. Additionally, engineers are able to ship features with a little less worry and release managers know that there is something in place to help mitigate incidents.

If this post inspired you or you’d like to join our reliability efforts, take a look at some of our job openings!

--

--