Building a large scale unsupervised model anomaly detection system — Part 1

Distributed Profiling of Model Inference Logs

Anindya Saha

Published in

Lyft Engineering

8 min readApr 21, 2023

By Anindya Saha, Han Wang, Rajeev Prabhakar

Introduction

LyftLearn is Lyft’s ML Platform. It is a machine learning infrastructure built on top of Kubernetes that powers diverse applications such as dispatch, pricing, ETAs, fraud detection, and support. In a previous blog post, we explored the architecture and challenges of the platform.

In this post, we will focus on how we utilize the compute layer of LyftLearn to profile model features and predictions and perform anomaly detection at scale.

This article is divided into two parts. In this first part we will focus on how we profile model features and predictions at scale. In part 2, we will focus on how we use this profiled data for anomaly detection.

Motivation

Machine learning forms the backbone of the Lyft app and is used in diverse applications such as dispatch, pricing, fraud detection, support, and many more. There are several potential applications of anomaly detection to improve machine learning models at Lyft. By identifying faults or changing trends with the features and predictions of the models, we can quickly identify whether there is feature drift, or concept drift. Identifying anomalies in critical business metrics allows teams to streamline operations and plan for corrective actions.

In our previous blog, we discussed the various challenges we faced in model monitoring and our strategy to address some of these issues. We briefly discussed using z-scores to find anomalies. One challenge we faced with z-scores was that there tend to be many false positives because features and predictions can deviate statistically without implying a problem with the models for several reasons such as seasonality. This motivated us to continue improving the existing Anomaly Detection system.

Pain Points

LyftLearn hosts a large number of models for various business use cases, making hundreds of millions of predictions every day. Even a single user operation such as requesting a ride could initiate calls to several models for different parts of the operation. We instrument all inference requests, sample and store a certain percentage of model inference requests and emitted predictions.

However, consuming this raw data presents several pain points:

The number of requests varies across models; some receive a large number of requests, while others receive only a few. For some models, aggregating data with simple queries is easy, while for others the data is too large to process on a single machine.
The number of features and predictions emitted by models varies widely. Some have 10 features and emit probability scores while others may have 30 features and emit a ranking.
Models also differ significantly in the type of features, with some having more categorical features and others having more numerical features.
The traffic pattern also varies by hour of the day across different models. Some hours in the day have a peak amount of requests as compared to the rest of the day. The data is skewed.

Since the number of models is large and constantly growing, we cannot have a specific monitoring logic for each model.

Solution

Since the models vary so widely in the number and nature of features and predictions, it is imperative for us to devise a uniform way of processing them. Therefore, we decided to profile the features and predictions and extract only the essential metrics from these profiles, regardless of the data scale.

The purposes of profiling are:

To normalize and compress metric data while retaining maximal information.
We can unify data from totally different models and process them using the same pipeline in the following step.
The subsequent steps will only need to handle purely numerical time series.
Significantly reduce the scale of the problem, so the compute can be more efficient and cost effective.

Unified profiling of features and predictions from model inference logs

In order to profile features and predictions of multiple models at scale we needed a data profiling framework that can distribute the profile computation across many nodes. It should have the capability to generate and merge partially generated profiles into one big profile. The profiling should also be extremely fast since we have a lot of data to process. Since we already use Spark for distributed computing, it would be an added bonus if we can also leverage the same stack with this profiling framework to distribute the computing logic to create partial profiles from data partitions and then aggregate the partial profiles for any granularity of time. e.g. we can get daily profiles by merging hourly profiles.

We considered several open source and paid solutions for monitoring models available in the community, and Whylogs proved to be suitable because its mergeable profiles fit well with Spark’s map-reduce style processing. This is the single most important feature that drove our decision. Since Whylogs requires only a single pass of data, the integration is highly efficient: no shuffling is required to build Whylogs profiles with Spark. In addition to that, Whylogs data profiling is very fast since it uses pybind11 to hook into numpy directly to facilitate fast 1-pass data profiling. The profiles are very compact and efficiently describe the dataset with high fidelity.

Mergeable Profiles

Let’s consider an example of a continuous feature used in a model for illustration. In the figure, on the left side, we divide the input data frame into two parts, as shown by the red line. On the right side, we have the entire input dataframe. If we take the Whylogs profile of each of the halves of the original dataframe and then merge them, the resultant profile will be exactly the same if we would have taken the Whylogs profile for the whole original input dataframe.

Dividing the input data frame into two parts — *Partial merge feature of Whylogs*

As a concrete example, let part_df_1 and part_df_2 be two parts of a pandas dataframe. Below, we show the whylogs profile of the part_df_1 and part_df_1 partitions of the original dataframe. The median value is 65.0 and 61.0 for the two partial frames respectively.

Dataframe examples for partial profiles — *Partial profiles of two different sections of a dataframe*

On the left, we see the resultant merged profile when we merge the profiles from part_df_1 and part_df_2 partitions of the dataframe. While on the right side we see the profile from the original dataframe. The resultant median value is the same in both situations.

Getting overall profile by merging partial profiles

Whylogs profiles are realized in protobuf format which make them extremely efficient to serialize and deserialize and store into the database as binary blobs.

Outcome

We have built distributed computing support for LyftLearn on top of Spark and Kubernetes. In our previous blog we discussed how LyftLearn democratizes distributed compute through Kubernetes Spark and Fugue. Apache Spark offers superb scalability, performance, and reliability. We use spark to spawn multiple executors to load the features and predictions data for those models and generate Whylogs data profiles for those features per model, every hour, every day in a distributed manner.

We also use the open source framework called Fugue for its excellent abstraction layer that unifies the computing logic over Spark.

Distributing Whylogs profile generation

The model features and predictions are logged as JSON blobs in Hive. Below, we present the features of two example models.

Model feature jsons — *Model feature JSONS*

One of Fugue’s most popular features is the ability to use a simple Python function call to distribute logic across many partitions of a larger dataframe. Users can provide functions with type-annotated inputs and outputs, and Fugue then converts the data based on the type annotations. This makes the custom logic independent from Fugue and Spark, removing the need for any pre-existing knowledge.

Here’s a simple Python function that employs Whylogs to generate a profile for a given chunk of data.

Python function creating profiles and serializing — *Use Whylogs to create profiles and serialize*

The above dataset can be partitioned by model name and version, and then use Fugue to apply the profile_features() function across all the partitions and help to scale to any number of models.

Similarly, if we want to generate hourly profiles for monitoring hourly deviations, we just partition the above dataframe by model name, version and hour of the day and apply the same profile_features() function across all the partitions to get hourly profiles. We can move between any level of granularity by controlling the partition. The beauty of profile_features() function is that the logic is defined independent of scale.

The result shows that the binary feature profiles have been generated for each model for each hour (ts) in the day.

Dataframe showing hourly feature profiles for models — *Hourly feature profiles for models*

Since the hourly profiles are already calculated, we can now leverage the partial profile merge capability of Whylogs profiles to combine all the hourly profiles to generate a daily profile.

We define a custom profile_reduce() function which operates on each model, each day separately. For each such combination, we deserialize() the binary feature blobs of each hour and then use the merge() method of DatasetProfileView to combine them into a bigger profile representing the entire day to get the daily profile. We again call the serialize() method on the resultant profile to convert the profile into a daily profile binary blob.

Python function merging profiles and serializing — *Use Whylogs to merge profiles and serialize*

The result shows that the binary feature profiles have been generated for each model for each day.

Dataframe showing daily feature profiles for models — *Daily feature profiles for models*

Conclusion

By combining Spark, the open-source Fugue, and Whylogs, we have been able to scale profile generation processes for all models. We can go as deep as we want and persist the Whylogs profile into the database. Similarly, if we need a more coarse or aggregated level of Whylogs profile we can use Whylogs mergeable profile capability.

All our model features and predictions are automatically profiled daily. The profile generation jobs use the raw model inference logs containing the features and predictions. Users do not need to take any action when new models are introduced. Models are automatically onboarded into the profiling framework. This provides the groundwork for the Anomaly Detection framework which we will discuss in Part 2.

Acknowledgements

Our team, consisting of Han Wang, Rajeev Prabhakar, and Anindya Saha, focuses on continuously building these features onto LyftLearn at Lyft.

Special thanks to Shiraz Zaman and Mihir Mathur for the Engineering and Product Management support behind this work.

Lastly, we would like to thank the early adopting teams that provided crucial feedback and helped us improve our systems.

As always, Lyft is hiring! If you’re passionate about developing state of the art systems, join our team.