Open Sourcing Amundsen: A Data Discovery And Metadata Platform

Tao Feng
Lyft Engineering
Published in
12 min readOct 30, 2019

--

By Tao Feng, Jin Hyuk Chang, Tamika Tannis, Daniel Won

In a modern data-driven company like Lyft, every interaction on the platform is powered by data. The challenges that arise from complex data generation, ETL processes, and analytics make metadata significantly important. Moreover, the types of data resources are constantly increasing. At Lyft, these resources include SQL tables and views in Redshift, Presto, Hive, PostgreSQL, as well as dashboards in business intelligence tools like Mode, Superset, and Tableau. With growing data resources it becomes increasingly difficult to know what data resources exist, how to access them, and what information is available in those resources.

Previously we introduced Amundsen — our solution for data discovery — from a high level product perspective. Over the past year since launching in production at Lyft, we’ve made all of Amundsen’s repositories public, explored various expansions and improvements, and established a growing open source community.

Today, we are proud to officially announce that we are open sourcing Amundsen (https://github.com/lyft/amundsen), our data discovery and metadata platform.

In this post, we discuss Amundsen’s architecture in depth, explain how this tool democratizes data discovery, and cover some challenges we faced when designing the product. At the end, we highlight some of the great contributions from our early adopters, and summarize our future roadmap.

Meet Amundsen

The graph below represents Amundsen’s architecture at Lyft. Amundsen follows a micro-service architecture and is comprised of five major components:

  • Metadata Service handles metadata requests from the front-end service as well as other micro services. By default the persistent layer is Neo4j, but can be substituted.
  • Search Service is backed by Elasticsearch to handle search requests from the front-end service. By default the search engine is powered by ElasticSearch, but can be substituted.
  • Front-End Service hosts Amundsen’s web application.
  • Databuilder is a generic data ingestion framework which extracts metadata from various sources.
  • Common is a library repo which holds common codes among all micro services in Amundsen.
Amundsen Microservices Architecture

Metadata Service

Amundsen’s metadata service provides metadata about data resources. This metadata is displayed in the front-end service’s web application, and is also utilized by other services at Lyft. The metadata service currently supports three types of resources:

  • Table resources include metadata such as the table’s name, descriptions, columns, relevant statistics, plus more.
  • User resources represent individual users or teams and include metadata such as names, team information, and contact information.
  • Dashboard resources (currently work in progress).

Internally the metadata service connects with its persistent layer through a proxy and serves requests via REST APIs. By default, the persistent layer is powered by Neo4j, an open source graph database. The service can also be configured to use Apache Atlas — another metadata engine. This integration with Apache Atlas was the result of an early open source contribution.

Graph model for Table resource metadata

Amundsen models metadata entities as a graph which makes it easy to extend the model when more entities are introduced. All of the entities — such as tables, columns, and schemas — are connected via edges that represent the relationships between them. Each entity is also provided with a unique resource ID for identification.

Lyft currently uses the Neo4j community version 3.3.5 to host metadata. We backup the Neo4j graph into Amazon S3 every four hours.

Search Service

Amundsen’s search service provides an API for indexing resources into a search engine and serves search requests from the front-end service.

Similar to the metadata service, the search service can interact with different search engines through a proxy. By default the search service is integrated with ElasticSearch 6.x but it can also be integrated with Apache Atlas, which provides similar search capabilities with Solr.

We currently support various search patterns for a more flexible search experience:

  • Normal search: returns the most relevant results for the given search term and particular resource type.
  • Category search: filters for resources where a primary search term matches a given metadata category (e.g. search for database:hive), then returns results matching a secondary search term based on relevancy.
  • Wildcard search: allows user to perform wildcard search over different resources.

Front-End Service

Amundsen’s front-end service is composed of two distinct parts:

  • A React application for client side rendering.
  • A Flask server for serving requests and acting as an intermediary for metadata or search service requests.
Frontend Stacks Used in Amundsen

The React application is written in TypeScript, leverages Redux for managing its application state, and follows the “Ducks” modular approach for code structuring. Redux-Saga was introduced as middleware to effectively manage the side effects of asynchronous requests on the application state.

Babel is used to transpile our source code across different ECMAScript standards, Webpack is used to bundle the source code, and Jest is used for unit testing. The application also uses NPM for managing dependencies.

Keeping the open source community in mind, Amundsen’s front-end application is highly configurable. There are feature flags and various configuration variables to account for the unique desires of different organizations. For example, at Lyft we ingest and index our employees as user resources. Realizing that not all organizations have a use case to integrate employee data, the features related to user resources are hidden by default in the application’s user interface but can be switched on with a feature flag.

Databuilder

Databuilder is an ETL ingestion framework for Amundsen, which is highly inspired by Apache Gobblin. There are corresponding components for ETL (Extractor, Transformer, and Loader) that deal with record level operations. A component called Task controls all three components. Job is the highest level component in Databuilder that controls Task and Publisher. Job is also what the client should use to launch an ETL job.

Databuilder

In Databuilder, each component is highly modularized. Using namespace based configurations and HOCON (Human-Optimized Config Object Notation) makes Databuilder highly reusable and pluggable. For example, a transformer can be reused within an extractor, or an extractor can be reused within an extractor.

  • Extractor extracts record from the source. This does not necessarily mean that it only supports pull pattern in ETL. For example, extracting record from messaging bus makes it a push pattern in ETL.
  • Transformer takes records from either extractor or from transformer itself (via ChainedTransformer) to transform records.
  • Loader takes records from transformer or from extractor directly and loads them to the staging area. As loader is operated at the record level, it’s not capable of supporting atomicity.
  • Publisher is an optional component. It’s common usage is to support atomicity at the job level and/or to easily support bulk load into the sink.

The diagram below shows an example of how a databuilder job fetches metadata from Hive metastore and publishes that metadata to Neo4j. Each distinct metadata source could be fetched through a different databuilder job.

An example databuilder job to extract metadata from hive and persist into Neo4j

At Lyft we use Apache Airflow as the orchestration engine for Databuilder. Each databuilder job will be an individual task within the DAG (Directed Acyclic Graph). Each type of data resource will have a seperate DAG since it may have to run with a different schedule.

Airflow DAG to extract metadata with databuilder

How does Amundsen democratize data discovery?

Showing relevant data for heterogeneous data resources

At Lyft we initially started with Amazon Redshift as our main data warehouse. Eventually we began migrating from Redshift to Hive. Over time the volume, complexity, and breadth of data grew exponentially and today we have in the order of a hundred thousand tables in Hive, and around a few thousand tables in Redshift.

Different metadata sources

As Lyft grew, finding relevant data resources became more and more important, yet it also became a bigger challenge as the data landscape became increasingly fragmented, complex, and siloed. Amundsen now empowers all employees at Lyft — from new hires to those more experienced with our data ecosystem — to quickly and independently find the data resources they need to be successful in their daily tasks.

Lyft’s data warehouse is on Hive and all physical partitions are stored in S3. Our users heavily rely on Presto as a live query engine for Hive tables as both share the same Hive metastore. Each query through Presto is recorded to a Hive table for auditing purposes. At Lyft, we would like to surface the most important or relevant tables to users. To achieve this, we leverage the Databuilder framework to build a query usage extractor that parses query logs to get table usage data. We then persist this table usage as an Elasticsearch table document. This information helps the search service surface the most relevant tables based on usage ranking from database access logs.

Connecting different data resources with people

Data ownership could be quite confusing for users. For example, lots of resources don’t have someone responsible for it. This problem is intertwined with the fact that data is hard to find, an issue that Amundsen tries to solve in a scalable way. A lot of the process for finding data consists in interactions between people. It is flexible but error prone and not scalable. It is time consuming unless you happen to know who to ask.

Amundsen addresses this problem by creating relationships between users and data resources and tribal knowledge is shared through exposing these relationships. Below is an example of what a user profile looks like in Amundsen.

A user resource relationship graph

There are currently three types of relationships between users and resources: followed, owned, and used. By exposing this information, experienced users will themselves become helpful data resources for other members on their team, and for other employees in similar job roles. For example, new hires can visit Amundsen and search for experienced users on their team, or for other employees who conduct similar kinds of analysis. The new hire will then be able to see which resources they can consider diving deeper into, or contact that person for more guidance. To make this tribal knowledge easier to find we also add a link to each user’s Amundsen profile on the profile pages of our internal employee directory.

Furthermore, we have also started building out a notification feature which allows users to request further information from owners of data resources if existing metadata context is lacking. For example, if a table is missing a description a user can directly send a request to the owners of that table in Amundsen asking the owners to improve their table descriptions.

Linking rich metadata with data resources

Capturing the ABC (Application Context, Behaviour, Change) metadata for data resources makes users more productive. We would like to capture all metadata that is meaningful for each type of data resource.

Here is an example table detail page which looks like below:

Example table detail page

There are two types of metadata from the above table page:

Programmatically curated

  • Description
  • Data range or watermark
  • Table stats
  • Owners
  • Frequent users
  • Source code of the table
  • Application (e.g. Airflow DAG) that generates the table
  • Numbers of records for the table

Manually curated

  • Description
  • Tags
  • Owners

Some metadata — such as descriptions, tags, and owners — can be curated both programmatically and manually. To ingest additional metadata into Amundsen, a developer needs only to leverage the databuilder ingestion framework. They can either reuse existing extractors, or build a new extractor and model based on the interface needed for accessing that metadata at its source.

Design Challenges

Scaling indexing metadata across different organizations

There are two types of approaches to index metadata into Amundsen — pull and push:

  • Pull approach: Periodically update the metadata by pulling from the source via crawlers.
  • Push approach: The sources (e.g databases) push metadata into a message queue (e.g Kafka), to which downstream sources can subscribe to and consume changes.

When Amundsen was bootstrapped, we initially considered the push approach, however Lyft had only just started building a reliable messaging platform based on Kafka. As a result, we built the data ingestion library databuilder to use the pull approach to index metadata into Amundsen. We acknowledge that the pull approach it is not scalable, as different organizations within Lyft would prefer to push their metadata. The benefit of the push model is to allow the metadata be indexed in near real-time. As a result, we are currently moving into a hybrid model: organizations will be able to push metadata into the Kafka topic based on a predefined schema, while Amundsen’s databuilder still pulls metadata through a daily Airflow DAG.

Removing and cleansing the stale data

Amundsen currently doesn’t provide metadata versioning. When the Databuilder Airflow DAG finishes, the latest metadata is upserted into the graph. Initially we did not delete stale metadata, so when a table was deleted from a data store the metadata for this table continued to exist in Amundsen. This created some confusion for users so we added a separate graph-cleansing Airflow task into our workflow DAG to remove stale metadata. In the long term, we plan to add versioning to data sets, so users can see how a data set has evolved instead of simply looking at the current/latest snapshot of metadata

Data modeling with graph model

Amundsen chose the graph data model to represent its metadata. Graph data model is not a common choice for most applications, but we believe it is a great fit for Amundsen as it deals with a lot of relationships among entities. A graph data model is to represent relationship between entity (vertex) and relation (edge). With this characteristic, it makes it easy to extend the model when more entities are introduced. In the graph data modeling, one of the design decisions that needs to be made is to decide whether to add new information as an entity’s property or as a new entity. In Amundsen, we decided to add a new entity as a default choice as it opens up an opportunity for relationship with other entities.

Open Source Community

In Feb 2019, we made Amundsen’s code repositories public as part of an open source soft launch. Since then we have been collaborating with awesome engineers from some early adopters at companies such as ING and Square. There are more users than just these organizations. We apologize for not listing all of them.

Amundsen has support for some of the popular data warehouses in the industry including, but not limited to: Snowflake, Hive, Redshift, Postgres, Presto, etc.

As of today, a few companies use Amundsen in production and 20+ people have contributed code back to Amundsen, including support for valuable features like new extractors for BigQuery and Redshift (Databuilder), support for Amundsen to integrate with Apache Atlas (metadata service & search service), and markdown support for description editing in the UI (front-end service). We have a thriving community of 60+ companies and 250 users at the time of this writing. We’d love for you to join our community.

Future

Moving forward there are many improvements that we are currently working and new features that we would like to tackle with the community. Our roadmap currently includes:

  • Search & Resource page UI/UX redesign
  • Email notifications system
  • Indexing Dashboards (Superset, Mode Analytics, Tableau, etc)
  • Lineage integration
  • Granular ACL (access control)

Further details can be found in Amundsen’s open source roadmap doc.

Amundsen has been running in production at Lyft for over a year now with about 1000 weekly active users at Lyft internally. We have extremely high penetration rate (90+%) for technical roles like Data Scientist, Research Scientists and Data Engineers while also being used by business users like Marketing and Community associates.

In the future, we will continue working with our growing community internally and externally to enhance the data discovery experience and boost users’ productivity. Last but not least, please join the community through our open source slack channel.

Acknowledgements

We would like to thank many people for making Amundsen happen:

  • Mark Grover who provides product leadership.
  • Jin Chang, Tamika Tannis, Tao Feng, Daniel Won who are currently working on the project at Lyft.
  • Shenghu Yang, Yuko Yamazaki, Prashant Kommireddi who provide engineering leadership.
  • Matt Spiel who provides UI design support.
  • Philippe Mizrahi, Alagappan Sethuraman, Junda Yang, Ryan Lieu who used to work on Amundsen at Lyft.
  • Verdan Mahmood, Bolke de Bruin, Nanne Wielinga from ING who have worked with us since the inception of the open source journey.
  • Many different people (e.g., Jørn Hansen, Alyssa Ransbury, Joshua Hoskins to name a few) from different companies who help the community continue growing.
  • Data Portal team at Airbnb, who gave us a lot of inspiration and suggestions.

As always, Lyft is hiring! If you’re passionate about developing state of the art machine learning/optimization models or building the infrastructure that powers them, read more about them on our blog and join our team.

--

--