Presto Infrastructure at Lyft

Published in

Lyft Engineering

12 min readAug 22, 2019

Overview

Early in 2017 we started exploring Presto for OLAP use cases and we realized the potential of this amazing query engine. It started as an adhoc querying tool for data engineers and analysts to run SQL in a faster way to prototype their queries, when compared to Apache Hive. A lot of internal dashboards were powered by AWS-Redshift back then and it had data storage and compute coupled together. Our data was growing exponentially (doubling every few days) this required frequent storage scaling as well. With storage coupled with compute, any maintenance, upgrade required down time, and scaling nodes made querying extremely slow (as massive data moves across nodes), we needed a system where data and compute were decoupled, thats where Presto fit nicely into our use case. A pipeline to store event data in Parquet format had already been setup and data was being accessed via Hive. Adding Presto to this stack was icing on the cake.

Now thousands of dashboards are powered by Presto and about 1.5K weekly active users are running a couple of million queries every month on this platform. As of today we have 60 PB of query-able event data stored in an S3 based data lake and about 10 PB of raw data is being scanned every day using Presto. Following are the charts showing the timeline of presto usage growth.

In the above chart we are looking at raw data scan volume per day. We have seen daily raw data scan volume growing 4X in last 4 months.

Presto Clients

Lyft users consume data using query tools such as Apache Superset, Mode, Looker, Tableau, Jupyter notebooks, and some internal ML tools that build and improve application specific machine learning models. All these tools connect to presto via Presto Gateway, queries sent by these clients are load-balanced across multiple Presto clusters uniformly. With the gateway we perform zero downtime upgrades, transparent to the users/apps querying through these clients.

Setup: build, release and deployments

We have forked prestosql/presto under lyft/presto and created the lyft-stable branch (adding additional cherry picks if needed). We have a private repository where we use saltstack and aws-orca scripts to roll out deployments to our environments. In this private branch we have added additional dependencies specific to our environment and have added integration tests that run on every update or pull request. We have dockerized this repo and use the Docker container as a development environment. On every new commit, we trigger the integration tests against the dev environment through Jenkins.

Here are the various components we use while releasing Presto to production.

Lyft’s presto query-log plugin: We have added a query logging and blocking plugin based on Presto-Event listener framework. We use EventListener to intercept when new query lands and we block a few types of queries that we deem harmful for system (eg. some tools actively cache column names and query system.jdbc.columns or catalog.information_schema.columns and cause additional overload on the system). We log query stats in both queryCreated and queryCompleted events, this helps us analyze success rate, latencies, various types of failures and their root causes etc.
Presto UDFs: We let our users add custom UDFs based on their needs if a function addressing the use case is not available in Presto. Some users have added custom geo functions based on their use cases.
Python-based stats collection: We are using datadog-agent lib and have added multiple checks to collect system/jvm/process stats metrics and push to statsd. We have statsd collector that collects and pushes the metrics further to Wavefront. We have setup various alerts and get paged through pagerduty when there is a threshold breach.
Test suits: Before we roll a new presto version to production, there are multiple types of tests we run to ensure the quality of a release candidate.

Following are the type of tests we run to ensure quality of each release.

a. Integration test suite — Runs Table-CRUD and UDF tests against each build, launching devbox docker container in Jenkins build task.

b. Replayer test suite — After rolling a release candidate to pre-prod, we launch day-long test with an older days worth queries, replaying in the same way as the queries had executed in the past. We feed the query log using our event logging pipeline and compare the performance with previously logged data.

c. Verifier test suite — Similar to replayer, we use presto-verifier to run static set of queries and run against old vs new versions. We move to this version if the performance results have not degraded.

5. PrestoInfra: collection of configurations and scripts to build and release deploy-able artifacts and saltstack scripts to rollout environment specific deployments.

6. PrestoProxy: a customized presto-gateway with lyft specific routing rules and overrides.

Presto production environment and configurations

We are running multiple presto clusters sharing load through presto-gateway, each 100+ worker nodes. The node specifications are as follows

Presto configurations

On each cluster, we have set 25 max query concurrency and 200 max query queuing with each running 4 queries max and 20 queries queuing per user. Each query has 30 min max run time limit and 10 minutes max execution time limit. We rotate the entire cluster once in a day to avoid long old gen GC build up. We let each query scan up to 1 TB max with each worker node loading up to 10 GB data. Below are exact configurations to achieve these settings.

In Presto gateway coordinators can be assigned a routing group each, setting X-Presto-Routing-Group header would route that request to one of the clusters under that routing group. We have one cluster which runs with extended memory limits assigned nolimit routing group. User has to add a comment ` — higherlimit` in the query as a hint to indicate resource heavy query and presto-gateway routes that query to the cluster with higher resource limit.

JVM Configuration

We are running Java 11 on presto nodes. Java 11 has parallel GC for old gen phase, reducing the collection time significantly.

Following are the JVM configuration we are running presto processes with. With Java 11 upgrade, we were able to bring down Old gen GC pause in the order of a couple of seconds, where with Java 8 it was peaking in the order of 400 seconds crippling the service every once in a while.

value for max_heap_size_mb is (0.6 * node_memory_mb) which is 114 GB on worker nodes.

Presto node recycle

Despite all the optimizations, presto consumes host resources over time and from our experiences we have learned that the longer the service runs, the slower it becomes. Old gen GC pause time increases with time as more and more queries execute on the cluster also we have noticed that a newer cluster yields better query performance as compared to an old cluster. Keeping that in mind, we have designed the infrastructure to recycle all the nodes in each cluster once every 24 hours.

We use PrestoGateway’s deactivate API to disable the cluster 30 minutes before scheduled shutdown time and activate the cluster 30 minutes after the scheduled start time through a cron providing no-downtime maintenance. Since we have set 30 min max run time per query ensuring no query loss during these operations. We are using aws-orca scripts to trigger AWS’s ScheduledActions to bring up or down entire presto cluster on a given schedule.

This is the salt script for shutting down a cluster at 5:20 PM and bringing up back on 7PM Pacific Time.

ASG scheduled action based presto scaling

Presto Gateway

Presto-Gateway is a stateful load-balancer, proxy and router for multiple presto clusters, it provides transparent access to underlying presto-backend without changing the protocol. It started as a proxy service project to expose presto coordinators to external BI tools like Mode and Looker in a secure way. As the query volume increased, we faced more frequent outages as single cluster could not handle the load. We soon realized that we need multi cluster setup. So we dedicated one cluster to each such query tool. Though it reduced the frequency of outages and incidents, we still had not achieved zero downtime maintenance and noticed that one cluster is relatively free when other was suffering with high queuing. At that point we decided to implement a true proxy load balancing router. Since the BI tools were external, they were accessing presto infra via an agent which is deployed in our network, these agents were unable to follow HTTP-redirects, so we needed to implement a true proxy router and load-balancer.

We used JettyProxy which is a proxy-server that allows binding custom filters and plugging proxy-handler. We implemented routing rules in a Proxy Handler which lets us intercept HTTP Requests and response allowing to inspect query, source, headers etc and based on that we may choose a backend host to serve the query request. There are two types of requests sent from clients 1. New query submission request, 2. Followup request for a previously submitted query. PrestoGateway proxy handler caches the query id to backend mapping so that it can stick the followup requests to the backend which ran the original query.

Presto Gateway has 3 components:

BaseApp — It provides boilerplate code to add/remove plug-able components through yaml configuration and has inbuilt metrics registry module making it easy to emit custom metrics for the apps built on top of this.

ProxyServer — Its a library built on top of jetty proxy which provides a proxy server implementation with a plug-able proxy-handler.

Gateway — This component acts as container for proxy-server and plugs in ProxyHanders to provide proxy, routing and load balancing functionalities. It also exposes a few end points and UI to activate / deactivate backends and query history view for recently submitted queries.

We are using lombok to reduce a lot of boiler plate code such as getters/setters/logger/equals etc speeding up the development process and the app is built using dropwizard framework.

Cost aware scaling to meet query demand

We started with one Presto cluster and as the usage grew, we kept on adding more worker nodes to support the higher compute demand. As we added more worker nodes, to utilize the full potential of cluster, the query concurrency setting had to be raised and that required presto-coordinator restart causing a down time. After introducing the presto-gateway, we went multi-cluster mode using gateway as load-balancer. The entire presto infrastructure usage is not uniform throughout the day and the nature of our workloads have been bursty, so we over provision the infra so if there is a burst of queries, infra should be able to absorb it gracefully.

Query volume for a week (granularity — 1 hour)

To optimize cost we implemented dynamic scaling. We looked at the incoming query volume rate and noticed that usage is usually higher during business hours. After implementing backend activate/deactivate API at gateway we were already able to perform no-downtime upgrade, deployments and maintenance. We took this to the next level and added scheduled down time for half the number of clusters during non-business hours. 30 minutes before triggering the shutdown, we disable the cluster using Gateway APIs, since we have set 30 minutes max-run time for any query ensuring no query is penalized in this process. Following chart shows how number of nodes vary with time.

Cutting 50% infra during non-business hours resulted 30% cost savings overall.

Google sheets connector plugin

This allows Presto to read data from a Google sheet so that small dimensional data can be added into a query.

Querying google sheet as table in Presto

We announced this feature in Presto Summit 2019 and contributed it back to open-source.

Limitations

Currently gsheets connector plugin has following restrictions.

All the sheets have to be shared with service account user at least with view access if sheet is not public.
First line of sheet is always considered as header having all column names.
All columns are resolved with type VARCHAR.
Gsheets API’s rate limit — 100 calls per 100 seconds (unlimited for the day) will be applied if google project account does not have billing enabled. To avoid this users may choose higher value for cache config property — sheets-data-expire-after-write in presto-gsheets configuration.

Setup instructions

Following are the steps to set up presto google sheets plugin:

Step 1.

To use this, user has to generate service account JSON credential in Gsuite-console, after creating a project and granting read access to Google sheets API. This credential file should be stored and provided in the sheets.properties as credentials-path. Service account user id will be in credentials file, user should share all the google sheets with this service account user to read the sheet in presto.

Step 2.

Now create a metadata sheet and to add table to sheet id mapping. Share this sheet with service account user and note down the sheet id. Add this sheet id in sheets.properties as metadata-sheet-id config property.

Step 3.

Now add the table name and corresponding sheet id (after sharing the sheet with service account user) in the metadata sheet and you should be able to query from your table in presto.

Step 4.

Happy querying !

This feature is currently in beta at lyft and has become pretty popular since we rolled it out and about a couple of hundred users already running thousands of queries every day on 50+ google sheets as table.

Google sheets as table usage — each color band is query volume per user

Superset Presto Integration Improvements

There are many challenges users face when developing a SQL based workflow or pipeline. Query browsers provide error and suggestions to fix syntactical errors (bad sql or wrong UDF usage) after executing the query. It takes some time for users to figure out and resolve such errors through multiple iterations and degrades the user experience. We implemented Explain Type (Validate) queries in Presto and send these explain queries as user types in sqlLab in Superset ahead of actual query execution and the returned explain plan captures the syntactical errors along with validity of columns, tables and udf signatures. This performs deep query validation without actually executing the whole query improving the overall querying experience by eliminating debugging time involved in writing complex sql queries.

Table name validation while writing presto query

Column validation while writing the query

Apache Superset — pre execution deep query validation. We have contributed these features back to the open source.

Summary

It is important for all the teams at Lyft to make data driven decisions and it has been Data Platform team’s mission to make data at the heart of all decisions made at Lyft. Presto is a key component for us to achieve this mission. In the last couple of years we have been primarily focused on scaling the infrastructure, reducing data arrival latency, improving user querying experience while being cost-effective at the same time.

A huge thanks to the rest of the Data Platform Infra team who helped in improving, scaling and supporting Presto Infrastructure at Lyft.

Also, thanks to the rest of the PrestoSQL open source community for helping and supporting us through the development. Please check this page on how to get involved in the project.

The Data Platform team is actively looking for talented and driven engineers, data engineers, scientists, product managers, and marketers to join our team. Learn about life at Lyft and visit the careers section for the latest openings.