Trino: Open Source Infrastructure Upgrading at Lyft

Published in

Lyft Engineering

9 min readMay 10, 2022

Background

As a data-driven company, many critical business decisions are made at Lyft based on insights from data. To support the various use cases of data scientists, analysts, and engineering teams, the Data Infra team at Lyft runs Trino as the main company-wide service to empower interactive query scenarios. Beyond that, Lyft encourages teams to leverage other major data features of Trino, such as dashboarding, reporting, and large volume Extract, Transform, Load (ETL) jobs. Our users and services send more than 250K queries with 50PB of processed data each day. Given the growing usage of Trino, the Data Infra team is dedicated to providing the latest Trino features and infrastructure upgrades to our internal customers so that we can more efficiently support their business scenarios.

Trino Upgrades at Lyft

The Trino community normally publishes one to two releases per month. The Lyft Data Infra team does platform upgrades to bring new community-released features to our internal customers. However, we currently only upgrade every six months due to the complexities involved in upgrading our internal infrastructure fleet-wide.

This blog post will cover how we upgraded our Trino clusters from version 359 to 365 and what we learned from it. In the open source world, failing to stay current with the most recent releases of a project may eventually turn into technical debt. Thus, the aim of this blog post is to outline approaches and tips to help enable other data infrastructure teams with similar scale to upgrade quickly while safely identifying regressions.

Compatibility Testing

Once the decision to upgrade Trino has been made, the team choses a build to use. We do not use the latest version of Trino as there may be regressions that have not yet been found by the community. The team reviews the release notes and chooses the version with the most critical features. For example, one feature we needed for our ETL workloads was the following: “Add support for insert overwrite operations on S3-backed tables” (#9234) included in the 363 version. The team then pulls the branch of the selected release into a Lyft-specific fork and cherry-picks our patches so that it builds within our infrastructure. These patches include any required bug fixes that have not yet been merged to the master branch. To guarantee that the new branch compiles and works as expected, minor changes may be needed to either disable unit tests that are not compatible with our settings or adjust features that overwrite default settings. Once the team is confident we have a working build, a snapshot build is made that will be used as part of the testing process.

Lyft has three main use cases for Trino: Adhoc, ETL, and Scheduling — each with its own clusters. The Adhoc clusters take care of interactive queries from users such as engineers and analysts, and are designed to handle a large number of queries with lower query wall time and higher global concurrency. The ETL clusters are designed to take care of larger and asynchronous queries, which may also have a longer wall time and higher resource utilization. The Scheduled cluster type is a special ETL dedicated to dashboards and reporting with more predictable requests and volumes. Before moving forward with real use case testing, we launch the new version in the staging clusters, each representing a different use case, to verify that it can build and execute. If the testing cluster does not start up, we make the necessary changes to fix it and file issues with the open source community.

Performance Testing

After the staging testing is complete, we release the version to a cluster that is dedicated to performance testing. In the first round of performance testing, we replay one to two days’ worth of read-only compute-intensive production queries in batches, called batch mode, against the old and new version to compare the p50, p75 and p99 wall time values. Batch mode enables the team to test high-load scenarios and to uncover any new failure cases that were not present in previous versions.

Why replay these queries on the previous version even though we already have the production data to compare p50, p75 and p99 values? Firstly, we haven’t built a system to replay batches with identical timestamps (though we do have it on our roadmap). Not being able to replay identical timestamp batches means that timing and load on the cluster is not comparable over time since the source data may change. Thus, it’s better to compare identical source data at a certain timestamp. Another reason is that identical queries can potentially scan more data if they’re run at a later date due to the partitions being scanned and how the queries are written. If p50, p75 and p99 values are comparable on both versions in the performance testing cluster, we approve the release fleet-wide.

In the case of the upgrade from version 359 to 365, there were no semantic differences between the versions. However, the wall times increased significantly. From these data points (based on two days of volume with a total of 10K queries), the team concluded that there was a regression in version 365.

Trino 365: significant performance regression

We also wanted to see how this would impact real-time traffic to the Adhoc and Scheduled clusters. We rolled out version 365 to one instance each of our production Scheduled and Adhoc clusters. Over the course of a week, we monitored metrics and determined these regressions were also visible in the production environments. The team decided to investigate the regression and find the commits that were slowing down the clusters.

Troubleshooting

The first place we looked was for any default config settings changes. This can be a subtle cause of regressions since they are not explicitly documented in the release notes. Sure enough, we found that hive.split-loader-concurrency had been changed from 4 to 64. Although changing it back improved the situation, we still found that we had a regression. Next, we wrote a query that would demonstrate the performance regression. We used a simplified query to find any large increases in wall time between version 359 and 365:

Query:
EXPLAIN ANALYZE SELECT count(*)
FROM hive.default.event_client_action
WHERE ((ds >= '2021-08-26') AND ds <= '2021-10-21')

This query’s wall time increased by about 2x between version 359 and 365 when we ran it by itself on a Lyft Trino cluster. We ran the suspect query on each version from 359 to 365 to find the culprit. By trying each release individually, we found the regression to be part of version 360. We ran our test query in a continuous loop and analyzed the Java profiler results from our Trino clusters. This is what we found:

Trino 359: addSplits is taking 0.47% of total CPU time

Trino 360: addSplits is taking 20.49% of total CPU time

The team noticed that there was a large difference in time spent in addSplits on the coordinator between the two versions: 0.47% of cycles in version 359 and 20.49% of cycles in version 365, which is an increase of 4300%. This commit was identified by the Lyft team as a potential issue because the triggerUpdate call that was introduced was synchronized, which slowed down the scheduler when planning and adding splits for processing. We reverted the change and tried the query on version 360 again and found that it fixed the regression. One of our engineers contributed a fix to the Trino community and it was released in version 370.

When we used Trino’s patch (that we submitted to fix 360) in version 365, the slowdowns got better but still persisted, which meant there was another performance regression between 360 and 365. The team decided to try a different strategy to find regressions in version 365, and used the profiler on a predetermined 30 minute window of a Scheduled cluster’s workload. First, the version was bisected again to find clear p50 and p75 slowdowns in a Scheduled workload. We found that version 365 was 80% worse than version 364 for the predetermined 30 minute workload in the Scheduled cluster. The profiler was enabled and flame graphs were generated (see below). The results were striking. The coordinators spent 3.08% and 15.37% on the filterRequest function in version 364 and 365 respectively. This showed that there was a slowdown with a new library that was used. This new library used the Java ServiceLoader class when building a JWT builder, which introduced the performance regression during internal communication authentication processing. The Trino community patched this issue to use a default serializer/deserializer in version 371.

Trino 364: filterRequest is taking 3.08% of total CPU time

Trino 365: filterRequest is taking 15.37% of total CPU time

We then ran version 365 with the two patches and found that all of the regressions were gone and that the p75 wall time for 550 queries within a 20 minute time frame were in line with version 359.

Trino 365: performance testing with/without patch

Summary & What’s Next

It was a long journey for the Lyft Data Infra team to get Trino version 365 fully seasoned and into our production clusters. We successfully fixed the performance regressions, set performance benchmarks, identified new release regressions, and shipped the new features to our customers. We were grateful to reach this finish line, but at the same time we are standing at another starting line.

We believe that it should not be difficult to adopt new releases from the open source world. Manual efforts should be reduced, if not fully removed, from our upgrade processes. As there is already sufficient interoperability in Lyft-Trino customized branches, one action we could take is to fully automate the branch merging and building process such that every new release can easily be merged to our branch and automatically tested for compatibility.

We are one of the early adopters of the Trino community’s new releases with large-scale datasets. There is potential to leverage Lyft’s data scale to help the Trino community to better understand the performance of each release, and potentially avoid long-lasting regressions brought by commits without large volume testing. The Data Infra team is dedicated to further strengthening the performance testing strategy so that it covers real-world scenarios. We will restructure our analytical tools so that they can be used in more generic ways both internally and externally in the community. We’re also considering releasing Trino performance benchmarks based on our real day-to-day querying scenarios and volume to help provide the community with more information before moving forward to a specific release. Right now Lyft processes 250K queries with 50PB of data per day, which can help other Trino users with the same or smaller data volumes to make decisions about version upgrades by referencing our benchmarks.

The Trino community is still growing rapidly and is full of supporters and new members. We’re looking forward to contributing back to the community and growing it together in the future. If you’re interested in joining us, we’re hiring!

A huge thanks to Akash Katipally and James Taylor who completed the last upgrade, kudos to Ritesh Varyani for identifying the regressions, and thanks to the rest of the Data Platform Infra team who helped to improve, scale, and support Trino at Lyft.