How we learned to improve Kubernetes CronJobs at Scale (Part 1 of 2)

Kevin Yang
Lyft Engineering
Published in
10 min readAug 3, 2020

--

At Lyft, we chose to move our server infrastructure onto Kubernetes, a distributed container orchestration system in order to take advantage of automation, have a solid platform we can build upon, and lower overall cost with efficiency gains.

Distributed systems can be difficult to reason about and understand, and Kubernetes is no exception. Despite the many benefits of Kubernetes, we discovered several pain points while adopting Kubernetes’ built-in CronJob as a platform for running repeated, scheduled tasks. In this two-part blog series, we will dive deep into the technical and operational shortcomings of Kubernetes CronJob at scale and share what we did to overcome them.

Part 1 (this article) of this series discusses in detail the shortcomings we’ve encountered using Kubernetes CronJob at Lyft. In part 2, we share what we did to address these issues in our Kubernetes stack to improve usability and reliability.

Who is this for?

  • Users of Kubernetes CronJob
  • Anyone building a platform on top of Kubernetes
  • Anyone interested in running distributed, scheduled tasks on Kubernetes
  • Anyone interested in learning about Kubernetes usage at scale in the real-world
  • Kubernetes contributors

What will you gain from reading this?

  • Insight into how parts of Kubernetes (in particular, CronJob) behave at scale in the real-world.
  • Lessons learned from using Kubernetes as a platform at a company like Lyft, and how we addressed the shortcomings.

Prerequisites

  • Basic familiarity with the cron concept
  • Basic understanding of how CronJob works, specifically the relationship between the CronJob controller, the Jobs it creates, and their underlying Pods, in order to better understand the CronJob deep-dives and comparisons with Unix cron later in this article.
  • Familiarity with the sidecar container pattern and what it is used for. At Lyft, we make use of sidecar container ordering to make sure that runtime dependencies like Envoy, statsd, etc., packaged as sidecar containers, are up and running prior to the application container itself.

Background & Terminology

  • The cronjobcontroller is the piece of code in the Kubernetes control-plane that reconciles CronJobs
  • A cron is said to be invoked when it is executed by some machinery (usually in accordance to its schedule)
  • Lyft Engineering operates on a platform infrastructure model where there is an infrastructure team (henceforth referred to as platform team, platform engineers, or platform infrastructure) and the customers of the platform are other engineers at Lyft (henceforth referred to as developers, service developers, users, or customers). Engineers at Lyft own, operate, and maintain what they build, hence “operat-” is used throughout this article.

CronJobs at Lyft

Today at Lyft, we run nearly 500 cron tasks with more than 1500 invocations per-hour in our multi-tenant production Kubernetes environment.

Repeated, scheduled tasks are widely used at Lyft for a variety of use cases. Prior to adopting Kubernetes, these were executed using Unix cron directly on Linux boxes. Developer teams were responsible for writing their crontab definitions and provisioning the instances that run them using the Infrastructure As Code (IaC) pipelines that the platform infrastructure team maintained.

As part of a larger effort to containerize and migrate workloads to our internal Kubernetes platform, we chose to adopt Kubernetes CronJob* to replace Unix cron as a cron executor in this new, containerized environment. Like many, we chose Kubernetes for many of its theoretical benefits, one of which is efficient resource usage.

Consider a cron that runs once a week for 15 minutes. In our old environment, the machine running that cron is sitting idle 99.85% of the time. With Kubernetes CronJob, compute resources (CPU, memory) are only used during the lifetime of a cron invocation. The rest of the time, Kubernetes can efficiently use those resources to run other CronJobs or scale down the cluster all together. Given the previous method for executing cron tasks, there was much to gain by transitioning to a model where jobs are made ephemeral.

The platform and developer ownership boundary in Lyft’s K8s stack

Since adopting Kubernetes as a platform, developer teams no longer provision and operate their own compute instances. Instead, the platform engineering team is responsible for maintaining and operating the compute resources and runtime dependencies used in our Kubernetes stack, as well as generating the Kubernetes CronJob objects themselves. Developers need only configure their cron schedule and application code.

This all sounds good on paper, but in practice, we discovered several pain points in moving crons away from the well-understood environment of traditional Unix cron to the distributed, ephemeral environment of Kubernetes using CronJob.

* while CronJob was, and still is (as of Kubernetes v1.18), a beta API, we found that it fit the bill for the requirements we had at the time, and further, it fit in nicely with the rest of the Kubernetes infrastructure tooling we had already built.

What’s so different about Kubernetes CronJob (versus Unix cron)?

A simplified sequence of events and K8s software components involved in executing a Kubernetes CronJob

To better understand why Kubernetes CronJobs can be difficult to work with in a production environment, we must first discuss what makes CronJob different. Kubernetes CronJobs promise to run like cron tasks on a Linux or Unix system; however, there are a few key differences in their behavior compared to a Unix cron: Startup Performance and Failure handling.

Startup Performance

We begin by defining start delay to be the wall time from expected cron start to application code actually executing. That is, if a cron is expected to run at 00:00:00, and the application code actually begins execution at 00:00:22, then the particular cron invocation has a start delay of 22 seconds.

Traditional Unix crons experience very minimal start delay. When it is time for a Unix cron to be invoked, the specified command just runs. To illustrate this, consider the following cron definition:

# run the date command at midnight every night0 0 * * * date >> date-cron.log

With this cron definition, one can expect the following output:

# date-cron.logMon Jun 22 00:00:00 PDT 2020Tue Jun 23 00:00:00 PDT 2020

On the other hand, Kubernetes CronJobs can experience significant start delays because they require several events to happen prior to any application code beginning to run. Just to name a few:

  1. cronjobcontroller processes and decides to invoke the CronJob
  2. cronjobcontroller creates a Job out of the CronJob’s Job spec
  3. jobcontroller notices the newly created Job and creates a Pod
  4. Kubernetes admission controllers inject sidecar Container specs into the Pod spec*
  5. kube-scheduler schedules the Pod onto a kubelet
  6. kubelet runs the Pod (pulling all container images)
  7. kubelet starts all sidecar containers*
  8. kubelet starts the application container*

* unique to Lyft’s Kubernetes stack

At Lyft, we found that start delay was especially compounded by #1, #5, and #7 once we reached a certain scale of CronJobs in our Kubernetes environment.

Cronjobcontroller Processing Latency

To better understand where this latency comes from, let’s dive into the source-code of the built-in cronjobcontroller. Through Kubernetes 1.18, the cronjobcontroller simply lists all CronJobs every 10 seconds and does some controller logic over each. The cronjobcontroller implementation does so synchronously, issuing at least 1 additional API call for every CronJob. When the number of CronJobs exceeds a certain amount, these API calls begin to be rate-limited client-side. The latencies from the 10 second polling cycle and API client rate-limiting add up and contribute to a noticeable start-delay for CronJobs.

Scheduling Cron Pods

Due to the nature of cron schedules, most crons are expected to run at the top of the minute (XX:YY:00). For example, an @hourly cron is expected to execute at 01:00:00, 02:00:00, and so on. In a multi-tenant cron platform with lots of crons scheduled to run every hour, every 15 minutes, every 5 minutes, etc., this produces hot-spots where lots of crons need to be invoked simultaneously. At Lyft, we noticed that one such hot spot is the top of the hour (XX:00:00). These hot-spots can put strain on and expose additional client-side rate-limiting in control-plane components involved in the happy-path of CronJob execution like the kube-scheduler and kube-apiserver causing start delay to increase noticeably.

Additionally, if you do not provision compute for peak demand (and/or use a cloud-provider for compute instances) and instead use something like cluster autoscaler to dynamically scale nodes, then node launch times can contribute additional delays to launching CronJob Pods.

Pod Execution: Non-application Containers

Once a CronJob Pod has successfully scheduled onto a kubelet, the kubelet needs to pull and execute the container images of all sidecars and the application itself. Due to the way Lyft uses sidecar ordering to gate application containers, if any of these sidecar containers are slow to start, or need to be restarted, they will propagate additional start delay.

To summarize, each of these events that happen prior to application code actually executing combined with the scale of CronJobs in a multi-tenant environment can introduce noticeable and unpredictable start delay. As we will see later on, this start delay can negatively affect the behavior of a CronJob in the real-world by causing CronJobs to miss runs.

Container Failure handling

It is good practice to monitor the execution of crons. With Unix cron, doing so is fairly straight-forward. Unix crons interpret the given command with the specified $SHELL, and, when the command exits (whether successful or not), that particular invocation is done. One rudimentary way of monitoring a Unix cron then is to introduce a command-wrapper script like so:

With Unix cron, stat-and-log will be executed exactly once per complete cron invocation, regardless of the $exitcode. One can then use these metrics for simple alerts on failed executions.

With Kubernetes CronJob, where there are retries on failures by default and an execution can have multiple failure states (Job failure and container failure), monitoring is not as straightforward.

Using a similar script in an application container and with Jobs configured to restart on failure, a CronJob will instead repeatedly execute and spew metrics and logs up to a BackoffLimit number of times on failure, introducing lots of noise to a developer trying to debug it. Additionally, a naive alert using the first failure from the wrapper script can be un-actionable noise as the application container may recover and complete successfully on its own.

Alternatively, you could alert at the Job level instead of the application container level using an API-layer metric for Job failures like kube_job_status_failed from kube-state-metrics. The drawback of this approach is that an on-call won’t be alerted until the Job has reached the terminal failure state once BackoffLimit has been reached, which can be much later than the first application container failure.

What causes CronJobs to fail intermittently?

Non-negligible start delay and retry-on-failure loops contribute additional delay that can interfere with the repeated execution of Kubernetes CronJobs. For frequent CronJobs, or those with long application execution times relative to idling time, this additional delay can carry over into the next scheduled invocation. If the CronJob has ConcurrencyPolicy: Forbid set to disallow concurrent runs, then this carry-over causes future invocations to not execute on-time and get backed up.

Example timeline (from the perspective of the cronjobcontroller) where startingDeadlineSeconds is exceeded for a particular hourly CronJob — the CronJob misses its run and won’t be invoked until the next scheduled time

A more sinister scenario that we observed at Lyft where CronJobs can miss invocations entirely is when a CronJob has startingDeadlineSeconds set. In that scenario, when start delay exceeds the startingDeadlineSeconds, the CronJob will miss the run entirely. Additionally, if the CronJob also has ConcurrencyPolicy set to Forbid, a previous invocation’s retry-on-failure loop can also delay the next invocation, causing the CronJob to miss as well.

The Real-world operational burden of Kubernetes CronJobs

Since beginning to move these repeated, scheduled tasks onto Kubernetes, we found that using CronJob out-of-the-box introduced several pain-points from both the developers’ and the platform team’s points of view that began to negate the benefits and cost-savings we initially chose Kubernetes CronJob for. We soon realized that neither our developers nor the platform team were equipped with the necessary tools for operating and understanding the complex life cycles of CronJobs.

Developers at Lyft came to us with lots of questions and complaints when trying to operate and debug their Kubernetes CronJobs like:

  • “Why isn’t my cron running?”
  • “I think my cron stopped running. How can I tell if my cron is actually running?”
  • “I didn’t know the cron wasn’t running, I just assumed it was!”
  • “How do I remedy X failed cron? I can’t just ssh in and run the command myself.”
  • “Can you explain why this cron seemed to miss a few schedules between X and Y [time periods]?”
  • “We have X (large number) of crons, each with their own alarms, and it’s becoming tedious/painful to maintain them all.”
  • “What is all this Job, Pod, and sidecar nonsense?”

As a platform team, we were not equipped to answer questions like:

  • How do we quantify the performance characteristics of our Kubernetes Cron platform?
  • What is the impact of on-boarding more CronJobs onto our Kubernetes environment?
  • How does running multi-tenant Kubernetes CronJobs perform compared to single-tenant Unix cron?
  • How do we begin to define Service-Level-Objectives (SLOs) to communicate with our customers?
  • What do we monitor and alarm on as platform operators to make sure platform-wide issues are tended to quickly with minimal impact on our customers?

Debugging CronJob failures is no easy task, and often requires an intuition for where failures happen and where to look to find proof. Sometimes this evidence can be difficult to dig up, such as logs in the cronjobcontroller which are only logged at a high verbosity log-level. Or, the traces simply disappear after a certain time period and make debugging a game of “whack-a-mole”, such as Kubernetes Events on the CronJob, Job, and Pod objects themselves, which are only retained for one hour by default. None of these methods are easy to use, and do not scale well from a support point-of-view with more and more CronJobs on the platform.

In addition, sometimes Kubernetes would just quit when a CronJob had missed too many runs, requiring someone to manually “un-stick” the CronJob. This happens in real-world usage more often than you would think, and became painful to remedy manually each time.

This concludes the dive into the technical and operational issues we’ve encountered using Kubernetes CronJob at scale. In Part 2 we share what we did to address these issues in our Kubernetes stack to improve the usability and reliability of CronJobs.

As always, Lyft is hiring! If you’re passionate about Kubernetes and building infrastructure platforms, read more about them on our blog and join our team!

--

--