
Application Security in a DevOps Environment
It seems like every AppSec vendor pitch talks about how you can shift security “to the left” and they can help you transition to “DevSecOps”. When I hear these pitches, I’ve often thought to myself, “I don’t think that word means what you think it does.” While it’s great that security is embracing DevOps-style engineering (yay!), and it’s great that many vendors are thinking about how their tools fit in these environments (yay!), I don’t see a lot of the discussion about what DevSecOps looks like when it’s done well. When I joined Lyft’s Security Team, surviving (let alone thriving) in their DevOps environment was a challenge, but working in this environment taught me a number of things about doing AppSec in a DevOps environment.
Are you my target audience?
There is plenty of advice for making your AppSec team function better in a DevOps environment, but this article is specifically about the characteristics of Lyft’s AppSec program that I feel support our DevSecOps efforts well. If you work on an AppSec team, maybe our experiences can help you think through how you work with you DevOps engineering teams. If you’re a vendor and your product doesn’t support these, then let’s delay talking until they do.
DevAppSecOps, a word which here means…
A vendor recently described Lyft’s engineering process as “extreme DevOps”. We have a lot of teams working on different products and features. Each team owns their services and are responsible for meeting SLA’s for availability and response time. Teams have almost complete control over how their services are built and run, as long as they maintain their SLA. Most teams use this freedom to develop fast, and are often deploying new features several times a day. With this freedom, teams have the responsibility of securing their services, and ensuring security issues are fixed within an established timeframe.
Adding a traditional AppSec program to this would be labor intensive. When building the AppSec program at Lyft, we had to re-think how we engage teams, and develop tooling to automate integrating security throughout the development flow.
When looking at the projects in Lyft’s AppSec program that have been successful, a couple of themes stand out.
- Everything has to be measured
- Security’s input needs to be timely and respect the developer’s time
- We need continuous feedback loops between processes
If you look at all of the activities that we’re doing, we don’t embody these themes in everything that we do, but we’ve had enough successful examples of each that I believe they’re worth sharing.
Everything must be Measured
Tools and processes need to be measurable. Not only do we need to collect the measurements, we need people watching those metrics. Some metrics can be watched by the AppSec team, but for many it’s far more effective when the the development teams monitor the metrics and are held responsible for maintaining a reasonable threshold.
At Lyft, each team maintains one or more dashboards for each of the services they run. This shows metrics such as the service’s error rates, response time, and the percentage of time the service has been within it’s SLA for services that rely on it. The Security Team is currently rolling out metrics to our service dashboards tracking out of date patches, with an alarm to page the team when security patches are left unapplied. Giving the teams visibility into the risks lets them prioritize and schedule their patching, instead of relying on the security team to monitor patch levels.
There are many vendors who can show patch levels across your fleet. Most of those tools do a much better job analyzing packages on the system and figuring out what patches are missing instead of our naive scripting. They produce prettier graphs. The problem with most systems is they display this data to the security team, and rarely support getting that information to the people who have the power (and responsibility) for patching those instances in a way that integrates with their workflow.
If you’re a security vendor, please, support native exporting of your tools data to the tools that our engineers are in every day — wavefront, grafana, elasticsearch. Or let us write all the data into S3 so we can ingest it into our standard audit pipeline. We also need to scope reports on that data to the appropriate teams, so please, support slicing data on AWS tags, ASG names, etc.
Input must be Timely (and respect the engineer’s time)
When giving security input to an engineering team, if the input is not given at exactly the right time there’s a good chance it will just be filtered out as noise by the members on that team. With the speed of development and the velocity of change in our environment, engineers have to digest a firehose of information. Whether it’s all-engineering emails about changes to a service template, changes to the process for deploy a particular job, or best practices that teams have figured out and want other teams to adopt, engineers are bombarded by (good and helpful) information from other teams every day. To be productive as a developer, you have to filter out a lot noise. The security team telling you it’s cybersecurity awareness month, so please don’t xss or fall for phishing, becomes noise. Engineers need to be reminded about cross-site scripting when they are writing new frontend code, and about phishing when they are reading emails from questionable sources.
One place where I saw a significant gap in getting timely information to our developers was during the pull request (PR) process. We have static analysis tools running against PR’s in github that must all pass before the PR can be merged. But often those tests take 10–15 minutes to run. By the time our static analysis tools fail the build, the developer is often off working on another task, or working through code review with a peer. To make things worse, the UX for discovering why a Jenkins test failed isn’t intuitive for new engineers. This resulted in engineers asking the security team on Slack why a test was failing (interrupting flow for both the developer and the security team member who needed to answer their question), and a mean time to fix of over an hour. Worse, if the engineer didn’t understand the results, they sometimes would force merge the PR under the assumption that it was a false positive or they could fix it later.
Seeing this, we built a system (LASER) at Lyft to quickly give non-blocking security feedback on PR’s. We try to give feedback within 30 seconds of the developer opening a PR or pushing a commit. This way the feedback is present before their peer looks at the PR for code review, and the developer is notified of the comment before they transition to another task. The comment from LASER gives a summary of the issue, with links to more information in case they aren’t familiar with the security issue that was found. This resulted in the average fix time dropping to 7 minutes.
In addition to having tools that are fast, the results need to be very accurate so the developer’s time is respected. Every false positive wastes an engineer’s time (even if it’s only 7 minutes). Yes, this means higher false-negative rates, but if we’re stopping the most common issues with no marginal work for the security team, then the security team is freed to work on better surfacing those issues in a fast and accurate way in our tooling.
Outputs from one process should be useful input to refine other processes
The best tools are ones that both address an existing set of issues, and allow us to improve our entire AppSec process at the same time. What does this look like at Lyft? The security team tries to interact with the owner of a service at 13 points in their development process. For that to be possible with a relatively small AppSec team, those processes are highly automated and manual work is prioritized based on risk. When implementing the automation, we specifically looks for ways that the outputs can be used as inputs into other automated processes.
- We use a self-assessment questionnaire that lets teams report what user data they are storing and where. This provides automated feedback to the developer based on their answers, to prevent those mistakes as they implement their service. Certain characteristics automatically flag the service for deeper review by the Security Team. We also use the results to update our data map and inform how we prioritize that service for review and external security assessment.
- When vulnerabilities are found, we look for ways to implement high signal rules in our scanning tools to detect these prior to deployment in the future. Which brings up another issue for vendors — if your scanning tool doesn’t allow us to modify and write new rules, then your tool is significantly less useful.
Most AppSec tooling and processes provide data that is useful input into other processes, but thinking through how that will be done in advance has been useful for ensuring that when we make investments in one aspect of our program, we’re improving multiple aspects of our program simultaneously.
Summary
This would be a nice, popular medium post if I promised that by doing these 3 things, you’ll be super successful and someone will probably give you a unicorn. The reality is that AppSec is a lot of hard work, and there’s a reasonable chance I’m wrong about a lot of things. Please leave feedback in the comments, and I hope we can all learn more together!
Interested in working in an environment like this? Lyft is hiring! Apply through our application system, or drop me a note at csteipp@lyft.com.