Load Testing at Netflix: Virtual Interview with Coburn Watson

I exchanged several e-mails with Coburn Watson (@coburnw), Cloud Performance Engineering Manager at Netflix, and he was kind enough to share with me some very interesting information about load testing at Netflix. I believe that this information is too valuable to keep hidden and decided to share it in the form of a virtual interview (of course, after asking Coburn’s permission).

AP: I had a conversation with Adrian Cockcroft (@adrianco) at the Performance and Capacity conference about load testing at Netflix – he had a slide referring to using JMeter in Continuous Integration (CI) in his presentation. He suggested contacting you for details.

I would be interested in any detail you would be able to share. I saw your presentations on Slideshare – but you don’t talk about load testing there.

It is interesting that people don’t talk much about load testing – even if they do interesting things there. So it is difficult to understand what is going on in the industry. I still believe that load testing is an important part of performance risk mitigation and should be used in any mature company – but it looks like there are other opinions. Any thoughts?

CW: We do have an in-house utility for load testing which leverages JMeter as the driver with Jenkins as the coordinating framework.

With regard to the load testing we perform at Netflix, this takes two primary forms, both of which are driven by the service-oriented push model we operate under. With such a model, teams push between every two days to three weeks, running canaries to identify both functional and performance risk that manifests under production load characteristics. In such a push model, the time required to execute a formal performance test with each deployment isn’t really there. This leads to us using the load-testing framework to test a new, or significantly refactored, service under production load to characterize performance and scalability characteristics. Typically, it lets us validate whether we need EVCache (memcache) in addition to Cassandra, and perhaps additional optimizations. Of the multiple new or refactored services deployed last year (we have over a hundred distinct services), most were load tested by the service teams who developed them. Through such testing, all services made it into production without a hitch. The second use of our load testing framework is to verify architectural changes to configuration (cross-coast load proxying) to quantify the system’s performance.

One additional use case involves an in-house benchmark suite used to quantify the variability of performance for identical loads on multiple instances of the same type, or cross-instances (e.g. m2.2xl vs m3.2xl) in some cases.

Having been through a few companies with quarterly or annual release cycles, subsequently landing at Netflix, I feel I have now seen a full spectrum of software development approaches. In the former model (quarterly, annual) it definitely makes sense to have formalized load testing as part of the development cycle, particularly when it’s a shipped product. Without such testing, the risk is too great to the customer and regression might not be caught until the product is in the customer’s hands. Running a SaaS provides Netflix with great flexibility. In the Netflix model, releases are backwards compatible, so if a push into production results in a significant performance regression (which escaped the canary analysis phase) we simply spin up instances on the old code base and take the new code out of the traffic path. I also believe the fact that we don’t “upgrade” our systems extends our flexibility and is only possible running on the cloud.

AP: Thank you very much for the detailed reply! So you don’t include load testing as a step in everyday CI using canaries instead? And do you believe that load testing doesn’t add value for incremental service changes (on top of canaries)? A couple of concerns I see are that (1) small performance changes won’t be seen due to the variation of production workload and (2) you accept the risk that users routed to the canary would be exposed to performance issues or failures.

CW: Our production workload tends to be quite consistent week-over-week, and running a canary as part of the production cluster guarantees it sees the exact same workload traffic pattern/behavior‒ something very difficult to get right (or even close in most cases) with a load test. We have an “automated canary analysis (ACA)” framework that many services adopt. As part of a push, it deploys a multi-instance canary cluster alongside the production cluster (considered baseline). Approximately 300 metrics (in the base configuration) are evaluated over a period of many hours, comparing the canary and baseline data. It then scores the canary from 0-1. The score is a sliding scale and represents the risk of pushing the code base into production. Higher scores indicate minimal risk to a full push. These scoring guidelines/interpretation vary by service and are constantly evolving. When applied effectively, we have seen it identify many performance problems that would have been difficult to detect in a load test.

One practice which isn’t yet widely adopted but is used consistently by our edge teams (who push most frequently) is automated squeeze testing. Once the canary has passed the functional and ACA analysis phases, the production traffic is differentially steered at an increased rate against the canary, increasing in well-defined steps. As the request rate goes up, key metrics are evaluated to determine the effective carrying capacity, automatically determining whether that capacity has decreased as part of the push.

A key factor that makes such a canary-based performance analysis model work well is that we are on the cloud. If you are in a fixed footprint deployment model, your deployed capacity is your total capacity. Given we autoscale our services in aggregate many thousands of instances a day (based on traffic rates), if we have either an increase in workload or a change in the performance profile we can easily absorb adding a few more instances into the cluster (push of a button). We also run slightly over-provisioned in most configurations to absorb the loss of a datacenter (AZ) within a given AWS region, so we have flex-carrying capacity in place for most services already to absorb minor performance regressions. Overall the architecture, supporting frameworks, and flexibility of the cloud make all this possible for us. So, even though I will gladly stand up in support of such a performance risk mitigation strategy, it’s not for everyone. Unless, of course, they move to the cloud and adopt our architecture. It also works for a SaaS provider, but probably not for someone who is shipping off-the-shelf software.

I cover the ACA framework in my surge presentation from 2013. If you have time I would recommend watching it, as it provides much more context around how Netflix optimizes for engineering velocity, even it if results in occasional failures. The benefit of our model is that failures tend to be very reduced in scope, are identified quickly, and many times our robust retry and fallback frameworks fully insulate end users from any negative experience.

AP: Actually, canary testing is the performance testing that uses real users to create load instead of creating synthetic load by using a load-testing tool. It makes sense in your case when (1) you have very homogenous workloads and can control them; (2) potential issues have minimal impact on user satisfaction and company image and you can easily rollback the changes, and (3) you have fully parallel and scalable architecture. You just trade in the need to generate (and validate) workload for a possibility of minor issues and minor load variability. I guess the further you are away from these conditions, the more questionable such a practice would be.

By the way, I guess a major part of ACA may be used in the case of generated load, too – the analysis should be the same, regardless of the way you apply load. Is there more information available about ACA anywhere (beyond your presentation)? Are there any plans to make it public in any way?

CW: For us, canary testing represents both a functional and performance testing phase. I do agree that if a customer’s architecture and push model differs significantly from ours, then canary testing might not be a great approach, but it could still bring value. It wasn’t an accident that we arrived at such a model. Many intentional choices were made to get here and the benefits are incredible in terms of both engineering velocity and system reliability.

I am hopeful that more details on our ACA framework will make it into the public domain, but I cannot guarantee any timelines.

Alex Podelko

Over the last sixteen years Alex Podelko supported major performance initiatives for Oracle, Hyperion, Aetna, and Intel in different roles including performance tester, performance analyst, performance architect, and performance engineer. Currently he is Consulting Member of Technical Staff at Oracle, responsible for performance testing and tuning of Hyperion...
Read more about Alex Podelko