Detecting Tiny Performance Regressions at Hyperscale
Abstract
This paper presents Meta’s performance testing and monitoring systems, which advance the state of the art in performance regression detection by catching regressions as small as 0.005%. These tiny regressions matter due to our large fleet size and their potential to accumulate over time. Detecting such tiny regressions, however, is challenging, due to various kinds of noises caused by heterogeneous machines, server failures, maintenance operations, load spikes, etc. To detect tiny regressions despite such noises, Meta has developed two complementary systems, ServiceLab and FBDetect. ServiceLab is a pre-production testing platform, which conducts A/B experiments in an isolated environment by reserving servers from our private cloud. While this approach avoids most of the noise in production, heterogeneous machines are still a major challenge, since even machines of the same instance type from our private cloud may have subtle differences in various ways. To address this challenge, we conduct a large-scale study with millions of performance experiments to identify machine factors, such as the kernel, CPU, and datacenter location, that introduce variance to test results. Moreover, we present statistical analysis methods to robustly identify small regressions. Despite the success of ServiceLab, we observe that pre-production testing has its inherent limitation: Due to the limited resource that can be used for testing purpose, pre-production testing cannot fully reproduce the scale and complexity of the production system. As a result, some buggy code or config changes slip through. To capture such buggy changes, FBDetect monitors performance in production as the last line of defense. Unlike ServiceLab, FBDetect cannot control the machines or run repeated experiments. Instead, to battle against noise, FBDetect introduces advanced techniques to capture stack traces fleet-wide, measure fine-grained subroutine-level performance differences, filter out deceptive false-positive regressions, deduplicate correlated regressions, and analyze root causes. Both systems have been in production for over seven years. They detect regressions in thousands of services and ML models running on millions of servers. Each year they detect performance regressions that could otherwise lead to the wastage of millions of machines in the following years. This paper integrates two works, one titled “ServiceLab: Preventing Tiny Performance Regressions at Hyperscale through Pre-Production Testing”, originally published in OSDI 2024, and the other titled “FBDetect: Catching Tiny Performance Regression at Hyperscale through In-Production Monitoring”, originally published in SOSP 2024, both of which were invited for extended publication in ACM TOCS upon recommendation by the Chairs of OSDI 2024 and SOSP 2024, respectively.