Seminars & Colloquia

Marcelo d'Amorim

Federal University of Pernambuco (UFPE)

"Using Noise to Detect Test Flakiness"

Tuesday March 22, 2022 01:15 PM
Location: 3211, EB2 NCSU Centennial Campus
Zoom Meeting Info
(Visitor parking instructions)


Abstract: Imagine that you are a Software Engineer and that you pushed code changes to a remote repository, shared with multiple developers. Half an hour later, the integration service notifies you about two tests that have failed during the last execution. You are confused as your changes looked simple. At that point, you start to inspect your changes to find an answer that could justify those failures. One hour later, you receive another notification of the integration service. This time, the server indicates that all tests have passed. You start to suspect that the problem was not with the code you changed but, instead, with the non-deterministic behavior of those two tests. After investing some time debugging, you confirm that the two tests are flaky. A flaky test is one that non-deterministically passes or fails in a fixed environment (e.g., machine, OS, etc.). Test flakiness is costly as developers spend precious time to find a problem in the application that may not exist. It is also a serious problem in Industry (e.g., Google, Facebook, Mozilla, Twitter, and Microsoft).


In this talk, I will present my work on flaky test detection. Prior studies have shown that concurrent behavior is the most common cause of test flakiness. Based on that observation, we hypothesize that adding noise in the environment can interfere in the ordering of program events and, consequently, can influence the test outputs. We propose Shaker, a practical technique to detect flaky tests. Shaker detects flakiness by comparing the outputs of tests executing in carefully selected 'noisy' environments. Compared with a regular test run, one test run in Shaker is slower as Shaker executes the tests in loaded environments, i.e., the process that runs a test competes for resources (e.g., memory or CPU) with stressor tasks that Shaker creates. However, we conjecture that Shaker pays off by detecting test flakiness in fewer runs compared with the alternative of running the test suite multiple times in a regular noiseless environment. We refer to that alternative as ReRun.


We evaluated Shaker on a public benchmark of flaky tests for Android applications using standard performance metrics (e.g., precision and recall) and ReRun as a comparison baseline. Results are encouraging. For example, we found that (1) Shaker is 98% precise; it is almost as precise as ReRun, which, by definition, does not report false positives, that (2) Shaker’s recall is much higher compared to ReRun’s (95% versus 65%), and that (3) Shaker detects flaky tests much more efficiently than ReRun, despite the execution overhead associated with the introduction of noise.


In the future, I plan to evaluate other mechanisms to introduce noise in the environment (e.g., resource throttling, test-specific noise generators) and to explore the idea of selectively introducing noise to debug flaky tests (i.e., to explain to the developer why a test is flaky). Shaker paved the way for those ideas.

Short Bio: Marcelo d'Amorim is an Associate Professor at the Federal University of Pernambuco (UFPE), Brazil. He obtained his PhD from the University of Illinois at Urbana-Champaign in 2007 and his MS and BS degrees from UFPE in 2001 and 1997, respectively. Marcelo's research goal is to help developers build correct software.

Host: Kathryn Stolee, CSC

Back to Seminar Listings
Back to Colloquia Home Page