Analyzing build logs to get rid of flaky tests
How build logs can unlock a better build setup
March 18th, 2020
Kasper Møller Andersen
Logging is our livelihood at Humio, and log management at scale is not a trivial problem. We use automatic tests a lot to help us stay confident in our code, but we have unfortunately accumulated a number of tests that flip between failing and succeeding. That happens because of the complexity of testing software that runs distributed across multiple servers, all doing many things concurrently.
This blog post describes how we have started analyzing our build logs, to find the tests that are causing trouble and to ensure we quickly catch new tests that turn out to have problems.
Shipping logs to Humio
Our build status in Humio. Every build has statistics derived from the build logs, and each build gets a link that takes you to a Humio search for the logs from that particular build.
Our original goal with getting the logs into Humio was just to make the logs accessible, and not specifically to improve our tests. Our experience is that making logs more accessible often unlocks a great deal of problem-solving, and indeed, with all our builds logs now easy to dissect, a few of us decided to do something about those tests. We had been annoyed with them for a while, but how big was the problem really?
Finding failed builds
Let’s search our build logs in February to see how many builds had failed there:
How many failed builds were there of our master branch in February? 127.
The first line in the query is a raw text search, which limits us to log lines that include the text “TEST FAILED -”. The second line limits the logs again, by saying we only want logs where the field ref equals “refs/heads/master”, which is how we ask to only get logs from builds of our master branch. We’ll refine this later, but for now, we filter out development branches, because test failures caused by incomplete code would be noise in the data. And finally, we just count how many of the remaining log lines have distinct values in the runId field, which tells us how many build runs have failed.
127 failed builds sounds like a lot. Let’s count the total number of builds of master in the same time frame:
How many builds of master did we have in total during February? 314
314 total builds means more than a third of our builds in a month failed. Yikes, that’s bad! But which tests are actually causing problems? Let’s see:
Which tests had failed the most times during February?
Which tests are failing?
Now we’ve changed the first line in the query from the previous raw string search to a regular expression with a capture group in it. This way we can both filter out log lines that don’t match the expression, but also extract new fields from the log data. Here we extract the name of the test, and we search for the most reoccurring failed tests with the top function.
This is sort of useful, but how do we tell if our failed test was actually a spurious failure, and not just someone pushing bad code to master and fixing it soon after? We could try to guess from the logs if the failure is spurious (e.g. based on how often it occurs), but it would still just be guesswork. The best way forward would be to add the data we need in our logs to determine this reliably.
Improving our logs with retries
Our solution is to introduce a retry-mechanism in our builds. If a test fails, then retry the test some number of times to see if it’s a spurious failure, and let the test pass as soon as any attempt succeeds. But more importantly, also log every retry that happens.
Under other circumstances, retrying like this would be a big anti-pattern. Diluting what it means for a build to be green is a dangerous path to go down, but the reason we are okay with it is the realization that we have been on this path for a long time already; after all, green builds are already diluted if you need to manually retry a build to make the tests pass. And indeed, this change actually helps us turn around and walk back the other way. Retrying automatically, while logging the retry, allows us to relieve the frustrations we have with failed builds in the short term, and gives us the insight we need to correct the failing tests in the long term.
So which tests are actually spurious?
Which tests have been retried the most times since the retry-mechanism was introduced?
To find the spurious tests, we look for log lines of flaky tests in builds that ended without any test failures. To get this data, we first filter away any log lines not pertaining to retrying failed tests, but we still need information from the log lines that sum up how the build went. We use the join function for this, by executing a separate query that finds log lines summing up builds without failing tests, and merge the two results where the runId and testType fields match.
This lets us get the full picture across all branches, because we no longer just look for test failures in general. It’s also important that we didn’t have to do any work in Humio to have these new logs accessible; as soon as this log line got shipped to Humio, it was available for searching, and we didn’t have to update any schemas or the like.
Improving our builds
This is where we are today; our next step is to start assigning owners to each test, whose job it is to fix them. While that is going on, we will also annotate all the known flaky tests. This way we can differentiate them from the new flaky tests that will undoubtedly be introduced in the future, and use Humio to warn us when new ones show up.
So though we’re not done cleaning up our tests, I hope this has still made for an interesting case on how your build logs can be used to unlock a better build setup for your team.
1 We read the logs from stdin, create JSON objects from them, and send them to the structured ingest endpoint on Humio. For logs with higher business value, you should use a more robust shipper that can handle resending if there is downtime, etc., but for our build logs, we just decided on a quick and simple solution.