Flaky Tests: a journey to beat them all

Author: Loic Mathieu

Original post on Foojay: Read More

Table of Contents

What’s a flaky test?First try: retry them all!Second try: fix them all!Third try: embrace the inevitability!Conclusion

“Sleep is not a synchronization primitive.”

Every test engineer, eventually

What’s a flaky test?

A flaky test is a test that sometimes passes and sometimes fails without any code changes. They’re the by‑product of non‑determinism: timing, concurrency, eventual consistency, network hiccups, clock drift, resource contention, and (our favorite) tests leaking state across runs.

Kestra is an open-source declarative orchestration platform designed to run, coordinate, and monitor large-scale, event-driven workflows. It is built to handle parallelism, asynchronous execution, and distributed systems at scale, exactly the kind of environment where determinism is hard and flaky tests tend to emerge.

At Kestra, we run 6,000+ tests across our repositories. We add dozens every day. If only 1% of those are flaky at 10% failure probability, you’ve got ~50 flaky tests. Expectation math says ~5 failures per CI run, good luck spotting real regressions under that noise.

As an orchestration platform, many of our tests execute parallel, asynchronous workflows. Async is powerful and naturally tricky to test: ordering isn’t guaranteed, and “eventually consistent” is not a helpful assertion.

One of our top issues is due to our queuing system; a test may receive a message from another test or miss a message from the queue. We strive to properly close the queue and handle all messages to ensure they are not leaked across tests, but it’s challenging to guarantee this.

Last year, CI was red often enough that we decided to go on a proper flake‑hunting journey.

First try: retry them all!

Our first try to bite them all was to retry the flaky tests.

Kestra is built in Java, and tests are written with the JUnit framework. The JUnit Pioneer extension contains an annotation that allows for retrying a test if it fails: @RetryingTest(5). We added this annotation to every test that often fails in our CI.

This helped… a bit. But it also inflated test times and masked real issues. Worse, some failures are structural (leaked resources, race conditions): once they fail, they keep failing, no matter how often you retry.
Verdict: good band‑aid, bad cure.

Second try: fix them all!

We then decided to put effort into fixing the failing test! We remove all the usage of the @RetryingTest(5) annotation and either fix the test or disable it.

As most of the flaky tests are tests that launch a workflow and assert on its execution. We improve our testing framework in this area to be sure that every test properly closes its resources and every workflow and execution created by a test will be deleted.

For that, we create a JUnit extension to manage test resource creation:

A @KestraTest annotation handles starting and closing the Kestra runner in the scope of a test class.
A @LoadFlows annotation handles loading and then removing flows in the scope of a test method.
A @ExecuteFlow annotation handles starting and then removing a flow execution in the scope of a test method.

Using this test framework everywhere gives us more control over resource allocation and deallocation, and allows us to clean any flow or execution created by a test to avoid possible test pollution with unrelated resources.

But after weeks of effort, we had to disable too many tests, and even if the number of flaky tests decreased, some were still failing, even rarely, but with the high number of tests we have, this would still make our CI suffer.

Third try: embrace the inevitability!

So tests will fail; we had to accept that, some pretty often, some rarely, but tests will fail.
We have to be pragmatic and embrace the inevitability of tests being flaky.

We decided to flag flaky tests and allow them to fail in the CI! This was not an easy decision as nobody wants to concede failure and accept it. But if we want to have a reliable CI without compromising test coverage and exploding testing implementation time, we have to avoid disabling tests and accept that some would fail pretty often.

To flag a flaky test, we annotate it with @FlakyTest which is a custom marker annotation that encapsulates Junit @Tag("flaky") annotation.
JUnit tags are very accurate for such use cases, they allow you to target a group of tagged tests when running your tests.

Our CI now launches tests in two steps:
First, tests non-tagged as flaky: those must pass for the CI run to be green
Then, tests tagged as flaky: those can fail

We also improve our CI to report differently standard tests and flaky tests, with a test summary in PR comment that directly contains the list of failing tests with their stack traces. This allows us to better pinpoint any test issues.

Of course, flagging a test as flaky is an easy thing to do, so we take care of first by trying to fix the test and only tag it as flaky as a last resort.
We have test observability in place to track flaky tests, so if they increase a lot, we would know.

Conclusion

You won’t beat every flaky test. That’s fine. The goal is to get reliable signals back into CI so you can confidently merge and ship. Separate what must be green from what’s allowed to wobble, invest in deterministic test lifecycles, and keep an eye on the flaky set so it doesn’t quietly grow.

Flakes are inevitable. Letting flakes dictate your delivery is optional.

Want to try Kestra? You can get started in 5 minutes following the quickstart guide.

The post Flaky Tests: a journey to beat them all appeared first on foojay.