The Difference Between Nearly Clean and Really Clean

Nearly clean often feels clean enough, but is it?

Your test suite has a 99.2% success rate. Sounds pretty good, right? You’ve got thousands of tests, and less than 1% of them fail on any given run. Your team should be celebrating.

Instead, they’re ignoring the test results entirely.

Why? Because when you have 1000 tests and 1% of them fail randomly, that’s 10 failing tests every single run. And when people see 10 failures, their brain doesn’t think “probably 9 false positives and 1 real issue.” It thinks “ugh, the tests are flaky again.”

The False Positive Death Spiral

I once inherited a testing organization running thousands of network protocol tests. The numbers looked impressive: ~10000 automated test scripts, false positive rate of less than 1%, very comprehensive coverage across multiple products.

But here’s what actually happened every day:

Someone spent an hour reviewing the “failed” results. The conversation was always the same: “Hmm, this looks like it might be a real issue… but it could be that intermittent timing thing. Let’s wait and see if it happens again.”

When we did suspect a real bug and sent it to a developer, their response was predictable: “Probably a false positive. This area’s been working fine.”

By the time we had multiple failures in the same area, different people had made changes, and we’d start the finger-pointing dance of who should investigate first - because the feedback loop took way too long.

Meanwhile, our false positive rate slowly ticked upward as stress accumulated, maintenance time got squeezed and real bugs snuck in amongst the noise.

The Magic of Really Clean

But we had a few test suites that were different. Around 1000 tests and with a false positive rate of around 0.1%.

When someone saw a test failure from these suites, their default assumption was: “There’s a bug.”

This created a completely different dynamic: people trusted the results and investigated failures immediately, while the code was fresh. When they found an actual false positive, they fixed the test, so the suite got cleaner over time, not worse.

The difference between 1% and 0.1% false positives wasn’t just statistical - it was psychological. It crossed the threshold where people’s default assumption switched from “probably noise” to “probably real.”

Breaking the Psychology of Broken Tests

The key thing here is that test suite trust isn’t just about numbers. It’s about human psychology.

When people expect test failures to be false positives, they: batch review failures instead of investigating immediately, look for excuses to dismiss failures rather than reasons to investigate, stop fixing test flakiness because “those tests are always flaky”, and eventually ignore test results altogether.

When people expect failures to indicate real issues, they: investigate immediately while context is fresh, fix flakiness when they encounter it, and create a virtuous cycle of improving test reliability.

Getting from Nearly to Really

So how do you cross that trust threshold? Based on our experience moving multiple test suites from “nearly clean” to “really clean”:

1. Accept that this will take significant effort. With 1000 tests at 1% false positive rate, you’ve got ~10 failures per run to investigate and fix. You need to get that down to ~1 failure per run before people will trust it.

2. Focus on one area at a time. Don’t try to fix everything at once. Pick your 50 most important tests and make them bulletproof. Better to have one “really clean” suite people trust than ten “nearly clean” ones they ignore.

3. The key test isn’t statistical - it’s belief. You’ll know you’ve crossed the threshold when your team’s gut reaction to a failure is “what’s wrong?” instead of “tests are flaky again.”

4. Hold the quality bar ruthlessly. Once you achieve “really clean,” defend it fiercely. We used a “fix it or remove it” policy - no exceptions, no special cases. The moment you let quality slip “just this once,” you’re back to the death spiral.

5. Make the status obvious. Clearly separate your “really clean” suites from your “work in progress” ones. People need to know which results they can trust.

6. Delete tests. People will scream at you for this one, but your flaky tests have negative value - they cost energy to maintain and aren’t actually catching issues. For the last tests where the get-to-really-clean cost is high, consider just deleting them.

Modern Test Flakiness Challenges

Today’s CI/CD environments create their own challenges: tests that pass locally but fail in CI, race conditions that only surface under load, environment-specific flakiness (containers, cloud resources), and timing issues with microservice dependencies.

But at heart the same principle applies. The goal isn’t perfect tests - it’s tests that people trust enough to act on.

A Real World Example

One team I worked with had a mobile app test suite with a 97% success rate. As we’ve already covered above that’s great on paper but terrible in practice. Every merge request showed multiple test failures, so developers skimmed or ignored them and just merged anyway.

We spent two weeks focusing only on their core user journey tests - about 20 tests total, and removing the rest from the pipelines (oh the howls of agony against the idea of increasing value by removing tests!). We then fixed every source of flakiness we could find: explicit wait-for-x-output instead of sleeps, better test data setup, more reliable selectors.

The result? Those 20 tests went from 97% to 99.8% success rate - and the suite running time was reduced by factor of three. Suddenly, when a test failed, developers immediately knew something was wrong. The team fixed more real bugs in the next month than they had in the previous quarter. And while adding back in the remaining tests over time went on the backlog, we’d already made the significant difference.

The Bottom Line

Nearly clean isn’t good enough. The gap between 99% and 99.9% success rates (or wherever it is) isn’t just about numbers - it’s about crossing the psychological threshold where people trust and act on test results.

If your team is ignoring test failures because “they’re probably flaky,” you haven’t reached really clean yet. Pick your most important tests, invest the time to make them bulletproof, and defend that quality ruthlessly.

Because a small suite that people trust and act on is infinitely more valuable than a comprehensive suite that people ignore.

Originally published on Edmund Pringle’s Substack. Follow Ed for more on software quality and engineering leadership.