Coasting on Reputation

Life is complex; working everything out from first principles is hard. And time consuming. So we all use proxies. Marks & Spencer == good food. Tesco Value less so. Volkswagen == quality cars. Lada less so.

But quite how these proxies get established is rarely clear. Nor is whether they remain accurate. Regardless we continue using them.

Software proxies

Something similar happens with software. One oft used proxy is code-review. “Does your team review all your code” is slowly morphing into “Do humans review all your code”.

But, really, asking about code-review is a proxy for quality. Teams that care about quality have rigorous dev processes. And one of those processes is code review.

The thing is code review isn’t great at finding bugs. Multiple studies have shown that human code review is most effective within narrow constraints: under 400 lines reviewed in under an hour. Even then the majority of what’s flagged is style and maintainability. Finding the subtle - yet critical - bugs is extremely hard. I’m not alone in admitting I never enjoyed reviewing code - it was hard and unrewarding. Spending multiple hours to find nothing more than a few typos - well, it didn’t seem like a great use of time.

But we did it because, once in a while, it turned up great bugs - like the time I spotted a whole new multi-threaded feature lacked any support for concurrency. Things like that make the pain worthwhile.

The past tense

You’ll have spotted I’ve been writing about human code review in the past tense. That’s deliberate - it’s increasingly clear that the era of human code review is over. If you are interested in quality - and you should be - nowadays manual code review offers a very poor return on the time invested.

Instead AI has opened the door to some incredible opportunities. Opportunities that could result in better quality software than we could ever have dreamed of.

UT testing is essentially now free. By default Claude and Codex will produce a paltry few tests. But give them a goal - say 100% line or region or function coverage and they will go and diligently work for hours building a test suite to achieve that goal. Even better the door is starting to open for MC/DC testing.

Modified Condition/Decision Coverage tests every condition within a conditional. So if you have something like if (A && B || C), it’ll test what happens if you flip A alone (while holding B and C fixed), and then the same for B and C independently.

MC/DC is the standard for life critical systems - think flight control systems, or aircraft engine control systems.

Until now it has been prohibitively expensive for normal software to use. You need hundreds of lines of test code for each line of production code. But when you’ve got AI it suddenly comes into reach.

Then there’s property testing. AI can build comprehensive property test suites. Or mutation test suites. Or fuzz test suites.

E2E tests become easy too - my emulator project has hundreds of end-to-end tests that use real games or Windows apps.

Then there’s exploratory testing. Once again Claude has you covered - build harnesses to let it drive your app and then let them explore all night. The emulator has emuscru (EMUlator SCript RUnner) and Gremlin (probably don’t need to explain that one) which allow Claude to create all sorts of tests cases and havoc. Even better another agent will fix up issues as soon as they are found.

What about code-review?

Of course, code-review still has a role to play in this new world. Except this time you can run it every night. You can run multiple agents working on adversarial reviews. You should run multiple agents every night. These things aren’t possible in the human world - imagine how many times you could ask an engineer to review the same code before they’d revolt. Twice? Three times?

And so?

Trouble is, proxies decay. Take Volkswagen. Despite all the “German engineering” marketing, Volkswagen came last in JD Power’s dependability survey last year. But if you haven’t bought one recently, you probably still think of them as quality cars.

And human code review?

In reality, it’s no longer the proxy for quality it once was. If anything insisting on it signals an organisation that hasn’t adjusted its thinking for the new world. A world where you can do so much more than human code review ever could. A world where your time is far better spent orchestrating agents.

Who knows? Maybe “do you review your code” will get replaced with “what’s your MC/DC coverage?”

Originally published on Martin Davidson’s Substack. Follow Martin for more on AI and software engineering.