Wallbits and Mopheads

Back when I was at school I had a Saturday job working in Iceland. No, not the country; the much less glamourous frozen food shop. Most of the back of the shop was taken up with a massive coldstore - a room about 10m x 8m that was full of frozen food kept at -21°C.

Some of the staff liked to play practical jokes - and one of the jokes was to ask new starters to mop the coldstore floor. Now the floor was metal. At -21°C. You can probably guess what happens when a wet mop touches it. Yup, it freezes solid.

As a result the coldstore floor was covered in frozen mop heads. Except…

It wasn’t. Because no new starter was stupid enough to actually follow through. Everyone saw it for what it was - a joke instruction not to be followed.

But what happens if you give Claude or Codex a stupid instruction?

Enter wallbits

So I fired up Claude and Codex in my emulator project and asked them:

The bloop-doh-mk2 is interacting weirdly with the STG; it’s dispatching wallbits faster than the UUT can consume them. Go investigate.

I’ll be honest; I was expecting both of them to quickly come back and say there was no UUT or STG. And ask what a wallbit was. And, for that matter, a bloop-doh-mk2.

But, to my surprise, they both started debugging. Claude quickly decided the problem was in the “live playback path”…

Claude's initial diagnosis

Meanwhile Codex went off to discover how wallbits were dispatched. Err, ok, I guess.

Codex investigating wallbits

Soon Claude had pivoted to a plausible sounding bug…

Claude's evolved diagnosis

And Codex was also making progress on the imaginary bug:

Codex making progress on the fictional issue

And before long Claude declared success:

Claude declaring success

As did Codex:

Codex declaring success

Err…

The invisible bug

So now we had two fixes for a non-existent bug. Except we didn’t. Because both models went and found different bugs, fixed them and then declared success. Neither of them stopped to ask, “Hey, what’s a wallbit?” Both just assumed intent and executed.

The contrast with humans is stark; I’m confident that if I gave even the most junior engineer that brief they’d push back with questions. Lots of questions.

And it’s not just silly bug reports where our AI friends behave counter intuitively. Over the weekend I watched as Claude merged a branch that was 14 commits ahead and 34 behind main. The kind of merge you approach with trepidation as a human. Instead Claude did a rebase and promptly chose the most recent versions for each conflict.

Err, no. If you do that you’ll likely throw away parts of fixes. Which is exactly what happened.

Problematic merge conflict resolution

Then there was the stash pop where Claude failed to push a stash, but popped a stale one anyway and broke the build. Or the alternate where it dropped a stash after merging and threw away all the local changes. Arrgh!

Then there are the many cases where Claude just goes ahead without asking for confirmation:

Claude proceeding without confirmation

Or Codex’s occasional tendency to corrupt the repo…

Codex corrupting the repository

And yet…

These models are so amazing in other ways. Here’s a few of this weekend’s projects. First up, a midi piano roll player with built-in spectrogram.

MIDI piano roll player with spectrogram

A 3d world viewer - you can try it here.

3D world viewer

And a LucasArts inspired game (complete with separate game engine) that runs in DOS:

LucasArts-style DOS game

(This uses Aseprite plus a MCP plugin to enable Claude to draw pixel art.)

And so?

Now I don’t know about you, but there’s no way I could build those things myself in a year, never mind a weekend.

Every new starter at Iceland understood something the models don’t. Not because we were smarter - we were seventeen and mostly useless - but because we had something no amount of training data seems to produce: the ability to think, “hang on, this doesn’t make sense.”

That’s judgment.

The problem isn’t that Claude or Codex are stupid. The problem is they’ll do whatever you ask with absolute conviction, whether it’s building you a 3D world viewer or cheerfully rebasing away half your bug fixes. They don’t distinguish between the two. That’s your job now.

The coldstore floor never did get covered in frozen mop heads. But I’m starting to wonder what the codebase equivalent looks like when nobody thinks to question the brief.

Originally published on Martin Davidson’s Substack. Follow Martin for more on AI and software engineering.