← Back to writingAI Driven Development

The 80% Fix for Whack-a-Mole Coding: A Testing Strategy You Can Actually Set Up

7 min read

I read a post the other day that I could have written for half the founders I talk to.

Someone had built an AI tool that generates landing pages. It worked. Then they kept improving it, and the improvements started breaking the thing. Fix the image picker, the copy generation regresses. Fix that, something else goes weird. They were seriously considering rolling all the way back to an older, dumber version that "generated better pages," just to have something stable to ship.

A reply cut straight to it: "This problem is actually quite common in agentic coding without guardrails like regression tests. Getting setup with a test strategy can solve 80% of this headache."

That's the whole article, really. But the original poster asked two follow-up questions that nobody answered well, and they're the questions everyone has: Do the tests the AI prompts me to write after every change actually count? And where do I even learn to set up a test strategy? Let's answer both.

Why your app keeps breaking the same way#

When you change one thing and four others break, it feels like the AI is being careless. It isn't. It literally cannot see what it might break.

The AI works from the small slice of code it has in front of it right now. It has no memory of the twenty other features you built last month and no way to check whether its new change quietly broke one of them. So it makes a reasonable-looking change, you nod, and you don't find out feature #14 is broken until a user trips over it. Or until you do, a week later, with no memory of what changed.

A human engineer has the exact same blind spot. The difference is they have a safety net under them: a pile of automated checks that run in seconds and shout "you just broke the login flow" before the code ever ships. That net is the thing you're missing. It's the whole gap.

You're not bad at this. You're just flying without instruments.

What a "test" actually is#

Strip away the jargon. A test is just a recorded expectation:

"When a user with no email tries to sign up, the app should show an error and not create an account."

You write that down once, as a tiny piece of code. From then on, a machine re-checks it every single time anything in the codebase changes. Hundreds of these run in a few seconds. The moment a change violates one, you get a red flag pointing at exactly what broke, before it ever reaches a user.

That's it. No single test is clever. The power is in the pile of them. A hundred boring checks, run automatically on every change, is the whole difference between "I think this still works" and "I know this still works." Tests don't make your code smart. They make breakage loud.

Do the tests the AI writes after every change count?#

This was the founder's sharpest question. The honest answer is sometimes, but probably not the way you're using them right now.

Tools like Replit, Lovable, and Cursor will happily generate tests for the thing they just built. That's better than nothing, but there are two traps:

Trap one: the AI grades its own homework. If the AI writes the code and writes the test for that code, it tends to write a test that passes by describing what the code does, not what it should do. If the code has a bug, the test often just enshrines the bug. The test goes green and you feel safe. You're not.

Trap two: tests you never run are decoration. A test only protects you if it runs automatically on every change. If those AI-generated tests are sitting in your project but nothing executes them when you make your next edit, they're catching nothing. They're a smoke detector with the battery taken out.

So: AI-generated tests count if you point them at the behaviour you actually care about, and if something runs them on every change. Which brings us to the strategy.

A test strategy in four plain-English layers#

You don't need all of this on day one. But here's the full shape, from "do this first" to "nice to have," in order of bang-for-buck.

1. Smoke tests: does the app even start? The cheapest, highest-value check there is. Can the app load? Can a user sign in? Does the main page render without errors? If you only ever have five tests, make them these. Most "everything is broken" panics get caught right here.

2. Critical-path tests: do the money flows work? Pick the three or four things that, if broken, make your product worthless. Signing up. The core action your product exists to do. Getting paid. Write a test that walks through each one end to end, the way a real user would. These are the ones that let you ship on a Friday.

3. Logic tests: do the rules behave at the edges? This is where bugs actually live: the empty input, the duplicate, the user who clicks twice, the discount that goes negative. For each rule that matters, write down what should happen for the normal case and for the weird ones. They're fast to run and they catch the subtle regressions smoke tests sail right past.

4. Guardrail tests: is anything leaking? If you store other people's data, you need tests that prove user A can never see user B's data. This is the one category where a single failure is a catastrophe rather than an inconvenience. Worth setting up the moment you have real users.

Start at layer one and add layers as the stakes rise. Most of the safety lives in the first two.

How to actually get the AI to do this for you#

You're not going to write these by hand, and you don't need to. You need to direct the AI like an engineer would. The shift is small but it changes everything:

  • Make tests a standing rule, not an afterthought. Tell your tool, once, up front: "Every time you change behaviour, add or update a test that would fail if this behaviour broke. Run the full test suite before telling me you're done." Put it in the project's instructions so it applies to every prompt, not just the one you remember to ask.

  • Write the expectation yourself, in English. The AI is great at turning "a signup with no email should fail and create no account" into a working test. It's bad at deciding that's the rule in the first place. You own the what; let it own the how.

  • Make it run the tests, every time. The single most valuable sentence you can add to your workflow: "Run all the tests and show me the results before we move on." A change isn't done when it looks right. It's done when the suite is green.

  • When something breaks, add a test for it. Every bug a user reports is a test you were missing. Fix the bug, then ask: "Write a test that would have caught this." Do that consistently and your app gets harder to break every single week. That's the exact opposite of the death spiral the founder was stuck in.

You don't have to roll back#

The founder in that thread was about to ship a worse version of their product because they'd lost confidence that the good version was stable. That's the real cost of having no tests: not the bugs themselves, but the fear that freezes you.

A test suite hands back the one thing AI coding quietly takes away: the ability to change something and know you didn't break everything else. That's not a luxury reserved for "real" engineers. For a non-technical founder shipping with AI, it's the difference between moving fast and moving in circles.

So no, don't roll back. Spend an afternoon writing down what your product is supposed to do, get your AI to turn each line into a test, and make it run them every time. The version you were about to throw away is probably fine. You just couldn't see it yet.


Where are you at?#

What's your relationship with tests right now?