Building a reliable Spec Driven Development workflow, and deleting my src folder to test it (Claude Code pt 2)

"Remember kids, the only difference between screwing around and science is writing it down" - Adam Savage

This quote has stayed with me for life from when I watched Mythbusters as a kid. You could say this describes pretty well the difference between Vibe Coding vs Spec Driven Development (SDD). If you've tried SDD, you have probably realized that:

It's major improvement compared to vibe coding when it comes to actually maintaining your code past the first prototype. You have a single source of truth (the spec!) you can maintain and version and understand.
It's actually surprisingly hard to make the AI keep your code in sync with your spec. It wants to change code and forget that spec exists. Constant entropy that comes with non-deterministic AI slowly pushes you back towards vibe coding area with every tool use.

When you start having tens or hundreds of pages of .md files, it becomes harder and harder to avoid the code-spec drift, details get lost and soon your spec is just a guideline with good intentions, like a Confluence page you created 3 years ago.

So I'm sharing a cool solution I found, that seems to be working rather well.

This week I converted one old hobby project (on PHP+MySQL from 15 years ago) to a modern stack with Bun, NextJs, Neon (version of Postgres), Biome, GCP Infra-as-code, all types of automated tests, and most importantly: fully spec driven development workflow.

I tried hard to make this (and only this) workflow happen: Spec change → E2E test change → Code change.

I started with moving specs and E2E tests into the same folder -- what changes together should stay together - maybe AI will keep them in sync better this way. Didn't help much, still a lot of spec drift as I kept changing the system.

Next I tried to link my acceptance criteria (bullets in the spec) to specific tests in code (using tags), but this turned out to be so not reliable at all. The many-many kilobytes of spec were still too hard even for parallel subagents to reason about in a consistent manner, and to get all the details right.

I thought there must be a deterministic solution for this.

Claude meets Gauge

I've heard of BDD (behaviour-driven development) frameworks before, so looked into this direction. There are many solutions, but Gauge (by ThoughtWorks) caught my eye, because:

It uses markdown for its spec format (and all my specs are in markdown)
It can use Playwright for e2e automation (all my e2e tests used Playwright)

I told Claude what to do and luckily, with full e2e test coverage, the conversion of all specs to Gauge format, and rewriting all tests after that, went quite smoothly. It was funny that Claude plan mode estimated first stage of the work to take "1 week". Happens when you learn estimating from humans.

This is how my spec folder looks now. It is important to realize that the file on the right is executed programmatically and deterministicly, so there is 0 possibility that your spec and e2e tests drift. And a major bonus of this is that I don't need to duplicate test logic and the spec -- it's one and the same.

This is what Gauge calls "steps", a function for executing one of these bullet points in the spec. As you can see, it can accept variables so they are reusable.

When you combine this with a slash command (/implement in my case), you can achieve a pretty good confidence that your source of truth is your spec, and can enforce a healthy TDD pattern. My 75 E2E tests run for 23 seconds, so it's OK even for a pre-commit hook.

To prove spec quality... just delete your source code

I finally felt confident enough to run this little experiment I always wanted to run -- delete the source code and rebuild the app from spec.

I want to make this experiment: I want to try out how well our spec describes our app. Make plan for this experiment:

  1. Go to a new folder
  2. Check out the app from git
  3. Remove all business logic source code except spec folder and config/gauge/bun/ts setup files etc.
  4. Generate the app from scratch just from the specs and e2e test specs. Plan this well while avoiding context rot and using parallel agents as much as possible
  5. Iterate until all tests succeed. 
  6. Do final visual validation using Chrome

It took about 1h to run and only 1% of my weekly quota. Claude stopped at 94% test coverage and reported "pretty good success" and I had to interrupt once and say "you are not done yet", and soon it was.

And it was beautiful! ♥️ Especially the last part (point 6) where it opened both sites (original and rewrite) sequentially in Chrome, looked at the screenshots, analyzed, criticized, compared.

The result was like jumping into a parallel universe, where you could have built the same thing, but tiny bit differently. There are small differences you realize you never even considered, and that's the job of the AI - to be so very helpful and fill in the blanks, so it's exactly what happened.

The new version was missing a couple of features I accidentally implemented with direct prompt (not using /implement, that always forces to update spec first). The site has some charts, where visuals were not documented, so they were different. No log-out button, because who would think of that... But most of it was eerily similar, if not identical.

The best part was that after Claude finished, it gathered insights about:

What were the technical challenges that could have been documented (like figuring out some lazy loading patterns)
What were the biggest visual differences, where spec was ambiguous or missing

... and proposed right away to improve the original repo specs. And after writing "do it" it was done. I'm pretty confident that this approach could handle a sizable "real" application if done in a disciplined manner.

More workflow insights from this week

I'm discovering that I run more and more shell commands by letting Claude run it. If there's a chance of failure it saves me the trouble of prompting it to fix it. E.g. running tests.
One of my most used slash commands is now /push-and-monitor-ci-cd. It does exactly that, use github CLI to make sure the build and deploy succeeded. If it didn't, it will iterate and fix until it does.
My Favourite Claude job assignment from this week is something that would have taken me a day at least, and I would have hated every second of it. Fix flaky E2E tests (that only manifest in CI/CD) by using Playwright browser traces and running the "push → monitor Github Actions → download traces → fix → push" loop for an hour until it found and fixed all flakiness.
Using claude with --chrome was super helpful for: "Compare old and new site and check for incorrect calculations". Funnily enough incorrect numbers turned out to be a bug in the old system.
Playing around more with Chrome: I created a /ui-design-review skill. It goes through all pages of the site in Desktop and Mobile view, takes screenshots and compares them against the design system, but also tries to find common sense issues I haven't asked it to. Initially I used --chrome extension, but due to some bugs I finally switched over to Chrome DevTools MCP, which was a lot faster and more reliable solution.
I am using a /code-review skill that launches parallel subagents to review code on different topics like code style, security etc. Inspired by Carl Rannaberg (again) I added a --codex flag to it, to get a second opinion from a different model. It uses OpenAI Codex MCP (instead of Claude subagents) to delegate the investigation topics to. It works surprisingly well in terms of finding totally different kinds of issues than what Opus 4.5 finds. If you're doing something weird or new or fundamental to your app or framework, it makes sense to ask for a second opinion. (Unfortunately there seems to be a bug in latest Claude Code that stops this from happening in parallel, so there is the latency penalty at the moment).

Final thoughts

It's cool to see how LLMs, with their non-deterministic nature, start affecting more and more domains that touch them. These domains need to adapt and change to become non-deterministic as well.

It started with using LLMs in your software product -- your product suddenly is not unit testable anymore, can produce unexpected outcomes, quality becomes a thing you measure over time, not build into the product.

Now, of course, it's changing how we make software. Suddenly we have these huge cannons everywhere that shoot out powerful code that can solve any problem. And instead of spending your time writing code (and competing with the cannons), you are building processes for steering these cannons at exactly the right targets. And of course you do the steering with even more smaller cannons that you steer with even smaller ones. It's like playing Satisfactory! Or TIM!

This article was originally published on LinkedIn.