How to Claude Code: Latest insights into spec-driven agentic development (pt 3)

Last 2 weeks have been busy - in addition to a ski trip and a ski marathon (different skis!) I've had time to polish my behaviour-driven AI-first development workflow, since I adopted Gauge as a primary tool for maintaining truth in spec-driven development.

I took up a new hobby project that has grown to some size and complexity. This seems to be my distribution of truth (docs, specs) vs implementation (code and tests).

So I'm sharing what I've discovered and what works best for me today.

Terminal is new Jira

I've started using my terminal tabs like a task list. Did not plan it, it happened naturally.

Cmd+t (new tab), reverse search "claude with yolo and chrome" and enter
Always plan mode, write the idea, start planning, new tab
Once plans near implementation-readiness, I move them to left. Leftmost tabs are my queue for implementation. Rightmost are active discussions.

So terminal tabs are like Kanban lanes in Jira. But without Jira. This is my Jira:

The new bottleneck: Need more Claudes, need more Air?

Obviously there is this huge problem with this approach - I can't run multiple Claudes in parallel on tasks that overlap in the codebase. If tasks touch different enough parts, it can be fine, but there are limits to the confusion you want to add to agents when other agents touch their stuff. The worst I've seen is one agent reverting all "other" changes to make its tests pass again. And thinking about "is this OK for parallel run?" is also a constant overhead if you need to do it 20 times per day.

There is a new kid on the block – Jetbrains Air – that promises to be a solution for agent-first development, and a wrapper for, and an improvement to CLI based tools like Claude Code and Codex. The feature I was hyped about the most is the automatic git worktree management. All in all:

Well, the worktree thing works.. until your changes involve database migrations. But that's another story.
It's based on Jetbrain's new platform called Fleet, written from scratch for speed. It is very fast. Does not feel like your usual Jetbrains, which is both good and bad.
Lovely that you can connect your Anthropic and OpenAI accounts to it, and importantly, not only API based but all your max/pro etc subscriptions. This is a huge win over Codex MacOS app for example, where you can only use GPT models, and is the reason I haven't found time for trying it out yet.
One thing that slightly put me off is the diff view once it completes a task. I really don't want to look at code it generated, as my workflow relies on automated tooling instead. Throwing it in my face every time it wants to show what it did makes me feel guilty of not caring about code, which is counter productive.
The UI feels a bit clunky, e.g. your familiar shortcuts for changing plan/yolo mode don't work and I pretty quickly switched back to my good old terminal. I spent time styling my terminal, so I start missing it easily.

Overall it shows promise and the advent of similar ADEs (agentic development environments) will be extremely interesting, so I'll definitely keep an eye on the developments.

But still, parallel work is an unsolved problem for me, so I'm very much looking forward to Claude Code implementing a native worktree solution very soon.

Postponing correctness

I made a change in my hooks and validators that works really well. Standard software development practice suggests that you might want your commits to be of consistent quality with passing tests. Turns out, with agents, especially working on the same codebase, some temporary regressions are totally fine.

I moved most of my git hooks from pre-commit to pre-push.
Pre-commit runs some quick things: unit tests + coverage threshold check, biome (lint+fix)
Pre-push runs e2e tests
Claude Stop hook runs a Codex agent that receives a diff of what was done and what was instructed, and suggests improvements

Running Codex inside a Stop hook is actually quite a funny experiment. This is a real example of two agents cutely collaborating to catch issues. Codex with GPT 5.2 points out a problem and Opus 4.6 politely accepts the criticism and fixes it.

 ⏺ Ran 2 stop hooks
  ⎿  Stop hook error: Codex review (pass 1/2) found issues:

  - specs/hooks.ts:111: `performSignOut()` clears cookies after navigating to `/login` but does not reload the page, so the login form can keep a stale CSRF token while the CSRF cookie was deleted, causing subsequent sign-in attempts in the same scenario to fail.

  Please fix these issues.

⏺ Good catch. After clearCookies() wipes the CSRF cookie, the /login page still holds a stale CSRF token in its form. A reload after clearing gives the page fresh state.

But the funny part is when it comes to false positives. Then the following argumentation between two agents seems like two junior developers arguing over who's fault is a bug and why it's not a bug at all, or why it's pre-existing and "not my problem".

ELI5

My Gauge workflow (described here) means that my E2E tests are actually markdown spec files that can be executed directly. This is lovely for maintaining a spec for what the system should be able to do, but I understood it's too low-level for mere humans to totally comprehend and be in sync with. Writing out all the ways a form must behave does not give me the overview of what its purpose is.

All problems in computer science can be solved by another level of indirection [abstraction]

This led me to separate docs from specs. Claude now generates high level documentations based on the executable markdown specs. I have a "docs/overview.md" that tracks implementation progress and the index of all functionality, and a "docs/domains" folder that has human readable short overviews of all business functionality that has been implemented.

Use the breathalyzer

If you haven't already, install ccstatusline right now. It's a nice config tool where you can customize what you see under your prompt box. The "Ctx %" thing is absolutely critical, and not there by default. Whenever I see Claude starting to get tired and drunk, talk nonsense, take shortcuts and make hacks, I see that I've forgotten to clear context and the conversation has gotten too long. I try to /compact or /clear once it gets over 60%. The "Cost" is just lovely to see, to remind myself that the 200€ subscription was a good idea.

Claude's Agent Teams / Swarms

Claude released Agent Teams a while back (currently in beta and works with a special flag). I use it regularly now and when it works, it seems to speed up work. Sometimes it picks it up itself, but often I have to tell it already in plan mode to use an agent swarm for implementing this plan.

It being in beta seems like a good idea though:

My Chrome extension does not seem to work reliably after enabling it
I tried to adopt it for my 7-parallel-agent code-review skill, but never got it to work. It could be that it does not support 7 parallel workers, so I reverted back to subagents.
Oftentimes the agent team members keep hanging after work is complete and although the UI looks good it does not seem to be quite stable. But it's in beta, so I'm not complaining.

Most used skills and commands

The skills and commands I use the most nowadays are:

/ship with optional – monitor. Absolutely love this. With monitor mode it checks Github Actions with "gh" (Github CLI tool) until it sees that deployment has succeeded. If something fails it will fix-push-loop until it's shipped. Awesome for stuff that touches infra, can be flaky in CI/CD etc
/audit-spec – checks consistency between code, e2e tests and human readable documentation. Mostly run it once in a while to see how well my AGENTS file works (that tells Claude to keep things in sync)
/implement – still required to enforce checklist-based implementation order for: spec -> tests -> code
/ui-design-review – good if there is something fishy in the UI and instead of explaining I want Claude to open Chrome and do a check instead
/update-diagrams – something I've started to experiment with. it uses excalidraw MCP to draw diagrams of more complex processes in the system. I've yet to produce a convenient skill that works reliably and universally, but I'll keep experimenting.

Brave new world

What is super exciting about this brave new world of software is that there is no one "correct" workflow for agentic development. There could be an article somewhere that suggests opposite things to what I have listed here, and still be correct and useful for the author.

We're finding ourselves in a new Wild West - we only assume what the new rules will be, we experiment with things that were firm laws for lifetimes, we start writing new constitutions. The telegraphs are still being invented, the railroads are still being planned, but it's clear that the gold rush is on.