Accountability Over Speed
The goal is not maximum automation. The goal is repeatable delivery with accountability.
The wrong question and the right one
he dominant narrative around AI-assisted development is about going faster. But speed without direction is just expensive wandering. The same trap caught many Agile adoptions: teams interpreted "working software over thorough documentation" as permission to skip specification entirely, which was based on a false assumption of infinite budget, the possibility of redoing everything, and endless customer patience. None of these really ever existed.
To be clear, Agile's core principles (iteration, feedback loops, self-governing teams) remain sound and this book builds on them. The problem was never the Manifesto; it was how many teams practiced it, treating speed as a substitute for direction rather than a consequence of good direction. Spec-driven development restores what was lost in that translation, not what Agile intended to discard.
Going fast in the wrong direction is worse than going slowly in the right one. Or in practice, using up your premium token quota to produce code that is toxic waste, and then the next day's quota to fix it or redo it is not a good working method.
Ask not how do we make AI write code faster.
Ask how do we make AI-assisted delivery repeatable and accountable when the whole team depends on it?I have a cautionary tale. I created a proof-of-concept that was vibe-coded to fit the design: fast, impressive, got the point across. Later, when evaluating whether it was any good for the real thing, it turned out to be a patchwork of odd choices with almost every anti-pattern in the book present. God components, big balls of mud, the full curriculum of the vibe coding academy. It was also not tracked anywhere (a classic agile thing, right) so figuring afterwards out what features were actually in there required some footwork. It wasn't useless, but I reckon it would have been less effort in total if done properly from the day 1.
I suspect this happens often. Vibe-coded prototypes give initially an unrealistically optimistic image of quality and speed. When things start to slow down (or, deteriorate if you will) to endless refactoring loops and subtle bug hunts, your burndown chart starts to look like the EKG of a cardiac arrest. Call it the "shiny prototype syndrome" if you like, but all in all it's just a healthy reminder of the fact that the last 20% of work will require 80% of the effort, and a healthy expectation is that with AI you can expect something similar. Winning a battle does not mean winning the war.
Generally, the reality of vibe coded codebases becomes apparent after a while, when somebody else (who knows things) takes a look, or when you yourself need to later fix or expand it. Even the best of the best of AIs will struggle with the mess.
Enforcement
So how do we build delivery that actually works the same way twice and leaves a trail you can follow? My suggestion: let's go back to basic principles of engineering: Divide the problem into smaller subproblems until you can solve them. To make this more tangible, let's imagine a multi-agent software delivery loop, where each AI agent owns a single stage of the lifecycle. Or a set of parallel agents working on the same stage but different tasks. How do we prevent them from wandering off into speculative work, or producing code that doesn't meet the requirements, or modifying tests to fit the code?
With CONTROL.
Three control mechanisms make this work:
Gates are enforced transitions between stages. An agent cannot advance from planning to implementation, or from implementation to testing, without meeting defined criteria. Say, the engineering agent can't start coding until its plan has been reviewed and its task list matches the acceptance criteria.
Tracking artifacts are the persistent records of what was planned, built, tested, and approved. They live near the code, in version control, not somewhere in the agent's chat history that vanishes when you close the session. Think a PLAN.md that links to the backlog item, commit messages that reference the task ID, and test results tied back to acceptance criteria.
Human checkpoints are hard stops where automation pauses for human approval. Not optional suggestions, but mandatory gates that cannot advance without a person signing off. You review the plan before coding starts, and you review the code before it gets merged. The agent waits.
Here's what it looks like in practice. A developer picks up a story: "Add date filtering to the task list." The planning agent produces a plan with tasks, references to the UI pattern library, and Gherkin acceptance criteria. The developer reviews the plan, catches that the agent missed the "no results" empty state, and approves the corrected version. Only then does the engineering agent start coding. When it's done, it produces a PR linked to the story. The testing agent runs the acceptance criteria. If something fails, it goes back to engineering with the failing test, not forward to review. At no point did anyone skip a step or approve something they hadn't read.
Chapter 10 describes how these mechanisms combine into a concrete pipeline with stages, agent roles, and state machines. For now, the key insight is that all three work together: gates enforce the boundaries, artifacts make the work traceable, and checkpoints keep humans in the loop.
In the end, any backlog item should be traceable all the way through to code, tests, and a pull request. Without your ever so helpful and hard-working agents wandering off into speculative work or ending up in endless loops.
Let me give you some concrete examples:
- Require your agent to link a commit or a PR to a specific backlog item. This is low-hanging fruit (a good practice to have, AI or not, and yet another checkmark in your BS bingo sheet).
- Require your agent to produce a detailed task list, and refer to specific supporting documents and artifacts it should be working on. Like plan a task to add a new API method: link to the API specification in the task.
- Link tests to your source of truth, i.e. the place where your spec-driven-design is being documented. An .md file with acceptance criteria, something unstructured, anything. You'll be able to later discover, why is this test here?
And so on.
What I'm suggesting here is not about replacing your judgment with AI automation, or about slowing things down with bureaucracy. It is about drawing a clear line between what the agent decides and what you decide, so neither of you wastes time on the other's job.
Why "Accountability" matters
In professional delivery, where my day job is (my roguelike is still not ready to be monetized, as you might have guessed), accountability is not a nice-to-have. Clients need to know what was built and why. Somebody might need to trace decisions. Teams need to onboard new members who can understand the history.
Without accountability, AI-assisted delivery becomes a black box: code appears, nobody can explain the decisions behind it, and the organization loses the ability to govern its own software. Well, that's an accurate description of most agile projects, right? After 20 sprints with 10 incremental changes to some feature, good luck figuring out why the code is the way it is, and what was the rationale behind it.
Naturally, a solo developer building a side project or a research-oriented team creating a PoC does not need any of this. But for teams delivering software at scale, the tradeoff tends to pay off.
The chapters that follow describe how we have implemented these principles in practice. I describe the specific framework, its specification-driven design approach, and the governance boundaries that make it work. Treat my words as a starting point for your own thinking, not as a rigid methodology or an off-the-shelf Claude Code plugin to be installed or followed without question, customization, and considerable effort. What I really wish is that all of you who are voluntarily or involuntarily looking into this would get insights and not step on each landmine I have.