The Specification Premium

What was forgotten during the Agile era

The acronym problem

f there's one thing IT people are creative at, it's coming up with TLAs and FLAs. The latest member of the family is SDD, Specification-Driven Development. Probably to align with Test Driven Development (TDD), Behaviour Driven Development (BDD) or Domain Driven Design (DDD). Or, how I referred to the Agile method, "Ticket Driven Development" (also TDD). Lots of D's.

Anyway, SDD is supposed to be the practice of writing detailed specifications for each feature before implementation. Funny that I thought something like this was supposed to happen all along, so I got on this ship already a bit annoyed.

While the acronym and the message are easy to make fun of, the idea is basically about giving your AI a good chance to find a good solution.

The winning formula

What if we could model the chance of success in a programming task with a formula? My hypothesis is that the probability of success (S) is a product of core factors divided by the task size (T). Let's also throw in a coefficient for model temperature and other factors that affect randomness.

S = A × C × PT × R

Where

Architectural quality (A): The degree to which the codebase has clear separation of concerns, well-defined modules, and consistent patterns. A well-architected codebase has a small solution space for any given task, while a poorly architected codebase has an enormous solution space.
Context limits (C): The extent to which the agent can find and understand the relevant code and requirements within its context window. A codebase with clean architecture allows the agent to focus on a small subset of code, while a tangled codebase forces the agent to load vast amounts, increasing the likelihood of context overflow and dropped information.
Specification precision (P): The clarity and detail of the task specification. A precise specification with clear examples and constraints narrows the solution space. A vague specification leaves the agent to invent the approach, the "too creative zone."
LLM Fuzzy Factor (R): The inherent randomness in language model outputs, influenced by temperature settings and model architecture. Higher temperatures increase creativity but reduce reliability, while lower temperatures improve consistency at the cost of innovation.
Task size (T): The complexity of the task itself. Larger tasks have more potential solutions, more handoff points, and thus a lower success ratio.

Here's a simple take on what this looks like in practice. Imagine tasks of three different sizes: S, M, and L. It's still pretty easy to get things right even with lower values of A, C, and P for small tasks. But as the task size grows, you need to raise the bar: the bigger the feature you want to one-shot, the more attention to architecture, context, and specification is needed.

The winning formula: as task size grows, the tolerance for imbalance between Architecture, Context, and Specification shrinks dramatically.

Click to enlarge

Compare this with the "dumb zone" and "smart zone" analogy in "Concept Overload": The vaguer the architecture, boundaries, and context, and the bigger the chunk, the further your hardworking but essentially dumb assistant will wander from the actual solution.

The formula is anecdotal, based on a small sample size, not a controlled study, but it illustrates the point: specification is not just a nice-to-have, but a critical factor in the success of AI-assisted development.

Getting your specs in order is one central piece here, but not the only one. Strong specs can compensate for weakness elsewhere. If your specifications are rock-solid, you can take on bigger tasks at a time.

The specification premium

There was a good reason why writing software was painful in the old days. The lead time from planning to working software was measured in months due to intensive up-front paperwork, and the bar for changing course got high fast. Designs and specifications were written in stone, but already outdated once you started to work on them. Getting anything changed required a CR theatre and a minimum of two meetings and one month.

I thought the issue was never really about the specifications but about the non-incremental waterfall process. The general idea was sound, of course: by carefully planning things ahead, you would avoid costly mistakes and rewrites, and the odds of going the right way were higher. Often the problem was (and is) in the inflexibility and the long lead time from words to action.

So it turns out that iterative development is still a solid choice with the resurrection of detailed functional specifications. It does not matter if we call them "user stories", "PRDs", "ADRs" or "use cases": the important point is to let the AI know exactly what you want, under what circumstances, and what you don't want. Let's just not go back to 'a Word document which states which database column is which textbox'-thing.

While writing documents is not entirely free, and often not the greatest fun, the cost of fixing and redoing is much higher. In my experience, having unbounded AI assistants write the code based on vague instructions just shifts the effort into the rework zone. Quite often the net hours spent would've been less doing it the old way.

Below, I've illustrated the effort balance between writing good specs up front or fixing/redoing them later. The numbers are for illustrative purposes, but probably not that unrealistic: when you take the 'left' part seriously, you can sometimes get those 5-10x speedups for the implementation phase.

In the end, it's your call if you want to jump directly to the coding part, and spend hours and hours later fixing that, or take that 10-15 minutes up front to write a good plan. Take your pick!

When you invest in ideation, specification, and planning, the fixing and redoing phase shrinks, and so does total effort.

Click to enlarge

Purely economically, I understand that promises about cheaper software don't fit well with the suggestion that work just moves to the specification phase and the same amount of effort is needed. On top of that, setting up a proper factory — documents, working agent chains, some self-learning capability — is easily a month's investment. And people like me suggest that you actually need new roles like the "AI Development Supervisor" to maintain the infrastructure: agent recipes, orchestration, supporting documents, and tools.

So let me be honest about where this pays off. For small projects with one or two developers, basic Plan and Agent modes with good project context is probably enough — the governance overhead isn't justified. For medium and large projects, the investment tends to pay for itself through reduced rework and repeatable delivery. As for token costs and licence fees: at current pricing, they are negligible compared to average developer hourly rates. Where that lands in two years remains to be seen.

Well, as we consultants say, there's no such thing as a free lunch, but some are considerably cheaper than others! The real cost-benefit question isn't "is governed delivery cheaper than ungoverned?" It's "is the rework you're doing now cheaper than the specification work you're skipping?"

Investing in reasonably detailed specifications and planning in small steps doesn't mean going back to the old days of Document-Driven Design. It's enough to write a "good enough" specification and let the AI help you refine it, find the relevant information, and synthesize it together before pressing run.

Structured artifacts, not vague descriptions

What is the optimal level of detail then? The answer depends on your team and context, so as usual, 'it depends'. The direction is clear, however: structured, detailed planning artifacts beat vague descriptions every time. They give agents unambiguous inputs, enable automated validation, and make the specification machine-readable without sacrificing human readability. For a practical calibration (how much spec effort different story types actually need), see Chapter 16's calibration table.

Not all specification practices are equal. A study published by Thoughtworks examined specification-driven tools such as Amazon's Kiro, GitHub's spec-kit, and Tessl. They identified, by no less than Martin Fowler himself, three maturity levels that describe how tightly the specification is coupled to the code it describes.

Maturity Level	Approach	Spec Lifecycle
Spec-first	Write the specification before code	Created per task, may drift or be discarded once coding begins
Spec-anchored	Keep the spec alive through implementation	Persists alongside code, updated as requirements evolve
Spec-as-source	Spec is the single source of truth	Code is generated from spec; humans edit only the spec

Understanding Spec-Driven Development: Kiro, spec-kit, and Tessl — Martin Fowler / Thoughtworks, 2025

Most tools operate at the spec-first level: they help you write a plan before coding, but the plan is essentially a detailed prompt that gets consumed and forgotten. A throwaway idea paper, basically.

The spec-anchored level is more persistent. The specification lives alongside the code and evolves with it. Specifications are versioned, referenced during implementation, and updated when requirements change.

The spec-as-source level is the most ambitious: the specification becomes the maintained artifact, and code is a generated derivative. This echoes earlier attempts at model-driven development (MDD), which never took off for business applications because it sat at an awkward abstraction level.

If you're old enough to remember MDD, you might be thinking "we've seen this movie before." Fair. But the differences matter. MDD required formal models: UML, BPMN, DSLs, and specialized tooling like Rational Rose that cost more than your car. SDD specs are markdown files. The barrier to entry is a text editor and the ability to write sentences. MDD was also all-or-nothing: if the model was incomplete, the generator produced nothing. LLMs degrade gracefully: a vague spec produces mediocre output, a precise spec produces good output, but you're never stuck staring at a compilation error in your class diagram. That said, if this turns into "spec files that are as complex as the code they describe," we've learned nothing. Keep them simple or don't bother.

I'd personally go for a spec-as-source but refine-it-as-you-go approach. Have one source of truth, and don't let it be the code this time (despite it remaining that in the end).

Specifications force smaller steps

The step size principle from Chapter 5 applies directly here: specification-driven design naturally enforces smaller steps because the act of writing a specification forces you to decompose work into units that can be clearly described. If you can't describe it clearly, it's too big.

And if you invest a bit in your tooling, like a '3rd Degree Interrogator Agent' that rips your spec apart and asks you questions about it until it's clear enough, you can get to a good specification much faster than doing it manually.

Every one of my game's ~200 features started as a specification. What the feature does, how it interacts with the player, what kind of color to use for the icon, where to render this or that. The AI-driven interview, where the agent asks clarifying questions before writing code, became my favorite part of the process. Not because it was efficient, but because it forced me to think about what I actually wanted before any code existed.

Context rot and how specification fights it

Qodo's 2025 survey of 609 developers discovered that while 78% reported productivity gains from AI coding tools, 65% said AI missed critical context during refactoring. Refactoring is probably most dependent on understanding how your code works, seeking call graphs, dependencies, and matches.

To make it worse, these kinds of links are often masked behind complex conditional logic, a DI framework, a dynamic plugin, or DLL hell -- or whatever evil thing the original developer was able to summon. For example, let's take a three-level nested if block that calls something else, which calls something else, which calls something else, and so on. Perhaps by passing nicely renamed state variables and some !boolean logic down the path. Good luck, AI.

Naturally, the distinction between 'refactoring', 'fixing', and 'building new features' is not always clear, but the point stands. The study cited here found similar gaps in test generation (60%) and code review (58%).

Qodo's study also found that context gaps were cited more often than hallucinations as the root cause of poor AI-generated code. As you'd guess, senior engineers were more likely to report these issues and reported more frustration than juniors.

This is an example of 'context rot' in action.

When conversations or 'agent runs' grow longer, more files are read, tool calls are made (10 kB of JSON a pop), your CLAUDE.md is already 56 kB, half of it in ALL CAPS (THIS IS IMPORTANT: [something the LLM keeps ignoring]) -- it's pretty obvious your AI's context window is full of noise and irrelevant information. Previous decisions, golden rules, and all fade from the context window, and there's just too much conflicting data inside.

Don't rot your context, man!

State of AI Code Quality in 2025 — Qodo, 2025

Nobody said the waterfall is back

Critics of SDD sometimes object that specification-driven design sounds like waterfall: big upfront design, rigid requirements, slow response to change. This mischaracterizes the approach.

Guess what, nobody said that is needed. You can iterate as you like, but need to go a bit further left than before, and that fantastic AI guy is gonna write and cross-check most of it for you. And neither Rational Rose, Visio, nor anything from Microsoft for that matter is going to be needed: nice, compact, and clean MD files will do, all revision-controlled and no 'Approved by: [insert name here] and Date: [insert date here]' needed on top of them either.

Yes, and you can refine them as you go.

How to not end up like Confluence

Every experienced developer has seen "living documentation" to not look very much alive. Confluence, SharePoint, whatever: the place where information goes to die. So why would specifications be any different this time? I think that's a fair question. Here's how I think about it.

First, have a clear single source of truth for how the system should behave from the user's or external actor's point of view. Link your implementation artifacts to it. After you've implemented a feature, do a quick sync pass: does the spec still describe what you actually built? If not, update it. Takes minutes, not hours. That's the difference between a living document and a fossil.

Second, don't try to document every little detail. Look and feel, pixel-level UX flows, individual field validations: these change constantly and are better expressed in code anyway. Your spec should capture what the system does and why, not every checkbox and tooltip. The detailed behavior? That belongs in tests. Tests are documentation that actually runs. When someone asks "what does this feature actually do?", the answer should be in a test case, not in a Word document from six months ago.

Third, ideally keep specs in the repo, version-controlled, next to the code, reviewed in the same pull request. If that's not possible (and I know it's not always possible), at least make sure your task artifacts point explicitly to the external source, even if that's Confluence. The worst case is when the spec exists somewhere but nothing in the codebase even references it. If your specs live in one system and your code lives in another with no links between them, they will drift apart. That's not a hypothesis, that's a law of nature.

Now here's what I think is actually different this time. When your agents use specs as input, specs become load-bearing. They're not decorative. When a spec goes stale, the next agent run produces visibly wrong output and somebody notices. A Confluence page that nobody reads can be wrong for years and nobody cares. A spec that your agents actually read and act on? That breaks loudly. And that changes the incentive structure entirely.

We've been here before, of course. UML diagrams that diverged from the code after the first sprint. Figma designs that stopped matching the UI after the third iteration. Architecture documents that described the system as it was six months ago. The failure mode is always the same: the model and the reality drift apart, and nobody has the time or motivation to keep them in sync. What might actually be different this time is that AI can work both sides. It can read the spec to generate code, and it can read the code to update the spec. The sync pass doesn't have to be a manual chore that everyone skips. An agent can compare what was specified with what was built and flag the gaps, or even propose the updates. We have a better fighting chance at maintaining a living description of our system than we ever did with Visio diagrams and Confluence pages. Whether that fighting chance turns into sustained practice remains to be seen.

And when spec and code diverge anyway (they will), that's a signal, not a sin. At 2 AM when production is on fire, nobody is going to update the spec before fixing the code. That's fine. Fix the code, update the spec as part of the post-incident cleanup, same as you'd update tests. The spec is the intended source of truth, not an inviolable law. The alternative, pretending the spec is always correct, is how MDD died.

I'll be honest about one thing though: external-facing documentation is a separate problem, and I haven't really figured that out. Wikis, user guides, API docs, these tend to go stale regardless of how much you preach about keeping them current. Making them part of the regular agentic workflow is the right direction, but doing it well in a larger project remains a challenge. That said, the situation is still considerably better when you at least have structured specs and tests as a baseline. In most projects I've seen, the documentation amounted to chained JIRA user stories, an oddball developer_readme.txt nobody ever updated, and code with meager comments. The bar is not exactly high.