Developer Intervention Required
User checkpoints, testing as confidence, and the regeneration option
Shift left, verify right
here should we engineers, testers, and other specialists in the software business shift our focus to make the most impact? One thing I've noticed, partly due to the immaturity of tools and processes, is that new roles will emerge. People need to rethink what their daily tasks will actually be. I see no return to handcrafting code line by line outside very specialist areas.
One dominant theme is the shift left, as already discussed in this book. More work needs to be done up front (left) and in the rear (right), letting the factory run the boring part in the middle. The explorative aspect of coding (or design) remains, but the tools are at a much higher level. In a way you can experiment and compare a hell of a lot faster than just a year or two ago.
We must also think about checkpoints, even hard stops where anthropogenic judgment is still required. What would these be?
I'd approach this puzzle by summoning the good ol' 5W+1H problem-framing technique:
5W+1H assumes, appealing to those with an engineering mindset (that's the tunnel vision a member of my family keeps talking about, right?) that to solve something, these questions all need decent answers. We the engineers, and especially software engineers, have always been obsessed with the H (How) and somewhat less about the Ws. Certainly this remains a problem to this day.
Now, engineers with their TLAs and IDEs who are told to take on an advising role about the "How" might finally give attention to "What, Why, When, Who, and Where". Perhaps this could've been a good idea all along. Anyway, in more practical terms, this new RACI could look something like this:
| 5W+1H | Checkpoint | Validatoion | What It Catches |
|---|---|---|---|
| What & Why | Requirements approval | AI understood the requirement correctly, scope is bounded, problem is worth solving | Building wrong things before any code is generated |
| When & Who | Plan approval | Phasing makes sense, task breakdown is realistic, dependencies are identified | Planning failures, unrealistic scope, missing sequencing |
| What & How | Code review | Implementation matches the plan, tests cover acceptance criteria, quality standards met | Quality drift before code enters main branch |
| How | Architecture compliance | Solution follows architectural boundaries, conventions, and patterns | Architectural erosion, structural decisions that compound over time |
| How | Tooling evaluation | Current models, tools, and agent setup are still the best fit | Stale infrastructure, falling behind on capabilities |
Obstacles to adoption
Getting people on board with these new job descriptions is going to be hard.
The specification discipline is new for many developers who, despite working in engineering, have been used to working like artists or craftspeople for quite some time. Luckily not in aeronautics or bridge building, but in software industry. We are used to making decisions on the fly and adjusting as we go.
Another challenge is that in practice you'll end up verifying the same things all over again, which are often wordy, repetitive, and not that fun.
Sooner or later you'll find yourself pressing "pass," "yes," "agreed," "do it" and move on without reading or checking anything. That's not always a problem as you can always revisit, adjust, and continue, but the risk of becoming too disconnected from the actual work is real.
Hence:
The checkpoints should be kept small, intuitive, and focused on the most critical things.
The AI fatigue and kind of laziness will get you, too. It's just so tempting to throw smaller and smaller things to the fancy model just at your fingertips, and essentially to stop really thinking. Don't do that: first it's not good for you, second it's a waste of resources, and third it will lead to more mistakes and rework.
A picture is worth a thousand words
To keep the review business viable, try to make it graphical.
In the example below, I've provided a simple ASCII graphic. This is what my AI UX Agent gave me to review. Not impressed? There were more details down to individual controls, but this was what I thought about most and found most valuable.
Any intermediate planning artifacts for real persons must be visual, structured, and easy to review.
As rudimentary and boring as my example probably is, it represents a sanitized real-world example of a dashboard generated from a short description: "give me a page layout and implementation plan for this kind of data with this kind of hierarchy, with search and action list options."
With the rough idea, I could throw in (via project instructions) a basic design system, some generic code principles, and a similar page already implemented as an example.
Here's what I got:
+------------------------------------------------------------------------+
| PageHeader: "ProcessControl" |
| [Search: Seek with order...] [Date: DD.MM.YYYY] [Actions] |
+------------------------------------------------------------------------+
| ProcessList (Master View) Card |
| Tabs: [Active] [History] |
| +------------------------------------------------------------------+ |
| | Id | Unit | Inf1 | Inf2 | Quality | Notes | Amount | ... | |
| |--------|--------|------|------|---------|--------|--------|------| |
| | 12542 | A12 | X | | prem | ... | 10032k | [X] | |
| | 54344 | B05 | | Y | 2nd | ! | 1233 | [-] | |
| +------------------------------------------------------------------+ |
| (flex: 0 0 40%, overflow: hidden) |
+------------------------------------------------------------------------+
| ProcessDetails (Detail View) Card outlined |
| Tabs: [Events] [KPIs] [BOM] |
| +------------------------------------------------------------------+ |
| | [+ Add] | |
| | Date | Event | Notes | Start | End | .. | | |
| |---------------|---------|------------|--------|-------------|----| |
| | 1202 | Bling | BP | 14:30:15 | 14:32 | 1m | | |
| +------------------------------------------------------------------+ |
| (flex: 1 1 auto, overflow: hidden) |
+------------------------------------------------------------------------+
Compare this to reading hundreds of lines of text like this and trying to figure out what I'm going to get. Day in, day out, and then wait an hour or two to discover the thing was nothing I wanted. In case you wonder, no, your visual drafts don't have to be the 80x25 ASCII art I so proudly showcase here (mostly motivated by nostalgia I reckon). Go ahead and spin up an HTML version of it looking very close to the final product, and review it.
Drawers (rendered at page level, not in content area):
- RecordDrawer (right side)
- QualityDrawer (bottom, placement="bottom", size="large")
Which of the above would you rather read? Use AI to generate summaries and graphical representations of the tasks, designs, and plans. Like a Gantt chart of the tasks, a diagram of the solution components and blast radius, a draft screenshot of the UI. All this is available at your fingertips.
While you can certainly still vibecode your way through all this, or "Lovablise" it, it might not give you anything solid to build on. The point of having somebody with a real brain in the loop is to keep the train on track, keeping things manageable and visual for the humans.
Your new role
You, in a governed loop, are not a programmer. You are an architect, a lead, a quality authority. The role shifts from producing code to governing delivery.
This is a genuine cultural shift. Most development organizations are structured around production: developers produce code, testers produce test cases, leads produce architecture documents. In a governed AI loop, the AI produces most of this. Your value is in the decisions: is this plan correct? Does this implementation meet our standards? Should this be shipped?
This requires different skills, different judgment, and different ways of measuring contribution. Organizations that try to run governed delivery with the old mental modelâwhere "value" means "lines of code written"âwill find the framework frustrating. The value is in the governance, not the generation.
Organizations that embrace the shift often find their senior engineers are happier. The tedious parts of codingâboilerplate, routine transformations, repetitive patternsâare delegated to AI. The interesting partsâarchitecture, design decisions, quality judgmentâbecome the focus of your attention.
Testing as the confidence layer
If AI output is probabilistic, testing converts probability into confidence.
| Testing Layer | What It Catches | Why It Matters for AI Code |
|---|---|---|
| Unit tests | Logic errors, incorrect return values | Verifies the agent got the core behavior right |
| Contract tests | API mismatches, schema violations | Catches when agents invent or misread API contracts |
| E2E tests | User flow breakage, integration failures | Validates the full story works, not just individual pieces |
| Static analysis | Convention violations, architecture drift | Enforces patterns agents should follow but sometimes don't |
| Policy-as-code | Gate criteria violations, compliance gaps | Automates governance checks that otherwise require someone's judgment |
In the governed pipeline, testing is a first-class stage with its own agent and gate. Tests must map to acceptance criteria before the gate advances. This is not an afterthought or a nice-to-haveâit is the mechanism that makes probabilistic generation reliable enough for production.
The scope extends progressively: unit tests for logic, contract tests for APIs, E2E tests for user flows, static analysis for conventions and architecture rules, and eventually policy-as-code for gate criteria themselves.
This changes the economics of AI-generated code. When validation is automated and thorough, you can let AI generate aggressively and catch failures cheaply. The tradeoff shifts from "is this code perfect?" to "does this code pass all the checks?"âwhich is exactly the tradeoff that CI/CD pipelines already manage for code people write.
The more you can automate quality verification, the more safely you can delegate generation to AI agents.
The regeneration option
Here is perhaps the most counterintuitive benefit of a spec-anchored approach: if your specifications are good enough and your traceability chain is complete, you canâin theoryâregenerate the entire implementation.
Think about what governed delivery produces for every story: a structured plan (architecture decisions, component mapping, scope boundaries), an ordered task breakdown with dependencies and deliverables, tests that verify acceptance criteria, and a complete traceability chain from story ID through branch, commits, and PR back to the backlog item.
If the AI models improveâand they willâyou could take the same specifications and re-run the Engineering Agent with a better model. If your architecture changes, you could update the plan and regenerate. If a new framework emerges, the tasks could be re-scoped while the requirements stay stable. The specification becomes the durable asset; the code becomes (partially) disposable.
This reframes the relationship between specifications and code. In traditional development, the code is the valuable artifactâspecifications are documentation that quickly drifts from reality. In a spec-anchored world, the specification is the valuable artifactâcode is a (verifiable) derivation of it.
I never needed to regenerate any part of the game. But the specifications GSD produced for each feature are still there, the acceptance criteria still valid. I'd be curious how close a fresh run from those same specs would come to the original result. That's the real test of whether specifications are truly the durable artifact.
The regeneration option is best understood as a theoretical endgame that motivates investment in specification qualityâa direction of travel rather than a current capability. Every improvement in specification precision, every addition to test coverage, moves you closer to this future. But teams should not plan on regeneration working reliably in the near term.
Feedback loop design
The governance boundary is not just about controlâit is also where learning happens. Each gate rejection is information: the plan was unclear, the implementation diverged from intent, the tests missed an edge case. Without systematic feedback loops, this information is lost.
Effective feedback loops answer:
What information from review flows into future specifications? When plan approval rejects a specification because the scope was wrong or the architecture would not work, that insight should inform how similar stories are specified in the future.
How does the team systematically learn from gate rejections? Patterns in rejections indicate systemic problems: unclear story descriptions, missing architectural context, inadequate test coverage. Tracking rejection reasons reveals where the process needs improvement.
How do specifications evolve based on implementation experience? When coding reveals that a planned approach will not work, the specification should be updated to reflect the new understanding, not just the code changed while the spec drifts.
The governed delivery artifacts make these feedback loops possible. Because specifications are structured and versioned, you can analyze them. Because the traceability chain is complete, you can connect rejections at later gates back to specification quality at earlier ones.
The governance boundary is not just a control mechanism, but it is a measurement instrument for process improvement. Organizations that treat gate rejections as learning opportunities improve faster than those that treat them as annoyances.