Where the Framework Runs Out
Current state analysis
hat I've been up to in my day job (which is not writing this stuff at all) has been to build a governed agentic development framework. I would've preferred to get most of it from ready-made tools, but after being in the scene I've found especially the governance side largely non-existent in the mainstream toolsets.
This book reflects what I learned while doing that and taking it to action. Small side projects I've used also in this book have covered the research side with more latitude of freedom, and I had fun doing it.
While the work I've done so far is actually doing a decent job, a lot of work remains. I realize keeping our agent setup up-to-date is a challenging task in itself. That's one of the reasons I think we will see new job openings for Agent Orchestrators or something like that, i.e. people responsible for taking care of your development infrastructure: agent recipes, orchestration, supporting documents and tools. As with the cloud, the result might be more complicated than what we used to have. (I've actually already seen some, and I'm slowly becoming one myself, too).
I'll go through some of my observations and open items in the table below, with selected points of view in more detail later in this chapter.
| Problem | Current State | Why It's Hard | Potential Paths Forward |
|---|---|---|---|
| Circuit breakers | No good implementation yet | Current tooling doesn't support detection of wandering or looping | Monitoring agent outcomes, detecting divergence from requirements, automated halt triggers |
| Feedback loops | Review gates exist but closing the loop is manual | Gathering info on agentic failures and improving context engineering is still manual work | Structured failure analysis, automated feedback into context engineering |
| Learning from mistakes | Ad hoc at best | No systematic way to capture and learn from agentic failures | Coroner agent for post-mortems, pattern libraries, institutional memory |
| Pace of change | Framework is tool-agnostic but assumptions drift | Tools and categories shift faster than processes can track | Meta-practice: assigned owners, evaluation cadence, clear migration criteria |
| Human readiness | Many teams lack AI governance skills | Requires cultural shift from coding to governing agents | Training programs, skill development, architectural judgment emphasis |
| Measurement | Speed metrics tracked, direction metrics missing | Most orgs measure velocity, not requirement fitness or maintainability | Add flow metrics, track plan-stage catches, measure defect rates by governance |
| Security and regulatory gaps | Partial coverage via reviews and gates | AI-generated code introduces novel risks; legal situation still forming | Enhanced auditing, EU AI Act compliance tracking (August 2026 deadline), liability clarity |
| Directional velocity | Governance gates help catch directional errors | Organizations optimize for speed over direction | Culture shift needed toward valuing direction alongside speed |
| IPR and legal risks | Risk management, not resolution | Confidentiality, license contamination, and liability questions unresolved; law still catching up | Data residency requirements, explicit IP policies, human approval chains for defensibility |
| Sustainability | Not yet addressed | Energy consumption, skill degradation, overreliance risks | Monitor energy usage, maintain developer skill development, platform diversification |
Circuit breakers
Circuit breakers are a critical safety mechanism for catching and halting negative feedback loops before they cause significant damage. In the context of governed agentic development, I haven't figured out a good way to implement them yet, and the current tooling does not support this very well either.
What I'm after here is the ability to detect wandering, going off the rails, or getting stuck in a loop. The idea is to have monitoring in place that tracks the outcomes of agentic development runs and identifies when things are going wrong. Such observations could be for example as simple as an agent repeatedly generating code that fails tests or other constraints, or even making repeated but failing tool calls.
When such patterns are detected, a circuit breaker would trigger, halting the process and alerting humans to intervene. From there you can investigate what went wrong: tool calls failing silently, documents not being followed, conflicting instructions, and so on.
Given that our current setup is not yet the autonomous software factory running hundreds of agents in parallel in separate git worktrees, but still a human-triggered process, this hasn't been a priority-one item. It will become essential as we strive for more autonomy and scale.
In the game project, the agent actually worked pretty well within each task. The meandering and loops that plague larger scale solutions were rare and easy to spot when you're the only one watching. In a team setting that's not viable. The goal should be that agents perform their tasks independently and deviations are automatically detected.
Feedback loops
More about this later, but stemming from the circuit breaker topic, feedback loops are a broader challenge.
The current framework includes review gates and checklists that are designed to catch issues before they propagate, and help the developer correct course if needed.
The question remains: how do you gather information about agentic failures and improve the context engineering and agent framework to lessen the chances of repeating the same mistakes? Right now, detecting and analyzing failures is still pretty much down to somebody doing it manually. We've made some effort towards this, but it's still an open problem.
Learning from mistakes & reinforcement learning
A close relative to feedback loops is the idea of learning from mistakes. In traditional software development, when something goes wrong, developers analyze the failure, identify the root cause, and adjust their approach. If you are still on speaking terms with your colleagues, you might have a chat about it, too. In an agentic development framework, we need a systematic way to capture and learn from failures.
One way I've been thinking to semi-automate this is through a structured post-mortem process for significant failures. Imagine a coroner agent whose task is to analyze run logs, results, and changes, and suggest improvements to the context engineering, specification writing, or even the framework itself. This could create a feedback loop that helps the system learn from its mistakes and improve over time. We're still missing this piece.
Dealing with the pace of change
Keeping up with the pace of change is one of the hardest practical challenges.
Even the basic tools we rely on keep adding new features and capabilities. This could be anything ranging, cross-tool compatibility changes such as new YAML frontmatter options, and other updates that can break the assumptions of your framework.
Imagine you've just gotten your agents, project docs, tools, and everything in order, ran a few successful runs, and then a piece of bad advice gets saved to a memory system and things deteriorate due to looping tool calls.
All of this is hard to debug because of the low transparency on what exactly is in the context. A hint: Github Copilot in VSCode offers very detailed logs that you can analyze to see what exactly was in the context and what the model was doing.
What I've been trying to build is a tool-agnostic framework with patterns to manage planning, phasing, basic context engineering, agentic job coordination, and splitting work to specialists when needed. Together, these help (sometimes more, sometimes less) to get more done in less time with equal or better quality. I've wanted to keep people in the loop to prepare and review things. And despite our stack being very common, we still need people who can really code, debug and refactor.
If you're doing the same as I am, i.e. trying to bend what's possible right now into your settings, remember that just getting things up and running takes a lot of time, effort, and research. Keeping it running requires constantly revisiting what already worked. Compared to old-fashioned maintenance tasks such as dealing with dependencies, build tools, and CI/CD pipelines, you'll have more to deal with. And all the previous things will still be there, too.
A good example of the field's immaturity: YAML frontmatter and skill/tool calling logic seem to change on every release. The frontmatter options are supposed to be "standards" but they're not. In practice, what works on one tool does not work the same way on another. So mixing different tools on the same project is a headache or at least a nuisance. As of early 2026, most common agentic tools are only beginning to read each other's configuration files, and their interoperability remains fragile.
Human readiness
Skeptics are there and they do have a point. The shift from "developer who writes code" to "delivery lead who governs agents" is cultural, not just procedural. Most developers were hired and trained to produce code. Governing AI-generated code requires different skills: architecture judgment, specification writing, quality evaluation, and the discipline to review what agents produce rather than rubber-stamping it. Chapters 5 and 6 explore this shift in depth.
The long-term question: will the ability to figure things out yourself become a lost art? If you haven't touched code directly for a long time, at what point have you lost the skill to do so?
The production feedback loop
Anyone ever built software that performed beautifully in development and testing but then failed in production? Yeah, me too. The feedback loop from production back to development is critical for catching issues that only surface in real-world usage. This time, it's even more important.
This is not an easy thing to solve. Even with real people in the loop, ensuring quality attributes are met (such as reliability, security, maintainability) has always been hard . These kind of cross cutting concerns by nature are determined by patterns and architecture, not as products of individual backlog items.
With agents in the loop, we need to capture production signals (errors, performance metrics, user feedback) and feed them back into the development process to inform future planning and context engineering. This requires instrumentation, monitoring, and a cultural shift in how teams think about the relationship between development and operations. This was the core tenet of the DevOps movement, and we might actually have a better chance to live by it than ever before.
Why this remains an open problem: the tools aren't there yet, and they are hard to integrate.
Measurement
Without flow metrics, you can't tell if the process adds value or ceremony. Most organizations measure speed (velocity, cycle time, deployment frequency) and almost none measure direction (requirement fitness, architectural coherence, long-term maintainability).
The framework needs measurement to justify its overhead: how much faster is story delivery with governance vs. without? How many plan-stage catches prevented expensive rework? What's the defect rate for governed vs. ungoverned stories? Without this data, adoption decisions are faith-based.
Security and regulatory gaps
It has been argued that AI-generated code is more insecure or buggy than what people write manually. It's not that simple.
Yes, AI is probably prone to use outdated patterns and sometimes insecure samples from its training data. It can make stupid mistakes that a security-aware developer wouldn't. The Fu et al. data from Chapter 3 (security weaknesses in roughly a quarter of AI-generated snippets) is not reassuring. So the starting point is honest: AI-generated code needs at least as much security scrutiny as human-written code, and probably more.
But here's the other side. With AI, you can take your security auditing to a level that most teams never had the budget or the people for. You can build custom security scanning agents on top of existing tools like SAST, DAST, and dependency checkers. Don't ditch those, obviously, but you can layer AI on top to catch things the existing tools miss: business logic vulnerabilities, subtle auth flaws, misconfigured middleware. You can also use AI as a kind of fuzzy security tester, because it's naturally good at doing things slightly differently each run, which is exactly what you want when probing for edge cases. Very few organizations had the money or experts to do that kind of thorough security work before, let alone at this scale.
The net effect on security depends entirely on whether you actually use that capability or just let agents generate code and hope for the best. And regardless of what you do internally, anything going live on the internet should be audited by competent people, preferably a third party. That was true before AI and it's doubly true now.
On the regulatory side, AI introduces novel risks beyond just code quality: training data leakage, hallucinated dependencies, prompt injection vulnerabilities, and the tendency to generate plausible-looking code that passes superficial review. The regulatory picture is still forming. The EU AI Act enforcement deadline in August 2026 will force clearer answers on liability when an agent introduces a vulnerability. The Cyber Resilience Act adds secure-by-design requirements and mandatory security updates for 5+ years. Whether your governance framework satisfies these requirements is a question you'll need legal help to answer, not just engineering judgment.
IPR and legal risks
Every prompt you send to a model provider carries your code, your specifications, and your business logic across a network to someone else's infrastructure. Most enterprise agreements include data processing terms and opt-outs from training on customer data, but the practical enforcement of those terms is trust-based. If your specifications contain trade secrets or proprietary algorithms, you're trusting the provider's isolation guarantees the same way you trust a cloud vendor.
As a rule of thumb, never store sensitive information of any kind on public cloud, and also don't send it to an LLM. If you have that kind of thing, you'll need to run your own models.
Then there's the question of what comes back. Models trained on vast codebases can reproduce patterns, idioms, and occasionally near-verbatim snippets from their training data. Personally I think this might be more of an issue for creative work, such as novelists, than for engineers who've been de facto doing the same thing for years with Stack Overflow.
Anyway, the risk of introducing copyleft-licensed code into a proprietary codebase is real but nearly impossible to detect at scale. No reliable tooling exists yet to scan AI-generated output for license contamination the way we scan dependencies.
Liability is a sharper issue when you run an LLM as part of your product. Recent debates about how AI can be used for purposes like mass surveillance, or as tools that lead to lethal use of force, have put this topic again in the spotlight. I think it's different if you're generating user-facing content, analyzing data, etc. than when you use AI as a development tool. If you run an LLM that processes your data, inputs or outputs, or makes decisions during runtime — which, as I've explained to death in this book, WILL go sideways regularly — you have the responsibility for it, pretty much the same way you have the responsibility for any other software component in your stack.
Taming the implications of having an LLM as part of your running system is not in the scope of this book, so we'll move on. For those of you who do that, best of luck.
Sustainability
Economics
How long will we enjoy nearly unlimited tokens per user at a few dozen euros per month? I'm not sure. Perhaps the next generation of LLM architectures will find vastly more efficient ways to produce those tokens. Perhaps specialized hardware will handle inference at a fraction of today's energy cost.
But the fact is that beyond Nvidia, not many are making money hosting LLMs. The biggest providers seem to be deep in the red. How long can they sustain that?
The longer-term risk is dependency. If we gradually lose the ability to create software without this technology, and the current consumer-friendly pricing goes up significantly, organizations that bet everything on AI-assisted delivery could find themselves locked in.
Environment and global issues
Since the early days of cloud computing, energy consumption of data centers has skyrocketed. The AI trend has been a step change in this steep upward curve. The numbers are huge, and it will not be just wind turbines powering the GPUs consuming your tokens.
Then there are geopolitical risks. We've already built deep dependencies on technology and services from a small number of sources. There has always been a confidentiality risk, but now we might potentially face something more consequential. So, diversifying across providers and maintaining the ability to operate without any single one seems prudent. Fortunately, the LLM field has genuine global competition despite the two biggest players making the most headlines.
Education and skill development
Last but not least, echoed by voices from schools and academia: we have a generation of people in the pipeline who might never learn to write properly, solve equations, or summarize anything by themselves, let alone write code. It's a real concern.
The same applies to us adults. Why bother going deep into anything if you can LLM your way around it? We are by nature opportunistic and take the path of least resistance. Some of this is indeed valid and smart, but the long-term effects of outsourcing most of your thinking to language models could be significant.
Applying a framework described in this book requires people who understand systems deeply enough to govern them. If we stop developing that understanding, also governance becomes impossible.