โ†‘ Back to Contents
20

Closing the Feedback Loop


hinking about the agentic world, there are several different feedback loops. Some of them are familiar and certainly not anything specific to AI-assisted development.

I listed some of the signals that we can collect to improve our process. I'll add a few below, and will discuss some ideas for incorporating them into your factory's tuning process.

  • How did the software we developed actually do? How many tickets, what kind of usage patterns there are?
  • How did the software perform in production? Error rates, performance regressions, resource usage?
  • How do we keep the software up to date, secure and manage it? How do we monitor the software and make sure it continues to meet the needs of users?
  • Does the code meet our quality standards and architectural intent? Are we accumulating tech debt faster than before?
  • Does the factory work well for the teams and developers? Is there so much to fix afterwards that it becomes more work than was saved?
  • How do we monitor our agents in the 'factory' and continuously improve their performance and accuracy?
  • How well did our specs capture the intent and guide the development? How do we improve them based on what we learn?

The questions above, and more, are grouped below into two separate boxes:

  1. What were the feedback loops that we had before, like user feedback, technical characteristics, and maintenance issues. These still exist, and no doubt we could build automation to improve our next delivery accordingly. Think like DevOps, and perhaps the thing that should've been part of the Agile way (constant improvement).

  2. The new one is the 'factory' feedback loop, where the sender or receiver is not a person anymore, but your agent. What often goes wrong? Are people struggling to get the most out of the machinery we have with constant breakdown and bad output? Are we feeding the factory with correct instructions (specs), and are we able to find the perfect level and content for them?

Seven Feedback Loops of the Software Factory

Traditional
๐Ÿ‘ฅ
User FeedbackUsage patterns, support tickets, adoption rates
โšก
Technical HealthError rates, latency, performance regressions
๐Ÿ”ง
MaintenanceSecurity alerts, dependency drift, incident patterns
Output quality
๐Ÿ—๏ธ
Code Quality & ArchitectureComplexity metrics, architecture drift, coverage gaps
Factory-specific
๐Ÿญ
Developer ExperienceCycle time, rework rate, developer sentiment
๐Ÿค–
Agent PerformanceAccuracy, hallucination rate, gate rejection patterns
๐Ÿ“‹
Spec QualitySpec-to-implementation drift, ambiguity-caused rework
Click to enlarge

Let's discuss these loops in detail.

User feedback loop

There's a lot we should learn and can learn from the users of our product. Support tickets, on-call feedback, and user sentiment are all important signals that can help us understand how our software is performing in the real world and what we can do to improve it. Some of it is simple fixes: imagine posting an incident to support, and discovering it's been automatically fixed in minutes instead of the usual days or weeks? Or asking for a small feature or change of UX to help you with your job.

The current way to handle issues is typically something like this:

  1. User reports an issue through support channels (email, chat, ticketing system).
  2. Somebody in the ITILized process triages the issue, assigns it to the right team, and adds it to the backlog.
  3. The team prioritizes the issue based on severity, impact, and other factors, and schedules it for a future sprint or release.
  4. Somebody does a root cause analysis, finds the fix, and implements it in the codebase.
  5. At some point in time, the fix is deployed.
  6. The user is notified (well... maybe).

Typical lead time in the process above for anything other than showstoppers is not really minutes or hours. Think weeks or months. Why this kind of process exists in the first place is to filter the incoming requests to manage the team workload; a bit like in our public healthcare, you need to talk to a nurse (sometimes several) before you can see the doctor.

Wouldn't it be cool if steps 1-5 could be done in minutes, or half an hour (or to get directly to see the doctor)? Having step 6 still behind manual control for QC purposes might be needed, but in clear cases, when for instance our ITIL triage agent is able to verify from other sources that a fix is indeed needed, go all the way.

So coming to your nearest software support center, the fully automated repair workshop with the the following service menu:

Agentic Issue Resolution

๐Ÿ‘คUser
reports an issue
notifies resolution
๐Ÿ”Triage

Classify, reproduce, confirm

๐Ÿ”งDiagnose & Fix

Root cause, generate fix, test

๐Ÿš€Deploy & Verify

Stage, smoke-test, deploy

Click to enlarge

Examples above about a software factory autonomously fixing products just based on consumer feedback may still sound a bit crazy, but not the 100% science fiction that it would've been just a couple of years ago.

Technical feedback loop

Another class of signals a software team can and should respond to is the technical stuff. Error rates, performance regressions, resource usage, and other technical signals are all measurements of how our software is performing in production. By monitoring these signals and responding to them quickly, we can ensure that our software continues to meet the needs of our users and remains reliable and performant.

Automatic error detection has been in place in several areas for some time now, but the idea of having an AI triage agent that can not only detect errors but also suggest fixes and even implement them is a new one. That is, if we go beyond naive rebooting or increasing capacity on demand.

This category of errors is harder to fix automatically beyond the obvious 'reboot if mem is >90%'. AI models have been widely applied, way beyond this current generative AI wave, e.g. to detect anomalies in logs or network traffic.

Maintenance feedback loop

The maintenance engineers who work in your factory or the ones working in the field taking care of your equipment are a great source of feedback, too. They have a unique perspective on your software project, and typically have a lot of experience with the software and its quirks.

In the real IT world this group are the people running your tier 2 support, check up on your CI/CD systems, handle your user accounts etc, but are perhaps not directly involved in the development of the software. Much of the feedback is hence about small improvements to the software, or to the process, that can make their life easier. For instance, they might report that a certain error is happening more often than it should, or that a certain feature is not working as expected. They might also report that a certain process is taking too long.

If we'd apply the requests from the designated experts as first-class citizens who are to be trusted, perhaps a similar agentic feedback response factory as we suggested to handle the end-user feedback loop could be implemented to handle the maintenance feedback loop as well. Or, perhaps the maintenance engineers could act as curators of the technical signals, and suggest changes to the product or even the software itself, and instead of going through the usual process, the AI triage agent could take care of it directly.

Code quality and architecture compliance

There's a feedback loop that sits between the product-focused loops above and the factory-focused loops below: is the code we're shipping actually good, or, at least according to our standards? Or is it yet another very different implementation that deviates from other similar features we might have?

This matters more in the agentic world than it ever did before. When humans write code slowly, architecture drift is gradual and usually caught in review. When agents generate code fast, entire subsystems can diverge from the intended architecture between Monday and Friday. The volume amplifies every quality problem.

The signals here are familiar, and some of them are possible to auto-detect to some degree. They might be for instance code smells, duplication, complexity metrics, dependency violations, or test coverage gaps. We need to address these effectively. Otherwise we'll risk our app becoming a big ball of mud spaghetti which will be hard also for agents to maintain later.

First line of defence is to inject these compliance checks after each agentic handoff; from planning all the way to review. What I'm after here really is how to improve these handoff points, their instructions and the architecture documentation to prevent these issues from happening in the first place.

Architecture compliance isn't just a gate at review time. When agents drift from architectural intent, the fix isn't more review. It's better constraints upstream: clearer conventions, tighter architecture decision records, and explicit boundaries in the agent's context.

Drift Detected โ€” What Gets Refined?

Each deviation points to an upstream artifact that needs updating

Inconsistent patternse.g. duplicated logic across modules
โ†’
Project instructionsadd explicit pattern rules
Wrong architecturee.g. direct DB calls from handlers
โ†’
Decision recordsdocument chosen patterns with rationale
Style violationse.g. naming conventions ignored
โ†’
Conventions & lintersencode as automated checks
Bad generation patternse.g. missing error boundaries in agent output
โ†’
Agent specs & promptsconstrain what agents generate
Weak review coveragee.g. tests coupling to internals
โ†’
Gate criteriaadd checks that catch it pre-merge

Fix the source, not the symptom.

Click to enlarge

I've listed some simple examples above what kind of deviations you are likely to encounter, and what might be the corrective actions to take. The feedback loop here is about improving the project-level instructions, the architecture documentation, and perhaps even the agent recipes to make sure that the next time we run the factory, we get better output.

Per each run, or periodically, collect a list of deviations and verify what kind of adjustments you need for your project-level instructions, architecture documentation, or the agent recipes.

The feedback loops need circuit breakers as well. As tempting as a fully self-correcting software factory might sound, in the end the signal-to-noise ratio of this transformation might not be very high. Essentially we're using AI to improve itself, and that has its limits. For now, collecting periodic performance data, even a 'self diagnostics' report from the factory, and then reviewing it together with the team to decide on the improvements to make is perhaps the best way to go.

Developer and agentic feedback loops

Finally, we have the feedback loop that is perhaps the most important one for the software factory itself: the feedback from the developers who are using the factory to build their software. (Yes, there will be people involved in this for some time, at least to some extent.)

Think issues like wandering agents taking hours to do simple stuff, too much to fix and iterate, unpredictable output or behavior. In the current world these things are not easy to fix. I have made some very limited attempts to have this kind of feedback loop in place, albeit human-triggered, to try to improve the agents and their context material, such as the documentation, after each run to act better next time.

I've developed a naive, manual loop with two example tasks to perform. I collected all the logs, including task calls, spawns and user adjustments, and then used that as input for my best-practices-agent to suggest improvements on agent recipes, tool usage, wandering, separation of concerns, and so on. So I basically used AI to improve AI, and after two rounds of corrections our tool chain performed much better.

Factory Improvement Loops

Manual / Batch Improvement
๐Ÿ“User Story๐Ÿค–Run Agents๐Ÿ“ŠCollect Logs๐Ÿ”ฌAnalyse๐Ÿ’กSuggestions๐Ÿ”„Apply
+
Per-Agent Self-Correction
๐Ÿ“‹ PlannerโšกTask๐Ÿ“ˆAnalyse๐Ÿ› ๏ธImprove๐Ÿ”ง EngineerโšกTask๐Ÿ“ˆAnalyse๐Ÿ› ๏ธImprove๐Ÿ” ReviewerโšกTask๐Ÿ“ˆAnalyse๐Ÿ› ๏ธImprove
Click to enlarge

This loop was run manually with prepared examples that represented the typical tasks we had.

Similar feedback-collection could also be triggered by the factory itself. For instance, if an agent takes too long to do a simple task, or if it produces output that is not useful, the factory could automatically trigger a feedback loop to try to improve the agent's performance. This could be done by analyzing the agent's behavior and suggesting changes to its prompts, context material, or even its architecture.

Or the developer might tell an agent, 'this tool call keeps failing, why?' Or, 'why do my custom instructions or document keep being ignored?'

Factory Self-Improvement Loop

Agents that fix the agents โ€” maintaining the machinery, not the product

Metrics thresholdPost-runDeveloper report
๐Ÿ”DiagnoseDiagnostics Agent๐Ÿ“‹Plan FixMaintenance Agent๐ŸงชTestReplay & compare๐Ÿš€DeployUpdate factory
โ€ข Prompts & specsโ€ข Conventionsโ€ข Tool configsโ€ข Templatesโ€ข Docs
Click to enlarge

We are in the early stages with our self-diagnosis and healing. This could be taken to the next level though, as illustrated above. Even now, it is possible to gather statistics about your runs, like duration, token usage, number of tool calls, and so on. Let's define boundaries for this, add another 'Diagnostics Agent' to watch the agent workflows to suggest improvements and optimizations, and apply them automatically.

Depending on the Agent Harness (Codex, Claude Code, Open Code, Copilot) you use, the detailed agent logs are an absolute goldmine and a deep-dive inside the agentic mind. Especially the Copilot logs are very detailed, and rather educational especially if you wonder where did all that context window go, and what really was in it. Might be a good idea to store them periodically to catch drift, token usage, runtime and also number of retries and such to keep your machine working smoothly (or, as smoothly as possible).

Feature feedback loop

Finally, let's discuss how to assess the quality of the specifications we are feeding to the factory. Were they clear enough? Did they correctly capture your intent, all the edge cases? Did they have the right level of detail? Did they include the right context material?

There are a few approaches I could think of that could be used to develop this further.

  • Did we miss any features? Despite asking for instance a specific navigation path, it's not there.
  • Did our implementation do something close but not exactly what we wanted?
  • Did we get unwanted features? A menu not needed, an entire page not requested, a new command-line switch for your tool that you didn't ask for?
  • Were too many E2E tests generated? What was the 'Farley score' of the tests?
  • Were there lots of user-reported bugs during the testing?
  • Were the architecture, patterns and other conventions followed?

So roughly speaking there are two different kinds of mistakes:

  1. Ignoring recurring patterns and practices, i.e. deviating from project-level instructions
  2. Deviating from the specific instructions for the task you're trying to get done

Probably both lead to different corrective actions. The first one is more about improving the project-level instructions, the architecture documentation, and perhaps even the agent recipes to make sure that the next time we run the factory, we get better output. The second one is more about improving the specific instructions for the task at hand, and perhaps also improving our review process to catch these kinds of mistakes before they make it to production. To conclude, I've thrown in a couple of ideas worth a shot if (when) you encounter these kinds of defects in your factory.

Fixing deviations from project standards and practices: semi-manual feedback loop

How to create a feedback loop for the first one is perhaps more straightforward. You can have an agent that continuously monitors the codebase for patterns and practices. That's already baked in the flow, but as it is with gen AI, they can be missed or violated. Key would be to detect why they were violated, and to suggest improvements to the project-level instructions or the architecture documentation.

A naive solution for this kind of deviation could be something like:

Example
You: "Hey factory, I noticed that the engineering agent keeps generating code with deeply nested conditionals, which violates our coding standards. Can you analyze why this is happening and suggest improvements to the agent's constraints or the architecture documentation to prevent this in the future?"
Factory: "Sure! I've analyzed the recent code generations and found that the agent is defaulting to a procedural style when it encounters complex logic. This is likely because our current architecture documentation doesn't provide clear guidance on handling such scenarios. I suggest we update our architecture decision records to include specific conventions for managing complex logic, such as using design patterns like Strategy or State. Additionally, we can enhance the agent's constraints to prioritize these patterns when generating code with complex logic. I'll go ahead and implement these changes to improve the agent's performance in this area."
You: Ok, propose the changes, and let's review them before you implement them.
Factory: "Here are the proposed changes to the architecture documentation and the agent's constraints. [Provides a summary of the changes]. Please review and let me know if you have any feedback or if you'd like me to proceed with implementing these changes."
You: "The proposed changes look good. Proceed."

For now, I think this particular 'drift' should be solved case-by-case as there's a far bigger risk of breaking things by giving agents too much power on the general topics.

Fixing deviations from task-specific instructions

The second class of mistakes is related to the specific task at hand.

By nature (depending on the scale of your mistake of course) these have a smaller blast radius than altering common practices. It's only one feature going bad, right? Well, this depends -- you might end up bricking some common service, template, or script and cause havoc by breaking 100 E2E tests.

A common mistake is saying one thing at the beginning of your specification ('this box should be openable from the top'), but something opposite at the end ('box should not be openable at all'). The longer your specification gets, the more difficult, even for AI, detecting this kind of inconsistency becomes.

So we're talking here about errors in specification. Back in the old days it might have been picked up by the reviewer, perhaps not.

Fixing this class of errors should be central to your Software Factory, as you don't want to produce invalid products. So, instead of relying on ralph-looping to get your way through, invest in a good, standard format for your specification (something easily understandable for both humans and machines), and make sure to have a good process in place to review and improve them based on the feedback you get from the factory.

Becoming a good 'AI whisperer' takes time and practice. This is why the 'prompt engineering' never really worked: a small difference in prompt, a different or updated model might give a different answer next time you run.