The Compound Probability Problem

Why AI coding tools are fundamentally probabilistic

Inner workings of LLMs and the probabilistic nature of the technology

n LLM produces the most likely next token given its context when producing your output, token by token. On the other hand, software it produces must be deterministic which means the same input must produce the same output, every time.

Below is a simplified illustration of the problem. It does not aim to completely capture the complexity of the issue but to give a visual representation of the core tension. In the end, after feeding our prompt and all the instructions, the model produces a probability distribution over all possible outputs. The more specific and constrained the prompt and context are, the narrower the distribution becomes, and the higher the probability of getting what you wanted. Or at least that's the goal.

Click to enlarge

The 'thinking' happens also on a higher layer, in a way, as kind of internal back-and-forths of the model before the output is produced. As those who have used modern LLM-based software engineering tools know, i.e. there's much more in play that I included in the image, but the fundamental thing is the same: when spitting out the next instruction to your code, the model picks it from a list of candidates, and the more you narrow down the list, the better your chances are.

One of the approaches to make models better has been to increase the size of the model and get more training data. Up to this point, this has indeed improved the quality of the output. Also the size of the input, i.e. the context, has increased dramatically, so in theory, you are able to solve more complex problems in a single run, and having that in the range of 100k tokens is a big deal and a necessity to solve the more obvious, 'smart' moves. There are signs, however, that this approach is nearing its limits, and that we might need to start thinking about other approaches. Throwing more GPU/TPU cycles, 'ralph looping' or 'making parallel runs' on the problem is probably not the best way to go about it either.

Finally the training data problem. "The internet" has already been ingested, and the models have been trained on it. What happens to the quality of it when the training materials become more and more generated by the models themselves? From signal processing perspective, I'd expect the quality to degrade over time.

Keeping up with the research side on the LLMs would be a full-time job. Knowing the basics on how the technology works is still important, and in the core of my engineer's mindset: I could not imagine writing programs without knowing the basics on how a computer operates; CPU, memory, storage and so forth. For the Agentic/GenAI software development, similar basic building blocks are the transformer architecture, the attention mechanism, the training process: how your prompt (+context) is transformed into an output.

You'll understand how this all works (to certain extent at least) by just looking at the logs. The IDEs and assistants go to great lengths not to show them to you, but they're there for you to study and learn. What's the FULL context? What tools were really called and by what parameters? How many tokens were used and how long did it take? What was the exact output? These are the things you need to understand to be able to work with the technology in a more informed way.

TL;DR In GitHub copilot, it's "Chat Debug View". and in Claude Code it's in the jsonl files per session, and there are plugins around to view them nicely.

Lost in the Middle: How Language Models Use Long Contexts

Claude Code History Viewer

So when you give AI a precise specification, carefully curated context, and well-defined constraints, the odds of one-shotting your task are often pretty good. Give it a vague prompt and a million-line codebase with no documentation, and you get what probability gives you: plausible-looking code that may be subtly wrong in ways that take longer to debug than manual writing would have.

In summary, the AI framework and processes built around them need constant tuning, just as agile development methods always called for feedback loops. Agents wander? Check why. Wrong instructions being read or just ignored? Check the wordings and ordering, make sure there's no overlap or conflicting text. And so on. The inherently probabilistic nature of AI isn't a bug that will be fixed in the next model release or with increased GPU power, and you should think of them as computer programs invoking functions and sending messages. It's the fundamental nature of the technology.

The compound probability problem

Let's do a simple (and naive) calculation of what happens when you chain LLMs. When LLMs effectively produce each other's inputs, as is the case in agentic usage, the odds of success don't necessarily increase. At least not if the agents cannot validate their outputs or inputs and reject false ones. I'll refer to this as the compound probability problem. The more steps you have, the more chances for things to go wrong, and the less likely you are to get what you wanted.

For instance, let's consider a multi-agent pipeline with several steps: Backlog → Planning → Engineering → Testing → Review → PR. Each stage involves an AI agent making probabilistic decisions. Even if each stage is 95% accurate, the compound probability across four agent stages is only 81.5%:

0.95⁴ = 81.5%

If you've ever worked with reliability engineering, this is the same thing as Mean Time Between Failures (MTBF). Chain components in series and each one that can fail drags the overall reliability down. Exponential decay, the engineers call it.

In a way to make it worse, during a single 'step' an agent performs, there are several probabilistic decisions inside. Of course that didn't affect my 'master formula' above, but this illustrates the non-constant nature of per-step accuracy. These kinds of intra-agent decisions include:

Tool selection: Which MCP server to call, which function to invoke, what parameters to pass. Each is a probabilistic choice. Wrong tool → wrong data → wrong code.
Document retrieval: RAG (Retrieval-Augmented Generation) retrieval is similarity-based (kNN or similar). You may get irrelevant docs, miss critical ones, or misinterpret what it reads. Stale docs poison the context.
Result interpretation: Even with the right tool and right docs, the agent must interpret results. A subtle misread of an API response or schema doc cascades into generated code.
Agent-to-agent handoff: Structured artifacts help, but the receiving agent still interprets them probabilistically. Any nuance lost in handoff accumulates across the chain.

So, if each of the agent stages involve six internal probabilistic decisions, a four-stage pipeline has 24 effective decision points:

0.95²⁴ = 29.2% (at 95% per decision)

0.97²⁴ = 48.1% (at 97% per decision)

These numbers are illustrative, not exact. The math assumes each decision is independent, which is a simplification. In practice, decisions are correlated: an early architectural mistake biases every subsequent step, and conversely, strong context can lift accuracy across the chain. But the basic point stands: the longer and more uncontrolled your pipeline, the worse it gets. So what can we do about it?

Input Quality: The more you can narrow down the space of plausible outputs, the higher the probability of getting what you wanted. Clear specs, curated context, and well-defined constraints are crucial.
Checkpoints/validation: Catch errors between steps before they propagate (like error detection codes)
Human-in-the-loop: Equivalent to a manual inspection step
Input size: Smaller tasks → fewer decisions → higher per-step accuracy
Shorter chains: Fewer steps = higher reliability (minimize series components)
Redundancy: Run the same call multiple times, vote/compare (like N-modular redundancy)
Self-correction loops: Retry with feedback, like automatic failover

All of these are going to be discussed in this book. Let's first add the most important one: make sure your input is correct.

Input quality

It's probably rather obvious that unless you are clear on what you want, you won't get it. But the probabilistic nature of the technology makes this even more critical. Things like wording, ordering, and formatting of the prompt and context can have a huge impact on the output. The more you can narrow down the space of plausible outputs, the higher the probability of getting what you wanted.

Roughly speaking, you can expect the following accuracy per decision based on the quality of your prompt and context:

Vague prompts: 70–80% accuracy per decision (huge search space)
Good specifications: 90–95% accuracy per decision (constrained space)
Excellent specifications + clear context: 95–99% accuracy per decision (narrow solution corridor)

Fear not: this is not "big waterfall upfront design" I'm suggesting. It's decomposition for probability management. Breaking a large, ambiguous problem into small, specific tasks isn't waterfall, it's more like doing a detailed todo-list before you start coding.

A common straw man argument from AI skeptics is the notion that SDD is a return to the stone ages where everything needed to be written down up front. The key misunderstanding is the word "everything." Of course you need a solid roadmap, necessary architecture, and guidelines specified, but you can still work iteratively at the feature level and create feedback loops.

Think of it like this: in the old days you had the vague user story and some Figma designs often written by someone else ages ago. You took a quick look and figured you'd just start coding and see if you got it right. Then in the next sprint, you'd refine.

You can still do this, but you need to shift the discovery phase outside the coding, or major parts of it. What pages or controls are needed and for what exactly? What messages, tables, and APIs will you send, read, or call? How do you verify that the thing actually did what it was supposed to do?

Checkpoints and validation, human in the loop

Having gates between your agents is, as illustrated above, a good idea to prevent your premium tokens from going to waste. Putting gates between your steps doesn't just enforce process, it breaks the probability chain. A human checkpoint resets error accumulation.

So, instead of p^24 (probability compounding across 24 decisions), you get gates resetting at checkpoints:

Scenario	Formula	Result
No gates, 24 decisions at 95%	0.95²⁴	29.2%
4 gates (6 decisions each at 95%)	0.95⁶⁴ with corrections	~75% (plus human catch at each gate)
Meaningful gates with review	0.98⁶⁴ with rework loops	~90%+ effective

Key is to catch drift at task boundaries. How it is done is perhaps something you'd imagine an UAT manual tester to detect if your beloved piece of software actually sucks at something or not; now we're gonna introduce that kind of sanity checks way earlier in the chain, shift left. How feasible it is to have a human in the loop depends, but if we rethink the role of this manual gating, it just might.

For example, instead of manually code-reviewing every PR, you could let that nice-talking agent to run static analysis, conformance, security and other checks and have you review them if necessary, granted you've specified quantitative or boolean criteria for your quality controller agent as triggers for manual intervention.

By applying checkpoints and validation, with optional manual review, we intervene early, evaluate what went wrong and try again with hopefully increased odds to nail it.

This is what the DevOps 'shift left' and 'continuous feedback' principles are all about. It's just that now we have to apply them to the AI-driven stages as well.

Input size

This one's straightforward but easy to ignore: step size management. The larger the task or step you ask an agent to do, the more decisions it has to make, and the more chances for things to go wrong. By breaking down your work into smaller, more specific tasks, you can increase the per-step accuracy and mitigate the compound probability problem.

Success Rate ≈ 1 / Step Size

This principle applies to both architecture and feature. Smaller the step size for a feature, the higher the probability of success. More modular the architecture, the more constrained the solution space, and the higher the probability of getting what you wanted.

My Lovecraftian roguelike has about 200 different features. Each was specified, generated, and tested as a bounded task: a new character class, a map event, an inventory mechanic, all bound to wider themes and phased across the game. I think I'm at Stage 33 of Milestone 3 right now.

I didn't call it "step size management," but that's what it was. Keeping each generation request small enough that compound probability never had room to multiply.

This is not a new idea, but it's more important than ever in the context of AI-driven development. Working like this was a good idea when developers still did the majority of the heavy lifting: think before you start throwing GPU cycles at it.

We learned this on a frontend project with complex screens. The planning stage had already decomposed features into small, atomic tasks, the kind of bounded work that should have been straightforward for the engineering agent. But the execution harness was handing the agent an entire feature's worth of tasks in a single session. So the agent tried to build everything in one context window: layout, state management, API integration, validation, edge cases, all at once.

What happened? The agent's context filled up, quality degraded toward the end of the session, and retries started piling up. The fix was almost too simple to admit: execute one atomic task per session, each starting from a fresh context. Same tasks, same plans, same specifications. Just fed to the agent one at a time instead of all at once.

Retries decreased significantly. Not because the tasks changed, but because the step size as experienced by the agent changed. We had planned small. We just weren't executing small.

Having small tasks in your plan is necessary but not sufficient. The execution model matters just as much: if your agent consumes all the tasks in a single context window, you've effectively created one large task with all the compound probability that implies. Plan small, execute small, reset context between steps.

Shorter chains

One of the core engineering principles for everything has been 'keep it simple'. Or, as Occam stated, "Entia non sunt multiplicanda praeter necessitatem, entities should not be multiplied beyond necessity".

The same applies to the compound probability problem. The more steps you have, the more chances for things to go wrong, and the less likely you are to get what you wanted unless you have good governance, checkpoints and circuit breaks. Narrow the tool scope, control the access to the codebase etc. Start with a minimal set like the standard 'Plan' and 'Do', and see if adding more agents and stages is really necessary for your use case.

Redundancy

Having dual power supplies and multiple hard drives in a server is a common practice to increase reliability. For the agents it's not that obvious, but you can run the same call multiple times and compare outputs. Better yet, use a different model for each run.

It's basically a voting system: run the same thing three times, compare, and pick the winner. Same idea as N-modular redundancy in hardware, where you have multiple components doing the same job and a vote decides the correct output. If the spread of answers is too wide, you might resort to having someone (in case they're not out to lunch or asleep already) to make the final call.

Self-correction loops

I'm going to discuss this topic in more detail later in the book, but the idea is probably clear: agents should learn from their mistakes, adjust the assumptions (tools, context, constraints) and try again. Most likely at first you'll be figuring out the necessary corrective actions manually, but eventually you can automate some of this process.

For instance, if an agent fails a test, it could automatically analyze the failure, identify the root cause (e.g., wrong API call, missing doc, misunderstood requirement), adjust its prompt or context accordingly, and retry the generation. Or on a higher level this could be figuring out why the damned thing created a new page instead of editing the existing one, i.e. why did you do this?

Having this kind of self-healing learning capability to increase the reliability of your agentic system is a powerful way to mitigate the compound probability problem. But make no mistake, we're not quite there yet and much of the job description of the new engineering I'm going to suggest in this book would be about this part: how to make the agents learn from their mistakes and improve over time.

What remains unsolved

Even with excellent governance and all the various ways I described above the compound probability problem is never going to be fully solved.

The fundamental probabilistic nature of the technology means that there will always be a non-zero chance of failure at each step.

So looking at the crystal ball, we can expect the following as you head toward the promised land of AI-assisted development:

Model reliability continues to improve with better hardware and training, but remains probabilistic
The failure rate per step across a larger agentic system is uncharted territory and probably highly context-dependent
One-shot generation (entire systems from scratch) remains high-risk, expensive, and slow
Drift in base model behavior between releases can break carefully tuned pipelines.
AI-driven development remains genuinely hard to stabilize.

In the meantime, the best we can do is to understand the nature of the problem, apply the mitigation strategies, and design our systems with the compound probability problem in mind.

A hobby game with no users can afford a few probabilistic misfires. You just regenerate and move on. Production systems can't : it's not an option to regenerate the whole damned thing every few weeks. But the principle transfers directly: constrain the problem space per step, and the math works in your favor whether you're building a dungeon crawler or a payment system.