↑ Back to Contents
3

Why This Is Harder Than It Looks

What the Evidence Actually Shows


he narrative around AI coding tools is relentlessly optimistic. Wild claims such as one-shotting a C compiler (at $20,000 compute cost) emerge almost daily, but results rarely withstand closer scrutiny. Much of it is tech bro hype, and the sources are not exactly academic.

I think overselling the AI wave is counterproductive and hinders the effective rollout of these capabilities due to often totally warranted skepticism. The ground truth is more complicated.

Feeling productive and being productive are different things. The evidence shows a consistent gap between what developers perceive and what the data measures.

The perception gap

The METR study was a randomized controlled trial tracking 16 experienced open-source developers across 246 real-world tasks. These were not AI novices. Participants were selected for AI tool experience and used their own preferred setups (primarily Cursor Pro with Claude 3.5/3.7 Sonnet). The finding: developers using AI tools took 19% longer to complete their tasks than when working without AI. Before the study, they predicted AI would make them 24% faster. After experiencing the slowdown, they still believed AI had sped them up by 20%.

Sixteen developers is a small sample, and the specific 19% figure could be noise. But the perception gap, a 39-percentage-point chasm between what developers believe and what the clock shows, is large enough to be directionally meaningful. And it doesn't stand alone.

Google's 2024 DORA report, surveying over 39,000 professionals, found that every 25% increase in AI adoption corresponded to a 1.5% dip in delivery throughput and a 7.2% increase in delivery instability. You guessed right, most participants reported feeling more productive nevertheless.

What DORA actually measures

Worth pausing on what DORA is, because it's probably the most widely accepted baseline for measuring software delivery performance. And not very often used. I haven't personally ever measured or seen these in the wild, but I've heard DevOps advocates preach about them in the past. That does not mean they're not real of valuable metrics, and I think the AI software factory community, of which I've been a part for a while should develop these furher.

The research program, originating from the Accelerate book by Forsgren, Humble, and Kim, defined four key metrics:

MetricWhat It Measures
Deployment frequencyHow often code reaches production
Lead time for changesTime from commit to production
Time to restore serviceHow quickly you recover from failures
Change failure rateWhat percentage of deployments cause incidents

These four metrics consistently distinguish high-performing teams from low-performing ones, regardless of technology stack or industry. They've been validated across over a decade of State of DevOps surveys.

What makes the 2024 DORA findings on AI adoption so interesting is that they use this same established framework. The report doesn't measure "lines produced" or "developer satisfaction", it measures the delivery outcomes that the industry has broadly agreed matter. And against those metrics, increased AI adoption correlated with worse outcomes. Even as teams reported feeling more productive.

The 2025 State of AI-Assisted Software Development report from Google DORA tells a more nuanced sequel. Surveying roughly 5,000 professionals, it found 90% now use AI at work, up 14 percentage points from the year before. Here's where it gets interesting: the throughput penalty from 2024 has flipped. AI adoption now correlates with higher delivery throughput, higher code quality, and higher individual effectiveness. But, and this is a significant but, delivery instability still goes up. Friction stays flat. Burnout stays flat. The report's framing is that AI is an amplifier, it magnifies the strengths of well-run teams and the dysfunctions of struggling ones. Organizations with strong foundational practices, things like clear AI policies, healthy data ecosystems, and quality internal platforms, see the benefits multiply. Those without them just get more of whatever problems they already had. That's basically the thesis of this book in a single research finding.

So the tools are being adopted at speed, the productivity feeling is real, but the delivery metrics tell a different story. DORA gives us the vocabulary to have that conversation honestly.

The screen recording data from METR reveals where the time goes: developers spent roughly 9% of total task time just reviewing and cleaning up AI-generated code. They accepted fewer than 44% of suggestions, and even when they accepted code, 56% reported making major changes afterward. The productivity gains from generation are consumed by the overhead of validation, and the more experienced the developer, the worse the tradeoff, because experienced developers were already fast.

These studies all have limitations, and they disagree on specifics. METR is rigorous (randomized, screen-recorded, real-world tasks) but tiny. DORA has massive sample sizes but relies on self-report. Stack Overflow captures sentiment, not delivery outcomes. No single study settles the question. What matters is the convergent pattern across all of them: developers consistently feel faster than the delivery metrics suggest they are. When someone cites "10x productivity," ask what the methodology was, and whether it measured output or feeling.

Trust erosion

Stack Overflow's 2025 survey shows only 29% of developers trust AI tool outputs, down from 40% a year earlier. 66% report spending more time fixing "almost-right" AI code than they save. 75% still prefer asking another person when unsure. Yet 46% of developers actively distrust AI-generated output while usage keeps climbing.

The tools feel good to use. They reduce cognitive load on individual keystrokes. But the aggregate effect on delivery is not what the marketing suggests.

Code quality and technical debt

GitClear's analysis of 211 million lines of code found "code churn" (code rewritten or deleted within two weeks) has doubled since 2021. Code duplication linked to AI is up 4x (8x between 2020 and 2024). Copy/paste is now more common than code reuse.

These metrics deserve some nuance though. Code reuse is not always the virtue we like to think it is, and AI-generated code tends to be structurally different from what humans write. It's simpler, more verbose, heavier on comments, and less likely to reach for abstract OO patterns or clever indirection. That means more lines and more apparent duplication, but not necessarily worse code. High churn could mean fragile output, or it could mean teams are now willing to rewrite and refactor at a scale that was previously too expensive to attempt. The jury is still out on what these shifts mean long-term.

What is harder to dismiss is the issue rate. CodeRabbit's analysis of 470 pull requests found that AI-generated code creates 1.7x more issues than code people write in open-source PRs. Fu et al. (2025) analyzed 733 real Copilot-generated snippets from GitHub projects and found security weaknesses in 29.5% of Python and 24.2% of JavaScript code, spanning 43 CWE categories including code injection and cross-site scripting.

The code quality evidence is the most concerning finding in this chapter. Speed gains mean nothing if you're accumulating technical debt faster than you can pay it down. This is why governance matters. It's not about slowing down, it's about not creating problems faster than you solve them.

The organizational reality

The vibe coding movement — Andrej Karpathy's term for "letting AI write your code while you embrace the vibes and forget that the code even exists" — has produced a flood of prototypes, MVPs, and demos. It has also produced what multiple analysts now call a coming technical debt tsunami.

Inconsistent coding patterns emerge because AI generates solutions based on different prompts without a unified architectural vision. Documentation becomes sparse because the focus shifts to prompt engineering. Security vulnerabilities appear because models lack awareness of security implications.

Stack Overflow captures the team dimension precisely: AI's most recognized impact is on personal efficiency. Only 17% of agent users report improved team collaboration. That was the lowest-rated impact by a wide margin. This suggests that AI helps individuals, but it doesn't improve how teams deliver together as much.

Adoption and economics

  • The AI code tools market reached $7.7 billion in 2025
  • GitHub Copilot achieves a 46% code completion rate, but only ~30% is accepted by developers
  • JetBrains found 85% of developers regularly use AI tools, 62% rely on at least one AI coding assistant, and 15% have not adopted AI tools at all
  • Y Combinator's Winter 2025 batch: 25% of startups running on 95%+ AI-generated codebases
  • Stanford: employment among software developers aged 22–25 fell nearly 20% between 2022 and 2025

The regulatory dimension

The EU AI Act entered into force in August 2024 and becomes broadly enforceable in August 2026. High-risk rules for embedded products follow in August 2027. Fines reach up to €35M or 7% of global turnover for prohibited practices.

The EU Cyber Resilience Act requires secure-by-design development, mandatory risk assessments, and ongoing security updates for 5+ years.

Stanford HAI's 2025 AI Index reports that AI-related security and privacy incidents rose 56.4% from 2023 to 2024.

Teams using AI agents to generate code need to answer: who is liable when an agent introduces a vulnerability? How do you demonstrate that your development process meets regulatory standards?

The regulatory question is not hypothetical. By August 2026, organizations deploying AI in high-risk contexts will need to demonstrate compliance. A governed pipeline with traceability (the subject of Part III) is one defensible answer.

What the evidence tells us

The evidence cited above doesn't suggest we should stop using AI. It does suggest we're still in the early days of figuring out how to adopt this new way of working. The perception gap, the trust erosion, the code quality concerns, the organizational blind spot. These are symptoms of a technology adopted faster than the practices around it.

One reservation I want to flag honestly: nearly all available studies measure short-term outcomes. Task completion time, sprint velocity, code produced per week. Nobody has yet tracked a governed AI pipeline, or an ungoverned one, over two or three years in production. The initial productivity story might look very different once maintenance costs, accumulated technical debt, and team turnover enter the picture. We don't know yet. That uncertainty cuts both ways, it could be better or worse than the early signals suggest.

Which brings us back to the basic intent of this book: what engineering practices make probabilistic outputs reliable enough for production?

References and further reading

https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/METR (2025). Early 2025 AI-Experienced Open-Source Developer Study.
https://cloud.google.com/devops/state-of-devopsGoogle DORA (2024). Accelerate State of DevOps Report.
https://survey.stackoverflow.co/2025/Stack Overflow (2025). Developer Survey 2025.
https://www.gitclear.com/coding_on_copilot_data_shows_ais_downward_pressure_on_code_qualityGitClear (2024). Coding on Copilot: AI's Downward Pressure on Code Quality.
https://www.qodo.ai/reports/state-of-ai-code-quality/Qodo (2025). State of AI Code Quality Report.
https://www.coderabbit.ai/blog/state-of-ai-vs-human-code-generation-reportCodeRabbit (2025). State of AI vs Human Code Generation Report.
Forsgren, Humble, KimAccelerate: The Science of Lean Software and DevOps (2018). The foundational DORA metrics research.
https://arxiv.org/abs/2310.02059Fu et al. (2025). Security Weaknesses of Copilot-Generated Code in GitHub Projects. ACM TOSEM.
https://dora.dev/research/2025/ai-assisted-development/Google DORA (2025). State of AI-Assisted Software Development.
https://www.jetbrains.com/lp/devecosystem-2025/JetBrains (2025). State of Developer Ecosystem Survey.
https://hai.stanford.edu/ai-indexStanford HAI (2025). AI Index Report.