The Pipeline: Stages, Gates, and Artifacts

The core building blocks of your Software Factory

The ticket lifecycle

fter all this handwaving, it's time to get more practical. Let's consider one unit of work: a single user story that flows through five stages. Four are agent-run; one is your existing CI/CD, agentized or not. Each of these stages has a clear intent, input and output and is probably very familiar to you from the pre-AI era.

Backlog item → Plan → Engineering → Testing → Review → PR/Merge → CI/CD

Each of these arrows represents a gate, basically an enforced state in a state machine. Agents cannot cross gate boundaries without meeting defined criteria.

The image below depicts the core stages, gates, and artifacts in the governed lifecycle. Each stage has specific inputs, outputs, and responsibilities. The gates enforce quality and alignment before allowing progression to the next stage. Perhaps at some point, you might be able to remove human involvement via some smart automation, or require it only if a deviation is detected. (More about this in the Feedback Loops chapter).

Click to enlarge

This is a good template to start building over the standard "Plan mode" and "Agent mode" once you've discovered the need for more structure and control.

Planning

In case you wonder, this is the 'outer loop' work that happens before any code is written. It sums up the requirements, designs, and whatnot, and turns them into atomic, structured, and ordered tasks that you can feed into your pipeline.

Don't let the planning agent have access to the backlog item or its conversational context. It should only get the structured requirements, designs, and relevant codebase references. This forces a clean handoff and prevents context bleed.

Specialist planning agents can be spawned to create and review planning tasks when needed. For instance, if you're porting an old frontend and need to learn from it and produce designs, you might use a differently instructed agent than for a greenfield feature. Same if you have a layered architecture or microservices; some parts of the system could benefit from a domain-specific agent with specialized knowledge and context. Apply this specialization as you see fit, but remember: the smaller and more focused an agent is, the better it will perform.

For example, a good plan for an agent might include:

Plan Element	Example
Flow	Add a filter to the data table
Alternatives	Gherkin-style acceptance criteria that can be directly mapped to tests
Architecture	Implement as a new API endpoint that the frontend will call
Components	Backend service A, frontend component B
Scope	Don't change the database schema, just add a new API that returns filtered data
Task List	A detailed 'DoD' style list of tasks, to be performed in a certain order and won't take hours to complete.
References	Link to relevant codebase files, design docs, UX mockups, task-specific DoDs, if any.
Exclusions and Constraints (the leash)	"Don't create any database tables", "Don't add pagination to the API or the screen". Etc. Otherwise you'll risk getting things you didn't ask for in the first place, as AI tends to be rather creative!

Tired of reading? Open a fresh session using another model and ask it to review the plan with a critical eye. While you're at it, craft a good review agent to do this for you. You might be surprised at what it catches.

Engineering

The engineering stage is where the rubber meets the road and used to be the one with the most attention and often work. An Engineering Agent takes the approved plan and executes it, producing code and unit tests that meet the defined acceptance criteria. This stage is all about execution: translating the plan into a working implementation as you planned, but nothing more.

The primary intent of the engineering agent should contain:

Have a carefully crafted context that follows your project practices and architecture
Execute a clear workflow that dictates the gating and required additional tasks, such as a requirement to build everything before declaring victory
Enforce revision control discipline
Do a self-review before asking for a human review (if required)

Compressed Indexes: Context Engineering for Documents

One practical tip for controlling the task-specific project context fed to agents is a technique called Compressed Indexes. The basic idea is that instead of chunking everything into your CLAUDE.md or maintaining separate reference sections for all agents, organize your documents into meaningful chunks by theme, backed by an accessible index. Let the agent seek advice from it when needed. It could be something like this:

Click to enlarge

A good index for LLMs is not just a table of contents; it should be designed to help the agent find the right context efficiently, without consuming unnecessary tokens.

The index should include:

Document summaries: A brief description of each document's content and purpose.
Relevance tags: Keywords or tags that indicate which agents or tasks the document is relevant for.
Sample questions: Example queries that would lead the agent to consult that document. This helps the agent understand when to seek that context.

Check the link below for an open-source implementation of this technique.

ai-docs-indexer

Specialist agents

As you learn along the way, you may end up having specialized engineers for different aspects of the work, like frontend and backend engineer agents. For instance, a "critique" agent (think of the guy in your team who is vicious about indentations (the correct answer is 3, spaces btw)) that reads your code and shreds it to pieces can also be a nice touch. Spawn it in a fresh context and with a different model before asking for a human review.

Don't let the engineering agent have access to the planning agent's conversational context. It should only get the structured artifacts and the current codebase state. The same applies for agents writing tests: given the code, they will produce tests that pass by fitting the code, not what it is supposed to do.

The gate state machine

Here's one example of how you could design your own work tracking engine. Essentially it's the same as we had back in the day as 'Definition of Ready' and 'Definition of Done,' but with more structure and automation; when the feature is clear enough to be worked on, when the code is good enough to be reviewed, and so on.

Click to enlarge

Whatever the exact gates you erect, in the absence of humans in the inner loop, you need to have something like this to prevent runaways and to maintain traceability.

I've provided a more detailed breakdown of one possible setup that we've been using.

Agent roles and responsibilities

Every governed pipeline needs a set of specialized agent roles. The exact names and boundaries will vary with your stack, but the division of responsibility matters more than the labels. Each role should have a clear mandate, defined inputs and outputs, and explicit criteria for advancing or falling back.

Role	Mandate	Advances When	Falls Back When
Roadmapper	Transforms use cases into structured backlog items with acceptance criteria	Items meet clarity, testability, and vision alignment	Items are vague or misaligned
Planner	Turns backlog items into ordered plans with tasks and references	Plan is complete, clear, and architecture-aligned	Plan is incomplete or misaligned
Engineer	Produces code and unit tests per plan	All tasks complete, build passes	Timeout or build failure
Tester	Validates implementation against plan and acceptance criteria	All tests pass	Tests fail or are incomplete
Reviewer	Runs quality checks against a defined checklist	All checks pass	Issues found

The critical pattern is that each role consumes only the structured artifacts from the previous stage, never the conversational context. This prevents context bleed and keeps each agent's input clean. The reviewer, for example, checks code quality, performance implications, accessibility, functional correctness against spec, architecture alignment, and UI/UX adherence.

Chapter 12 walks through a concrete implementation of these roles: the specific agents, their recipe files, and the lessons from running them on a real project.

How to structure the backlog

The classic backlog, you know, the 'Epic/Feature/Story/Task' or whatever your ticketing system calls them, is a good blueprint on how to organize your work. And chances are you're required to use it. The key is to design it as hierarchical and sequential, as you'd normally do when working with flesh-and-bone developers.

Here's a setup I've found to work well in practice.

Artifact	Stage	Creator	Format	Consumer	Purpose
Feature	Planning	Roadmapper Agent	Markdown/JIRA	Backlog item	Project-level requirement
Story	Backlog	Product	Markdown/JIRA	Planning Agent	Requirements source of truth
Plan	Planning	Planning Agent	JSON/MD	Review gate + Engineering	Architecture and task breakdown
Tasks	Planning	Planning Agent	JSON/MD	Engineering Agent	Executable specification
Code + Tests	Engineering	Engineering Agent	PR	Testing Agent + Review Agent	Deliverable
Test Results	Testing	Testing Agent	Test Report	Review gate	Validation evidence
Review Checklist	Review	Review Agent	Structured	Deploy gate	Quality assurance
PR + CI Report	Integration	Your CI/CD	Standard	Deploy team	Production readiness

Having this systematic and well-defined is a bit like programming the workflow. Clear input and output for each stage keeps things on track. With humans, certain degrees of freedom and flexibility are expected, you know, Markku will surely know how to add this cancel button without telling much more, but the agent won't. And you can still fix a single thing without a ticket.

Making your backlog items AI-ready

Traditional agile stories are deliberately lightweight: a sentence of intent, a few acceptance criteria, a conversation placeholder. This works fine when the executor is a human who will ask clarifying questions and fill in the gaps from experience. It works considerably less well when the executor is a Planning Agent that fills gaps probabilistically, i.e. by guessing.

On one of my projects I found out the hard way that the standard 'As a user, I want to...' format was nowhere near enough. The planning agent would produce technically plausible but contextually wrong plans because it had no idea which APIs were available (and what kind of input they'd require and what kind of data you'd get back), what the UI conventions were, or where the architectural boundaries lived.

Naturally these kind of concerns, at least some of them might reside in project documentation, as they should, but in pre-planning stage a good 'Definition of Ready' checklist for the backlog items is a good way to make sure the planning agent has what it needs to produce a good plan, and also coaches the developer in charge of the 'production line' of the agent factory to pay attention.

Think of it as the 'pre flight checklist' for the planning agent. If you have good structure there, even with gaps and just titles, your detailed plans end up looking much more like each other and limit the creativity of the agent in a good way.

So what helped was treating the analysis phase as a proper Definition of Ready checklist before handing anything to the planning agent. For each story, we added:

Architectural guidance: which layers and modules the feature touches, and which it must not
Technical analysis: relevant existing patterns, data models, and constraints
Integration points: required APIs, services, and their contracts
UI/UX decisions: layouts, controls, behaviors, and the reasoning behind them
Acceptance criteria: clear, testable conditions that define when the story is done

This is not a requirements document per story. It's maybe ten to fifteen minutes of extra work during analysis, but the downstream impact on plan quality was significant. The agent stopped inventing architecture and started following it.

Bugs discovered later in the process should be judged by the same criteria: if they are big enough to require a plan, they should be treated as a new backlog item and go through the same process. If they are small enough to be fixed without a ticket, then just fix them.

While the details and complexity of your dream setup will certainly be something other than what's above, remember that you need some kind of clear system to maintain traceability and control.

In summary, a good agentic lifecycle is not just about the agents and their capabilities. It's also about how you structure the work, define the stages, and enforce the gates to maintain quality and alignment. A good one lives by these principles:

Forces clarity: Vague specs become obvious when you serialize them to a structured format and break them down into tasks
Enables automation: Each agent can validate entry criteria before starting and refuse to work if they are not met.
Creates evidence: Audit trails show what was decided and by whom. Enables you to go back, improve the plans and agents, and perhaps regenerate the code.
Allows rollback: You can revert from clean artifact boundaries. The tester won't fix it but relegates the fix to engineering. Engineers can't figure out (trust me, I'm one) what the customer really wanted, so they ask the planner to clarify and update the plan. And so on.
Keeps things in check: You know what was done and what wasn't. You can resume and validate the work at any point and maintain a clear picture of the progress and bottlenecks.

What not to do

Some things are perhaps not to be GenAI-automated due to several good reasons. I'm the first to admit I do have an 'infrastructure engineer agent' in my toolbox, whose idea really is not to act as a runtime (although it might be used to diagnose!), but to produce scripts and tools for the humans to execute. The risk of letting it execute them is just too high, and the potential benefits are not that great.

So some words of warning of what not to agentize just yet. You may thank me later.

Don't replace CI/CD Your CI/CD practices should remain unchanged. You can and will CI much more frequently than before, leave the CD behind a manual gate.

You don't want a DevOps Agent Don't let the AI manage your infrastructure, pipelines, or deployments. It can produce handy scripts and tools for it, but don't let it execute them. Thank me later.

No DBA Agent either Don't let the AI modify your database schema or manage migrations. It can generate migration scripts, but the execution should be a human decision.

No IAC or runtime work Modifying (and probably breaking) your pipeline or the infrastructure will cause havoc. Use the AI to create handy tools for it, but don't let it execute them.

Don't manage cross-team dependencies Getting simple tasks to flow will be hard enough. Leave the project or team boundaries to your PM.

*Don't fix bad backlog items or invent new ones: You'll be the only one to tell what is really needed and what is not.

Feedback loops that actually work

The pipeline as described above is mostly forward-flowing: backlog to plan to code to test to review to merge. Rejections loop back one stage, and that's about it. But real delivery needs richer feedback than just "send it back to engineering." I'll cover the broader feedback loop design in Chapter 20, but two loops proved their worth early enough to mention here.

The first was capturing reusable elements. When the tester or reviewer spotted a pattern that kept recurring, a shared call pattern, a common UI control, or a cross-cutting concern, that information got fed back into the architecture context documents that agents read for future tasks. Without this, the agents reinvent the same patterns every time, and you end up fixing the same things over and over.

The second was the revert-to-engineering loop for bugs discovered by the testing agent. Instead of having the tester attempt fixes or the engineer debug within an already cluttered context, the bug report went back to a fresh engineering session with just the failing test and the relevant code. Concerns stayed separated, context stayed clean. Exactly the kind of thing that matters when your executor has a finite context window and no memory between sessions.

The feedback loops that actually worked were the ones that fed information back into the system's memory: architecture docs, shared patterns, conventions. The ones that stayed aspirational were the meta-loops: metrics calibrating the process, retrospectives tuning the governance. Start with the concrete ones.

The game eventually acquired its own lightweight version of these mechanisms: a milestone plan, per-feature specs as structured prompts, and a personal review step before each merge. No state machine, no shared tracking. Just habits that emerged from getting burned too many times. The framework in this chapter is what those habits look like when they need to scale beyond one person.