Erik Perttu: Autonomous Engineering Pipelines

This blog documents an autonomous engineering pipeline in Python that takes a ticket from input to tested, reviewed code with no human in the execution loop, proven on TypeScript and Python codebases, including a ~100k line TypeScript monorepo.

Read the full pipeline overview.

JULY 15, 2026

Build for the Second Language You Don't Have Yet: Polyglot Discipline in a Code-Generation Pipeline

The discipline of asking whether each change holds for a second language caught a latent fixture-scanner bug before any second language existed to trigger it.

JULY 6, 2026

The Registry Is Not a Fixed Input: Extend It, Don't Ask the Model

Three times a new ticket type forced a new registry fact. Each was an extraction, not a prompt, and each kept the derivation deterministic forever after.

JUNE 21, 2026

Steering a Coding Agent Across Long Sessions: The Rules It Can't Drift From

How to keep an AI coding agent on course across sessions: an always-loaded contract, constraints backed by failure records, and self-enforcing rules.

JUNE 20, 2026

Steering a Coding Agent Across Long Sessions: Three Documents and One Rule

How three plain-markdown documents with separate jobs, and one mechanical rule, keep an AI coding agent coherent across hundreds of hand-steered sessions.

JUNE 20, 2026

The False Green Baseline: When a Passing Test Suite Hides a Broken Type-Check

How an AI coding agent's type-check gate misattributed pre-existing LSP errors to a correct rename, and why gating on the change delta is the answer.

JUNE 18, 2026

Raising the Floor under the Coder: Generalizing an AI Coding Agent across Ticket Types

An AI coding agent went from one ticket type to six, with four architectural moves that raised the floor to deliver zero-Coder results on four types.

JUNE 16, 2026

Proving a generated test can fail: mutation testing as a sufficiency gate for an AI coding agent

How mutation testing proves a generated behavioral oracle is sufficient, through three ticket types and five false passes that nearly went undetected.

JUNE 10, 2026

The Decision an AI Coding Agent Can't Make Alone: Operator-Grounded Intent Capture

A deterministic resolver that passed every unit test then failed 40% of runs. Operator confirmation worked where a smarter algorithm did not.

JUNE 2, 2026

The Behavioral Oracle: Testing What an AI Coding Agent's Route Suite Structurally Cannot See

The pipeline derives a behavioral test from the same operation that writes the guard, runs it below the flat-stub boundary, and proves it via mutation testing.

MAY 26, 2026

Error Messages as a Hypothesis Ladder: Five Techniques

Five techniques that turn a raw compiler error into a frequency-ordered hypothesis ladder, and why the pattern matters more for LLM debuggers than human ones.

MAY 23, 2026

The Debugger Becomes a Router: Sending Each Failure to the Stage That Owns the Fix

The Debugger in the autonomous engineering pipeline now routes each failure to the stage that owns the fix. Half the original destinations no longer exist.

MAY 20, 2026

Agent Pipeline Grounding Chat: From Free-Form Q&A to Typed Fields

Two open quality items are closed. Operator design decisions flow through typed handoff fields. The grounding chat has a structural defense against LLM drift.

MAY 18, 2026

From LLM Author to LLM Reviewer: An AI Coding Agent Authors a Production Feature With Zero LLM Code Generation

The pipeline completed a multi-layer TypeScript feature with zero LLM code generation: 6/6 ticket tests, 959/959 suite, $0.308 versus $0.681.

MAY 14, 2026

Grounding an Autonomous Engineering Pipeline in Operator Design: The Pre-Autonomous Chat Stage

A new pre-autonomous chat stage lets the operator ground design decisions in the registry before the pipeline runs, adding a new top to the trust hierarchy.

MAY 9, 2026

From LLM Luck to Structurally Guaranteed: One Ticket Across Four Architectural Eras

Seven pipeline runs, one ticket, four architectural eras. Per-test cost dropped from $0.385 to $0.074 by replacing LLM guesswork with structural derivation.

MAY 7, 2026

Who Reviews the Swarm? Why Probabilistic Verification Fails at Scale

Swarm parallelism is a throughput solution applied to a reliability problem. Probabilistic verification of probabilistic output does not converge.

MAY 6, 2026

Engineering Around LLM Non-Determinism: The Architectural Follow-Up to 248 Runs

What shipped in the three weeks after the 248-run hallucination ceiling: removing the LLM from computable decisions and validating everything else.

MAY 2, 2026

When the Pipeline Should Ask Instead of Guess

When a ticket has two equally-plausible interpretations, a deterministic stage stops the pipeline and asks before any Coder agent runs.

MAY 1, 2026

Vendor-Agnostic by Configuration: Per-Stage Model Setup in an LLM Coding Agent

Each stage in the pipeline runs against its own model and vendor config. How that design enables per-stage cost control, model swaps, and vendor flexibility.

MAY 1, 2026

You Can't Diagnose an LLM Pipeline from Output Alone

The run archive is where pass/fail becomes diagnostic: per-stage operation logs, reasoning traces, and a correlation token that spans every stage.

APRIL 30, 2026

No Stage Runs Forever: Retry Budgets and Escalation in an Agent Pipeline

How per-stage retry budgets, wall-clock timeouts, and a global token cap keep any stage from running indefinitely, with the Debugger as the most complex case.

APRIL 28, 2026

How a Binding Validator Is Wired: Synchronous Pre-Commit, Structured Errors, Retry Folding

The mechanics behind a binding validator: why synchronous pre-commit timing, structured rejections, and retry folding are each individually load-bearing.

APRIL 27, 2026

Prompt Rules Are Advisory; Validators Are Binding

When the model has a strong prior, naming the failure mode in the prompt doesn't prevent it. Prompt rules are advisory; validators are binding.

APRIL 24, 2026

What the Symbol Registry Stores, and How It Stays Fresh

The data model behind the symbol registry: per-symbol records, file-level hashes, call-graph edges, and the invalidation strategy that keeps it current.

APRIL 21, 2026

How Filename Lookups Flood an AI Coding Agent's Context Window

Looking up symbols by filename instead of full path pulls every `index.ts` in the project into the agent's context. One line changed. 20 results down to 1.

APRIL 19, 2026

The Lego Instructions: An Architectural Principle for AI Coding Agents

Three properties of a Lego instruction set, mapped to an AI coding pipeline: why manifest quality matters more than builder quality.

APRIL 17, 2026

Per-Field Hallucination Fixes Hit a Ceiling: 248 Runs on an AI Coding Agent

Bernoulli model predicted 36% first-pass success across 248 pipeline runs. Measured: 21%. The gap explains why per-field hallucination fixes have a ceiling.

APRIL 16, 2026

Stop Asking the Model What the Code Already Knows

Every field a Planner emits that the codebase already knows is a dice roll. Machine extraction replaces those dice rolls with deterministic lookups.

APRIL 15, 2026

Why Architecture Gaps Need a Close Condition, Not a Backlog

Why tracking known architectural gaps with specific close conditions is more useful than a backlog, and what makes each entry work.

APRIL 13, 2026

Fixture-First Development as an Early Warning System for AI Pipelines

Fixture-first development as an early warning system for AI pipelines: the first real-project run confirmed three known gaps instead of discovering new ones.

APRIL 11, 2026

How claude -p Silently Inflates Your Pipeline Token Costs

Using claude -p in a pipeline? The model has bash access you never granted. Each tool call re-sends your full context. One sentence cuts token spend by 52%.

APRIL 4, 2026

Silent Data Destruction: The Write Path Bug in Agentic Pipelines

The Coder added a new function to an existing file. The pipeline reported success. All seven existing functions were gone.

MARCH 31, 2026

Four Pipeline Bugs That Only Surface With Less Capable Models

A ticket that passed twice failed four times at lower model effort, exposing four structural pipeline bugs the higher-effort run had masked.

MARCH 28, 2026

LLM Non-Determinism Is a Pipeline Failure, Not a Model Problem

Same ticket, same pipeline config, different result two days apart. Why the first run passing was not confirmation that the constraint was enforced.

MARCH 21, 2026

Intentional Technical Debt: Building Features in the Wrong Order

The pipeline committed code before branch isolation existed. The risk was real, named, given a close condition. That is what makes it different from a shortcut.

MARCH 13, 2026

Why a Warning Is Worse Than a Hard Stop

When the pipeline detects zero test files, logging a warning and continuing produces output that looks correct but cannot be caught by any downstream gate.

MARCH 6, 2026

Correct Code, Wrong File: How the Write Gate Contains Scope Creep

On attempt 3, the Coder tried to write a file that was not in the manifest. The write gate stopped it before anything hit disk. This is what it is for.

FEBRUARY 27, 2026

Why the Debugger Never Inherits the Coder's Reasoning

The Debugger receives the test failure and the code on disk, not the Coder's reasoning. That isolation is not a constraint. It is the design.

FEBRUARY 20, 2026

What Calls This Function? Why AI Coding Agents Need a Language Server

Tree-Sitter tells you where a symbol is defined. It cannot tell you where it is called. That gap cost one pipeline run 33,000 tokens to find out.

FEBRUARY 13, 2026

The Quality Gate That Passed When It Failed

A Haiku optimization made the L2 quality gate silently pass on every run. The fix was removing the LLM call entirely.

FEBRUARY 6, 2026

Why AI Coding Agents Fail on Evolving Codebases

Not a model capability problem. An agent with the wrong codebase version produces output that is plausible but wrong in ways that are hard to catch.