Spec-Driven Development Works. The Hard Part Is Where You Cut It.

Two weeks ago I helped a client engineering team switch to Spec-Driven Development. The numbers from the first week: test coverage up 40%, roughly 80% of a new feature shipped from a single PRD with minimal engineer intervention.

That isn’t the story. The story is what showed up in week two: a bottleneck nobody on the team had a name for, sitting exactly where they’d just freed up the most capacity.

The bottleneck is figuring out where to cut the work. The bottleneck is chunking.

Before

The team had the workflow most product engineering teams have. A ticket lands. An engineer picks it up. They build, they push, somebody reviews, it ships. Product and design renegotiate inside the build window — sometimes politely in PR comments, sometimes via a Slack DM that starts with “quick question”.

The engineers were absorbing the ambiguity. They always do. That absorption was the slow part of the cycle, but because it lived inside engineering’s slice of the process, the rest of the org filed it under “that’s just how long building takes”.

Standard story. I see this exact shape in most engagements.

What we changed

The change wasn’t dramatic in shape. We moved to PRDs first — a real PRD, not a Notion doc with three bullet points and a Loom — and routed the implementation through an AI coding workflow. Engineers stopped writing first-draft implementation. They reviewed PRDs, they reviewed AI-generated PRs, they integrated, they fought the ambiguous bits.

If you’ve read why I think AI adoption isn’t actually stalled, this is the team-level version of that argument. The org-level claim is that your AI pilot stalled because your decision pipeline didn’t change. The team-level claim is what happens when you do change it: the decisions move out of the code, the build phase shrinks, and a different problem appears in their place.

I’m staying tool-agnostic on what we used. The interesting thing isn’t which AI agent. It’s what changed about the loop.

The first week: receipts

Two numbers came out of the first week.

Test coverage went up 40%. Not because anyone made a coverage initiative. The PRDs specified behavior at a level of detail that made tests writeable as a side effect of implementation. The AI agent wrote the tests because the spec told it what the function should do. Engineers stopped deferring tests because there was nothing to defer — the test was part of the chunk, not a separate ticket.

Roughly 80% of a new feature shipped from a single PRD with minimal engineer intervention. Engineers were involved in the boundaries, the integration points, and a handful of decisions that hadn’t been pre-resolved in the spec. They weren’t writing the lines. They were directing.

I’d seen these kinds of numbers in demos and thought “sure, on a toy”. Watching them hold up against real product work changed my read.

The second week: where the workflow stalled

Then we hit week two.

The team kept the cadence going on small, well-shaped work. PRDs that targeted a single endpoint, a single migration, a single component family — those flew. Engineers were starting to push back on the spec-writing time as feeling slower than just opening the editor.

Then a feature came up that was bigger. A real feature. Cross-cutting, touching three subsystems, with a handful of policy decisions baked in. The team wrote one PRD for the whole thing.

The AI started strong, but as the implementation moved across subsystems it began contradicting earlier decisions. It re-litigated function signatures it had defined three steps back. It introduced patterns that didn’t match what existed elsewhere in the codebase. The engineer reviewing it spent the day correcting drift instead of integrating.

The team’s response was to split the PRD. They re-spec’d the feature as four separate chunks. That worked. Each chunk shipped clean. But the team had spent two days arguing over where to draw the lines. Somebody said it out loud in standup: “we shipped this in a day but spent two arguing over where to split the spec”.

That was the moment. The work compressed but the deciding didn’t.

The two failure modes

What we found, repeating, was a narrow band where SDD works and two failure modes outside it.

Chunk too big: the AI loses context partway through. It stops being able to hold the whole problem in working memory. Earlier decisions get re-decided, patterns drift, the integration cost on review eats whatever speed you gained.

Chunk too small: the spec-writing overhead exceeds the implementation cost. By the time an engineer has written a spec precise enough for the AI to act on, they could have just opened the file and made the change. The team spent more time documenting intent than building.

The interesting thing is that the working band — the chunk size where SDD pays — is narrower than I expected and shifts depending on the problem. Cross-cutting changes have a smaller working band than localized ones. Greenfield code has a bigger band than touching legacy. Code with strong existing patterns has a bigger band than code without.

This is the same insight that shows up one layer down, in the architecture itself: explicit boundaries are what AI needs to function. Chunking is the same problem at the workflow layer. A chunk is a boundary. Where you cut it is where the AI can reason about the whole problem.

You might read this as “AI is doing 80% of the work, so we need fewer engineers, or we can hire less-senior ones to handle the easy 20%”. That isn’t the claim. The 20% didn’t get smaller. It moved upstream, and the stakes on each call went up. The senior engineer who used to spend 80% of their week implementing now spends that 80% deciding where to cut. They’re not less needed. They’re applied to a problem with a much bigger blast radius — get the chunk wrong and you lose a day of AI drift; get it right and a feature ships from a single spec.

What this actually means

Two weeks isn’t long. Here’s what we know and what we don’t.

What we know: SDD ships real product work. The 40% / 80% numbers happened on a real engagement, on real code, in week one. They held up in week two on the parts of the work where the chunks were sized right.

What we don’t know: a method for sizing chunks correctly on a feature you’ve never built before. We have hunches. Localized changes have a working chunk size somewhere around “one PRD per file family”. Cross-cutting features want to be split before you write the PRD, not after. Greenfield code rewards bigger chunks than legacy. None of these are a method. They’re patterns we keep noticing.

The lesson isn’t “do SDD”. The lesson is that SDD works the way good tooling always works. It eliminates the easy version of a problem and reveals the hard version underneath. The hard version of “build a feature” used to be “write the implementation”. It’s now “decide where the boundaries are”. That decision used to live inside engineering’s slice of the cycle, hidden inside the implementation work. Now it lives in front of the cycle, in the open, requiring an answer before any code gets written.

This is what your senior engineers should be doing. It’s also what most teams have spent the last five years not training engineers to do.

Where this leaves us

We’re still working on a chunking heuristic worth shipping as a method. When we have one, I’ll write it. For now: SDD works on the chunks you size right, fails on the ones you don’t, and the team that figures out chunking faster will outpace the team that just writes more PRDs.

If you’re a CTO or VP of Engineering thinking about running SDD on a real team and wondering how to think about chunking from day one, book a call. I’d rather you hit week two without spending two days arguing over a spec.