I Never Wrote the Spec. I Just Kept Asking “What Breaks?”

How a multi-agent context manager — design, stress test, working plugin, and all — got built across one session without a single written requirement. The whole thing ran on questions, not prompts.


I started with a fuzzy idea and no plan: I wanted multiple language models, skills, and agents working on the same project, doing different jobs at different times. Write a PRD with Claude, hand it to Codex to review, hand it to Gemini to stress-test, come back to Claude to fold in the findings. Each model brings something different. The work shouldn’t have to start over every time I switch windows.

What I did not do was write a spec, a PRD, or a requirements doc. I never described the system I wanted in any detail, because I didn’t know it in any detail. What I had was a hunch and a willingness to interrogate it.

By the end of the session there was a working architecture, a stress-tested protocol, a runtime skill, a routing table, and an installable plugin published to a live site. None of it came from a brief. All of it came from a loop: throw out a half-formed idea, make the model grill it, build the smallest slice, then ask “what breaks?” — and let the answer reshape the thing. This is the account of that loop, because the method turned out to be more transferable than the tool.

The build had no spec — it had a loop: idea, grill the forks, build the slice, ask what breaks, repeat

I refused to let it build anything yet

My opening prompt ended with an instruction that did more work than anything else I typed all session:

“grill me with any questions, blind spots, conflicts or other considerations we need to resolve before you can plan your PRD/Architecture”

I wasn’t asking for a design. I was asking to be cross-examined. The reasoning was simple: I knew my idea had holes, and I’d rather have them exposed by a model poking at it than discover them after something was built around the wrong assumption.

What came back was a single instinct in one breath — “this is a handoff journal problem, not a database problem” — followed by a refusal to commit until I’d settled the forks that actually changed the design. We went two rounds. Eight questions in total: where does the store live, what’s the canonical task source, how are tasks routed, how deep does mid-task resume go, how is the ritual enforced, do we trust a prior agent’s “done” mark, what’s the scope, can a fresh agent overturn an earlier decision.

I answered each one tersely — “in-repo, git-tracked,” “markdown in the store,” “task-level plus a handoff note,” “propose, you ratify.” Only after every fork was closed did the locked decision set appear:

ForkChoice
Store.context/ in-repo, git-tracked
Task sourceCanonical markdown (STATE.md)
RoutingTask-level, explicit in STATE.md
Mid-task resumeHandoff note only — no deep replay
EnforcementClaude hooks; instruction file for others
TrustPrior “done” marks trusted; contradictions surfaced
ScopeThis project only (project-local plugin)
OverridesFresh agent may propose; ratification required

Concept #1: Commit to the forks before you commit to the code. The questions that change the design are not the same as the questions that clarify it. Forks are the design decisions where a different answer produces a different system. Close every fork before building anything. Skipping this step doesn’t save time — it just moves the rework to after something exists.

The build loop

Once the forks were locked, the session moved into a rhythm that repeated three or four times: build the smallest slice, ship it, ask what breaks, rebuild. Never more than one thing at a time. Never more than was needed to answer the next question.

The first slice was the store itself — four files, a directory, a lock. The second was the session skill — the check-in/check-out ritual that made the store useful. The third was the routing table — the mechanism for assigning tasks to specific models. The fourth was the plugin wrapper — the thing that made all of it installable.

At each step, the question wasn’t “what should we add?” It was “what does this break, and is that the break we want?”

Concept #2: Build to find the next question, not to build a system. Every slice you ship is an instrument for discovering what you don’t know yet. If you’re not learning something from each slice, the slices are too large. The goal of each build step isn’t to make progress — it’s to surface the next constraint.

The stress test that changed the architecture

After the session skill was working, I asked the model to stress-test it: find the scenarios where the protocol would fail in practice. What came back was a list of six failure modes — none of which had been visible before the thing existed to break.

The most important one was model trust. The protocol assumed that each agent’s “done” mark was reliable — that if an agent marked a task complete, the next agent could trust that and move on. The stress test surfaced the obvious problem: what if the model got it wrong? What if “done” in the store didn’t match “done” in reality?

The fix was small — a ratification step, where a fresh agent proposes rather than asserts when overriding a prior decision — but it changed the protocol’s trust model. Without the stress test, that assumption would have stayed invisible until it broke in production.

Concept #3: Stress-test before shipping, not after something breaks. The best time to find the failure modes is when the system is small enough to change. Ask the model to try to break the thing you just built. Ask it to find the scenarios where the protocol doesn’t hold, the assumptions that might not survive contact with reality, the edge cases you didn’t think of. The answers are free. The consequences of skipping this step are not.

What transferred

The context manager works. But the session produced something more durable than the tool: a repeatable build method that doesn’t require knowing what you’re building before you start.

The method has four moves, in order:

  1. State the hunch, not the plan. Give the model the vague idea, not a spec. The vaguer the input, the more exposed the forks.
  2. Ask to be grilled, not helped. Explicitly request questions, blind spots, conflicts. Don’t let the model rush to build — make it interrogate first.
  3. Build the smallest slice. Not the MVP. The smallest thing that makes the next question answerable.
  4. Ask what breaks. After each slice, run the stress test. Let the answer reshape the next slice.

None of these steps require expertise in what you’re building. They require willingness to not know yet — and a model that’s been told to push back instead of agree.