Insight: Multi-Model Debate as a Quality Control System for High-Stakes Work

Original Insight

“I told it, there are no sacred cows, and make sure everything you say works, it’s easy to implement, it’s modular and scalable. And you will see input from other AIs. You can comment on their input as well. Remember, the goal here is not to be right, but to have the best solution possible. And I did find a lot of times when Codex, or even Gemini, specifically, would come back and say, yeah, you know, Claude’s a little bit full of it right now, because what it’s saying is actually not correct. And then Claude said, oh yeah, Codex is right, it’s not correct. So I find that for really important things where you have to validate and have a little research done, this is a very useful kind of thing.” — Lou

Expanded Synthesis

The February 26th session introduced one of the most operationally practical workflows to emerge from the mastermind series: a multi-model AI debate structure in which Claude, Gemini, and Codex each contributed to a shared markdown file, with each model able to read and critique the contributions of the others, culminating in a synthesis that explicitly acknowledged which ideas came from which model and why the best ideas won.

Lou used a combination of GhostTTY (a terminal application that allows multiple side-by-side terminal panes), Claude Code, Gemini CLI, and the Marked markdown viewer — which automatically refreshes when the underlying file changes — to create a live visual environment where he could watch the shared document evolve in real time as each model contributed.

The setup is worth describing in detail because the principle it embodies is more fundamental than the specific tools. Each model was given the same shared file and the same instruction: read what’s here, understand what’s being built, and add your constructive contribution. Critically, each model was also told it could comment on what the other models had said. The success criterion was explicit: not to be right, but to produce the best possible result.

What emerged was striking. Gemini identified the “human DNA problem” — that the initial spec would produce well-structured but generic content without a mechanism for injecting the author’s specific voice and experience. Codex identified the “operating modes” problem — that the spec needed explicit quality rubrics rather than vague craft guidelines. Claude, when confronted with Codex’s critique, agreed it had been wrong and incorporated the correction. The final spec was better than any single model would have produced, and the synthesis was transparent about attribution.

This structure addresses one of the most important failure modes in AI-assisted work: single-model overconfidence. When you work with one AI in one session, that model has no incentive to identify the limits of its own answers. It will produce fluent, confident output whether its reasoning is sound or not. The model is not being dishonest — it is doing what it was designed to do: produce the most probable next token. Probability is not epistemology.

Introducing a second model as a critic breaks this. When Claude knows that Codex is going to read its output and has been told to identify flaws, the effective quality bar changes. More importantly for the user, it creates explicit visibility into where models disagree — and that disagreement is itself diagnostic. When Gemini and Claude reach the same conclusion, that’s a reasonable confidence signal. When they diverge, that divergence marks a place that deserves human investigation.

For coaches and high-performers who are building consequential things — new frameworks, complex client strategies, technical systems, IP that needs to hold up over time — this is not just a workflow optimization. It is a quality control system. And it maps directly to principles that good coaches already know: the value of peer consultation, the blind spot that emerges from any single perspective, the productive friction of genuine disagreement.

Kasimir’s follow-up question in the session raised an important architectural point: can skills be chained? Lou confirmed that yes, in Claude Code, a skill can call other skills as long as they’re all present in the .claude folder. The implication is that the multi-model debate structure can eventually be automated — a chained skill that runs one model’s output through another’s critique and then synthesizes, without manual orchestration. The current manual version is superior for high-stakes, genuinely uncertain work because it allows for human steering at each handoff. The automated version will be superior for routine work where speed matters more than depth.

Lou also introduced the Marked markdown viewer as a low-friction tool that makes the multi-model workflow practical: because it auto-refreshes on file change, you can watch the shared document evolve in real time as models work on it, without switching applications. This is a small but meaningful friction reduction — in sustained creative and analytical work, the quality of the environment directly affects the quality of the thinking.

The broader principle connecting to PowerUp Coaching is this: sustainable high performance is never single-threaded. The highest-performing people and systems build in redundancy, critique, and correction as structural features, not as occasional events. The multi-model debate is AI’s version of a principle Lou’s clients already need to build into their own work: don’t trust your own thinking without exposing it to genuine challenge.

Practical Application for PowerUp Clients

The Two-Model Critique Protocol (Accessible Version)

You don’t need a multi-pane terminal setup to apply this principle. The simplest version works in any AI environment.

Generate the first-pass output in Model A. Use Claude, ChatGPT, or whichever model you normally use. Save the output to a file or copy it to a document.
Ask Model B to critique it. Open a second conversation in a different model (Gemini, Perplexity, Copilot). Paste the output from Model A with this framing: “This was produced by another AI working on [specific problem]. Your job is to identify what’s wrong, what’s missing, what’s overconfident, and what could be substantially improved. There are no sacred cows. The goal is the best possible outcome, not defending this version.”
Bring the critique back to Model A. Paste the critique from Model B into Model A and ask it to respond: “Here is a critique of your earlier output. Where is the critique correct? Where is it wrong? Produce a revised version that incorporates the valid criticisms.”
Synthesize and document. Ask either model to write a synthesis that explicitly notes which ideas were kept, which were cut, and from which model the strongest contributions came. This attribution creates accountability and supports your own learning about which models are strongest for which task types.

When to Use This:

Any high-stakes content (a framework, a methodology document, a key client deliverable)
Technical decisions where you’re not sure which approach is right
Strategic questions where you want to stress-test a conclusion before committing
Any situation where you’re aware you might be in an echo chamber

Coaching Application: The multi-perspective principle applies directly in coaching: before presenting a framework or interpretation to a client, ask yourself, “What would a skeptical peer say about this? What’s the weakest part of my analysis?” Running your coaching hypotheses through a second perspective — even a mental simulation of one — is the professional version of this quality control system.

Questions for Reflection:

“What would someone who disagrees with this conclude from the same evidence?”
“Where in my current thinking am I most likely to be wrong?”
“What would I need to see to change my mind about this?”

Additional Resources

The Intelligence Trap by David Robson — why smart people are more susceptible to certain kinds of overconfidence, and how to build in correction mechanisms
Superforecasting by Philip Tetlock — the research foundation for using structured disagreement and calibration in high-stakes prediction
Insight - Build Tiny Tools That Remove Real Friction — the companion implementation: building the workflow environment that makes multi-model work frictionless
GhostTTY (ghostty.org) — the terminal multiplexer Lou used for the multi-pane setup
Marked (marked2.app) — the auto-refreshing markdown viewer that completes the workflow

Evolution Across Sessions

This workflow directly extends the eigenthinking framework from February 19th. Eigenthinking extracts the unique axes of your cognition; the multi-model debate stress-tests the frameworks that emerge from those axes. Together they form a complete IP development cycle: extract your cognitive fingerprint → build frameworks from your natural axes → stress-test those frameworks through multi-perspective critique → codify the tested result as a skill.

Next Actions

For Lou: The multi-model debate workflow is worth codifying as a skill for the mastermind group — a simple script that sets up a shared file and prompts each model with the right instruction template. This would make the workflow accessible to members without terminal comfort.
For clients: Assign a two-model critique exercise on the client’s most important current strategic assumption. Have them bring the result to the next coaching session and discuss what they learned about the assumption and about which critiques landed.

Derived Artifacts

Brief - Use Two AI Models to Catch What One Model Misses

PowerUp Coaching — Living Knowledge Base

Explorer