“Stop fixing the input. We spend a lot of time on the input side — bigger models, memory, better prompts. But the model is not deterministic. Even great prompts produce slop on some runs. What if instead of monitoring that, we have a process where we turn it into a number from 0 to 1?” — Lou
Session context: 2026-06-04_Mastermind — Lou opened the session with a live walkthrough of a quality gate system he’d been building, inspired by an article about fixing AI slop using Hermes. He wanted the same capability without a third-party agent.
Core Idea
The standard fix for AI slop is to improve the prompt: add examples, sharpen the instruction, try a bigger model. But this misses the structural problem — the model is non-deterministic. Even a perfect prompt produces weak outputs on some runs, and every model upgrade reshuffles the behavior you were relying on. Optimizing the input can never fully solve an output problem.
Lou’s reframe: build a trained evaluation agent that sits downstream of your writing skills and scores everything before it publishes. The gate doesn’t fix bad work — it identifies it and sends it back. The difference between this and an inline quality rubric (like Insight - The Quality Gate Pattern — Embed 9-10 Self-Evaluation at Every Pipeline Handoff) is structural: instead of a single rubric baked into one skill, this is a separate ambient folder containing an evaluator that any skill can call, trained on your gold standard.
The mechanics:
- Collect your gold standard. 20–50 pieces of your best published work — the content you’d want all future output to match. It can be your own work or anybody else’s you’d aspire to.
- Extract the rubric. A command reads through the gold standard and derives scoring criteria — not just grammar and structure, but substance: perspective, hook quality, specificity, angle. The rubric captures what made these pieces worth publishing.
- Score, don’t rewrite. The gate returns a number (0–1) and itemizes what failed. It never edits the output. The calling skill uses the score to decide whether to rerun or flag for review.
- Get harder to fool over time. The gate accumulates edge cases. Every time it runs and learns from a correction, it sharpens its discrimination. The first run is the least accurate it will ever be.
The gate runs across platforms: a LinkedIn post rubric, a newsletter rubric, a thought-leader article rubric. Each content type can have its own 20–50 gold examples and its own derived criteria. When a skill invokes the gate, it passes the content type, and the evaluator applies the right rubric.
Why this matters for scaling: The aspiration is to remove yourself from the production loop as much as possible — to focus on the conversations that generate ideas while automation handles extraction, writing, and publishing. But automation needs a quality floor. Without a gate, anything that looks plausible ships. With a gate, the floor rises with each use, because the gate is also compounding.
Practical Application
Build the eval loop as an ambient folder in your project:
- Create a
gold-standard/subfolder and populate it with 20–50 pieces of your best work by content type. - Run a command (or prompt): “Read this collection and generate a scoring rubric that distinguishes what makes these pieces worth publishing from work that doesn’t.”
- Create an evaluation agent that accepts any piece of content + a content-type label, runs it against the matching rubric, and returns a score + failure reasons.
- After each use, instruct the agent: “Learn from the corrections I made to the output and update the rubric.”
For coaching clients: frame this as “training your AI on what good means to you.” The rubric isn’t a generic best-practice checklist — it’s your taste, encoded. That’s what makes it durable as models change.
Related Insights
- Insight - The Quality Gate Pattern — Embed 9-10 Self-Evaluation at Every Pipeline Handoff — inline quality evaluation within a skill; this is the external, cross-skill equivalent trained on actual gold-standard examples
- Insight - The Self-Improving Skill Loop — Have the Skill Learn From Every Use — both are compounding loops; this one runs on the evaluator, not the writing skill
- Insight - Skills Encode Judgment Into Persistent, Composable Intelligence — the rubric is your judgment encoded; the eval loop is how it accretes over time
- Insight - Authentic AI Voice Is Built on Lived Experience, Not Style Prompts — style prompts drift; gold-standard training captures the thing prompts can only describe
- Insight - Process Architecture Transmits Judgment More Reliably Than Individual Prompts — the gate is a process layer, not a prompt layer; that’s what makes it stable across model changes
- Insight - Ambient Intelligence — Build a Skill in Every Folder to Make Your Entire Knowledge Base Alive — the eval loop is an ambient folder with its own intelligence that any other skill can inherit
Evolution Across Sessions
Builds on Insight - The Quality Gate Pattern — Embed 9-10 Self-Evaluation at Every Pipeline Handoff (2026-04-09), which established the inline self-evaluation pattern. New development: this session externalizes the gate into a separate ambient agent trained on gold-standard examples — a cross-skill quality layer that any skill can call, distinct from per-skill rubrics. Also connects to Insight - The Self-Improving Skill Loop — Have the Skill Learn From Every Use (2026-05-28): both involve loops that compound with each use, but this one operates on the evaluator rather than the writing skill itself.