PRD — Output Quality Gate

§1 Problem Statement

What fails without this: AI writing skills produce inconsistent output quality — even well-crafted prompts generate slop on some runs, and model updates reshuffle behavior without warning. Users currently review every output manually (expensive) or publish without review (risky). No structural quality floor exists that can run automatically and improve over time.

Transcript evidence:

“Stop fixing the input. The model is not deterministic. Even great prompts produce slop on some runs, which means even if everything is working the way we expected, it’s still going to produce some slop. But as the models change, then we increase the probability of something changing in the way that it performs. And so, the reframe was, well, what about if instead of monitoring that, we have a process where we turn it into a number from 0 to 1?” — Lou

§2 Trigger Surface

Should fire on (include indirect cases):

“Quality gate my content before it publishes”
“Score this output against my gold standard”
“Run the eval loop on this”
“I want to add a quality check to my writing skill”
“My AI content isn’t consistent — some runs are great, some are slop”
“Check this against my rubric”

Should NOT fire on:

“Improve this content” (that’s a writing/editing skill, not an evaluation skill — the gate scores, it never rewrites)
“Review this for factual accuracy” (the gate evaluates voice/substance quality, not factual correctness)
“Rate this content 1–10” (casual quality check, not the trained rubric system)

§3 User Journey (Happy Path)

User runs a writing skill and it produces content (e.g., a LinkedIn post)
The writing skill hands the content to the quality gate with a content-type label (“linkedin-post”)
Gate loads the matching gold-standard rubric for that content type
Gate evaluates the content against the rubric and returns a score (0–1) plus itemized failure reasons
If score ≥ threshold: passes through
If score < threshold: gate returns failure report to the writing skill, writing skill reruns
After N reruns (configurable, default 3), gate passes the best-scoring version with a flag: “below threshold after 3 attempts — review recommended”
Gate appends to a learning log: score, content type, what failed, what version passed

§4 Step Classification

Step	Type	Justification
Load rubric by content-type	code	Simple file lookup by label
Score content against rubric	inference	Rubric application requires judgment about whether specific criteria are met — output space is unbounded per criterion
Itemize failure reasons	inference	Natural language explanation of which criteria failed and why
Compare score to threshold	code	Numerical comparison
Write to learning log	code	Structured append operation
Pass/fail routing back to caller	code	Conditional on score

Rule: Every “inference” classification requires a written justification. If you cannot state why code cannot handle a step, reclassify it as code.

§5 Inference Call Contracts

Call	Input schema	Output schema	Why not code
rubric-score	{content: string, rubric: markdown, content-type: string}	{score: float 0-1, passed: bool, failures: [{criterion: string, reason: string}]}	Rubric criteria involve judgment about substance, perspective, and audience fit — cannot be reduced to regex or keyword matching
failure-explain	{criteria_failures: list}	{explanation: string, revision_hints: string}	Natural language explanation of why content failed requires generation

§6 References Needed

Always in body: Content-type to rubric file path mapping; scoring threshold (configurable, default 0.7); max retry count.

Conditional: gold-standard/[content-type].md — loaded per invocation based on content-type label passed by caller. Load only when a scoring call fires.

§7 Known Gotchas

Gate scores substance (hook quality, perspective, specificity, angle) — not just grammar/structure. Don’t conflate this with a style checker or grammar checker.
The gate never rewrites. It scores and blocks. Callers must handle the rerun logic.
“Getting harder to fool” requires a periodic update step: after a certain number of runs, review the learning log and update the rubric. This is not automatic in the first version — it requires a manual update pass (the Insight - The Self-Improving Skill Loop — Have the Skill Learn From Every Use discipline applied to the rubric).
Multiple rubrics per content type are fine (e.g., “linkedin-post-technical” vs. “linkedin-post-insight”). Content-type labels should be agreed conventions, not free-form.
Lou noted Opus was run during demo and hit usage limits — this skill can be token-intensive if the gold standard rubric is long. Consider summarizing the rubric to key criteria (<2K tokens) rather than the full examples.

§8 Eval Cases

Trigger Evals

User input	Expected	Rationale
”Score this LinkedIn post against my rubric”	fire	Clear invocation of scoring function
”Improve this blog post”	no-fire	Writing/editing task, not evaluation
”Check this newsletter before I send it”	fire	”Check before” implies quality gate
”Rate this from 1 to 10”	no-fire	Casual quality check, not the trained rubric system

Output Evals

Scenario	Input	Expected output shape	Pass criterion
Happy path	LinkedIn post scoring 0.82	{score: 0.82, passed: true, failures: []}	Score ≥ threshold, passes through
Failure case	LinkedIn post scoring 0.58	{score: 0.58, passed: false, failures: [{criterion: “hook”, reason: ”…”}, …]}	Score < threshold, specific failures itemized
Max retries	Post still below threshold after 3 runs	Best-scoring version with “below threshold after 3 attempts” flag	Surfaces for human review rather than looping infinitely

§9 Composition

Assumes loaded: Writing skills that produce content; gold-standard rubric files per content type.

Potential conflicts: The inline Insight - The Quality Gate Pattern — Embed 9-10 Self-Evaluation at Every Pipeline Handoff — that skill’s self-evaluation pattern and this external gate serve different functions. Ensure writing skills don’t double-evaluate (inline self-eval + external gate) in a way that creates redundant loops.

Routing position: Downstream of all writing skills; upstream of any publishing or distribution skill.

§10 Success Criteria

Below-threshold content does not pass through to publishing
Score + itemized failure reasons returned within reasonable latency (< 30s for typical content)
Rubric correctly loads per content-type label — never applies wrong rubric
Learning log grows with each run and is readable for future rubric refinement

§11 Out of Scope

The gate does NOT rewrite content — callers handle revision logic
The gate does NOT evaluate factual accuracy or citation quality
The gate does NOT manage the gold-standard collection — that’s user-owned input
The gate does NOT automatically update its own rubric — that’s a manual refinement step in v1

Source

2026-06-04_Mastermind (Lou — AIA quality gate walkthrough, 30+ minute demo with live scoring)

PowerUp Coaching — Living Knowledge Base

Explorer

output-quality-gate-prd