PRD — Output Quality Gate
§1 Problem Statement
What fails without this: AI writing skills produce inconsistent output quality — even well-crafted prompts generate slop on some runs, and model updates reshuffle behavior without warning. Users currently review every output manually (expensive) or publish without review (risky). No structural quality floor exists that can run automatically and improve over time.
Transcript evidence:
“Stop fixing the input. The model is not deterministic. Even great prompts produce slop on some runs, which means even if everything is working the way we expected, it’s still going to produce some slop. But as the models change, then we increase the probability of something changing in the way that it performs. And so, the reframe was, well, what about if instead of monitoring that, we have a process where we turn it into a number from 0 to 1?” — Lou
§2 Trigger Surface
Should fire on (include indirect cases):
- “Quality gate my content before it publishes”
- “Score this output against my gold standard”
- “Run the eval loop on this”
- “I want to add a quality check to my writing skill”
- “My AI content isn’t consistent — some runs are great, some are slop”
- “Check this against my rubric”
Should NOT fire on:
- “Improve this content” (that’s a writing/editing skill, not an evaluation skill — the gate scores, it never rewrites)
- “Review this for factual accuracy” (the gate evaluates voice/substance quality, not factual correctness)
- “Rate this content 1–10” (casual quality check, not the trained rubric system)
§3 User Journey (Happy Path)
- User runs a writing skill and it produces content (e.g., a LinkedIn post)
- The writing skill hands the content to the quality gate with a content-type label (“linkedin-post”)
- Gate loads the matching gold-standard rubric for that content type
- Gate evaluates the content against the rubric and returns a score (0–1) plus itemized failure reasons
- If score ≥ threshold: passes through
- If score < threshold: gate returns failure report to the writing skill, writing skill reruns
- After N reruns (configurable, default 3), gate passes the best-scoring version with a flag: “below threshold after 3 attempts — review recommended”
- Gate appends to a learning log: score, content type, what failed, what version passed
§4 Step Classification
| Step | Type | Justification |
|---|---|---|
| Load rubric by content-type | code | Simple file lookup by label |
| Score content against rubric | inference | Rubric application requires judgment about whether specific criteria are met — output space is unbounded per criterion |
| Itemize failure reasons | inference | Natural language explanation of which criteria failed and why |
| Compare score to threshold | code | Numerical comparison |
| Write to learning log | code | Structured append operation |
| Pass/fail routing back to caller | code | Conditional on score |
Rule: Every “inference” classification requires a written justification. If you cannot state why code cannot handle a step, reclassify it as code.
§5 Inference Call Contracts
| Call | Input schema | Output schema | Why not code |
|---|---|---|---|
| rubric-score | {content: string, rubric: markdown, content-type: string} | {score: float 0-1, passed: bool, failures: [{criterion: string, reason: string}]} | Rubric criteria involve judgment about substance, perspective, and audience fit — cannot be reduced to regex or keyword matching |
| failure-explain | {criteria_failures: list} | {explanation: string, revision_hints: string} | Natural language explanation of why content failed requires generation |
§6 References Needed
Always in body: Content-type to rubric file path mapping; scoring threshold (configurable, default 0.7); max retry count.
Conditional: gold-standard/[content-type].md — loaded per invocation based on content-type label passed by caller. Load only when a scoring call fires.
§7 Known Gotchas
- Gate scores substance (hook quality, perspective, specificity, angle) — not just grammar/structure. Don’t conflate this with a style checker or grammar checker.
- The gate never rewrites. It scores and blocks. Callers must handle the rerun logic.
- “Getting harder to fool” requires a periodic update step: after a certain number of runs, review the learning log and update the rubric. This is not automatic in the first version — it requires a manual update pass (the Insight - The Self-Improving Skill Loop — Have the Skill Learn From Every Use discipline applied to the rubric).
- Multiple rubrics per content type are fine (e.g., “linkedin-post-technical” vs. “linkedin-post-insight”). Content-type labels should be agreed conventions, not free-form.
- Lou noted Opus was run during demo and hit usage limits — this skill can be token-intensive if the gold standard rubric is long. Consider summarizing the rubric to key criteria (<2K tokens) rather than the full examples.
§8 Eval Cases
Trigger Evals
| User input | Expected | Rationale |
|---|---|---|
| ”Score this LinkedIn post against my rubric” | fire | Clear invocation of scoring function |
| ”Improve this blog post” | no-fire | Writing/editing task, not evaluation |
| ”Check this newsletter before I send it” | fire | ”Check before” implies quality gate |
| ”Rate this from 1 to 10” | no-fire | Casual quality check, not the trained rubric system |
Output Evals
| Scenario | Input | Expected output shape | Pass criterion |
|---|---|---|---|
| Happy path | LinkedIn post scoring 0.82 | {score: 0.82, passed: true, failures: []} | Score ≥ threshold, passes through |
| Failure case | LinkedIn post scoring 0.58 | {score: 0.58, passed: false, failures: [{criterion: “hook”, reason: ”…”}, …]} | Score < threshold, specific failures itemized |
| Max retries | Post still below threshold after 3 runs | Best-scoring version with “below threshold after 3 attempts” flag | Surfaces for human review rather than looping infinitely |
§9 Composition
Assumes loaded: Writing skills that produce content; gold-standard rubric files per content type.
Potential conflicts: The inline Insight - The Quality Gate Pattern — Embed 9-10 Self-Evaluation at Every Pipeline Handoff — that skill’s self-evaluation pattern and this external gate serve different functions. Ensure writing skills don’t double-evaluate (inline self-eval + external gate) in a way that creates redundant loops.
Routing position: Downstream of all writing skills; upstream of any publishing or distribution skill.
§10 Success Criteria
- Below-threshold content does not pass through to publishing
- Score + itemized failure reasons returned within reasonable latency (< 30s for typical content)
- Rubric correctly loads per content-type label — never applies wrong rubric
- Learning log grows with each run and is readable for future rubric refinement
§11 Out of Scope
- The gate does NOT rewrite content — callers handle revision logic
- The gate does NOT evaluate factual accuracy or citation quality
- The gate does NOT manage the gold-standard collection — that’s user-owned input
- The gate does NOT automatically update its own rubric — that’s a manual refinement step in v1
Source
- 2026-06-04_Mastermind (Lou — AIA quality gate walkthrough, 30+ minute demo with live scoring)