PRD — Output Quality Gate

§1 Problem Statement

What fails without this: AI writing skills produce inconsistent output quality — even well-crafted prompts generate slop on some runs, and model updates reshuffle behavior without warning. Users currently review every output manually (expensive) or publish without review (risky). No structural quality floor exists that can run automatically and improve over time.

Transcript evidence:

“Stop fixing the input. The model is not deterministic. Even great prompts produce slop on some runs, which means even if everything is working the way we expected, it’s still going to produce some slop. But as the models change, then we increase the probability of something changing in the way that it performs. And so, the reframe was, well, what about if instead of monitoring that, we have a process where we turn it into a number from 0 to 1?” — Lou

§2 Trigger Surface

Should fire on (include indirect cases):

  • “Quality gate my content before it publishes”
  • “Score this output against my gold standard”
  • “Run the eval loop on this”
  • “I want to add a quality check to my writing skill”
  • “My AI content isn’t consistent — some runs are great, some are slop”
  • “Check this against my rubric”

Should NOT fire on:

  • “Improve this content” (that’s a writing/editing skill, not an evaluation skill — the gate scores, it never rewrites)
  • “Review this for factual accuracy” (the gate evaluates voice/substance quality, not factual correctness)
  • “Rate this content 1–10” (casual quality check, not the trained rubric system)

§3 User Journey (Happy Path)

  1. User runs a writing skill and it produces content (e.g., a LinkedIn post)
  2. The writing skill hands the content to the quality gate with a content-type label (“linkedin-post”)
  3. Gate loads the matching gold-standard rubric for that content type
  4. Gate evaluates the content against the rubric and returns a score (0–1) plus itemized failure reasons
  5. If score ≥ threshold: passes through
  6. If score < threshold: gate returns failure report to the writing skill, writing skill reruns
  7. After N reruns (configurable, default 3), gate passes the best-scoring version with a flag: “below threshold after 3 attempts — review recommended”
  8. Gate appends to a learning log: score, content type, what failed, what version passed

§4 Step Classification

StepTypeJustification
Load rubric by content-typecodeSimple file lookup by label
Score content against rubricinferenceRubric application requires judgment about whether specific criteria are met — output space is unbounded per criterion
Itemize failure reasonsinferenceNatural language explanation of which criteria failed and why
Compare score to thresholdcodeNumerical comparison
Write to learning logcodeStructured append operation
Pass/fail routing back to callercodeConditional on score

Rule: Every “inference” classification requires a written justification. If you cannot state why code cannot handle a step, reclassify it as code.

§5 Inference Call Contracts

CallInput schemaOutput schemaWhy not code
rubric-score{content: string, rubric: markdown, content-type: string}{score: float 0-1, passed: bool, failures: [{criterion: string, reason: string}]}Rubric criteria involve judgment about substance, perspective, and audience fit — cannot be reduced to regex or keyword matching
failure-explain{criteria_failures: list}{explanation: string, revision_hints: string}Natural language explanation of why content failed requires generation

§6 References Needed

Always in body: Content-type to rubric file path mapping; scoring threshold (configurable, default 0.7); max retry count.

Conditional: gold-standard/[content-type].md — loaded per invocation based on content-type label passed by caller. Load only when a scoring call fires.

§7 Known Gotchas

  • Gate scores substance (hook quality, perspective, specificity, angle) — not just grammar/structure. Don’t conflate this with a style checker or grammar checker.
  • The gate never rewrites. It scores and blocks. Callers must handle the rerun logic.
  • “Getting harder to fool” requires a periodic update step: after a certain number of runs, review the learning log and update the rubric. This is not automatic in the first version — it requires a manual update pass (the Insight - The Self-Improving Skill Loop — Have the Skill Learn From Every Use discipline applied to the rubric).
  • Multiple rubrics per content type are fine (e.g., “linkedin-post-technical” vs. “linkedin-post-insight”). Content-type labels should be agreed conventions, not free-form.
  • Lou noted Opus was run during demo and hit usage limits — this skill can be token-intensive if the gold standard rubric is long. Consider summarizing the rubric to key criteria (<2K tokens) rather than the full examples.

§8 Eval Cases

Trigger Evals

User inputExpectedRationale
”Score this LinkedIn post against my rubric”fireClear invocation of scoring function
”Improve this blog post”no-fireWriting/editing task, not evaluation
”Check this newsletter before I send it”fire”Check before” implies quality gate
”Rate this from 1 to 10”no-fireCasual quality check, not the trained rubric system

Output Evals

ScenarioInputExpected output shapePass criterion
Happy pathLinkedIn post scoring 0.82{score: 0.82, passed: true, failures: []}Score ≥ threshold, passes through
Failure caseLinkedIn post scoring 0.58{score: 0.58, passed: false, failures: [{criterion: “hook”, reason: ”…”}, …]}Score < threshold, specific failures itemized
Max retriesPost still below threshold after 3 runsBest-scoring version with “below threshold after 3 attempts” flagSurfaces for human review rather than looping infinitely

§9 Composition

Assumes loaded: Writing skills that produce content; gold-standard rubric files per content type.

Potential conflicts: The inline Insight - The Quality Gate Pattern — Embed 9-10 Self-Evaluation at Every Pipeline Handoff — that skill’s self-evaluation pattern and this external gate serve different functions. Ensure writing skills don’t double-evaluate (inline self-eval + external gate) in a way that creates redundant loops.

Routing position: Downstream of all writing skills; upstream of any publishing or distribution skill.

§10 Success Criteria

  • Below-threshold content does not pass through to publishing
  • Score + itemized failure reasons returned within reasonable latency (< 30s for typical content)
  • Rubric correctly loads per content-type label — never applies wrong rubric
  • Learning log grows with each run and is readable for future rubric refinement

§11 Out of Scope

  • The gate does NOT rewrite content — callers handle revision logic
  • The gate does NOT evaluate factual accuracy or citation quality
  • The gate does NOT manage the gold-standard collection — that’s user-owned input
  • The gate does NOT automatically update its own rubric — that’s a manual refinement step in v1

Source