Summary
Make scientific correctness of ClawBio skills enforced deterministically in CI, not caught per-PR by human + agent review. Across the 2026-06 hackathon rounds the /pr-audit keyword scanner returned SAFE on PRs that were scientifically or clinically wrong, and every real catch came from deep per-PR domain agents. That works but does not scale. This issue proposes a three-layer gate, built on the patterns ClawBench already uses, that pushes each recurring failure class down to the lowest layer that can catch it deterministically.
Implements DEC-2026-059 (recurring failure classes migrate into ClawBench/CI gates) and extends DEC-2026-061 (validate skills against GIAB/gnomAD/VEP through ClawBench) by adding GWAS Catalog + PubMed as provenance oracles.
Evidence (why this is needed, not speculative)
Three failure classes shipped past the scanner this month and were only caught by manually resolving primary sources:
- Fabricated / mismatched citations —
ancestry-risk-profiler (#297), three review rounds. Re-resolving PMIDs against PubMed showed 5 of 6 high-reuse PMIDs are wrong, including the flagship APOL1 citation 20566908, which resolves to a head-and-neck-cancer disability survey, not Genovese et al. One "representative" PMID per super-population was blanket-applied as the source for every variant in that population, regardless of trait or even the population the paper studied.
- Clinical-classification boundary error —
cnv-acmg-classifier (#305) hard-coded a "terminal 2A" rule absent from ClinGen/Riggs 2020; the standard's own worked example (2A + inherited 5B = 0.70 VUS) was returned as 1.00 Pathogenic. The contributor's own tests passed because they encoded the wrong rule.
- Low-marker ancestry over-call / inflated risk math —
ancestry-risk-profiler, same PR.
These are exactly the classes the North Star ("open trust layer for agentic genomics") cannot afford to merge. A mechanically-clean but scientifically-wrong skill is the worst failure mode for a trust layer.
Design: three layers, built on existing ClawBench patterns
ClawBench already establishes the pattern this should follow: SCHEMAS/acmg_evidence_schema.json (structural contract + x-provenance block), HARNESS/validate_evidence.py (fail-closed cross-field rules with machine-readable error codes), HARNESS/attribution.py (error attribution by layer), and TRUTH/{giab,clinvar,reference} oracles. The new work mirrors these, it does not reinvent them.
Layer 1 — Provenance oracle (deterministic, blocking) — BUILD FIRST
The permanent fix for the citation class. Two parts:
1a. Provenance schema — new SCHEMAS/effect_size_provenance_schema.json, modelled on acmg_evidence_schema.json. Any skill that emits effect sizes / ancestry-stratified risk MUST ship association entries in this shape:
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://github.com/ClawBio/ClawBench/SCHEMAS/effect_size_provenance_schema.json",
"title": "ClawBench effect-size provenance contract",
"type": "object",
"required": ["variant", "trait", "ancestry", "effect", "source"],
"properties": {
"variant": { "type": "object", "required": ["rsid"],
"properties": { "rsid": { "type": "string", "pattern": "^rs[0-9]+$" } } },
"trait": { "type": "object", "required": ["label", "efo_id"],
"properties": { "efo_id": { "type": "string", "pattern": "^EFO_[0-9]+$" } } },
"ancestry": { "enum": ["AFR", "AMR", "EAS", "EUR", "SAS"] },
"effect": { "type": "object", "required": ["measure", "value"],
"properties": { "measure": { "enum": ["OR", "beta", "HR"] },
"value": { "type": "number" },
"ci_low": { "type": "number" }, "ci_high": { "type": "number" } } },
"source": { "type": "object", "required": ["pmid"],
"properties": { "pmid": { "type": "string", "pattern": "^[0-9]{6,9}$" },
"gwas_catalog_accession": { "type": "string", "pattern": "^GCST[0-9]+$" } } }
}
}
1b. Resolver harness — new HARNESS/validate_provenance.py + a GWAS Catalog snapshot under TRUTH/gwas_catalog/. For each entry, fail-closed checks with machine-readable error codes:
| Check |
Error code |
Catches |
| PMID resolves at all (PubMed efetch) |
pmid_unresolvable |
typos, fabricated IDs |
| GWAS Catalog has an association for this rsID + trait (EFO) |
assoc_not_found |
variant/trait mismatch |
| The supporting study's PMID matches the cited PMID |
pmid_study_mismatch |
20566908 APOL1 case |
| The study's ancestry matches the entry's ancestry |
ancestry_mismatch |
22158537 EAS-tagged-SAS case |
| Effect size within plausible range of the catalog value |
effect_out_of_range |
transcription / inversion errors |
| Where GWAS Catalog lacks coverage: PubMed title/abstract topic match |
topic_match_low (WARN, human sign-off) |
honest gaps, not auto-pass |
Backfill ancestry-risk-profiler as the first target (worst offender, perfect regression case): the gate must flag 20566908, 22158537, 27005778, 23945395, 17478679.
Layer 2 — Clinical-framework golden fixtures (deterministic, blocking)
Largely exists: SCHEMAS/acmg_evidence_schema.json + HARNESS/acmg_points.py. What's missing is a framework-owned worked-example fixture set wired as CI invariants, so a skill cannot pass by encoding its own wrong rule:
- ClinGen/ACMG CNV (Riggs 2020) worked examples:
2A + 5B inherited = 0.70 VUS, 2A + 5A de novo = 1.45 Pathogenic, the five-tier boundaries at +/-0.90 / +/-0.99.
- Ancestry: panel-size >= 30 markers => abstain (Kosoy 2009); no absolute-risk % when ORs are applied on a baseline; soft posterior, never a hard single label from a sub-threshold panel.
Layer 3 — Domain-agent review as an advisory CI check (non-blocking)
Wire the /pr-audit deep domain auditor into a GitHub Action on any skills/* PR, posting findings as a non-blocking check + human gate. Agents are non-deterministic and over-reach (a re-audit once flagged an out-of-diff regression that was not real), so they cover only the residue the deterministic gates cannot: novel clinical framing, LD-correlated risk math, judgement calls. This is the top of the pyramid, not the base.
Policy glue
Skill-contribution template declares a capability class: emits_effect_sizes, clinical_classification, infers_ancestry. The flag routes the PR to the required gates, so a contributor cannot opt out of a gate by omission.
Acceptance criteria
Phasing
- Layer 1 (this sprint) — schema + resolver + GWAS Catalog oracle + ancestry-risk-profiler backfill. Permanently closes the citation class.
- Layer 2 — ACMG/ancestry golden fixtures.
- Layer 3 — agent CI check + template routing.
Strategic note
This is not overhead around the product; for a trust layer it is the product. "Every effect size resolves to a real, ancestry-matched, peer-reviewed source, enforced in CI" is the claim that makes an institution willing to depend on the catalog. The provenance oracle is also reusable as a GenomeQA scoring axis (provenance-grounding), not only a gate.
Summary
Make scientific correctness of ClawBio skills enforced deterministically in CI, not caught per-PR by human + agent review. Across the 2026-06 hackathon rounds the
/pr-auditkeyword scanner returned SAFE on PRs that were scientifically or clinically wrong, and every real catch came from deep per-PR domain agents. That works but does not scale. This issue proposes a three-layer gate, built on the patterns ClawBench already uses, that pushes each recurring failure class down to the lowest layer that can catch it deterministically.Implements DEC-2026-059 (recurring failure classes migrate into ClawBench/CI gates) and extends DEC-2026-061 (validate skills against GIAB/gnomAD/VEP through ClawBench) by adding GWAS Catalog + PubMed as provenance oracles.
Evidence (why this is needed, not speculative)
Three failure classes shipped past the scanner this month and were only caught by manually resolving primary sources:
ancestry-risk-profiler(#297), three review rounds. Re-resolving PMIDs against PubMed showed 5 of 6 high-reuse PMIDs are wrong, including the flagship APOL1 citation20566908, which resolves to a head-and-neck-cancer disability survey, not Genovese et al. One "representative" PMID per super-population was blanket-applied as the source for every variant in that population, regardless of trait or even the population the paper studied.cnv-acmg-classifier(#305) hard-coded a "terminal 2A" rule absent from ClinGen/Riggs 2020; the standard's own worked example (2A + inherited 5B = 0.70 VUS) was returned as 1.00 Pathogenic. The contributor's own tests passed because they encoded the wrong rule.ancestry-risk-profiler, same PR.These are exactly the classes the North Star ("open trust layer for agentic genomics") cannot afford to merge. A mechanically-clean but scientifically-wrong skill is the worst failure mode for a trust layer.
Design: three layers, built on existing ClawBench patterns
ClawBench already establishes the pattern this should follow:
SCHEMAS/acmg_evidence_schema.json(structural contract +x-provenanceblock),HARNESS/validate_evidence.py(fail-closed cross-field rules with machine-readable error codes),HARNESS/attribution.py(error attribution by layer), andTRUTH/{giab,clinvar,reference}oracles. The new work mirrors these, it does not reinvent them.Layer 1 — Provenance oracle (deterministic, blocking) — BUILD FIRST
The permanent fix for the citation class. Two parts:
1a. Provenance schema — new
SCHEMAS/effect_size_provenance_schema.json, modelled onacmg_evidence_schema.json. Any skill that emits effect sizes / ancestry-stratified risk MUST ship association entries in this shape:{ "$schema": "https://json-schema.org/draft/2020-12/schema", "$id": "https://github.com/ClawBio/ClawBench/SCHEMAS/effect_size_provenance_schema.json", "title": "ClawBench effect-size provenance contract", "type": "object", "required": ["variant", "trait", "ancestry", "effect", "source"], "properties": { "variant": { "type": "object", "required": ["rsid"], "properties": { "rsid": { "type": "string", "pattern": "^rs[0-9]+$" } } }, "trait": { "type": "object", "required": ["label", "efo_id"], "properties": { "efo_id": { "type": "string", "pattern": "^EFO_[0-9]+$" } } }, "ancestry": { "enum": ["AFR", "AMR", "EAS", "EUR", "SAS"] }, "effect": { "type": "object", "required": ["measure", "value"], "properties": { "measure": { "enum": ["OR", "beta", "HR"] }, "value": { "type": "number" }, "ci_low": { "type": "number" }, "ci_high": { "type": "number" } } }, "source": { "type": "object", "required": ["pmid"], "properties": { "pmid": { "type": "string", "pattern": "^[0-9]{6,9}$" }, "gwas_catalog_accession": { "type": "string", "pattern": "^GCST[0-9]+$" } } } } }1b. Resolver harness — new
HARNESS/validate_provenance.py+ a GWAS Catalog snapshot underTRUTH/gwas_catalog/. For each entry, fail-closed checks with machine-readable error codes:pmid_unresolvableassoc_not_foundpmid_study_mismatch20566908APOL1 caseancestry_mismatch22158537EAS-tagged-SAS caseeffect_out_of_rangetopic_match_low(WARN, human sign-off)Backfill
ancestry-risk-profileras the first target (worst offender, perfect regression case): the gate must flag20566908,22158537,27005778,23945395,17478679.Layer 2 — Clinical-framework golden fixtures (deterministic, blocking)
Largely exists:
SCHEMAS/acmg_evidence_schema.json+HARNESS/acmg_points.py. What's missing is a framework-owned worked-example fixture set wired as CI invariants, so a skill cannot pass by encoding its own wrong rule:2A + 5B inherited = 0.70 VUS,2A + 5A de novo = 1.45 Pathogenic, the five-tier boundaries at +/-0.90 / +/-0.99.Layer 3 — Domain-agent review as an advisory CI check (non-blocking)
Wire the
/pr-auditdeep domain auditor into a GitHub Action on anyskills/*PR, posting findings as a non-blocking check + human gate. Agents are non-deterministic and over-reach (a re-audit once flagged an out-of-diff regression that was not real), so they cover only the residue the deterministic gates cannot: novel clinical framing, LD-correlated risk math, judgement calls. This is the top of the pyramid, not the base.Policy glue
Skill-contribution template declares a capability class:
emits_effect_sizes,clinical_classification,infers_ancestry. The flag routes the PR to the required gates, so a contributor cannot opt out of a gate by omission.Acceptance criteria
SCHEMAS/effect_size_provenance_schema.jsonmergedHARNESS/validate_provenance.py+TRUTH/gwas_catalog/snapshot; unit-tested against the six PMIDs above (5 fail with the right error codes,16415884passes)ancestry-risk-profilerbackfilled to the schema and passing/failing as expectedskills/*PRsPhasing
Strategic note
This is not overhead around the product; for a trust layer it is the product. "Every effect size resolves to a real, ancestry-matched, peer-reviewed source, enforced in CI" is the claim that makes an institution willing to depend on the catalog. The provenance oracle is also reusable as a GenomeQA scoring axis (provenance-grounding), not only a gate.