Skip to content

Provenance + clinical-correctness CI gates: make scientific safety deterministic, not per-PR #3

Description

@manuelcorpas

Summary

Make scientific correctness of ClawBio skills enforced deterministically in CI, not caught per-PR by human + agent review. Across the 2026-06 hackathon rounds the /pr-audit keyword scanner returned SAFE on PRs that were scientifically or clinically wrong, and every real catch came from deep per-PR domain agents. That works but does not scale. This issue proposes a three-layer gate, built on the patterns ClawBench already uses, that pushes each recurring failure class down to the lowest layer that can catch it deterministically.

Implements DEC-2026-059 (recurring failure classes migrate into ClawBench/CI gates) and extends DEC-2026-061 (validate skills against GIAB/gnomAD/VEP through ClawBench) by adding GWAS Catalog + PubMed as provenance oracles.

Evidence (why this is needed, not speculative)

Three failure classes shipped past the scanner this month and were only caught by manually resolving primary sources:

  • Fabricated / mismatched citationsancestry-risk-profiler (#297), three review rounds. Re-resolving PMIDs against PubMed showed 5 of 6 high-reuse PMIDs are wrong, including the flagship APOL1 citation 20566908, which resolves to a head-and-neck-cancer disability survey, not Genovese et al. One "representative" PMID per super-population was blanket-applied as the source for every variant in that population, regardless of trait or even the population the paper studied.
  • Clinical-classification boundary errorcnv-acmg-classifier (#305) hard-coded a "terminal 2A" rule absent from ClinGen/Riggs 2020; the standard's own worked example (2A + inherited 5B = 0.70 VUS) was returned as 1.00 Pathogenic. The contributor's own tests passed because they encoded the wrong rule.
  • Low-marker ancestry over-call / inflated risk mathancestry-risk-profiler, same PR.

These are exactly the classes the North Star ("open trust layer for agentic genomics") cannot afford to merge. A mechanically-clean but scientifically-wrong skill is the worst failure mode for a trust layer.

Design: three layers, built on existing ClawBench patterns

ClawBench already establishes the pattern this should follow: SCHEMAS/acmg_evidence_schema.json (structural contract + x-provenance block), HARNESS/validate_evidence.py (fail-closed cross-field rules with machine-readable error codes), HARNESS/attribution.py (error attribution by layer), and TRUTH/{giab,clinvar,reference} oracles. The new work mirrors these, it does not reinvent them.

Layer 1 — Provenance oracle (deterministic, blocking) — BUILD FIRST

The permanent fix for the citation class. Two parts:

1a. Provenance schema — new SCHEMAS/effect_size_provenance_schema.json, modelled on acmg_evidence_schema.json. Any skill that emits effect sizes / ancestry-stratified risk MUST ship association entries in this shape:

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://github.com/ClawBio/ClawBench/SCHEMAS/effect_size_provenance_schema.json",
  "title": "ClawBench effect-size provenance contract",
  "type": "object",
  "required": ["variant", "trait", "ancestry", "effect", "source"],
  "properties": {
    "variant":  { "type": "object", "required": ["rsid"],
                  "properties": { "rsid": { "type": "string", "pattern": "^rs[0-9]+$" } } },
    "trait":    { "type": "object", "required": ["label", "efo_id"],
                  "properties": { "efo_id": { "type": "string", "pattern": "^EFO_[0-9]+$" } } },
    "ancestry": { "enum": ["AFR", "AMR", "EAS", "EUR", "SAS"] },
    "effect":   { "type": "object", "required": ["measure", "value"],
                  "properties": { "measure": { "enum": ["OR", "beta", "HR"] },
                                  "value": { "type": "number" },
                                  "ci_low": { "type": "number" }, "ci_high": { "type": "number" } } },
    "source":   { "type": "object", "required": ["pmid"],
                  "properties": { "pmid": { "type": "string", "pattern": "^[0-9]{6,9}$" },
                                  "gwas_catalog_accession": { "type": "string", "pattern": "^GCST[0-9]+$" } } }
  }
}

1b. Resolver harness — new HARNESS/validate_provenance.py + a GWAS Catalog snapshot under TRUTH/gwas_catalog/. For each entry, fail-closed checks with machine-readable error codes:

Check Error code Catches
PMID resolves at all (PubMed efetch) pmid_unresolvable typos, fabricated IDs
GWAS Catalog has an association for this rsID + trait (EFO) assoc_not_found variant/trait mismatch
The supporting study's PMID matches the cited PMID pmid_study_mismatch 20566908 APOL1 case
The study's ancestry matches the entry's ancestry ancestry_mismatch 22158537 EAS-tagged-SAS case
Effect size within plausible range of the catalog value effect_out_of_range transcription / inversion errors
Where GWAS Catalog lacks coverage: PubMed title/abstract topic match topic_match_low (WARN, human sign-off) honest gaps, not auto-pass

Backfill ancestry-risk-profiler as the first target (worst offender, perfect regression case): the gate must flag 20566908, 22158537, 27005778, 23945395, 17478679.

Layer 2 — Clinical-framework golden fixtures (deterministic, blocking)

Largely exists: SCHEMAS/acmg_evidence_schema.json + HARNESS/acmg_points.py. What's missing is a framework-owned worked-example fixture set wired as CI invariants, so a skill cannot pass by encoding its own wrong rule:

  • ClinGen/ACMG CNV (Riggs 2020) worked examples: 2A + 5B inherited = 0.70 VUS, 2A + 5A de novo = 1.45 Pathogenic, the five-tier boundaries at +/-0.90 / +/-0.99.
  • Ancestry: panel-size >= 30 markers => abstain (Kosoy 2009); no absolute-risk % when ORs are applied on a baseline; soft posterior, never a hard single label from a sub-threshold panel.

Layer 3 — Domain-agent review as an advisory CI check (non-blocking)

Wire the /pr-audit deep domain auditor into a GitHub Action on any skills/* PR, posting findings as a non-blocking check + human gate. Agents are non-deterministic and over-reach (a re-audit once flagged an out-of-diff regression that was not real), so they cover only the residue the deterministic gates cannot: novel clinical framing, LD-correlated risk math, judgement calls. This is the top of the pyramid, not the base.

Policy glue

Skill-contribution template declares a capability class: emits_effect_sizes, clinical_classification, infers_ancestry. The flag routes the PR to the required gates, so a contributor cannot opt out of a gate by omission.

Acceptance criteria

  • SCHEMAS/effect_size_provenance_schema.json merged
  • HARNESS/validate_provenance.py + TRUTH/gwas_catalog/ snapshot; unit-tested against the six PMIDs above (5 fail with the right error codes, 16415884 passes)
  • ancestry-risk-profiler backfilled to the schema and passing/failing as expected
  • ACMG worked-example fixtures wired as blocking invariants
  • Domain-agent advisory check live on skills/* PRs
  • Skill template declares capability class; CI routes accordingly

Phasing

  1. Layer 1 (this sprint) — schema + resolver + GWAS Catalog oracle + ancestry-risk-profiler backfill. Permanently closes the citation class.
  2. Layer 2 — ACMG/ancestry golden fixtures.
  3. Layer 3 — agent CI check + template routing.

Strategic note

This is not overhead around the product; for a trust layer it is the product. "Every effect size resolves to a real, ancestry-matched, peer-reviewed source, enforced in CI" is the claim that makes an institution willing to depend on the catalog. The provenance oracle is also reusable as a GenomeQA scoring axis (provenance-grounding), not only a gate.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions