Provenance + clinical-correctness CI gates: make scientific safety deterministic, not per-PR

## Summary

Make scientific correctness of ClawBio skills **enforced deterministically in CI**, not caught per-PR by human + agent review. Across the 2026-06 hackathon rounds the `/pr-audit` keyword scanner returned **SAFE** on PRs that were scientifically or clinically **wrong**, and every real catch came from deep per-PR domain agents. That works but does not scale. This issue proposes a three-layer gate, built on the patterns ClawBench already uses, that pushes each recurring failure class down to the lowest layer that can catch it deterministically.

Implements DEC-2026-059 (recurring failure classes migrate into ClawBench/CI gates) and extends DEC-2026-061 (validate skills against GIAB/gnomAD/VEP through ClawBench) by adding **GWAS Catalog + PubMed** as provenance oracles.

## Evidence (why this is needed, not speculative)

Three failure classes shipped past the scanner this month and were only caught by manually resolving primary sources:

- **Fabricated / mismatched citations** — `ancestry-risk-profiler` (#297), three review rounds. Re-resolving PMIDs against PubMed showed 5 of 6 high-reuse PMIDs are wrong, including the flagship APOL1 citation `20566908`, which resolves to a head-and-neck-cancer disability survey, not Genovese et al. One "representative" PMID per super-population was blanket-applied as the source for every variant in that population, regardless of trait or even the population the paper studied.
- **Clinical-classification boundary error** — `cnv-acmg-classifier` (#305) hard-coded a "terminal 2A" rule absent from ClinGen/Riggs 2020; the standard's own worked example (2A + inherited 5B = 0.70 VUS) was returned as 1.00 Pathogenic. The contributor's own tests passed because they encoded the wrong rule.
- **Low-marker ancestry over-call / inflated risk math** — `ancestry-risk-profiler`, same PR.

These are exactly the classes the North Star ("open trust layer for agentic genomics") cannot afford to merge. A mechanically-clean but scientifically-wrong skill is the worst failure mode for a trust layer.

## Design: three layers, built on existing ClawBench patterns

ClawBench already establishes the pattern this should follow: `SCHEMAS/acmg_evidence_schema.json` (structural contract + `x-provenance` block), `HARNESS/validate_evidence.py` (fail-closed cross-field rules with machine-readable error codes), `HARNESS/attribution.py` (error attribution by layer), and `TRUTH/{giab,clinvar,reference}` oracles. The new work mirrors these, it does not reinvent them.

### Layer 1 — Provenance oracle (deterministic, blocking) — BUILD FIRST

The permanent fix for the citation class. Two parts:

**1a. Provenance schema** — new `SCHEMAS/effect_size_provenance_schema.json`, modelled on `acmg_evidence_schema.json`. Any skill that emits effect sizes / ancestry-stratified risk MUST ship association entries in this shape:

```json
{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://github.com/ClawBio/ClawBench/SCHEMAS/effect_size_provenance_schema.json",
  "title": "ClawBench effect-size provenance contract",
  "type": "object",
  "required": ["variant", "trait", "ancestry", "effect", "source"],
  "properties": {
    "variant":  { "type": "object", "required": ["rsid"],
                  "properties": { "rsid": { "type": "string", "pattern": "^rs[0-9]+$" } } },
    "trait":    { "type": "object", "required": ["label", "efo_id"],
                  "properties": { "efo_id": { "type": "string", "pattern": "^EFO_[0-9]+$" } } },
    "ancestry": { "enum": ["AFR", "AMR", "EAS", "EUR", "SAS"] },
    "effect":   { "type": "object", "required": ["measure", "value"],
                  "properties": { "measure": { "enum": ["OR", "beta", "HR"] },
                                  "value": { "type": "number" },
                                  "ci_low": { "type": "number" }, "ci_high": { "type": "number" } } },
    "source":   { "type": "object", "required": ["pmid"],
                  "properties": { "pmid": { "type": "string", "pattern": "^[0-9]{6,9}$" },
                                  "gwas_catalog_accession": { "type": "string", "pattern": "^GCST[0-9]+$" } } }
  }
}
```

**1b. Resolver harness** — new `HARNESS/validate_provenance.py` + a GWAS Catalog snapshot under `TRUTH/gwas_catalog/`. For each entry, fail-closed checks with machine-readable error codes:

| Check | Error code | Catches |
|---|---|---|
| PMID resolves at all (PubMed efetch) | `pmid_unresolvable` | typos, fabricated IDs |
| GWAS Catalog has an association for this rsID + trait (EFO) | `assoc_not_found` | variant/trait mismatch |
| The supporting study's PMID matches the cited PMID | `pmid_study_mismatch` | `20566908` APOL1 case |
| The study's ancestry matches the entry's ancestry | `ancestry_mismatch` | `22158537` EAS-tagged-SAS case |
| Effect size within plausible range of the catalog value | `effect_out_of_range` | transcription / inversion errors |
| Where GWAS Catalog lacks coverage: PubMed title/abstract topic match | `topic_match_low` (WARN, human sign-off) | honest gaps, not auto-pass |

Backfill `ancestry-risk-profiler` as the first target (worst offender, perfect regression case): the gate must flag `20566908`, `22158537`, `27005778`, `23945395`, `17478679`.

### Layer 2 — Clinical-framework golden fixtures (deterministic, blocking)

Largely exists: `SCHEMAS/acmg_evidence_schema.json` + `HARNESS/acmg_points.py`. What's missing is a **framework-owned worked-example fixture set** wired as CI invariants, so a skill cannot pass by encoding its own wrong rule:

- ClinGen/ACMG CNV (Riggs 2020) worked examples: `2A + 5B inherited = 0.70 VUS`, `2A + 5A de novo = 1.45 Pathogenic`, the five-tier boundaries at +/-0.90 / +/-0.99.
- Ancestry: panel-size >= 30 markers => abstain (Kosoy 2009); no absolute-risk % when ORs are applied on a baseline; soft posterior, never a hard single label from a sub-threshold panel.

### Layer 3 — Domain-agent review as an advisory CI check (non-blocking)

Wire the `/pr-audit` deep domain auditor into a GitHub Action on any `skills/*` PR, posting findings as a **non-blocking** check + human gate. Agents are non-deterministic and over-reach (a re-audit once flagged an out-of-diff regression that was not real), so they cover only the residue the deterministic gates cannot: novel clinical framing, LD-correlated risk math, judgement calls. This is the top of the pyramid, not the base.

## Policy glue

Skill-contribution template declares a capability class: `emits_effect_sizes`, `clinical_classification`, `infers_ancestry`. The flag routes the PR to the required gates, so a contributor cannot opt out of a gate by omission.

## Acceptance criteria

- [ ] `SCHEMAS/effect_size_provenance_schema.json` merged
- [ ] `HARNESS/validate_provenance.py` + `TRUTH/gwas_catalog/` snapshot; unit-tested against the six PMIDs above (5 fail with the right error codes, `16415884` passes)
- [ ] `ancestry-risk-profiler` backfilled to the schema and passing/failing as expected
- [ ] ACMG worked-example fixtures wired as blocking invariants
- [ ] Domain-agent advisory check live on `skills/*` PRs
- [ ] Skill template declares capability class; CI routes accordingly

## Phasing

1. **Layer 1** (this sprint) — schema + resolver + GWAS Catalog oracle + ancestry-risk-profiler backfill. Permanently closes the citation class.
2. **Layer 2** — ACMG/ancestry golden fixtures.
3. **Layer 3** — agent CI check + template routing.

## Strategic note

This is not overhead around the product; for a trust layer it **is** the product. "Every effect size resolves to a real, ancestry-matched, peer-reviewed source, enforced in CI" is the claim that makes an institution willing to depend on the catalog. The provenance oracle is also reusable as a GenomeQA scoring axis (provenance-grounding), not only a gate.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Provenance + clinical-correctness CI gates: make scientific safety deterministic, not per-PR #3

Summary

Evidence (why this is needed, not speculative)

Design: three layers, built on existing ClawBench patterns

Layer 1 — Provenance oracle (deterministic, blocking) — BUILD FIRST

Layer 2 — Clinical-framework golden fixtures (deterministic, blocking)

Layer 3 — Domain-agent review as an advisory CI check (non-blocking)

Policy glue

Acceptance criteria

Phasing

Strategic note

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Check	Error code	Catches
PMID resolves at all (PubMed efetch)	`pmid_unresolvable`	typos, fabricated IDs
GWAS Catalog has an association for this rsID + trait (EFO)	`assoc_not_found`	variant/trait mismatch
The supporting study's PMID matches the cited PMID	`pmid_study_mismatch`	`20566908` APOL1 case
The study's ancestry matches the entry's ancestry	`ancestry_mismatch`	`22158537` EAS-tagged-SAS case
Effect size within plausible range of the catalog value	`effect_out_of_range`	transcription / inversion errors
Where GWAS Catalog lacks coverage: PubMed title/abstract topic match	`topic_match_low` (WARN, human sign-off)	honest gaps, not auto-pass

Uh oh!

Provenance + clinical-correctness CI gates: make scientific safety deterministic, not per-PR #3

Description

Summary

Evidence (why this is needed, not speculative)

Design: three layers, built on existing ClawBench patterns

Layer 1 — Provenance oracle (deterministic, blocking) — BUILD FIRST

Layer 2 — Clinical-framework golden fixtures (deterministic, blocking)

Layer 3 — Domain-agent review as an advisory CI check (non-blocking)

Policy glue

Acceptance criteria

Phasing

Strategic note

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions