Bug: audit integrity verifier/cron coverage gaps and latent chain hazards (org cap, unverified tail, broken resume, clock skew, intra-mutation fork)

## Summary

A set of coverage gaps and latent correctness hazards in the audit-chain verifier and its daily cron, found while investigating a production integrity alert. None of these is the active false-alarm root cause (tracked separately — see related issues), but several silently void the control's guarantees (orgs/rows that are never verified) and two can produce false tamper reports or an actually-forked chain under realistic conditions.

All line references at `08ca62581` (main).

## 1. Orgs beyond the first 500 are never checked (comment claims otherwise)

`listAuditedOrganizationIds` (`audit_logs/integrity_check.ts:39-57`) iterates `auditLogChainGenesis` and hard-stops at `MAX_ORGS_PER_RUN = 500` with no cursor. The action logs _"remaining orgs are checked on the next daily run"_ (`:107-111`) — false: every run takes the **same first 500** genesis rows. On a deployment with >500 audited orgs, the rest are permanently outside the monitored control. Fix: persist a fan-out cursor (round-robin window per run), or page through all orgs across scheduler-chained batches.

## 2. Rows past the oldest 5000 are never verified by the cron

The cron verifies the **oldest** `MAX_ENTRIES_PER_ORG = 5000` rows per org (`integrity_check.ts:36`, `:122`) and treats `truncated` as a warning, never paging (`:138-142`); `fromTimestamp` exists only on the public query. With the floor-clamped retention of ≥365 days an active org easily holds far more than 5000 rows, so the newest segment — where live tampering would land — is exactly the part never checked. Fix: page with `fromTimestamp` across runs (after fixing item 3 below), persisting per-org progress.

## 3. The documented `fromTimestamp` resume path always reports a false break

`verifyAuditChain`'s resume contract ("page from `lastVerifiedTimestamp + 1`", `verify_integrity.ts:150-157`) is broken at the first row of any resumed page: on resume, `previousExpectedHash` starts `''` and `isFirstEntry` is forced false (`:315-321`), the only seeding site sits inside the skipped `isFirstEntry` block (`:457`), so the linkage check (`:461`) compares the first resumed row's real `previousHash` against `''` → guaranteed `firstBrokenAt` with zero tampering. Additionally, `gte(fromTimestamp)` (`:291-295`) permanently skips unverified same-timestamp siblings when a page cut lands inside a same-`timestamp` group — and such groups are guaranteed by multi-append mutations (e.g. `governance/legal_hold.ts` bulk place: N per-target rows + summary row share one mutation's frozen `Date.now()`).

No production caller passes `fromTimestamp` today (the cron doesn't; the frontend never calls the query), so this is dead-but-documented — but it is the public admin query's own contract, and item 2's fix depends on it. Fix: accept a `previousExpectedHash` (or last verified row id) alongside `fromTimestamp` and seed from it; use an exclusive-bound resume keyed on (timestamp, \_id) rather than `timestamp + 1`.

## 4. No write-side timestamp monotonicity clamp — a clock regression permanently orphans a row

The chain orders by the app-level `timestamp` (`Date.now()`, `audit_logs/helpers.ts:200`) on `by_organizationId_and_timestamp` for both head-pick (desc, `:227-233`) and verification walk (asc, `verify_integrity.ts:288-297`). The genesis sentinel stores `lastInsertedAt` (`helpers.ts:218`) but **never reads it back** — there is no `Math.max(now, lastInsertedAt + 1)` clamp. If the backend wall clock steps backward (NTP step correction, VM snapshot restore), the next append chains off the true head but its `timestamp` sorts _before_ it; subsequent writers keep picking the higher-timestamp old head, so the row is permanently orphaned and the ascending verify walk reports a linkage break at it — a permanent false tamper alarm until retention deletes the row. Fix: clamp the write timestamp against the genesis row's `lastInsertedAt` (it is already read+patched in every append for OCC purposes, so this costs nothing).

## 5. No guard against same-org parallel appends inside one mutation (chain fork)

Cross-transaction forks are properly prevented (predecessor `chainSuccessor` patch + genesis sentinel → OCC conflict, `helpers.ts:208-224`, `:313-315`). But **within a single mutation there is no OCC**: two `createAuditLog` calls under `Promise.all` both read the same head before either inserts, and both commit with the same `previousHash` — a silent fork, which the next integrity run then reports as tampering (reproduced in a scratch convex-test: single-transaction same-org `Promise.all` → two rows with identical `previousHash` → `valid: false`). The one live parallel site, `users/update_user_password.ts:105-120`, is safe today only because each call targets a different org's chain — nothing enforces that, and `getUserOrganizations` does not dedupe membership rows. Fix: serialize appends within a mutation (per-ctx promise chain or an explicit `createAuditLogs(batch)` helper that chains sequentially), and add a lint/test guard against `Promise.all`-wrapped `createAuditLog`.

## How this was found

Investigating a production "Audit log integrity check failed" notification. Related umbrella: #1803. Sibling issues from the same investigation: #1842, #1843, #1844, #1845.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: audit integrity verifier/cron coverage gaps and latent chain hazards (org cap, unverified tail, broken resume, clock skew, intra-mutation fork) #1846

Summary

1. Orgs beyond the first 500 are never checked (comment claims otherwise)

2. Rows past the oldest 5000 are never verified by the cron

3. The documented `fromTimestamp` resume path always reports a false break

4. No write-side timestamp monotonicity clamp — a clock regression permanently orphans a row

5. No guard against same-org parallel appends inside one mutation (chain fork)

How this was found

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Bug: audit integrity verifier/cron coverage gaps and latent chain hazards (org cap, unverified tail, broken resume, clock skew, intra-mutation fork) #1846

Description

Summary

1. Orgs beyond the first 500 are never checked (comment claims otherwise)

2. Rows past the oldest 5000 are never verified by the cron

3. The documented fromTimestamp resume path always reports a false break

4. No write-side timestamp monotonicity clamp — a clock regression permanently orphans a row

5. No guard against same-org parallel appends inside one mutation (chain fork)

How this was found

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

3. The documented `fromTimestamp` resume path always reports a false break