Skip to content

Bug: audit integrity verifier/cron coverage gaps and latent chain hazards (org cap, unverified tail, broken resume, clock skew, intra-mutation fork) #1846

@larryro

Description

@larryro

Summary

A set of coverage gaps and latent correctness hazards in the audit-chain verifier and its daily cron, found while investigating a production integrity alert. None of these is the active false-alarm root cause (tracked separately — see related issues), but several silently void the control's guarantees (orgs/rows that are never verified) and two can produce false tamper reports or an actually-forked chain under realistic conditions.

All line references at 08ca62581 (main).

1. Orgs beyond the first 500 are never checked (comment claims otherwise)

listAuditedOrganizationIds (audit_logs/integrity_check.ts:39-57) iterates auditLogChainGenesis and hard-stops at MAX_ORGS_PER_RUN = 500 with no cursor. The action logs "remaining orgs are checked on the next daily run" (:107-111) — false: every run takes the same first 500 genesis rows. On a deployment with >500 audited orgs, the rest are permanently outside the monitored control. Fix: persist a fan-out cursor (round-robin window per run), or page through all orgs across scheduler-chained batches.

2. Rows past the oldest 5000 are never verified by the cron

The cron verifies the oldest MAX_ENTRIES_PER_ORG = 5000 rows per org (integrity_check.ts:36, :122) and treats truncated as a warning, never paging (:138-142); fromTimestamp exists only on the public query. With the floor-clamped retention of ≥365 days an active org easily holds far more than 5000 rows, so the newest segment — where live tampering would land — is exactly the part never checked. Fix: page with fromTimestamp across runs (after fixing item 3 below), persisting per-org progress.

3. The documented fromTimestamp resume path always reports a false break

verifyAuditChain's resume contract ("page from lastVerifiedTimestamp + 1", verify_integrity.ts:150-157) is broken at the first row of any resumed page: on resume, previousExpectedHash starts '' and isFirstEntry is forced false (:315-321), the only seeding site sits inside the skipped isFirstEntry block (:457), so the linkage check (:461) compares the first resumed row's real previousHash against '' → guaranteed firstBrokenAt with zero tampering. Additionally, gte(fromTimestamp) (:291-295) permanently skips unverified same-timestamp siblings when a page cut lands inside a same-timestamp group — and such groups are guaranteed by multi-append mutations (e.g. governance/legal_hold.ts bulk place: N per-target rows + summary row share one mutation's frozen Date.now()).

No production caller passes fromTimestamp today (the cron doesn't; the frontend never calls the query), so this is dead-but-documented — but it is the public admin query's own contract, and item 2's fix depends on it. Fix: accept a previousExpectedHash (or last verified row id) alongside fromTimestamp and seed from it; use an exclusive-bound resume keyed on (timestamp, _id) rather than timestamp + 1.

4. No write-side timestamp monotonicity clamp — a clock regression permanently orphans a row

The chain orders by the app-level timestamp (Date.now(), audit_logs/helpers.ts:200) on by_organizationId_and_timestamp for both head-pick (desc, :227-233) and verification walk (asc, verify_integrity.ts:288-297). The genesis sentinel stores lastInsertedAt (helpers.ts:218) but never reads it back — there is no Math.max(now, lastInsertedAt + 1) clamp. If the backend wall clock steps backward (NTP step correction, VM snapshot restore), the next append chains off the true head but its timestamp sorts before it; subsequent writers keep picking the higher-timestamp old head, so the row is permanently orphaned and the ascending verify walk reports a linkage break at it — a permanent false tamper alarm until retention deletes the row. Fix: clamp the write timestamp against the genesis row's lastInsertedAt (it is already read+patched in every append for OCC purposes, so this costs nothing).

5. No guard against same-org parallel appends inside one mutation (chain fork)

Cross-transaction forks are properly prevented (predecessor chainSuccessor patch + genesis sentinel → OCC conflict, helpers.ts:208-224, :313-315). But within a single mutation there is no OCC: two createAuditLog calls under Promise.all both read the same head before either inserts, and both commit with the same previousHash — a silent fork, which the next integrity run then reports as tampering (reproduced in a scratch convex-test: single-transaction same-org Promise.all → two rows with identical previousHashvalid: false). The one live parallel site, users/update_user_password.ts:105-120, is safe today only because each call targets a different org's chain — nothing enforces that, and getUserOrganizations does not dedupe membership rows. Fix: serialize appends within a mutation (per-ctx promise chain or an explicit createAuditLogs(batch) helper that chains sequentially), and add a lint/test guard against Promise.all-wrapped createAuditLog.

How this was found

Investigating a production "Audit log integrity check failed" notification. Related umbrella: #1803. Sibling issues from the same investigation: #1842, #1843, #1844, #1845.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions