Summary
A set of coverage gaps and latent correctness hazards in the audit-chain verifier and its daily cron, found while investigating a production integrity alert. None of these is the active false-alarm root cause (tracked separately — see related issues), but several silently void the control's guarantees (orgs/rows that are never verified) and two can produce false tamper reports or an actually-forked chain under realistic conditions.
All line references at 08ca62581 (main).
1. Orgs beyond the first 500 are never checked (comment claims otherwise)
listAuditedOrganizationIds (audit_logs/integrity_check.ts:39-57) iterates auditLogChainGenesis and hard-stops at MAX_ORGS_PER_RUN = 500 with no cursor. The action logs "remaining orgs are checked on the next daily run" (:107-111) — false: every run takes the same first 500 genesis rows. On a deployment with >500 audited orgs, the rest are permanently outside the monitored control. Fix: persist a fan-out cursor (round-robin window per run), or page through all orgs across scheduler-chained batches.
2. Rows past the oldest 5000 are never verified by the cron
The cron verifies the oldest MAX_ENTRIES_PER_ORG = 5000 rows per org (integrity_check.ts:36, :122) and treats truncated as a warning, never paging (:138-142); fromTimestamp exists only on the public query. With the floor-clamped retention of ≥365 days an active org easily holds far more than 5000 rows, so the newest segment — where live tampering would land — is exactly the part never checked. Fix: page with fromTimestamp across runs (after fixing item 3 below), persisting per-org progress.
3. The documented fromTimestamp resume path always reports a false break
verifyAuditChain's resume contract ("page from lastVerifiedTimestamp + 1", verify_integrity.ts:150-157) is broken at the first row of any resumed page: on resume, previousExpectedHash starts '' and isFirstEntry is forced false (:315-321), the only seeding site sits inside the skipped isFirstEntry block (:457), so the linkage check (:461) compares the first resumed row's real previousHash against '' → guaranteed firstBrokenAt with zero tampering. Additionally, gte(fromTimestamp) (:291-295) permanently skips unverified same-timestamp siblings when a page cut lands inside a same-timestamp group — and such groups are guaranteed by multi-append mutations (e.g. governance/legal_hold.ts bulk place: N per-target rows + summary row share one mutation's frozen Date.now()).
No production caller passes fromTimestamp today (the cron doesn't; the frontend never calls the query), so this is dead-but-documented — but it is the public admin query's own contract, and item 2's fix depends on it. Fix: accept a previousExpectedHash (or last verified row id) alongside fromTimestamp and seed from it; use an exclusive-bound resume keyed on (timestamp, _id) rather than timestamp + 1.
4. No write-side timestamp monotonicity clamp — a clock regression permanently orphans a row
The chain orders by the app-level timestamp (Date.now(), audit_logs/helpers.ts:200) on by_organizationId_and_timestamp for both head-pick (desc, :227-233) and verification walk (asc, verify_integrity.ts:288-297). The genesis sentinel stores lastInsertedAt (helpers.ts:218) but never reads it back — there is no Math.max(now, lastInsertedAt + 1) clamp. If the backend wall clock steps backward (NTP step correction, VM snapshot restore), the next append chains off the true head but its timestamp sorts before it; subsequent writers keep picking the higher-timestamp old head, so the row is permanently orphaned and the ascending verify walk reports a linkage break at it — a permanent false tamper alarm until retention deletes the row. Fix: clamp the write timestamp against the genesis row's lastInsertedAt (it is already read+patched in every append for OCC purposes, so this costs nothing).
5. No guard against same-org parallel appends inside one mutation (chain fork)
Cross-transaction forks are properly prevented (predecessor chainSuccessor patch + genesis sentinel → OCC conflict, helpers.ts:208-224, :313-315). But within a single mutation there is no OCC: two createAuditLog calls under Promise.all both read the same head before either inserts, and both commit with the same previousHash — a silent fork, which the next integrity run then reports as tampering (reproduced in a scratch convex-test: single-transaction same-org Promise.all → two rows with identical previousHash → valid: false). The one live parallel site, users/update_user_password.ts:105-120, is safe today only because each call targets a different org's chain — nothing enforces that, and getUserOrganizations does not dedupe membership rows. Fix: serialize appends within a mutation (per-ctx promise chain or an explicit createAuditLogs(batch) helper that chains sequentially), and add a lint/test guard against Promise.all-wrapped createAuditLog.
How this was found
Investigating a production "Audit log integrity check failed" notification. Related umbrella: #1803. Sibling issues from the same investigation: #1842, #1843, #1844, #1845.
Summary
A set of coverage gaps and latent correctness hazards in the audit-chain verifier and its daily cron, found while investigating a production integrity alert. None of these is the active false-alarm root cause (tracked separately — see related issues), but several silently void the control's guarantees (orgs/rows that are never verified) and two can produce false tamper reports or an actually-forked chain under realistic conditions.
All line references at
08ca62581(main).1. Orgs beyond the first 500 are never checked (comment claims otherwise)
listAuditedOrganizationIds(audit_logs/integrity_check.ts:39-57) iteratesauditLogChainGenesisand hard-stops atMAX_ORGS_PER_RUN = 500with no cursor. The action logs "remaining orgs are checked on the next daily run" (:107-111) — false: every run takes the same first 500 genesis rows. On a deployment with >500 audited orgs, the rest are permanently outside the monitored control. Fix: persist a fan-out cursor (round-robin window per run), or page through all orgs across scheduler-chained batches.2. Rows past the oldest 5000 are never verified by the cron
The cron verifies the oldest
MAX_ENTRIES_PER_ORG = 5000rows per org (integrity_check.ts:36,:122) and treatstruncatedas a warning, never paging (:138-142);fromTimestampexists only on the public query. With the floor-clamped retention of ≥365 days an active org easily holds far more than 5000 rows, so the newest segment — where live tampering would land — is exactly the part never checked. Fix: page withfromTimestampacross runs (after fixing item 3 below), persisting per-org progress.3. The documented
fromTimestampresume path always reports a false breakverifyAuditChain's resume contract ("page fromlastVerifiedTimestamp + 1",verify_integrity.ts:150-157) is broken at the first row of any resumed page: on resume,previousExpectedHashstarts''andisFirstEntryis forced false (:315-321), the only seeding site sits inside the skippedisFirstEntryblock (:457), so the linkage check (:461) compares the first resumed row's realpreviousHashagainst''→ guaranteedfirstBrokenAtwith zero tampering. Additionally,gte(fromTimestamp)(:291-295) permanently skips unverified same-timestamp siblings when a page cut lands inside a same-timestampgroup — and such groups are guaranteed by multi-append mutations (e.g.governance/legal_hold.tsbulk place: N per-target rows + summary row share one mutation's frozenDate.now()).No production caller passes
fromTimestamptoday (the cron doesn't; the frontend never calls the query), so this is dead-but-documented — but it is the public admin query's own contract, and item 2's fix depends on it. Fix: accept apreviousExpectedHash(or last verified row id) alongsidefromTimestampand seed from it; use an exclusive-bound resume keyed on (timestamp, _id) rather thantimestamp + 1.4. No write-side timestamp monotonicity clamp — a clock regression permanently orphans a row
The chain orders by the app-level
timestamp(Date.now(),audit_logs/helpers.ts:200) onby_organizationId_and_timestampfor both head-pick (desc,:227-233) and verification walk (asc,verify_integrity.ts:288-297). The genesis sentinel storeslastInsertedAt(helpers.ts:218) but never reads it back — there is noMath.max(now, lastInsertedAt + 1)clamp. If the backend wall clock steps backward (NTP step correction, VM snapshot restore), the next append chains off the true head but itstimestampsorts before it; subsequent writers keep picking the higher-timestamp old head, so the row is permanently orphaned and the ascending verify walk reports a linkage break at it — a permanent false tamper alarm until retention deletes the row. Fix: clamp the write timestamp against the genesis row'slastInsertedAt(it is already read+patched in every append for OCC purposes, so this costs nothing).5. No guard against same-org parallel appends inside one mutation (chain fork)
Cross-transaction forks are properly prevented (predecessor
chainSuccessorpatch + genesis sentinel → OCC conflict,helpers.ts:208-224,:313-315). But within a single mutation there is no OCC: twocreateAuditLogcalls underPromise.allboth read the same head before either inserts, and both commit with the samepreviousHash— a silent fork, which the next integrity run then reports as tampering (reproduced in a scratch convex-test: single-transaction same-orgPromise.all→ two rows with identicalpreviousHash→valid: false). The one live parallel site,users/update_user_password.ts:105-120, is safe today only because each call targets a different org's chain — nothing enforces that, andgetUserOrganizationsdoes not dedupe membership rows. Fix: serialize appends within a mutation (per-ctx promise chain or an explicitcreateAuditLogs(batch)helper that chains sequentially), and add a lint/test guard againstPromise.all-wrappedcreateAuditLog.How this was found
Investigating a production "Audit log integrity check failed" notification. Related umbrella: #1803. Sibling issues from the same investigation: #1842, #1843, #1844, #1845.