Skip to content

refactor: derive Codex cost cache producer key from parser source hash#1042

Open
hhh2210 wants to merge 2 commits into
steipete:mainfrom
hhh2210:codex/cost-cache-producer-key
Open

refactor: derive Codex cost cache producer key from parser source hash#1042
hhh2210 wants to merge 2 commits into
steipete:mainfrom
hhh2210:codex/cost-cache-producer-key

Conversation

@hhh2210
Copy link
Copy Markdown
Contributor

@hhh2210 hhh2210 commented May 19, 2026

Summary

  • stamp Codex cost caches with producerKey = "codex:cu:p<hash>", where <hash> is a generated SHA256 prefix over the included non-Claude Vendored/CostUsage/*.swift inputs
  • keep the existing codex-v8.json artifact path stable while ignoring legacy or mismatched Codex caches so they rebuild once
  • add a tracked generated CodexParserHash file plus a non-mutating Scripts/lint.sh check that fails when the hash is stale
  • leave non-Codex cost caches on their existing artifact-version behavior, so a Codex scanner/cache-source change does not cold-invalidate Claude caches

Why

This implements the auto-invalidation design proposed in #1020 along the forgotten-bump axis. The earlier correctness fix in #1014 made oversized turn_context rows recover the model from the retained prefix, but a fixed parser can still be hidden by an already-written stale codex-v8.json cache. Continuing to solve that by renaming the file to codex-v9.json, then codex-v10.json, is fragile because every future Codex parser semantic fix has to remember to bump the filename.

The parser content hash is a conservative build-time producer identity for Codex cost-cache output. It changes when the included non-Claude Vendored/CostUsage/*.swift inputs change, so Codex caches rebuild after scanner/cache-source changes without requiring a manual codex-vN.json path bump. This may over-invalidate on refactors or comments in the included files, but it avoids under-invalidation after parser fixes.

The existing artifact version remains the coarse file/schema boundary. This PR removes the need to bump the Codex cache path for ordinary Codex scanner semantic changes; it does not remove the artifact-version mechanism for incompatible cache schema/path changes.

Implementation

  • Scripts/regenerate-codex-parser-hash.sh reads every *.swift under Sources/CodexBarCore/Vendored/CostUsage/ excluding Claude-specific files, concatenates them in a stable order with a separator, takes SHA256, and writes the first 16 hex characters into Sources/CodexBarCore/Generated/CodexParserHash.generated.swift.
  • The generated file is tracked in git so contributors and CI can compile without requiring a SwiftPM plugin or pre-build step.
  • Scripts/lint.sh runs the generator in --check mode, comparing against a temporary expected file without mutating the worktree.

Why this does not close #1020 entirely

#1020 also proposes invalidating on observed cli_version drift (an upstream-shape signal). I prototyped that on top of this PR and pulled back. The expected producer key needs to be known before CostUsageCacheIO.load returns, but the observed session_meta.payload.cli_version / originator set is only knowable after reading session files. Making that real would require either a preflight metadata scan or a separate producer-version index.

More importantly, the cache stores scanner output, not raw JSONL bytes. If upstream Codex CLI changes JSONL shape and CodexBar has not shipped a scanner fix, invalidating the cache only reruns the same unchanged scanner and produces the same wrong output. The useful follow-up for that axis is likely telemetry/logging for newly observed producer versions, not cache invalidation itself.

Local benchmark context

From the merged #1017 synthetic shape benchmark on this machine after this patch:

Codex JSONL shape benchmark: divisor=20 bytes=19621792 lines=7290 truncated=129 current=127.1MB/s baseline=47.3MB/s speedup=2.7x

From the real local 30-day Codex sessions tree used in #1016 (~1.24 GB / 545 files / 160k lines), a cold CLI cost refresh took about 14.18s wall-clock, while a cached read took about 1.15s. So this trades a one-time Codex cache rebuild when the included Codex scanner/cache-source inputs actually change for a stable invalidation mechanism that no longer needs manual path-version patches.

Tests

  • Scripts/regenerate-codex-parser-hash.sh --check
  • swift test --filter CostUsageCacheTests
  • swift test --filter CostUsage
  • Scripts/lint.sh lint
  • swift test

Refs #1020 (forgotten-bump axis only)

Copilot AI review requested due to automatic review settings May 19, 2026 07:28
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds producer-key based invalidation for Codex cost caches while keeping the existing codex-v8.json path stable and leaving non-Codex cache behavior unchanged.

Changes:

  • Adds optional producerKey storage and matching checks for Codex cache load/save.
  • Derives Codex producer keys from app/CLI release version, with development executable fingerprint fallback.
  • Adds cache behavior and producer-key derivation tests.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
Sources/CodexBarCore/Vendored/CostUsage/CostUsageCache.swift Adds producer-key stamping, validation, and release/development key derivation for Codex caches.
Tests/CodexBarTests/CostUsageCacheTests.swift Adds coverage for producer-key matching, legacy cache invalidation, non-Codex behavior, and version-source selection.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@hhh2210 hhh2210 changed the title fix: stamp Codex cost cache producer identity refactor: derive Codex cost cache producer key from parser source hash May 19, 2026
The first iteration of this PR stamped the cache with the CodexBar
marketing version. That over-fires on cosmetic releases (every release
forces a one-time Codex cache rebuild, even when the scanner did not
change) and depends on a fallback chain through the executable bundle to
discover the version at runtime.

Replace it with a content hash of the Codex scanner sources, generated
at build time by Scripts/regenerate-codex-parser-hash.sh and emitted as
CodexParserHash.value. The producer key is now codex:cu:p<hash>:

- Only invalidates when the Codex scanner code actually changes, so
  cosmetic releases keep the existing cache.
- The hash is generated from source, so contributors cannot forget to
  bump a manual artifactVersion constant.
- Scripts/lint.sh re-runs the generator and fails if the tracked file is
  out of date, so CI catches stale hashes.

Non-Codex caches remain on their previous artifact-version behavior.
@hhh2210 hhh2210 force-pushed the codex/cost-cache-producer-key branch from 1f9b0fa to e7dd706 Compare May 19, 2026 07:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Replace manual Codex cost cache version bumps with auto-invalidation

2 participants