Skip to content

feat(encryption): raft envelope + engine pre-apply hook (Stage 3)#744

Merged
bootjp merged 3 commits intomainfrom
feat/encryption-raft-envelope
May 10, 2026
Merged

feat(encryption): raft envelope + engine pre-apply hook (Stage 3)#744
bootjp merged 3 commits intomainfrom
feat/encryption-raft-envelope

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented May 7, 2026

Summary

Stage 3 of the data-at-rest encryption rollout (design doc: docs/design/2026_04_29_proposed_data_at_rest_encryption.md, §4.2 + §6.3). Wraps Raft proposal payloads in an authenticated AES-256-GCM envelope so the etcd/raft WAL on disk holds ciphertext only, and adds the engine pre-apply hook that unwraps committed entries past the §7.1 Phase 2 cutover index. Stage 3 is inert in production: with the new OpenConfig fields nil/unset, the apply hook never fires and every existing engine test continues to pass byte-for-byte.

  • New raft envelope wire format reuses §4.1 layout with a distinct AAD ('R' ‖ envelope_version ‖ key_id) so storage↔raft layer-confusion replay fails GCM.
  • applyNormalEntry becomes (any, error); on unwrap failure applyCommitted halts without advancing setApplied, mapping the entry-replay invariant directly into the runLoop's fatal-error path. Silent skip is rejected by design (§6.3).
  • applyCommitted arms refactored into applyNormalCommitted / applyConfChangeCommitted / applyConfChangeV2Committed to stay under the cyclomatic budget while threading the new error.
  • Coordinator-side RaftPayloadWrapper + wrappedProposer adapter — Stage 3 ships only the seam (nil wrapper = identity); Stage 6 will install the active wrapper from the sidecar.

Out of scope (deferred)

  • Stage 4: FSM-internal Raft entry types (raftEncodeEncryptionRegistration/Bootstrap/Rotation opcodes)
  • Stage 5: EncryptionAdmin gRPC + CLI
  • Stage 6: 3-phase rollout + cluster flags (sidecar publishes the cutover index + active wrapper)
  • Stage 7: writer registry + deterministic-nonce uniqueness across voters/learners/historical members
  • Stage 8: snapshot header v2 + authenticated MVCC encryption_state
  • Stage 9: KMS, compression, rotation/retire/rewrite, full Jepsen

Test plan

  • go test -race ./internal/encryption/... — round-trip, tag/keyID tamper, retired DEK, short input, layer-confusion, no-leak, AAD layout pin
  • go test -race ./internal/raftengine/etcd/... — cipher=nil pass-through, cutover boundary (strict greater-than), above-cutover Unwrap, unwrap-failure halts apply loop without advancing setApplied, proposal-id still resolvable after unwrap, OpenConfig nil-cutover default
  • go test -race ./kv/... — wrapper nil = identity, wrapper invoked exactly once, propagates wrap-side errors, end-to-end round-trip with real Cipher
  • golangci-lint run ./internal/raftengine/... ./internal/encryption/... ./kv/... clean
  • make test full sweep (CI verifies)
  • Existing engine tests unchanged (cipher=nil legacy path) (CI verifies)

Self-review (CLAUDE.md 5 passes)

  1. Data loss — tampered raft entries surface ErrRaftUnwrapFailed; setApplied never advances past a failing entry; runLoop's existing fatal-error path takes the process down for operator-supervised restart. The replay-not-skip invariant is the load-bearing safety property and is locked down by TestApplyCommitted_UnwrapFailure_DoesNotAdvanceApplied.
  2. ConcurrencyraftCipher / raftCutoverIndex are set once at Open and only read from the run-loop apply goroutine (same single-writer discipline as fsm). The wrapper closure type is documented as concurrent-safe; Stage 3's pass-through default sidesteps the question.
  3. Performancecipher=nil hot path is one nil branch + one uint64 compare per applied entry. With cipher set, the Unwrap path adds one DecodeEnvelope + one AEAD.Open per above-cutover entry; AES key expansion is amortised in the keystore's pre-initialised AEAD cache.
  4. Consistency — AAD purpose byte 'R' rules out storage↔raft layer confusion; key_id participates in AAD so on-disk rewrite fails GCM; cutover boundary is strict > so the enable-flag entry itself stays cleartext (design §7.1).
  5. Test coverage — 30+ new unit tests across three packages; existing raftengine_test.go continues to pass untouched.

Stage 3 of the encryption rollout (design doc §4.2 + §6.3). Adds
the AES-256-GCM raft envelope used to wrap Raft proposal payloads
end-to-end through the WAL, plus the engine-side pre-apply hook
that unwraps committed entries past a configurable cutover index.

Wire format. Reuses the §4.1 storage envelope layout but with a
distinct AAD: 'R' (0x52) ‖ envelope_version ‖ key_id BE32. No
pebble_key — the raft envelope is location-independent; the engine
identifies entries by raftpb.Entry.Index. Storage and raft envelopes
therefore cannot be cross-replayed: a storage ciphertext fed to
UnwrapRaftPayload (or vice versa) fails GCM tag verification on
the AAD prefix.

Engine pre-apply hook. internal/raftengine/etcd/engine.go gains
two optional OpenConfig fields (RaftCipher, RaftCutoverIndex) plus
matching Engine struct fields. applyNormalEntry now decodes the
proposal envelope FIRST (preserving the leading
0x01 proposalEnvelopeVersion byte that resolveProposal needs), then
— iff cipher!=nil AND entry.Index > cutoverIndex() — unwraps the
inner fsm payload through the raft DEK before handing the
cleartext to fsm.Apply. The FSM contract is unchanged at
Apply(data []byte) any. resolveProposal continues to receive the
original entry.Data, so per-proposal id resolution and response
delivery are untouched.

Fail-closed. applyNormalEntry now returns (any, error). On unwrap
failure (GCM tag mismatch / unknown DEK / malformed envelope)
applyCommitted propagates ErrRaftUnwrapFailed up to runLoop and
DOES NOT advance setApplied — the next restart must replay the
same entry, not skip it. Silent skip would let the local FSM
diverge from peers that successfully unwrapped, defeating the
integrity property the tag was added to detect (design §6.3).

applyCommitted refactor. Each entry-type arm extracted into its
own helper (applyNormalCommitted, applyConfChangeCommitted,
applyConfChangeV2Committed) so the dispatch loop stays under the
cyclomatic budget while the new (any, error) signature is woven
through.

Coordinator seam. kv/raft_payload_wrapper.go introduces the
RaftPayloadWrapper closure type and a wrappedProposer adapter that
runs the wrapper transparently before forwarding to a
raftengine.Proposer. Stage 3 ships only the seam: a nil wrapper is
returned verbatim by newWrappedProposer, preserving the existing
zero-encryption behaviour for every coordinator path. Stage 6
installs the active wrapper (cipher + active raft DEK + nonce
factory) from the sidecar.

Stage 3 is INERT in production. With cipher=nil and cutover=
^uint64(0) (the OpenConfig defaults from orInertCutover), every
existing engine_test continues to pass byte-for-byte; the new
behaviour only fires when Stage 5/6 wires the cipher and cutover.

Test coverage:

  - internal/encryption/raft_envelope_test.go: round-trip across 4
    sizes, deterministic-nonce determinism, tag tamper, key_id
    tamper, retired DEK, short-input, nil cipher, AAD layout pin
    (TestBuildRaftAAD), layer-confusion (storage envelope rejected
    by UnwrapRaftPayload), no-plaintext-leak.
  - internal/raftengine/etcd/encryption_test.go: cipher=nil pass-
    through, below-cutover pass-through (incl. boundary index ==
    cutover), above-cutover Unwrap, unwrap-failure halts apply
    loop without advancing setApplied, proposal id still
    resolvable after unwrap, OpenConfig nil-cutover default.
  - kv/raft_payload_wrapper_test.go: nil wrapper is identity,
    wrapper invoked exactly once per Propose, propagates wrap-side
    errors without touching the inner proposer, end-to-end round-
    trip through a real Cipher.

Self-review (CLAUDE.md 5 passes):

  1. Data loss: tampered raft entries fail closed via
     ErrRaftUnwrapFailed; setApplied never advances past a failing
     entry; runLoop's existing fatal-error path takes the process
     down for operator-supervised restart.
  2. Concurrency: raftCipher / raftCutoverIndex are set once at
     Open and only read from the run-loop apply goroutine — same
     single-writer discipline as fsm. The wrapper closure is
     documented as concurrent-safe; Stage 3's pass-through default
     avoids the question entirely.
  3. Performance: cipher=nil hot path is one nil branch + one
     uint64 compare per applied entry. With cipher set, the
     Unwrap path adds one DecodeEnvelope + one AEAD.Open per
     above-cutover entry; AES key expansion is amortised in the
     keystore's pre-initialised AEAD cache.
  4. Consistency: AAD purpose byte 'R' rules out storage↔raft
     layer confusion; key_id participates in AAD so on-disk
     rewrite fails GCM; cutover boundary is strict greater-than
     so the enable-flag entry itself stays cleartext (design
     §7.1).
  5. Test coverage: 30+ unit tests across three packages plus the
     existing raftengine_test.go which continues to pass with no
     changes (cipher=nil legacy path).
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 7, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7296cefa-dd8b-4c8a-b6c6-67fb09ca6db6

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/encryption-raft-envelope

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements Stage 3 of the Raft-level encryption design, introducing the core logic for wrapping and unwrapping Raft payloads using AES-GCM. It adds a new raft_envelope package for handling encrypted envelopes with distinct AAD, integrates an unwrap hook into the etcd-based Raft engine with cutover index support, and provides a coordinator-side wrapper for transparently encrypting proposals. The implementation ensures that decryption failures are fatal and prevent the advancement of the applied index to maintain consistency. I have no feedback to provide.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 9, 2026

@claude review

Stage 3 raft envelope + engine pre-apply hook の review trigger です。

主要変更点:

  • internal/encryption/raft_envelope.go (新規): WrapRaftPayload / UnwrapRaftPayload + AAD = 'R' ‖ envelope_version ‖ key_id BE32 (location-independent、storage envelope と層分離)
  • internal/raftengine/etcd/encryption.go (新規): ErrRaftUnwrapFailed sentinel + RaftCutoverIndex callback type + unwrapRaftPayload helper
  • internal/raftengine/etcd/engine.go 改修:
    • OpenConfigRaftCipher / RaftCutoverIndex 2 オプショナルフィールド追加 (default = nil で legacy 動作)
    • Engine struct に同2 フィールド追加 (single-writer discipline、fsm と同様)
    • applyNormalEntry(any, error) に signature 変更、cipher 有 + index > cutover() の場合に UnwrapRaftPayload 実行
    • applyCommitted の per-entry-type 分岐を applyNormalCommitted / applyConfChangeCommitted / applyConfChangeV2Committed に extract (cyclop budget 維持 + (any, error) 伝搬)
    • ErrRaftUnwrapFailed 時に setApplied を進めず error 伝搬 → runLoop fatal-error 経路で halt
  • kv/raft_payload_wrapper.go (新規): RaftPayloadWrapper 型 + wrappedProposer adapter (Stage 3 は seam のみ、nil 時 identity)

Stage 3 は production で inert (cipher=nil + cutover=^uint64(0) の OpenConfig default で従来動作 100% 維持)。Stage 5/6 で sidecar wiring 完了時に initial-active になります。

caller audit:

  • applyNormalEntry 戻り値変更: 唯一の caller applyCommitted (同 PR で更新済み)
  • raftengine.Proposer interface: 既存 caller applyRequests (kv/transaction.go) は wrappedProposer 経由で透過、wrap=nil で同一 Proposer instance を返すため挙動不変

特に確認したい点:

  • §4.2 raft envelope と §4.1 storage envelope の layer-confusion 検証
  • applyNormalEntry の (any, error) signature 変更と setApplied halt 不変条件
  • decodeProposalEnvelope先に 呼ぶ順序 (resolveProposal の id resolution 不破壊)

@claude
Copy link
Copy Markdown

claude Bot commented May 9, 2026

Claude finished @bootjp's task in 3m 3s —— View job


Stage 3 Raft Envelope Review

Tasks:

  • Gather context and understand request
  • Read internal/encryption/raft_envelope.go — AAD layout, layer-confusion separation
  • Read internal/raftengine/etcd/encryption.go — sentinel + cutover callback
  • Read internal/raftengine/etcd/engine.go (diff) — signature change, setApplied halt invariant, decode ordering
  • Read kv/raft_payload_wrapper.go — seam / nil identity
  • Read test files for coverage gaps
  • Cross-check §4.1 storage envelope AAD vs §4.2 raft envelope AAD
  • Post full review

Layer-Confusion Verdict (§4.2 vs §4.1)

Sound. The discriminant is byte 0 of the AAD:

Layer AAD byte[0] Full layout
Storage (§4.1) 0x01 (EnvelopeVersionV1) version ‖ flag ‖ key_id ‖ value_header(9B) ‖ pebble_key
Raft (§4.2) 0x52 ('R') 'R' ‖ version ‖ key_id

Any storage ciphertext fed to UnwrapRaftPayload reconstructs the wrong AAD (['R', 0x01, keyID...] vs the original [0x01, flag, keyID,...]) and GCM rejects with ErrIntegrity. The reverse replay likewise fails. TestRaftEnvelope_RejectsStorageEnvelope locks this down correctly.

One nuance worth documenting in the design doc: the flag byte that occupies position 1 in the storage header AAD is absent from the raft AAD (key_id goes directly after version at offset 2). This is intentional (raft proposals don't carry the compression bit), and the AAD pin test TestBuildRaftAAD catches any accidental drift.


applyNormalEntry (any, error) + setApplied Halt Invariant

Correct. The call chain is clean:

applyCommitted(entries)
  └─ applyNormalCommitted(entry)
       ├─ applyNormalEntry(entry) → (response, err)
       │    └─ on err: return nil, err  ← FSM never called
       ├─ if err != nil: return err     ← setApplied NOT called
       ├─ setApplied(entry.Index)       ← only on success
       └─ resolveProposal(...)

TestApplyCommitted_UnwrapFailure_DoesNotAdvanceApplied validates the exact invariant: the good entry at cutover+1 advances applied; the tampered entry at cutover+2 returns ErrRaftUnwrapFailed and applied stays at cutover+1. The never entry at cutover+3 is never reached.

One note: when applyNormalEntry returns an error, resolveProposal is not called and the proposal's req.done channel is never signalled. Since §6.3 designates this as a process-fatal event and runLoop halts, the dangling channel is cleaned up by process death rather than a graceful close. That's correct per design, but worth a comment near applyNormalCommitted noting that the pending-proposals map is intentionally abandoned on this path — otherwise future readers may wonder if it's a bug.


decodeProposalEnvelope Ordering and resolveProposal Non-Mutation

Correct. decodeProposalEnvelope at line 3707:

return binary.BigEndian.Uint64(data[1:envelopeHeaderSize]), data[envelopeHeaderSize:], true

…returns data[envelopeHeaderSize:] — a slice into entry.Data. The local variable payload is subsequently rebound to plain (a fresh Decrypt allocation):

plain, err := unwrapRaftPayload(e.raftCipher, payload)
// ...
payload = plain  // local variable rebind; entry.Data unchanged

entry.Data is never mutated. resolveProposal(entry.Index, entry.Data, response) can decode the proposal ID a second time safely. TestApplyNormalEntry_ProposalIDStillResolvable correctly verifies this by calling decodeProposalEnvelope(entry.Data) after the apply and checking the ID is intact.

Asymmetry to note (pre-existing, not a regression): On the non-cipher path, fsm.Apply receives payload = data[envelopeHeaderSize:], which is a slice aliased into entry.Data's backing array. On the cipher path it receives a fresh allocation from Decrypt. An FSM that retains the slice after Apply returns would see different ownership semantics depending on whether encryption is active. This isn't introduced by Stage 3, but Stage 6 should document the FSM contract or harden with a defensive copy at the apply boundary.


Other Observations

errors.Mark + errors.Wrap chain (encryption.go:74): the pattern errors.Wrap(errors.Mark(err, ErrRaftUnwrapFailed), "...") is correct. errors.Is(err, ErrRaftUnwrapFailed) and errors.Is(err, encryption.ErrIntegrity) both resolve, as verified by TestUnwrapRaftPayload_Helper. ✓

inertRaftCutoverIndex default: returning ^uint64(0) is clean — no entry index can be strictly greater, so the unwrap branch is unreachable in Stage 3. The corner case ^uint64(0) - 1 in TestApplyNormalEntry_NoCutoverDefault covers the penultimate possible index. ✓

newWrappedProposer nil shortcut (kv/raft_payload_wrapper.go:60-64): returning inner directly when wrap == nil avoids an extra allocation and an extra indirection on every Propose. ✓

Minor gap — above-cutover + decodeProposalEnvelope !ok path: applyNormalEntry returns nil, nil when the entry's data doesn't parse as a proposal envelope, even above the cutover. This is the same behavior as today (pre-encryption). A future Stage that introduces non-proposal-envelope entry types above the cutover will need to audit this path; a one-line comment at the !ok early return pointing to that concern would help.

RaftCutoverIndex callback performance contract: The comment in encryption.go:32 says implementations "should be lock-free (atomic.Uint64.Load)" but the type is just func() uint64. The inert default and the test closures satisfy this, but a future Stage 6 implementer could accidentally pass a mutex-protected function. A lint-time check isn't feasible, but a comment in OpenConfig.RaftCutoverIndex near the field doc reinforcing the "must not block" constraint would be worthwhile.


Summary

The implementation is clean and the three focal concerns are all handled correctly. No blocking issues. The two items worth following up before Stage 6 lands:

  1. A comment in applyNormalCommitted noting that pending proposals are intentionally not resolved on the fatal-unwrap path (process halt cleans up).
  2. FSM slice-aliasing note (or defensive copy) at the apply boundary, to be documented before the cipher path and non-cipher path can be active on the same cluster.

Two non-blocking documentation follow-ups from the Stage 3 review:

1. applyNormalCommitted: explicitly note that resolveProposal is
   intentionally NOT called on the unwrap-failure path. The
   originating coordinator's done channel is left dangling and
   reaped by process exit (§6.3 process-fatal contract). A graceful
   resolveProposal would deliver a misleading "committed-but-failed"
   signal that the coordinator could mis-interpret as a recoverable
   apply error.

2. applyNormalEntry: document the FSM payload aliasing asymmetry
   between cipher OFF (payload is a slice into entry.Data) and
   cipher ON (fresh allocation from Decrypt). FSM implementations
   that retain `data` past Apply's return without copying would see
   different ownership semantics depending on cipher state. Note
   that Stage 6 plans a defensive copy at the apply boundary to
   make the contract uniform.

No code change. Tests / lint green.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 9, 2026

@claude review

PR744 round-2: claude[bot] round-1 review の 2 件の non-blocking follow-up を反映しました (commit 3292ced)。

  • applyNormalCommitted doc: unwrap-failure path で resolveProposal を意図的に呼ばない理由を明記 (§6.3 process-fatal で channel は OS が reap、graceful 通知すると coordinator が "committed-but-failed" と誤解する)
  • applyNormalEntry doc: FSM payload の aliasing asymmetry (cipher OFF: entry.Data へのスライス / cipher ON: Decrypt の fresh alloc) を明記、Stage 6 の defensive copy 計画にリンク

コード変更なし、test/lint green。

@claude
Copy link
Copy Markdown

claude Bot commented May 9, 2026

Claude finished @bootjp's task in 1m 9s —— View job


Stage 3 Raft Envelope — Round 2 Review

Tasks:

  • Gather context and understand request
  • Read commit 3292ced changes — verify applyNormalCommitted doc comment
  • Read commit 3292ced changes — verify applyNormalEntry doc comment (aliasing asymmetry)
  • Verify round-1 follow-up items are fully addressed
  • Post final review

Both round-1 non-blocking follow-up items are correctly and fully resolved by commit 3292ced. No new issues found.

Item 1 — applyNormalCommitted: halt-path proposal channel

The added doc comment (engine.go:2030–2044) hits every relevant point:

  • States explicitly that resolveProposal is intentionally not called on the error path.
  • Gives the correct mechanical reason: the coordinator's done channel is left dangling and reaped by OS on process exit (§6.3 process-fatal contract).
  • Calls out the exact failure mode that a graceful resolveProposal would introduce — "committed-but-failed" — which a coordinator could mis-interpret as a recoverable apply error rather than a halt signal.

The reasoning is sound and the comment will orient future readers immediately. ✓

Item 2 — applyNormalEntry: FSM payload aliasing asymmetry

The block comment in applyNormalEntry (engine.go:2082–2097) correctly documents:

  • Cipher OFF: payload is a sub-slice of entry.Data's backing array (from decodeProposalEnvelope).
  • Cipher ON: payload is a fresh []byte allocation from cipher.Decrypt.
  • FSM implementations MUST NOT retain data past Apply's return without copying, or the two paths diverge in ownership semantics.
  • Stage 6 plans a defensive copy at the apply boundary to make the contract uniform.

One micro-note, not a blocker: the comment uses data (the FSM interface's parameter name) while the local variable is called payload. The intent is clear from context — it's describing the FSM's perspective — but a future reader scanning just the function body could briefly wonder which variable is meant. A parenthetical like (the \payload` slice passed as `data` to fsm.Apply)` would eliminate all ambiguity. Totally optional given how clear the rest of the comment is.

Overall

Commit 3292ced is documentation-only, 18 lines added, no code change. Both comments are accurate, placed at the right callsites, and address the exact concerns from round 1. Stage 3 is LGTM from my end. Ready to merge once CI is green.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3292cedb7e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/raftengine/etcd/engine.go Outdated
Comment on lines +2119 to +2121
if e.raftCipher != nil && entry.Index > e.raftCutoverIndex() {
plain, err := unwrapRaftPayload(e.raftCipher, payload)
if err != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fail closed when cutover is active but cipher is nil

Once the cluster starts submitting wrapped raft payloads (indexes above the cutover), this branch silently skips UnwrapRaftPayload whenever raftCipher is nil and sends ciphertext bytes directly to fsm.Apply, then setApplied advances as if apply succeeded. A node that starts with RaftCutoverIndex wired but RaftCipher missing (e.g., sidecar/init race or misconfiguration) will permanently diverge state instead of halting on an unwrap error.

Useful? React with 👍 / 👎.

…R744 codex P1)

The previous applyNormalEntry hook gated on `raftCipher != nil &&
entry.Index > cutover()`. When `raftCipher == nil` AND
`raftCutoverIndex` returned a real cutover (any value other than
the inert MaxUint64 default), an above-cutover entry would silently
fall through to fsm.Apply with raw envelope bytes. The cluster
would then advance setApplied as if the entry succeeded, while
peers that DID unwrap applied a different (cleartext) payload —
permanent FSM divergence from a sidecar/init race or operator
misconfig, exactly the silent-skip failure mode the integrity tag
was added to detect.

Restructure the gate as: `if entry.Index > cutover() { ... }`,
and inside the branch, if the cipher is nil, log + return
ErrRaftUnwrapFailed wrapped with a "cutover active but no raft
cipher wired" message. applyCommitted's existing fail-closed
machinery then halts the apply loop without advancing setApplied,
the same as it does for any other unwrap failure.

The default OpenConfig (no cutover, no cipher) still uses
RaftCutoverIndex = MaxUint64 via orInertCutover, so no entry
satisfies entry.Index > cutover and the new branch never fires.
The existing TestApplyNormalEntry_NoCipher_PassThrough test
exercises this path verbatim and continues to pass.

Caller audit (semantic change: fail-closed on a previously-silent
misconfig path):

  - applyNormalEntry's only caller is applyNormalCommitted
    (engine.go:2027), which propagates errors through to
    applyCommitted's halt-without-setApplied path. New
    ErrRaftUnwrapFailed surface is handled identically to the
    existing tampered-envelope case.
  - No other code path constructs an Engine with cutover wired
    and cipher missing today (Stage 3 OpenConfig wiring is the
    only seam, and it accepts both fields together). Stage 6's
    cluster-flag pipeline must continue to wire both atomically
    or expect this engine refusal.

New regression test
TestApplyNormalEntry_CutoverActive_NoCipher_FailsClosed pins both
sides of the contract: above-cutover with no cipher returns
ErrRaftUnwrapFailed and the FSM is NOT called; below-cutover with
no cipher continues to pass through (legacy cleartext entries
remain readable).
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 9, 2026

@claude review

PR744 round-3: codex P1 Fail closed when cutover is active but cipher is nil を反映しました (commit 722aea5)。

Gap: raftCipher != nil && entry.Index > cutover() の AND-gate で、cipher=nil + cutover 設定済 (sidecar/init race) のとき above-cutover entry が silent fall-through → fsm.Apply に raw envelope bytes、setApplied advance → 他 peer (unwrap 成功側) と permanent FSM divergence。

修正: gate を if entry.Index > cutover() に変更、内側で raftCipher == nil を fail-closed (ErrRaftUnwrapFailed wrap, log で entry_index + cutover_index 出力)。applyCommitted の既存 halt 機構が runLoop fatal-error 経路で process exit。

caller audit:

  • applyNormalEntry caller は applyNormalCommitted のみ、新 ErrRaftUnwrapFailed surface は既存 tampered-envelope case と同じ halt 経路を通過。
  • 既存 TestApplyNormalEntry_NoCipher_PassThrough (cipher=nil + cutover=^uint64(0) inert default) は新 gate でも never-fire、挙動不変 → continue to pass。
  • Stage 6 の cluster-flag pipeline で cipher と cutover を atomic に wire する責任を明示 (commit message)。

新規テスト TestApplyNormalEntry_CutoverActive_NoCipher_FailsClosed:

  • above-cutover + cipher=nil → ErrRaftUnwrapFailed、FSM 非到達
  • below-cutover + cipher=nil → そのまま pass-through (legacy cleartext 互換)

make test/lint green。

@claude
Copy link
Copy Markdown

claude Bot commented May 9, 2026

Claude finished @bootjp's task in 1m 49s —— View job


Stage 3 Raft Envelope — Round 3 Review

Tasks:

  • Gather context and understand request
  • Read commit 722aea5 changes — verify gate change (raftCipher != nil removed from outer condition)
  • Verify fail-closed path: raftCipher == nil above-cutover → ErrRaftUnwrapFailed
  • Verify existing TestApplyNormalEntry_NoCipher_PassThrough still correct under new gate
  • Review new test TestApplyNormalEntry_CutoverActive_NoCipher_FailsClosed
  • Check logging output for diagnostic adequacy (entry_index + cutover_index)
  • Scan for any new issues introduced by the fix
  • Post final review

P1 Fix Verdict

Correct. The gap was real and the fix is the minimal, right restructuring.

The original gate:

if e.raftCipher != nil && entry.Index > e.raftCutoverIndex()

had the AND-short-circuit flaw: cipher=nil unconditionally bypassed the branch, silently sending raw envelope bytes to fsm.Apply on any above-cutover entry. A node in a sidecar/init race would advance setApplied with a different FSM state than peers that successfully unwrapped — permanent divergence with no observable error.

The fix correctly inverts the decomposition:

if entry.Index > e.raftCutoverIndex() {   // outer: "is this a post-cutover entry?"
    if e.raftCipher == nil {              // inner: "can we actually unwrap it?"
        // fail closed
    }
    // normal unwrap
}

The outer predicate answers "is decryption required for this entry?" and the inner answers "can we do it?" — which is the correct structure.


Error Wrapping Consistency

The nil-cipher fail-closed path uses errors.Wrap(ErrRaftUnwrapFailed, message). The tamper-detection path in unwrapRaftPayload uses errors.Wrap(errors.Mark(err, ErrRaftUnwrapFailed), message) to preserve the underlying ErrIntegrity. The distinction is correct: there's no encryption-layer error to wrap in the nil-cipher case (it's a configuration error, not a decryption failure), so errors.Wrap(ErrRaftUnwrapFailed, ...) is the right call. Both paths satisfy errors.Is(err, ErrRaftUnwrapFailed). ✓


raftCutoverIndex() Called Twice on the Log Path

Inside the fail-closed branch, e.raftCutoverIndex() is called once in the outer if gate and again in the slog.Error call. This is benign: the implementation contract requires lock-free reads (atomic.Uint64.Load), so the second call is safe. In theory the logged cutover value could differ from the gate-checked value (TOCTOU), but since the cutover index is set once during Stage 6 wiring and never decremented, this is purely diagnostic, not a correctness concern. Not worth a code change — worth knowing if you see a confusing log line during debugging.


New Test Coverage

TestApplyNormalEntry_CutoverActive_NoCipher_FailsClosed covers both sides of the contract correctly:

  • Index = cutover+1, cipher=nilErrRaftUnwrapFailed, fsm.calls == 0
  • Index = cutover, cipher=nil → pass-through, fsm.calls == 1 (strict > boundary) ✓

The below-cutover case uses Index = cutover (not cutover-1), which is the maximally-boundary value — exactly where an off-by-one in the gate would manifest. Good choice.

TestApplyNormalEntry_NoCipher_PassThrough still passes under the new gate: orInertCutover(nil) installs ^uint64(0), and 0 > ^uint64(0) is false, so the outer branch never fires. ✓


Stage 6 Contract Documented

The commit message explicitly states: "Stage 6's cluster-flag pipeline must continue to wire both atomically or expect this engine refusal." This is the right place for that constraint — commit messages and code comments are where future Stage 6 implementers will look. The engine now enforces the invariant defensively rather than trusting callers to get it right. ✓


Summary

Commit 722aea5 is a clean, minimal fix for the Codex P1 gap. The gate restructuring is correct, the error handling is consistent with the existing codebase pattern, the new test covers both arms of the fail-closed branch at the exact boundary, and no new issues were introduced. LGTM — ready to merge once CI is green.

Base automatically changed from feat/encryption-storage to main May 10, 2026 15:53
@bootjp bootjp merged commit 9303e52 into main May 10, 2026
5 checks passed
@bootjp bootjp deleted the feat/encryption-raft-envelope branch May 10, 2026 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant