Skip to content

feat: replace soft tx cache with revalidation-based validation cache#744

Open
mpskowron wants to merge 1 commit intomainfrom
skowron/tx-revalidation
Open

feat: replace soft tx cache with revalidation-based validation cache#744
mpskowron wants to merge 1 commit intomainfrom
skowron/tx-revalidation

Conversation

@mpskowron
Copy link
Contributor

@mpskowron mpskowron commented Feb 22, 2026

Overview

Remove the soft transaction cache and introduce tx revalidation: cached VerifiedTransaction entries are reused when state changes by revalidating against a RevalidationReference instead of re-running full ZK proof verification. Adds cache metrics (miss, strict hit, revalidation hit) and tests covering the full validation lifecycle. Metrics are not backwards compatible - require grafana dashboards update.

Revalidation skips most costly part of transaction verification which is independent of ledger state - ZK proofs. This should be a significant performance improvement. We also rerun apply on every mempool revalidation now - this is slower than the previous solution (cache it once and never rerun in mempool), but from the other hand it lets us invalidate transaction from the mempool earlier.

🗹 TODO before merging

  • Benchmark performance
  • Ready

📌 Submission Checklist

  • Changes are backward-compatible (or flagged if breaking): grafana metrics need updating
  • Pull request description explains why the change is needed
  • Self-reviewed the diff
  • I have included a change file, or skipped for this reason:
  • If the changes introduce a new feature, I have bumped the node minor version
  • Update documentation (if relevant)
  • Updated AGENTS.md if build commands, architecture, or workflows changed
  • No new todos introduced

🧪 Testing Evidence

Please describe any additional testing aside from CI:

TBD

  • Additional tests are provided (if possible)

🔱 Fork Strategy

  • Node Runtime Update
  • Node Client Update
  • Other:
  • N/A

Links

Jira1: https://shielded.atlassian.net/browse/PM-21736
Jira2: https://shielded.atlassian.net/browse/PM-18691

@github-actions
Copy link
Contributor

kics-logo

KICS version: v2.1.16

Category Results
CRITICAL CRITICAL 0
HIGH HIGH 0
MEDIUM MEDIUM 96
LOW LOW 12
INFO INFO 83
TRACE TRACE 0
TOTAL TOTAL 191
Metric Values
Files scanned placeholder 31
Files parsed placeholder 31
Files failed to scan placeholder 0
Total executed queries placeholder 73
Queries failed to execute placeholder 0
Execution time placeholder 9

@mpskowron mpskowron force-pushed the skowron/tx-revalidation branch from 747cee2 to a55f163 Compare February 23, 2026 11:32

#[derive(PartialEq, Eq, Hash)]
pub struct StrictTxValidationKey {
state_hash: Hash,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Twox128 has no cryptographic collision resistance, so using it as a cache key seems risky. Was there a specific reason it was chosen over tx.hash(), or is replacing it with the transaction's own hash straightforwardly safe here?

Copy link
Contributor Author

@mpskowron mpskowron Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes here require the update of grafana dashboards, but I think they will give us much better insight on the performance impact of this change. Where can I update them?

tx_validation_key,
Arc::new(TxValidationValue {
verified_tx: verified_tx.clone(),
state: ledger.state.clone(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ledger state is a bag of pointers, so caching it directly should be fine. However, I'm wondering - does keeping multiple clones of ledger states have an influence on ledger performance?


// Dry-run apply to validate guaranteed execution against current state
let ctx = ledger.get_transaction_context(block_context.clone())?;
let (_next_state, result) = ledger.state.apply(&verified_tx, &ctx);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, in case of a cache hit (even for old state) we didn't rerun apply. This might impact performance. Should we revert to previous behavior or always run it? (We can also just rerun it only on strict cache hit, but strict hits should almost never happen here)


static NEXT_SPEC_VERSION: AtomicU32 = AtomicU32::new(1_000_000);

fn unique_spec_version() -> u32 {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the cache is lazy static, thus shared between tests. In order for the test to have distinct cache entries we force different runtime version per test

Copy link
Contributor Author

@mpskowron mpskowron Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't figure out a better way to test the cache than reading prometheus metrics. I'm open to other ideas

   Remove the soft transaction cache and introduce tx revalidation:
   cached VerifiedTransaction entries are reused when state changes by
   revalidating against a RevalidationReference instead of re-running
   full ZK proof verification. Adds cache metrics (miss, strict hit,
   revalidation hit) and tests covering the full validation lifecycle.
@mpskowron mpskowron force-pushed the skowron/tx-revalidation branch from a55f163 to 0009021 Compare February 23, 2026 11:56
@mpskowron mpskowron marked this pull request as ready for review February 23, 2026 11:57
@mpskowron mpskowron requested a review from a team as a code owner February 23, 2026 11:57
///
/// Uses `tx_hash` only for quick revalidation of transactions already in the pool.
/// The soft cache prevents redundant ZK proof verification for mempool housekeeping.
fn revalidate_transaction(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tkerber could you please double check if we're introducing revalidation correctly in this PR?:

  1. On cache miss: run full well_formed(), cache the VerifiedTransaction + current LedgerState keyed by (tx_hash, runtime_version)
  2. On cache hit with a different state hash: run revalidate_transaction against (cached_state, new_state) instead of full verification — this covers all three entry points: mempool (validate_transaction), block proposal (those are all possible places, aren't they?)
  3. On success: update the cache entry with the new VerifiedTransaction + new LedgerState, so subsequent state changes can revalidate incrementally again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant