Skip to content

fix: harden supervisor recovery and stuck scan#207

Open
liobrasil wants to merge 4 commits intoholyfuchs/supervisor-fixfrom
lionel/fix-supervisor-scan-recovery
Open

fix: harden supervisor recovery and stuck scan#207
liobrasil wants to merge 4 commits intoholyfuchs/supervisor-fixfrom
lionel/fix-supervisor-scan-recovery

Conversation

@liobrasil
Copy link
Contributor

Summary

  • add the final supervisor regression coverage for duplicate recoveries and mixed recurring/non-recurring scan sets
  • fix duplicate recovery churn by treating in-flight executed recurring transactions as active until rebalance state advances
  • restrict stuck-scan ordering to recurring participants and lazily prune stale non-recurring entries during candidate walks
  • align scheduler docs/comments with the recurring-only scan behavior

Verification

  • flow test cadence/tests/scheduler_mixed_population_regression_test.cdc
  • flow test cadence/tests/scheduled_supervisor_test.cdc

return true
}

if status == FlowTransactionScheduler.Status.Executed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same issue we faced in onflow/FlowYieldVaultsEVM#70
The problem with the fix is that if the transaction panics, lastRebalanceTimestamp is not updated. This makes the rebalancer permanantely stuck because it's Executed and the lastRebalanceTimestamp was never updated.
You might want to consider a grace period based fix.

Comment on lines +87 to +95
/// A transaction is considered active when it is:
/// - still `Scheduled`, or
/// - already marked `Executed` by FlowTransactionScheduler, but the AutoBalancer has not
/// yet advanced its last rebalance timestamp past that transaction's scheduled time.
///
/// The second case matters because FlowTransactionScheduler flips status to `Executed`
/// before the handler actually runs. Without treating that in-flight window as active,
/// the Supervisor can falsely classify healthy vaults as stuck and recover them twice.
///
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain this?
I am not sure I understand.
The status can be Executed even though it is not executed?

yieldVaultID: uniqueID.id,
handlerCap: handlerCap,
scheduleCap: scheduleCap,
participatesInStuckScan: recurringConfig != nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contract mentions its:
A registry of all yield vault IDs that participate in scheduled rebalancing

Would we ever have an instance where this is not the case?
To me it seems the better approach to ensure that SchedulerRegistry only has Vaults that are getting scheduled and we shouldn't add the ones which aren't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants