Requeue on cancellation to avoid stale cache race condition#1625
Requeue on cancellation to avoid stale cache race condition#1625matheuscscp merged 1 commit intomainfrom
Conversation
Signed-off-by: Stefan Prodan <stefan.prodan@gmail.com>
| // of stale runtime cache that would cause the source predicate filter to drop the reconcile request. | ||
| // In the case where the cache is fresh and the object is already in the queue, the new reconcile request | ||
| // will be dropped by the controller runtime dedupe logic, so we don't risk reconciling twice the same revision. | ||
| return ctrl.Result{Requeue: true}, nil |
There was a problem hiding this comment.
| return ctrl.Result{Requeue: true}, nil | |
| return ctrl.Result{RequeueAfter: 1}, nil |
Requeue: true is deprecated, the equivalent is RequeueAfter: 1
Last night I also ran tons of tests with and without this fix. I never observed something being reconciled twice after the cancellation and never saw anything get stuck. Without a fix, I was also able to reproduce the issue and managed to make some Kustomizations get stuck.
There was a problem hiding this comment.
Let's go with Requeue: true in this PR as it's used everywhere. In Flux 2.9 we could replace it in all controllers.
|
Last night I also asked Opus 4.6 to try and root-cause the issue (which we managed to reproduce). It found a conceptually more correct fix that involves entirely replacing the default event sourcing logic of controller-runtime. The fix is very complicated and I'd rather not implement it in a patch release (or at all). I prefer the fix of this PR a lot more, because I also tested it and I never saw any double reconciliation for any objects, plus nothing got stuck. Click below to see the details of the root cause found by Opus. Details |
|
@matheuscscp this fix is not need in helm-controller since it had This also explains why users only see this in kustomize-controller |
|
Huh! I wonder why HC already had it 🤔 Maybe AI saw it would be needed while writing tests... |
|
It's not possible to reproduce this in envtest, all my attempts failed because the latency is almost zero, the cache syncs in real-time. The |
|
Successfully created backport PR for |
Handle health check cancellation by requeuing immediately to ensure the object is reconciled against the new revision in the eventuality of stale runtime cache that would cause the source predicate filter to drop the reconcile request.
Note that there is no proof that the requests gets dropped from the queue after the cancelation is triggered, so this "fix" may not do anything. I was not able to reproduce in envtest this issue as the cache doesn't become stale (this requires flaky network or etcd pressure). The scenario I'm relying on is:
PS. In the case where the cache is fresh and the object is already in the queue, the new reconcile request will be dropped by the controller runtime dedupe logic, so we don't risk reconciling twice the same revision.
Preview image for testing:
ghcr.io/fluxcd/kustomize-controller:rc-203b42e7Fix: #1624