Requeue on cancellation to avoid stale cache race condition by stefanprodan · Pull Request #1625 · fluxcd/kustomize-controller

stefanprodan · 2026-03-29T08:10:27Z

Handle health check cancellation by requeuing immediately to ensure the object is reconciled against the new revision in the eventuality of stale runtime cache that would cause the source predicate filter to drop the reconcile request.

Note that there is no proof that the requests gets dropped from the queue after the cancelation is triggered, so this "fix" may not do anything. I was not able to reproduce in envtest this issue as the cache doesn't become stale (this requires flaky network or etcd pressure). The scenario I'm relying on is:

  Rapid commits cause stale cache scenario:                              
                                                          
  1. 124a229e reconcile → health check → cancelled by V_new dispatch                                                                                                         
  2. Old code: ctrl.Result{}, nil. Dispatch item in queue.
  3. Worker picks up dispatch item → getSource → stale cache → 124a229e → build/apply (no-op)                                                                                
  4. During build/apply, ANOTHER dispatch fires (V_newer) → interrupt context cancelled                                                                                      
  5. checkHealth → already cancelled → returns *QueueEventSource                                                                                                             
  6. Reconcile() → HealthCheckCanceled → ctrl.Result{}, nil (old code)                                                                                                       
  7. V_newer dispatch item in queue                                                                                                                                          
  8. Repeat from step 3... until eventually no more dispatches, but by then ctrl.Result{}, nil means no requeue                                                              
  9. STUCK                                                                                                                                                                   
                                                          
The rapid commits keep cancelling each stale-cache reconcile. Eventually the commits stop, but the LAST cancellation returns ctrl.Result{}, nil with no requeue.
The source cache has caught up, but nothing triggers a reconcile.
                                                                                                                                                                             
Requeue: true breaks the cycle: even after the last cancellation, a requeue fires, the source cache has V_final, the reconcile applies it, health checks pass.

PS. In the case where the cache is fresh and the object is already in the queue, the new reconcile request will be dropped by the controller runtime dedupe logic, so we don't risk reconciling twice the same revision.

Preview image for testing: ghcr.io/fluxcd/kustomize-controller:rc-203b42e7

Fix: #1624

Signed-off-by: Stefan Prodan <stefan.prodan@gmail.com>

matheuscscp · 2026-03-29T09:13:00Z

internal/controller/kustomization_controller.go

+		// of stale runtime cache that would cause the source predicate filter to drop the reconcile request.
+		// In the case where the cache is fresh and the object is already in the queue, the new reconcile request
+		// will be dropped by the controller runtime dedupe logic, so we don't risk reconciling twice the same revision.
+		return ctrl.Result{Requeue: true}, nil


Suggested change

return ctrl.Result{Requeue: true}, nil

return ctrl.Result{RequeueAfter: 1}, nil

Requeue: true is deprecated, the equivalent is RequeueAfter: 1

Last night I also ran tons of tests with and without this fix. I never observed something being reconciled twice after the cancellation and never saw anything get stuck. Without a fix, I was also able to reproduce the issue and managed to make some Kustomizations get stuck.

Let's go with Requeue: true in this PR as it's used everywhere. In Flux 2.9 we could replace it in all controllers.

matheuscscp · 2026-03-29T09:18:50Z

Last night I also asked Opus 4.6 to try and root-cause the issue (which we managed to reproduce). It found a conceptually more correct fix that involves entirely replacing the default event sourcing logic of controller-runtime. The fix is very complicated and I'd rather not implement it in a patch release (or at all). I prefer the fix of this PR a lot more, because I also tested it and I never saw any double reconciliation for any objects, plus nothing got stuck.

Click below to see the details of the root cause found by Opus.

Details

❯ now we need to root cause it                                                                       

● Let me dig into the WrapReconciler code in fluxcd/pkg/runtime/controller — that's where the
  cancellation and enqueueing happens.

  Searched for 2 patterns, read 4 files (ctrl+o to expand)

● This is key. Let me read the full filter logic and the reconcile function's status patching.

  Read 2 files (ctrl+o to expand)

● Let me look at finalizeStatus and when Reconciling is set/cleared.

  Searched for 1 pattern, read 1 file (ctrl+o to expand)

● Let me check when LastAttemptedRevision is set and what HasRevision does.

  Searched for 2 patterns, read 1 file (ctrl+o to expand)

● Let me look at how the For() handler and predicate interact, and check ReconcileRequestedPredicate.

  Searched for 1 pattern, read 2 files (ctrl+o to expand)

● Now let me check exactly how controller-runtime calls this for Update events — this is the key.

  Searched for 1 pattern (ctrl+o to expand)

● Let me read the exact version used by this project.

  Searched for 1 pattern, read 1 file (ctrl+o to expand)

● There it is. Lines 108-110:

  reqs := map[request]empty{}
  e.mapAndEnqueue(ctx, q, evt.ObjectOld, reqs, lowPriority)
  e.mapAndEnqueue(ctx, q, evt.ObjectNew, reqs, lowPriority)

  And mapAndEnqueue (lines 140-152):
  for _, req := range e.toRequests(ctx, o) {
      _, ok := reqs[req]
      if !ok {
          q.Add(req)
          reqs[req] = empty{}
      }
  }

  For an OCIRepository Update event (v4→v5), the map function is called twice: once with the old
  OCIRepository (artifact=v4) and once with the new (artifact=v5). The reqs map deduplicates — if a
  Kustomization was already returned by the ObjectOld call, the ObjectNew call's q.Add is skipped.

  But our wrapper's dispatch() runs inside e.toRequests() — before the requests are returned. So both
   calls dispatch (cancel), but only the first call's q.Add actually fires.

  Here's the race:

  1. ObjectOld call (artifact=v4): The wrapper calls requestsForRevisionChangeOf with the old
  OCIRepo. For a Kustomization X that was health-checking v3:
    - Filter: IsReconciling=true, HasRevision(v3) against v4 → mismatch → included
    - dispatch(X) → cancels X's health check context, removes listener
    - Request for X returned
    - mapAndEnqueue: X not in reqs → q.Add(X), add X to reqs
  2. Worker goroutine (concurrent): X detects cancelled context, returns ctrl.Result{}, nil,
  finalizeStatus patches status, Done(X) runs. X was marked dirty by q.Add(X) → re-queued. Worker
  immediately picks up X → registerListener(X) → new listener registered.
  3. ObjectNew call (artifact=v5): The wrapper calls requestsForRevisionChangeOf with the new
  OCIRepo:
    - Lists Kustomizations, finds X
    - Filter: IsReconciling=true, HasRevision(v3 or v4) against v5 → mismatch → included
    - dispatch(X) → finds the new listener from step 2 → cancels X again
    - Request for X returned
    - mapAndEnqueue: X IS in reqs → q.Add(X) is SKIPPED
  4. Worker goroutine: X detects second cancellation, returns ctrl.Result{}, nil, Done(X). X was not
  marked dirty (no q.Add happened since Get). X is removed from processing. X is not in the queue.
  STUCK.

  The race window for step 2 is between q.Add(X) in step 1 and dispatch(X) in step 3. That window
  includes processing remaining items in the ObjectOld loop plus the List call and iteration in the
  ObjectNew loop — microseconds to low milliseconds. An idle worker can pick up X and register a
  listener in that time.

  This explains why:
  - More Kustomizations → wider race window → more likely to hit
  - Operators with CRDs (external-secrets, cloudnative-pg) → longer health checks → more items
  concurrently reconciling
  - Rapid commits → multiple source events → more opportunities for the race

  The RequeueAfter fix is correct — it ensures the second cancellation in step 3 returns
  RequeueAfter: 30s, so even though q.Add was skipped, the object is re-queued via q.AddAfter.

✻ Crunched for 13m 47s

❯ so the bug is in controller-runtime? can you think of a fix?                                       

● The bug is in the interaction between controller-runtime's reqs deduplication and our wrapper's
  dispatch(). The fix belongs in fluxcd/pkg/runtime/controller — we should stop wrapping
  handler.EnqueueRequestsFromMapFunc and instead implement a custom EventHandler.

  The problem: controller-runtime's Update() calls the map function twice (ObjectOld, ObjectNew) and
  deduplicates q.Add via a shared reqs map. But our dispatch() runs inside the map function, so it
  fires twice while q.Add fires once. If a worker re-registers a listener between the two calls, the
  second dispatch cancels it with no corresponding q.Add.

  The fix: implement our own EventHandler that pairs every dispatch() with a q.Add() — no
  deduplication between ObjectOld and ObjectNew. Double q.Add is harmless (the workqueue deduplicates
   internally by marking items dirty).

  Read 1 file (ctrl+o to expand)

● The current wrapper:

  func (r *reconcilerWrapper) EnqueueRequestsFromMapFunc(
      objKind string, fn handler.MapFunc) handler.EventHandler {
      return handler.EnqueueRequestsFromMapFunc(func(...) []ctrl.Request {
          reqs := fn(ctx, obj)
          for _, req := range reqs {
              r.hooks.dispatch(...)  // cancels inside toRequests
          }
          return reqs  // controller-runtime deduplicates q.Add via reqs map
      })
  }

  The fix: return a custom EventHandler instead of wrapping handler.EnqueueRequestsFromMapFunc:

  func (r *reconcilerWrapper) EnqueueRequestsFromMapFunc(
      objKind string, fn handler.MapFunc) handler.EventHandler {
      return &dispatchAndEnqueueHandler{
          mapFunc: fn,
          hooks:   r.hooks,
          objKind: objKind,
      }
  }

  type dispatchAndEnqueueHandler struct {
      mapFunc handler.MapFunc
      hooks   *queueHooks
      objKind string
  }

  func (h *dispatchAndEnqueueHandler) dispatchAndEnqueue(
      ctx context.Context, q workqueue.RateLimitingInterface, obj client.Object) {
      reqs := h.mapFunc(ctx, obj)
      payload := queueEventPayload{source: QueueEventSource{
          Kind:      h.objKind,
          Name:      obj.GetName(),
          Namespace: obj.GetNamespace(),
          ...
      }}
      for _, req := range reqs {
          h.hooks.dispatch(queueEvent{queueEventObjectEnqueued, req}, payload)
          q.Add(req)  // always Add — no cross-call deduplication
      }
  }

  func (h *dispatchAndEnqueueHandler) Update(ctx context.Context,
      evt event.UpdateEvent, q workqueue.RateLimitingInterface) {
      h.dispatchAndEnqueue(ctx, q, evt.ObjectOld)
      h.dispatchAndEnqueue(ctx, q, evt.ObjectNew)
  }

  // Create, Delete, Generic: same pattern, single call

  Each dispatch() is immediately followed by q.Add() in the same loop iteration. No shared reqs map.
  If the ObjectOld call dispatches+enqueues X, and a worker picks up X and re-registers a listener,
  the ObjectNew call dispatches X again AND calls q.Add(X) which marks X dirty. The worker's second
  cancellation is now backed by a queue entry.

  The double q.Add for the same request is harmless:
  - If X is in the queue: no-op (already queued)
  - If X is being processed: marks dirty → re-queued after Done()
  - If X is neither: added fresh

stefanprodan · 2026-03-29T10:55:57Z

@matheuscscp this fix is not need in helm-controller since it had Requeue: true from the start: https://github.com/fluxcd/helm-controller/blob/8a0f9d044fbec659f88db3e60a028901ab42b7ec/internal/controller/helmrelease_controller.go#L433

This also explains why users only see this in kustomize-controller

matheuscscp · 2026-03-29T11:03:20Z

Huh! I wonder why HC already had it 🤔 Maybe AI saw it would be needed while writing tests...

stefanprodan · 2026-03-29T12:45:03Z

It's not possible to reproduce this in envtest, all my attempts failed because the latency is almost zero, the cache syncs in real-time. The Requeue: true is from copy/pasting the conventions in HC.

matheuscscp

LGTM 🙌

fluxcdbot · 2026-03-30T11:49:59Z

Successfully created backport PR for release/v1.8.x:

[release/v1.8.x] Requeue on cancellation to avoid stale cache race condition #1626

Requeue on cancellation to avoid stale cache race condition

203b42e

Signed-off-by: Stefan Prodan <stefan.prodan@gmail.com>

matheuscscp reviewed Mar 29, 2026

View reviewed changes

matheuscscp mentioned this pull request Mar 29, 2026

Specific Kustomizations hang with "New reconciliation triggered" #1624

Closed

stefanprodan added bug Something isn't working backport:release/v1.8.x To be backported to release/v1.8.x labels Mar 29, 2026

stefanprodan marked this pull request as ready for review March 29, 2026 10:46

matheuscscp approved these changes Mar 29, 2026

View reviewed changes

matheuscscp merged commit 877e664 into main Mar 30, 2026
11 checks passed

matheuscscp deleted the requeue-on-cancelation branch March 30, 2026 11:49

fluxcdbot mentioned this pull request Mar 30, 2026

[release/v1.8.x] Requeue on cancellation to avoid stale cache race condition #1626

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Requeue on cancellation to avoid stale cache race condition#1625

Requeue on cancellation to avoid stale cache race condition#1625
matheuscscp merged 1 commit intomainfrom
requeue-on-cancelation

stefanprodan commented Mar 29, 2026 •

edited

Loading

Uh oh!

matheuscscp Mar 29, 2026 •

edited

Loading

Uh oh!

stefanprodan Mar 29, 2026

Uh oh!

matheuscscp commented Mar 29, 2026

Uh oh!

stefanprodan commented Mar 29, 2026

Uh oh!

matheuscscp commented Mar 29, 2026

Uh oh!

stefanprodan commented Mar 29, 2026

Uh oh!

matheuscscp left a comment

Uh oh!

Uh oh!

fluxcdbot commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	return ctrl.Result{Requeue: true}, nil
	return ctrl.Result{RequeueAfter: 1}, nil

Conversation

stefanprodan commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matheuscscp Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stefanprodan Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

matheuscscp commented Mar 29, 2026

Uh oh!

stefanprodan commented Mar 29, 2026

Uh oh!

matheuscscp commented Mar 29, 2026

Uh oh!

stefanprodan commented Mar 29, 2026

Uh oh!

matheuscscp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fluxcdbot commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stefanprodan commented Mar 29, 2026 •

edited

Loading

matheuscscp Mar 29, 2026 •

edited

Loading