Skip to content

Conversation

@JRBANCEL
Copy link
Contributor

@JRBANCEL JRBANCEL commented Oct 1, 2025

Fixes #4256

This addresses an issue we observe often: pod metadata is not consistent.

Example

Rollout is done.

kubectl argo rollouts list rollouts -A
NAMESPACE      NAME                                         STRATEGY   STATUS        STEP   SET-WEIGHT  READY  DESIRED  UP-TO-DATE  AVAILABLE
platform       portal-app-rollout                           Canary     Healthy       10/10  100         2/2    2        2           2

A single ReplicaSet has replicas:

kubectl get replicaset -n platform | grep portal
...
portal-app-rollout-58ff85fcb                                          2         2         2       17h

With the correct spec:

spec:
  replicas: 2
  template:
    metadata:
      labels:
        ...
        role: stable

Yet, some pods don't have the expected labels:

kubectl get pods -n platform -l app=portal --show-labels         
NAME                                 READY   STATUS    RESTARTS   AGE   LABELS
portal-app-rollout-58ff85fcb-bk7kr   2/2     Running   0          17h   ...role=canary...
portal-app-rollout-58ff85fcb-tqd98   2/2     Running   0          17h   ...role=stable...

Reconciliation happens and does nothing to fix this:

time="2025-10-01T18:41:16Z" level=info msg="Started syncing rollout" generation=314 namespace=platform resourceVersion=398118957 rollout=portal-app-rollout
time="2025-10-01T18:41:16Z" level=info msg="Found 1 TrafficRouting Reconcilers" namespace=platform rollout=portal-app-rollout
time="2025-10-01T18:41:16Z" level=info msg="Reconciling TrafficRouting with type 'Istio'" namespace=platform rollout=portal-app-rollout
time="2025-10-01T18:41:16Z" level=info msg="No StableRS exists to reconcile or matches newRS" namespace=platform rollout=portal-app-rollout
time="2025-10-01T18:41:16Z" level=info msg="No Steps remain in the canary steps" namespace=platform rollout=portal-app-rollout
time="2025-10-01T18:41:16Z" level=info msg="No status changes. Skipping patch" generation=314 namespace=platform resourceVersion=398118957 rollout=portal-app-rollout
time="2025-10-01T18:41:16Z" level=info msg="Reconciliation completed" generation=314 namespace=platform resourceVersion=398118957 rollout=portal-app-rollout time_ms=61.302378

Why does it matter?

We have PDBs on role=stable and the mismtach between the RS replica count and the Pods actually matching the label leads to PDB preventing any change.

Root Cause

This is because the code bails if the ReplicaSet metadata matches what it should be. So if for any reason: pod update failure, crash, OOM, etc... one of the pod is not updated after the ReplicaSet is updated the first time, the controller will never try to rectify this and the inconsistent state will stay.

Also, the race condition is still present. Between the ReplicaSet update and the Pod listing. It is entirely possible that the ReplicaSet controller is reconciling on the old spec and creates a Pod with the old spec, after the Pod listing.

This ensure the logic to update pod metadata always runs, which is mostly fine efficiency-wise because the current code already doesn't patch pod in the correct state, so it would be a Pod list call at most in the standard case.

Testing

Asserting on reads doesn't seem to be a pattern and has little value. It would require 80 tests to be updated, makes more sense to ignore it.

Duplication gate is broken, tests should be ignored, but also how does removing code lead to 100% duplication 🤦
image


Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
  • The title of the PR is (a) conventional with a list of types and scopes found here, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
  • I've signed my commits with DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My builds are green. Try syncing with master if they are not.
  • My organization is added to USERS.md.

@JRBANCEL JRBANCEL force-pushed the jrbancel/always-reconcile-pod-labels branch from 434294d to 0e4ec8f Compare October 1, 2025 23:12
@github-actions
Copy link
Contributor

github-actions bot commented Oct 2, 2025

Published E2E Test Results

  4 files    4 suites   9h 24m 35s ⏱️
117 tests  90 ✅  7 💤  20 ❌
684 runs  401 ✅ 28 💤 255 ❌

For more details on these failures, see this check.

Results for commit f427fec.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Contributor

github-actions bot commented Oct 2, 2025

Published Unit Test Results

2 370 tests   2 370 ✅  3m 3s ⏱️
  129 suites      0 💤
    1 files        0 ❌

Results for commit f427fec.

♻️ This comment has been updated with latest results.

@JRBANCEL JRBANCEL changed the title fix(controller): Always reconcile pod metadata. Fixes #4256 fix(controller): Always reconcile pod metadata Oct 2, 2025
@zachaller zachaller added this to the v1.9 milestone Oct 21, 2025
@zachaller zachaller force-pushed the jrbancel/always-reconcile-pod-labels branch from 74f40be to 30aa843 Compare October 21, 2025 23:30
@sonarqubecloud
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
41.4% Duplication on New Code (required ≤ 40%)

See analysis details on SonarQube Cloud

@sonarqubecloud
Copy link

sonarqubecloud bot commented Dec 2, 2025

Quality Gate Failed Quality Gate failed

Failed conditions
41.6% Duplication on New Code (required ≤ 40%)

See analysis details on SonarQube Cloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Inconsistencies with pod metadata during progressive rollouts

2 participants