Enhancement: Add drift detection and automatic reconciliation by eshulman2 · Pull Request #668 · k-orc/openstack-resource-controller

eshulman2 · 2026-02-03T16:04:33Z

Proposal for drift detection feature.

mandre

What part of the code needs changing? I expect we detail how shouldReconcile changes.

mandre · 2026-02-16T15:51:00Z

enhancements/drift-detection.md

+
+Similar Kubernetes controllers for cloud resources have implemented drift detection:
+
+- **AWS Controllers for Kubernetes (ACK)**: Uses a 10-hour default resync period for drift recovery


It would be good to detail the design choice of both ACK and ASO (apart from their default resync period), in case we want to take inspiration from these projects.

added a short referance

mandre · 2026-02-16T15:57:02Z

enhancements/drift-detection.md

+
+- **Real-time drift detection**: Event-driven detection of changes (would require OpenStack webhooks or very short polling intervals)
+- **Drift reporting without correction**: Alerting on drift without taking corrective action (future enhancement)
+- **Selective field reconciliation**: Allowing some fields to drift while correcting others


Agreed that this selective reconciliation is out of scope, in practice, we will likely have it if we allow immutable (from our API standpoint) fields to drift, while correcting mutable ones.

true, even if someone deleted (outside of the ORC scope) the resource and re-created it with different immutable fields there is not much we can do from drift detection perspective meaning this is more "partial" then "selective".

mandre · 2026-02-16T16:01:53Z

enhancements/drift-detection.md

+- **Drift reporting without correction**: Alerting on drift without taking corrective action (future enhancement)
+- **Selective field reconciliation**: Allowing some fields to drift while correcting others
+- **Conflict resolution with merge semantics**: Merging external changes with desired state
+- **Drift detection for unmanaged resources**: Unmanaged resources are explicitly not modified by ORC


Does it mean we never refresh an external resource after it's been imported? While we don't want to correct drift on the OpenStack resource itself, perhaps we still want to correct our view of the system (to be able to report correct info in the object status) ?

mandre · 2026-02-16T16:06:28Z

enhancements/drift-detection.md

+2. **Fetch**: On resync, ORC fetches the current state of the OpenStack resource
+3. **Compare**: The current state is compared against the desired state in the Kubernetes spec
+4. **Update**: If drift is detected, ORC updates the OpenStack resource to match the desired state


How is it different from our normal reconciles?

It is actually not that different from a normal reconcile the main difference is the triggering, I will add a sentence clarifying it.

enhancements/drift-detection.md

mandre · 2026-02-16T16:35:03Z

enhancements/drift-detection.md

+
+**Mitigation**:
+- Conservative 10-hour default resync period
+- Add random jitter to resync times to avoid thundering herd


What would that look like?
Could you describe what you mean by thundering herd? Don't we already have the problem when restarting ORC?

I mean that in case that many resources were created at once and will try to re-reconcile in T+10H (all at the same time) it might flood the OSP API possibly causing slowness or other issues. The point in adding a random jitters is to have resources created together start their drift detection in T+10H+random number making it safer. As we only reschedule with the suggested implementation adding the jitter should be as simple as adding a random number to the re-queue request. added a note on that.

Regarding the restart issueI assume it is the same conceptual issue but unfortunately the suggested solution won't solve it on startup. We should consider possibly a different approach for this issue on startup like the one in ASO limiting the number of concurrent reconciles MAX_CONCURRENT_RECONCILES.

mandre · 2026-02-16T16:37:04Z

enhancements/drift-detection.md

+**Risk**: Frequent reconciliation increases CPU and memory usage on the ORC controller.
+
+**Mitigation**:
+- Implement hash-based comparison: compute a hash of the OpenStack resource state and store it in `status.observedStateHash`. Only proceed with update operations if the hash differs from the previous reconciliation.


I'm not clear exactly what problem the hash would solve.

the idea with the hash is to have the old hash (from the last reconcile) kept in a status field and instead of triggering reconciliation immediately get the resource from openstack, compare the hash and if similar to the existing one do not start a full reconciliation this is an optional mechanism I was thinking on adding to reduce the full reconciliation cycles we commit and the value to value comparison done in the update. this is not a must but it should save us time and resources on the controller side. updated the doc to make it clearer.

enhancements/drift-detection.md

mandre · 2026-02-16T16:49:54Z

enhancements/drift-detection.md

+
+Detect and report drift without automatically correcting it.
+
+**Rejected because**: Adds operational burden requiring human intervention. Could be added as a separate management policy option in the future.


While I agree we that drift detection without correction is currently out of scope, I disagree about the rejection reason given: notification that something changed under us is still better than no notification at all. It's just that it's not the top-most priority and we'd better focus on correcting drift first.

Although I agree this is useful for user I believe this might be better addressed in a separate effort for alerting (or something similar). The idea is not to reject the general idea of reporting it, just reject it for the scope of this enhancement. I'll add a clarification on the reasoning.

mandre · 2026-02-16T16:52:44Z

enhancements/drift-detection.md

+
+### Resource Recreation on External Deletion
+
+When a managed resource is deleted from OpenStack but the ORC object still exists:


What happens if resyncPeriod: 0 (no drift detection)? Will it still be a Terminal Error as it is today, or will be try to re-create the resource?

I would say that if drift detection is disabled we should probably keep the behavior as is to avoid additional api changes and make it expectable

Proposal for drift detection feature.

github-actions bot added the semver:patch No API change label Feb 3, 2026

This was referenced Feb 4, 2026

Never retry on conflicts - Quota exceeded #667

Open

Drift detection #655

Open

mandre reviewed Feb 16, 2026

View reviewed changes

Enhancement: Add drift detection and automatic reconciliation

a9f9abf

Proposal for drift detection feature.

eshulman2 force-pushed the drift-d-proposal branch from 3134799 to a9f9abf Compare February 17, 2026 14:05

eshulman2 requested a review from mandre February 17, 2026 14:05


		Similar Kubernetes controllers for cloud resources have implemented drift detection:

		- AWS Controllers for Kubernetes (ACK): Uses a 10-hour default resync period for drift recovery


		Detect and report drift without automatically correcting it.

		Rejected because: Adds operational burden requiring human intervention. Could be added as a separate management policy option in the future.


		### Resource Recreation on External Deletion

		When a managed resource is deleted from OpenStack but the ORC object still exists:

Conversation

eshulman2 commented Feb 3, 2026

Uh oh!

mandre left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eshulman2 Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eshulman2 Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eshulman2 Feb 17, 2026 •

edited

Loading

eshulman2 Feb 17, 2026 •

edited

Loading