Skip to content

Enhancement: Add drift detection and automatic reconciliation#668

Open
eshulman2 wants to merge 1 commit intok-orc:mainfrom
eshulman2:drift-d-proposal
Open

Enhancement: Add drift detection and automatic reconciliation#668
eshulman2 wants to merge 1 commit intok-orc:mainfrom
eshulman2:drift-d-proposal

Conversation

@eshulman2
Copy link
Contributor

Proposal for drift detection feature.

@github-actions github-actions bot added the semver:patch No API change label Feb 3, 2026
Copy link
Collaborator

@mandre mandre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What part of the code needs changing? I expect we detail how shouldReconcile changes.


Similar Kubernetes controllers for cloud resources have implemented drift detection:

- **AWS Controllers for Kubernetes (ACK)**: Uses a 10-hour default resync period for drift recovery
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to detail the design choice of both ACK and ASO (apart from their default resync period), in case we want to take inspiration from these projects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a short referance


- **Real-time drift detection**: Event-driven detection of changes (would require OpenStack webhooks or very short polling intervals)
- **Drift reporting without correction**: Alerting on drift without taking corrective action (future enhancement)
- **Selective field reconciliation**: Allowing some fields to drift while correcting others
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that this selective reconciliation is out of scope, in practice, we will likely have it if we allow immutable (from our API standpoint) fields to drift, while correcting mutable ones.

Copy link
Contributor Author

@eshulman2 eshulman2 Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, even if someone deleted (outside of the ORC scope) the resource and re-created it with different immutable fields there is not much we can do from drift detection perspective meaning this is more "partial" then "selective".

- **Drift reporting without correction**: Alerting on drift without taking corrective action (future enhancement)
- **Selective field reconciliation**: Allowing some fields to drift while correcting others
- **Conflict resolution with merge semantics**: Merging external changes with desired state
- **Drift detection for unmanaged resources**: Unmanaged resources are explicitly not modified by ORC
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean we never refresh an external resource after it's been imported? While we don't want to correct drift on the OpenStack resource itself, perhaps we still want to correct our view of the system (to be able to report correct info in the object status) ?

Comment on lines +59 to +61
2. **Fetch**: On resync, ORC fetches the current state of the OpenStack resource
3. **Compare**: The current state is compared against the desired state in the Kubernetes spec
4. **Update**: If drift is detected, ORC updates the OpenStack resource to match the desired state
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is it different from our normal reconciles?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is actually not that different from a normal reconcile the main difference is the triggering, I will add a sentence clarifying it.


**Mitigation**:
- Conservative 10-hour default resync period
- Add random jitter to resync times to avoid thundering herd
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would that look like?
Could you describe what you mean by thundering herd? Don't we already have the problem when restarting ORC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that in case that many resources were created at once and will try to re-reconcile in T+10H (all at the same time) it might flood the OSP API possibly causing slowness or other issues. The point in adding a random jitters is to have resources created together start their drift detection in T+10H+random number making it safer. As we only reschedule with the suggested implementation adding the jitter should be as simple as adding a random number to the re-queue request. added a note on that.

Regarding the restart issueI assume it is the same conceptual issue but unfortunately the suggested solution won't solve it on startup. We should consider possibly a different approach for this issue on startup like the one in ASO limiting the number of concurrent reconciles MAX_CONCURRENT_RECONCILES.

**Risk**: Frequent reconciliation increases CPU and memory usage on the ORC controller.

**Mitigation**:
- Implement hash-based comparison: compute a hash of the OpenStack resource state and store it in `status.observedStateHash`. Only proceed with update operations if the hash differs from the previous reconciliation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear exactly what problem the hash would solve.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea with the hash is to have the old hash (from the last reconcile) kept in a status field and instead of triggering reconciliation immediately get the resource from openstack, compare the hash and if similar to the existing one do not start a full reconciliation this is an optional mechanism I was thinking on adding to reduce the full reconciliation cycles we commit and the value to value comparison done in the update. this is not a must but it should save us time and resources on the controller side. updated the doc to make it clearer.


Detect and report drift without automatically correcting it.

**Rejected because**: Adds operational burden requiring human intervention. Could be added as a separate management policy option in the future.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree we that drift detection without correction is currently out of scope, I disagree about the rejection reason given: notification that something changed under us is still better than no notification at all. It's just that it's not the top-most priority and we'd better focus on correcting drift first.

Copy link
Contributor Author

@eshulman2 eshulman2 Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I agree this is useful for user I believe this might be better addressed in a separate effort for alerting (or something similar). The idea is not to reject the general idea of reporting it, just reject it for the scope of this enhancement. I'll add a clarification on the reasoning.


### Resource Recreation on External Deletion

When a managed resource is deleted from OpenStack but the ORC object still exists:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if resyncPeriod: 0 (no drift detection)? Will it still be a Terminal Error as it is today, or will be try to re-create the resource?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that if drift detection is disabled we should probably keep the behavior as is to avoid additional api changes and make it expectable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

semver:patch No API change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants