Enhancement: Add drift detection and automatic reconciliation#668
Enhancement: Add drift detection and automatic reconciliation#668eshulman2 wants to merge 1 commit intok-orc:mainfrom
Conversation
mandre
left a comment
There was a problem hiding this comment.
What part of the code needs changing? I expect we detail how shouldReconcile changes.
enhancements/drift-detection.md
Outdated
|
|
||
| Similar Kubernetes controllers for cloud resources have implemented drift detection: | ||
|
|
||
| - **AWS Controllers for Kubernetes (ACK)**: Uses a 10-hour default resync period for drift recovery |
There was a problem hiding this comment.
It would be good to detail the design choice of both ACK and ASO (apart from their default resync period), in case we want to take inspiration from these projects.
There was a problem hiding this comment.
added a short referance
|
|
||
| - **Real-time drift detection**: Event-driven detection of changes (would require OpenStack webhooks or very short polling intervals) | ||
| - **Drift reporting without correction**: Alerting on drift without taking corrective action (future enhancement) | ||
| - **Selective field reconciliation**: Allowing some fields to drift while correcting others |
There was a problem hiding this comment.
Agreed that this selective reconciliation is out of scope, in practice, we will likely have it if we allow immutable (from our API standpoint) fields to drift, while correcting mutable ones.
There was a problem hiding this comment.
true, even if someone deleted (outside of the ORC scope) the resource and re-created it with different immutable fields there is not much we can do from drift detection perspective meaning this is more "partial" then "selective".
| - **Drift reporting without correction**: Alerting on drift without taking corrective action (future enhancement) | ||
| - **Selective field reconciliation**: Allowing some fields to drift while correcting others | ||
| - **Conflict resolution with merge semantics**: Merging external changes with desired state | ||
| - **Drift detection for unmanaged resources**: Unmanaged resources are explicitly not modified by ORC |
There was a problem hiding this comment.
Does it mean we never refresh an external resource after it's been imported? While we don't want to correct drift on the OpenStack resource itself, perhaps we still want to correct our view of the system (to be able to report correct info in the object status) ?
| 2. **Fetch**: On resync, ORC fetches the current state of the OpenStack resource | ||
| 3. **Compare**: The current state is compared against the desired state in the Kubernetes spec | ||
| 4. **Update**: If drift is detected, ORC updates the OpenStack resource to match the desired state |
There was a problem hiding this comment.
How is it different from our normal reconciles?
There was a problem hiding this comment.
It is actually not that different from a normal reconcile the main difference is the triggering, I will add a sentence clarifying it.
enhancements/drift-detection.md
Outdated
|
|
||
| **Mitigation**: | ||
| - Conservative 10-hour default resync period | ||
| - Add random jitter to resync times to avoid thundering herd |
There was a problem hiding this comment.
What would that look like?
Could you describe what you mean by thundering herd? Don't we already have the problem when restarting ORC?
There was a problem hiding this comment.
I mean that in case that many resources were created at once and will try to re-reconcile in T+10H (all at the same time) it might flood the OSP API possibly causing slowness or other issues. The point in adding a random jitters is to have resources created together start their drift detection in T+10H+random number making it safer. As we only reschedule with the suggested implementation adding the jitter should be as simple as adding a random number to the re-queue request. added a note on that.
Regarding the restart issueI assume it is the same conceptual issue but unfortunately the suggested solution won't solve it on startup. We should consider possibly a different approach for this issue on startup like the one in ASO limiting the number of concurrent reconciles MAX_CONCURRENT_RECONCILES.
enhancements/drift-detection.md
Outdated
| **Risk**: Frequent reconciliation increases CPU and memory usage on the ORC controller. | ||
|
|
||
| **Mitigation**: | ||
| - Implement hash-based comparison: compute a hash of the OpenStack resource state and store it in `status.observedStateHash`. Only proceed with update operations if the hash differs from the previous reconciliation. |
There was a problem hiding this comment.
I'm not clear exactly what problem the hash would solve.
There was a problem hiding this comment.
the idea with the hash is to have the old hash (from the last reconcile) kept in a status field and instead of triggering reconciliation immediately get the resource from openstack, compare the hash and if similar to the existing one do not start a full reconciliation this is an optional mechanism I was thinking on adding to reduce the full reconciliation cycles we commit and the value to value comparison done in the update. this is not a must but it should save us time and resources on the controller side. updated the doc to make it clearer.
enhancements/drift-detection.md
Outdated
|
|
||
| Detect and report drift without automatically correcting it. | ||
|
|
||
| **Rejected because**: Adds operational burden requiring human intervention. Could be added as a separate management policy option in the future. |
There was a problem hiding this comment.
While I agree we that drift detection without correction is currently out of scope, I disagree about the rejection reason given: notification that something changed under us is still better than no notification at all. It's just that it's not the top-most priority and we'd better focus on correcting drift first.
There was a problem hiding this comment.
Although I agree this is useful for user I believe this might be better addressed in a separate effort for alerting (or something similar). The idea is not to reject the general idea of reporting it, just reject it for the scope of this enhancement. I'll add a clarification on the reasoning.
|
|
||
| ### Resource Recreation on External Deletion | ||
|
|
||
| When a managed resource is deleted from OpenStack but the ORC object still exists: |
There was a problem hiding this comment.
What happens if resyncPeriod: 0 (no drift detection)? Will it still be a Terminal Error as it is today, or will be try to re-create the resource?
There was a problem hiding this comment.
I would say that if drift detection is disabled we should probably keep the behavior as is to avoid additional api changes and make it expectable
Proposal for drift detection feature.
3134799 to
a9f9abf
Compare
Proposal for drift detection feature.