Provide hypervisor overcommit controller#584
Conversation
📝 WalkthroughWalkthroughAdds a HypervisorOvercommitController with trait-driven, ordered overcommit mappings, registers it at startup behind an enabled-controller flag, updates Helm values and RBAC, adds comprehensive unit tests, and changes config loading APIs to return errors rather than panicking. Changes
Sequence DiagramsequenceDiagram
participant Manager as Manager
participant Controller as HypervisorOvercommitController
participant Config as Config System
participant APIServer as API Server
participant Remote as Remote Cluster/Multicluster Client
Manager->>Controller: SetupWithManager()
Controller->>Config: GetConfig[HypervisorOvercommitConfig]()
Config-->>Controller: config (or error)
Controller->>Manager: Register local & multicluster Watches
Remote->>Controller: Remote Hypervisor event
Controller->>Manager: Enqueue reconcile request
Controller->>APIServer: Get Hypervisor
APIServer-->>Controller: Hypervisor (spec/status)
Controller->>Controller: Evaluate trait conditions, apply mappings in order
alt Overcommit differs
Controller->>APIServer: Patch Hypervisor.Spec.Overcommit (MergeFrom)
APIServer-->>Controller: Patch result
else No change
Controller-->>Controller: No patch
end
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
|
To test this feature, it is possible to add the following hypervisor resource to the local dev setup: apiVersion: kvm.cloud.sap/v1
kind: Hypervisor
metadata:
name: example-hypervisor
status:
allocation:
cpu: 16
memory: 64Gi
capacity:
cpu: 32
memory: 128Gi
traits:
- "example-trait"And the following configuration: cortex-scheduling-controllers:
conf:
overcommitMappings:
- hasTrait: "example-trait"
overcommit:
cpu: 2
memory: 1.5Then logs should look like this: And the resulting resource: # ...
spec:
overcommit:
cpu: 2
memory: 1.5 |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@helm/bundles/cortex-nova/values.yaml`:
- Around line 113-128: The chart currently enables the reservations-controller
by default (enabledControllers includes "reservations-controller") while leaving
overcommitMappings empty, which causes benignly-set Hypervisor.spec.overcommit
to be cleared on reconcile; fix by either removing "reservations-controller"
from the enabledControllers list so the mutating controller is disabled by
default, or make the default overcommitMappings a true no-op (set
overcommitMappings to null or an explicit sentinel that your reconcile logic
treats as no-op) so an empty list does not trigger patches; update the values
for enabledControllers or overcommitMappings accordingly.
In `@helm/library/cortex/templates/rbac/hypervisor_role.yaml`:
- Around line 15-21: The shared Hypervisor ClusterRole currently includes
mutating verbs (create, delete, patch, update) in the verbs list; remove those
mutating verbs from the shared template (leave only read verbs like get and
list) in hypervisor_role.yaml, and instead add a new scheduling-only
Role/ClusterRole + RoleBinding/ClusterRoleBinding that grants create, delete,
patch, update on the same Hypervisor resources to the overcommit controller's
ServiceAccount (scheduling bundle) so only the scheduling-specific controller
gets write access.
In `@internal/scheduling/nova/hypervisor_overcommit_controller_test.go`:
- Around line 535-540: The test currently only checks
updatedHypervisor.Spec.Overcommit when tt.expectedOvercommitValues is non-nil;
update the assertion logic in the test loop (the block that reads
tt.expectedOvercommitValues and checks updatedHypervisor.Spec.Overcommit) to
also explicitly assert the cleared state for nil/empty cases: when
tt.expectedOvercommitValues is nil or an empty map/slice, assert that
updatedHypervisor.Spec.Overcommit is nil or empty (as appropriate) to ensure old
ratios were removed; use the existing test variables
(tt.expectedOvercommitValues and updatedHypervisor.Spec.Overcommit) and include
clear error messages for the nil/empty expectations.
In `@internal/scheduling/nova/hypervisor_overcommit_controller.go`:
- Around line 115-128: The current code collapses c.config.OvercommitMappings
into a map (traitMappings) then ranges that map, losing the original ordering
and making which mapping wins non-deterministic; instead iterate
c.config.OvercommitMappings in order, skip entries where mapping.Trait is nil or
the trait is not present in obj.Status.Traits, and when applying a mapping merge
its mapping.Overcommit into desiredOvercommit by iterating its key/value pairs
(e.g. for res,ratio := range mapping.Overcommit { desiredOvercommit[res] = ratio
}) so later slice entries deterministically override earlier ones and
non-overlapping resources are preserved; remove the intermediate traitMappings
and the maps.Copy usage and keep the log call when a trait is applied.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6cb209a6-db1c-4311-b760-89623a9e9685
📒 Files selected for processing (6)
cmd/main.gohelm/bundles/cortex-nova/values.yamlhelm/library/cortex/templates/rbac/hypervisor_role.yamlinternal/scheduling/nova/hypervisor_overcommit_controller.gointernal/scheduling/nova/hypervisor_overcommit_controller_test.gopkg/conf/conf.go
internal/scheduling/nova/hypervisor_overcommit_controller_test.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
🧹 Nitpick comments (2)
internal/scheduling/nova/hypervisor_overcommit_controller_test.go (2)
529-550: Consider guarding against nil hypervisor in the test loop.Lines 531-534 check
if tt.hypervisor != nilwhen building the fake client objects, but line 548 accessestt.hypervisor.Nameunconditionally. While all current test cases define a hypervisor, this inconsistency could cause a panic if someone later adds a test case with a nil hypervisor.♻️ Suggested fix
req := ctrl.Request{ - NamespacedName: types.NamespacedName{ - Name: tt.hypervisor.Name, - }, } + if tt.hypervisor != nil { + req.NamespacedName = types.NamespacedName{ + Name: tt.hypervisor.Name, + } + }Alternatively, remove the nil check at lines 532-534 since all test cases require a hypervisor to be meaningful.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/scheduling/nova/hypervisor_overcommit_controller_test.go` around lines 529 - 550, The test accesses tt.hypervisor.Name when building req but only conditionally adds tt.hypervisor to fake client objects; guard against nil by checking tt.hypervisor before using it (e.g., ensure tt.hypervisor != nil before constructing ctrl.Request/NamespacedName) or remove the earlier conditional so tt.hypervisor is always present; update the test around the loop where HypervisorOvercommitController is created and req is constructed to consistently handle tt.hypervisor to avoid a panic.
714-761: Consider adding a negative test case for non-Hypervisor objects.The predicate test only verifies that Hypervisor objects return
true. Adding a test case with a different object type (e.g., a Pod or ConfigMap) that expectsfalsewould validate the type-check logic and guard against regressions.♻️ Suggested addition to the tests slice
{ name: "Non-Hypervisor object returns false", object: &corev1.Pod{ ObjectMeta: metav1.ObjectMeta{ Name: "test-pod", }, }, expectedResult: false, },This would require importing
corev1 "k8s.io/api/core/v1"and adding it to the test scheme.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/scheduling/nova/hypervisor_overcommit_controller_test.go` around lines 714 - 761, Add a negative test case to TestHypervisorOvercommitController_predicateRemoteHypervisor to assert predicateRemoteHypervisor returns false for non-Hypervisor types: extend the tests slice with an entry using a corev1.Pod (e.g., name "test-pod") and expectedResult false, import corev1 "k8s.io/api/core/v1", and ensure the test scheme used by the file includes corev1 so the predicate type-checking in HypervisorOvercommitController.predicateRemoteHypervisor is exercised for both true and false branches.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@internal/scheduling/nova/hypervisor_overcommit_controller_test.go`:
- Around line 529-550: The test accesses tt.hypervisor.Name when building req
but only conditionally adds tt.hypervisor to fake client objects; guard against
nil by checking tt.hypervisor before using it (e.g., ensure tt.hypervisor != nil
before constructing ctrl.Request/NamespacedName) or remove the earlier
conditional so tt.hypervisor is always present; update the test around the loop
where HypervisorOvercommitController is created and req is constructed to
consistently handle tt.hypervisor to avoid a panic.
- Around line 714-761: Add a negative test case to
TestHypervisorOvercommitController_predicateRemoteHypervisor to assert
predicateRemoteHypervisor returns false for non-Hypervisor types: extend the
tests slice with an entry using a corev1.Pod (e.g., name "test-pod") and
expectedResult false, import corev1 "k8s.io/api/core/v1", and ensure the test
scheme used by the file includes corev1 so the predicate type-checking in
HypervisorOvercommitController.predicateRemoteHypervisor is exercised for both
true and false branches.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b4acb683-22d3-444e-a3b3-a31d95694be7
📒 Files selected for processing (2)
internal/scheduling/nova/hypervisor_overcommit_controller.gointernal/scheduling/nova/hypervisor_overcommit_controller_test.go
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
internal/scheduling/nova/hypervisor_overcommit_controller.go (1)
122-147:⚠️ Potential issue | 🔴 CriticalApply mappings in a single ordered pass to preserve override semantics.
Line 122 and Line 135 split evaluation into two passes (
HasTraitfirst,HasntTraitsecond), soHasntTraitmappings always win even when declared earlier. That breaks the ordered override contract documented in Line 64–Line 67.🐛 Proposed fix
- for _, mapping := range c.config.OvercommitMappings { - // Skip mappings without a trait specified. - if mapping.HasTrait == nil { - continue - } - // Only consider applied traits which are reflected in the status. - if !slices.Contains(obj.Status.Traits, *mapping.HasTrait) { - continue - } - log.Info("Applying overcommit mapping for trait present on hypervisor", - "trait", *mapping.HasTrait, "overcommit", mapping.Overcommit) - maps.Copy(desiredOvercommit, mapping.Overcommit) - } - for _, mapping := range c.config.OvercommitMappings { - // Skip mappings without a trait specified. - if mapping.HasntTrait == nil { - continue - } - // Only consider applied traits which are reflected in the status. - if slices.Contains(obj.Status.Traits, *mapping.HasntTrait) { - continue - } - log.Info("Applying overcommit mapping for trait not present on hypervisor", - "trait", *mapping.HasntTrait, "overcommit", mapping.Overcommit) - maps.Copy(desiredOvercommit, mapping.Overcommit) - } + for _, mapping := range c.config.OvercommitMappings { + applyByHasTrait := mapping.HasTrait != nil && slices.Contains(obj.Status.Traits, *mapping.HasTrait) + applyByHasntTrait := mapping.HasntTrait != nil && !slices.Contains(obj.Status.Traits, *mapping.HasntTrait) + if !applyByHasTrait && !applyByHasntTrait { + continue + } + + log.Info("Applying overcommit mapping", + "hasTrait", mapping.HasTrait, + "hasntTrait", mapping.HasntTrait, + "overcommit", mapping.Overcommit) + maps.Copy(desiredOvercommit, mapping.Overcommit) + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/scheduling/nova/hypervisor_overcommit_controller.go` around lines 122 - 147, The two-pass evaluation over c.config.OvercommitMappings breaks the declared ordering because HasTrait mappings are applied in one loop and HasntTrait in another; replace both loops with a single ordered pass over c.config.OvercommitMappings and for each mapping check its conditions in-place: if mapping.HasTrait != nil and slices.Contains(obj.Status.Traits, *mapping.HasTrait) then apply maps.Copy(desiredOvercommit, mapping.Overcommit); else if mapping.HasntTrait != nil and !slices.Contains(obj.Status.Traits, *mapping.HasntTrait) then apply maps.Copy(desiredOvercommit, mapping.Overcommit); otherwise skip—this preserves original declaration order and override semantics.
🧹 Nitpick comments (1)
internal/scheduling/nova/hypervisor_overcommit_controller_test.go (1)
216-217:expectNoUpdateis unused in the reconcile table.
expectNoUpdateis defined/set but never asserted, so this case intent is currently dead and misleading.♻️ Proposed cleanup
type tests := []struct { name string hypervisor *hv1.Hypervisor config HypervisorOvercommitConfig expectedOvercommit map[hv1.ResourceName]float64 - expectNoUpdate bool expectNotFoundError bool }{ ... - expectNoUpdate: true,Also applies to: 384-384
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/scheduling/nova/hypervisor_overcommit_controller_test.go` around lines 216 - 217, The test table defines expectNoUpdate but never asserts it; either remove the unused expectNoUpdate field from the reconcile table and its setters, or implement an assertion that verifies no update happened when expectNoUpdate is true (e.g., assert the fake client/update call count remained zero or that the resource was not modified). Update the table-driven test where expectNoUpdate is set and the test harness that validates updates (the reconcile test and any existing expectNotFoundError/assertion logic) so the intent is enforced or the dead field is removed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/scheduling/nova/hypervisor_overcommit_controller_test.go`:
- Around line 648-666: The test
TestHypervisorOvercommitController_SetupWithManager_InvalidClient currently
never calls SetupWithManager; update it to instantiate a manager (or mock
manager) and call controller.SetupWithManager(mgr) and then assert that it
returns a non-nil error (or fails) because HypervisorOvercommitController
requires a multicluster client; specifically modify the test to invoke
HypervisorOvercommitController.SetupWithManager and check the returned error is
not nil (and that the controller was not registered) instead of only asserting
controller.Client != nil.
- Around line 668-717: The test
TestHypervisorOvercommitController_Reconcile_PatchError is named to assert a
patch failure but currently uses a normal fake client and expects no error;
change it to simulate a Patch error and assert the controller returns an error:
create a small test double that embeds client.Client (or wraps the
fake.NewClientBuilder() result) and overrides Patch to return a specific error
(e.g., errors.New("patch failed")), assign that failing client to
controller.Client, call controller.Reconcile(ctx, req) and assert err is non-nil
and contains the expected message; keep all other setup (hypervisor object,
HypervisorOvercommitConfig) and reference
TestHypervisorOvercommitController_Reconcile_PatchError and controller.Reconcile
when locating where to replace the client.
---
Duplicate comments:
In `@internal/scheduling/nova/hypervisor_overcommit_controller.go`:
- Around line 122-147: The two-pass evaluation over c.config.OvercommitMappings
breaks the declared ordering because HasTrait mappings are applied in one loop
and HasntTrait in another; replace both loops with a single ordered pass over
c.config.OvercommitMappings and for each mapping check its conditions in-place:
if mapping.HasTrait != nil and slices.Contains(obj.Status.Traits,
*mapping.HasTrait) then apply maps.Copy(desiredOvercommit, mapping.Overcommit);
else if mapping.HasntTrait != nil and !slices.Contains(obj.Status.Traits,
*mapping.HasntTrait) then apply maps.Copy(desiredOvercommit,
mapping.Overcommit); otherwise skip—this preserves original declaration order
and override semantics.
---
Nitpick comments:
In `@internal/scheduling/nova/hypervisor_overcommit_controller_test.go`:
- Around line 216-217: The test table defines expectNoUpdate but never asserts
it; either remove the unused expectNoUpdate field from the reconcile table and
its setters, or implement an assertion that verifies no update happened when
expectNoUpdate is true (e.g., assert the fake client/update call count remained
zero or that the resource was not modified). Update the table-driven test where
expectNoUpdate is set and the test harness that validates updates (the reconcile
test and any existing expectNotFoundError/assertion logic) so the intent is
enforced or the dead field is removed.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 78a6c187-12af-4420-bf62-f63d5669c14d
📒 Files selected for processing (2)
internal/scheduling/nova/hypervisor_overcommit_controller.gointernal/scheduling/nova/hypervisor_overcommit_controller_test.go
Test Coverage ReportTest Coverage 📊: 68.1% |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
internal/scheduling/nova/hypervisor_overcommit_controller_test.go (1)
589-617: Duplicate test case for not-found scenario.This standalone test duplicates the "hypervisor not found" case already covered in
TestHypervisorOvercommitController_Reconcile(lines 507-516). Consider removing this redundant test to keep the test file concise.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/scheduling/nova/hypervisor_overcommit_controller_test.go` around lines 589 - 617, The test TestHypervisorOvercommitController_ReconcileNotFound duplicates the "hypervisor not found" scenario already exercised in TestHypervisorOvercommitController_Reconcile; remove the redundant TestHypervisorOvercommitController_ReconcileNotFound function to avoid duplication and keep tests concise, ensuring any unique assertions from that function are preserved in TestHypervisorOvercommitController_Reconcile if needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/scheduling/nova/hypervisor_overcommit_controller.go`:
- Around line 48-66: The Validate method on HypervisorOvercommitMapping
incorrectly places the HasTrait/HasntTrait mutual-exclusion and presence checks
inside the loop over m.Overcommit, so those checks are skipped when Overcommit
is empty; move the checks for m.HasTrait != nil && m.HasntTrait != nil and for
m.HasTrait == nil && m.HasntTrait == nil out of the for-loop in
HypervisorOvercommitMapping.Validate so they run unconditionally, and keep the
per-resource overcommit validation (overcommit < 1.0) inside the loop; ensure
the method still returns errors for invalid overcommit ratios and for invalid
trait combinations even when m.Overcommit is nil or empty.
---
Nitpick comments:
In `@internal/scheduling/nova/hypervisor_overcommit_controller_test.go`:
- Around line 589-617: The test
TestHypervisorOvercommitController_ReconcileNotFound duplicates the "hypervisor
not found" scenario already exercised in
TestHypervisorOvercommitController_Reconcile; remove the redundant
TestHypervisorOvercommitController_ReconcileNotFound function to avoid
duplication and keep tests concise, ensuring any unique assertions from that
function are preserved in TestHypervisorOvercommitController_Reconcile if
needed.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8a188ec1-d95e-4dac-9814-9c7cfa67e7b3
📒 Files selected for processing (3)
helm/library/cortex/templates/rbac/hypervisor_role.yamlinternal/scheduling/nova/hypervisor_overcommit_controller.gointernal/scheduling/nova/hypervisor_overcommit_controller_test.go
This change adds a controller to the nova scheduling controller manager, which finds hypervisors with specific traits and assigns desired overcommit ratios to them.