When testing the super-slice feature manually:
- Create Jobset
- (Slice CR gets created)
- Transition the Slice state to Ready
- Workload gets admitted
- Transition the slice state to Error
- Workload get suspended
- (after ~1 minute) Slice CR is garbage-collected
The Workload is never unsuspended, the new Slice CR object is not created.
The reason for this is that we transition the admission check state to Rejected:
|
case len(slicesByState[v1alpha1.Error]) > 0 || len(slicesByState[v1alpha1.Deformed]) > 0: |
|
ac.State = kueue.CheckStateRejected |
Which is not retried:
https://github.com/kubernetes-sigs/kueue/blob/6a1f89a58334b282f0c820b889d4137a4bdd6249/apis/kueue/v1beta1/admissioncheck_types.go#L32-L35
I think we should in this case transition to CheckStateRetry to
A) Give the slice some time to recover
B) Create a new slice if the old one does not recover after it is deleted
When testing the super-slice feature manually:
The Workload is never unsuspended, the new Slice CR object is not created.
The reason for this is that we transition the admission check state to Rejected:
xpk/slice/internal/controller/workload_controller.go
Lines 619 to 620 in 34c7fc7
Which is not retried:
https://github.com/kubernetes-sigs/kueue/blob/6a1f89a58334b282f0c820b889d4137a4bdd6249/apis/kueue/v1beta1/admissioncheck_types.go#L32-L35
I think we should in this case transition to
CheckStateRetrytoA) Give the slice some time to recover
B) Create a new slice if the old one does not recover after it is deleted