[slice] Reschedule after Slice Failed

When testing the super-slice feature manually:
1. Create Jobset
2. (Slice CR gets created)
3. Transition the Slice state to Ready
4. Workload gets admitted 
5. Transition the slice state to Error
6. Workload get suspended 
7. (after ~1 minute) Slice CR is garbage-collected

The Workload is never unsuspended, the new Slice CR object is not created. 

The reason for this is that we transition the admission check state to Rejected:
https://github.com/AI-Hypercomputer/xpk/blob/34c7fc706b1c2a4320457b040c18a9d8c9edf03d/slice/internal/controller/workload_controller.go#L619-L620

Which is not retried:

https://github.com/kubernetes-sigs/kueue/blob/6a1f89a58334b282f0c820b889d4137a4bdd6249/apis/kueue/v1beta1/admissioncheck_types.go#L32-L35

I think we should in this case transition to `CheckStateRetry` to 
A) Give the slice some time to recover
B) Create a new slice if the old one does not recover after it is deleted

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[slice] Reschedule after Slice Failed #685

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	case len(slicesByState[v1alpha1.Error]) > 0 \|\| len(slicesByState[v1alpha1.Deformed]) > 0:
	ac.State = kueue.CheckStateRejected

[slice] Reschedule after Slice Failed #685

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions