Skip to content

Feature Request: Add spec.clusterDeletionBehavior to ClusterProfile #1329

@kahirokunn

Description

@kahirokunn

Summary

Add clusterDeletionBehavior to ClusterProfile.spec to specify the behavior when a cluster is deleted with a single key. Three values are available, with RemovePolicies as the default.

  • LeavePolicies — Leave deployed resources (Helm/manifests) intact
  • RemovePolicies — Best-effort deletion (MUST NOT block with Runtime Hook)
  • EnforceRemovePoliciesEnsure deletion (Block using CAPI Runtime Hook until deletion completes)

This provides explicit control for "cluster deletion" behavior, complementing stopMatchingBehavior (behavior when "match is lost").

Proposal (API)

# ClusterProfile (CRD sketch)
spec:
  # When the Cluster resource itself is being deleted,
  # what should Sveltos do with resources deployed by this ClusterProfile?
  clusterDeletionBehavior: LeavePolicies | RemovePolicies | EnforceRemovePolicies
  # default: RemovePolicies

Semantics

  • LeavePolicies
    Do not delete anything (leave resources in place).
  • RemovePolicies (default)
    Best-effort deletion as much as possible. However, cluster deletion MUST NOT be blocked by Runtime Hook.
    Even if Runtime Extension exists, Hook returns immediate success (or unused), and cleanup proceeds non-blocking.
  • EnforceRemovePolicies
    Stop cluster deletion until deletion completion is observed. Utilizing the BeforeClusterDelete Hook described in CAPI's Lifecycle Hook Runtime Extensions, returns retryAfterSeconds for retry until completion → blocks.

Dependency-aware deletion order (Important)

  • Consider ClusterProfile's dependsOn, execute deletion in reverse dependency order (delete dependents last).
    Example: If a depends on b → Delete in order ab.
  • Implementation builds a DAG among ClusterProfiles for the target cluster and processes in reverse topological order.

Controller behavior (high-level)

  1. Detect Cluster with metadata.deletionTimestamp and enumerate associated ClusterProfiles.
  2. Analyze dependsOn and sort in reverse topological order (dependents last).
  3. Apply clusterDeletionBehavior in sorted order:
    • LeavePolicies → Leave in place
    • RemovePolicies → Best-effort deletion (no Hook blocking / async progress)
    • EnforceRemovePolicies → Wait for completion with Hook coordination (BeforeClusterDelete/retryAfterSeconds)
  4. Reflect progress/results in ClusterSummary / status.conditions.
  5. Backward compatibility: Unspecified defaults to RemovePolicies.

Examples

Best-effort deletion (default/non-blocking)

apiVersion: config.projectsveltos.io/v1beta1
kind: ClusterProfile
metadata:
  name: cleanup-on-delete
spec:
  clusterSelector:
    matchLabels: { env: prod }
  stopMatchingBehavior: RemovePolicies
  clusterDeletionBehavior: RemovePolicies

Ensure deletion (block with Hook)

apiVersion: config.projectsveltos.io/v1beta1
kind: ClusterProfile
metadata:
  name: strict-cleanup
spec:
  clusterSelector:
    matchLabels: { env: prod }
  stopMatchingBehavior: RemovePolicies
  clusterDeletionBehavior: EnforceRemovePolicies

Reference Information

CAPI Runtime Hook (BeforeClusterDelete) Key Points

CAPI's Runtime SDK provides extensions (Runtime Extensions) that can hook into cluster lifecycle. BeforeClusterDelete is called immediately before cluster deletion starts and can block deletion until add-on cleanup completes (by returning retryAfterSeconds for retry). See Cluster API Book's Lifecycle Hook Runtime Extensions for details.

Runtime Extension is implemented as an HTTPS server, registering handlers (e.g., BeforeClusterDelete). Blocking behavior is achieved simply by returning retryAfterSeconds, causing CAPI to retry. For implementation details, see Cluster API Book's Implementing Runtime Extensions.

This design enables implementing "wait until deletion completes" with clusterDeletionBehavior: EnforceRemovePolicies in this proposal. Conversely, RemovePolicies makes Hook immediate success (or unregistered) for async cleanup, achieving non-blocking behavior.

ExtensionConfig Registration Example (CAPI side)

Minimal example of ExtensionConfig to register Runtime Extension to management cluster (Service/TLS prepared separately):

apiVersion: runtime.cluster.x-k8s.io/v1alpha1
kind: ExtensionConfig
metadata:
  name: sveltos-cleanup-gate
  annotations:
    runtime.cluster.x-k8s.io/inject-ca-from-secret: sveltos-cleanup/ext-svc-cert
spec:
  clientConfig:
    service:
      name: sveltos-cleanup-svc    # Runtime Extension Service name
      namespace: sveltos-cleanup   # Deployment namespace
      port: 443
  namespaceSelector:
    matchExpressions:
      - key: kubernetes.io/metadata.name
        operator: In
        values:
          - default                # Example: Apply to Clusters in default namespace

ExtensionConfig declares "which clusters to apply Runtime Extension to". In this example, Hook is enabled for Cluster under default namespace. For detailed configuration, see Cluster API Book's Implementing Runtime Extensions.

Hook Handler Implementation Minimal Code (Go/pseudo)

Minimal skeleton example for "waiting for add-on uninstall completion" with BeforeClusterDelete (replace actual decision logic per operations):

package main

import (
  "context"
  ctrl "sigs.k8s.io/controller-runtime"

  runtimehooksv1 "sigs.k8s.io/cluster-api/exp/runtime/hooks/api/v1alpha1"
  "sigs.k8s.io/cluster-api/exp/runtime/server"
  runtimecatalog "sigs.k8s.io/cluster-api/exp/runtime/catalog"
)

var catalog = runtimecatalog.New()

func init() { _ = runtimehooksv1.AddToCatalog(catalog) }

func main() {
  s, _ := server.New(server.Options{Catalog: catalog, Port: 9443, CertDir: "/certs"})
  _ = s.AddExtensionHandler(server.ExtensionHandler{
    Hook:        runtimehooksv1.BeforeClusterDelete,
    Name:        "before-cluster-delete",
    HandlerFunc: DoBeforeClusterDelete,
  })
  _ = s.Start(ctrl.SetupSignalHandler())
}

func DoBeforeClusterDelete(
  ctx context.Context,
  req *runtimehooksv1.BeforeClusterDeleteRequest,
  resp *runtimehooksv1.BeforeClusterDeleteResponse,
) {
  log := ctrl.LoggerFrom(ctx)
  // Example: Implement HelmChartProxy uninstall completion check here
  //     (Read management cluster API, check Sveltos/CAAPH status, etc.)
  ready := addonsCleanupCompleted(req.Cluster)

  if !ready {
    resp.Status = runtimehooksv1.ResponseStatusSuccess
    resp.Message = "waiting for add-on cleanup"
    resp.RetryAfterSeconds = 10 // Block until complete (CAPI will retry)
    log.Info(resp.Message)
    return
  }

  resp.Status = runtimehooksv1.ResponseStatusSuccess // Complete → Continue deletion
}

Key point: Simply returning resp.RetryAfterSeconds achieves "stop deletion → retry later". Returning Success instead of failure is more operational (don't fail unless permanent error). This implementation pattern is recommended in Cluster API Book's Runtime Extensions implementation guide.

Actual PoC Testing Notes

In local PoC, confirmed "Block deletion with Hook → Continue deletion after add-on completion" with the following flow:

  1. Build and deploy sample Runtime Extension (configuration as above)
    • Deploy Service + TLS Secret, apply ExtensionConfig.
  2. Create test Cluster and apply add-ons (Helm/manifests).
  3. Execute kubectl delete cluster ....
    • Hook keeps returning retryAfterSeconds, deletion pauses.
  4. Complete uninstall during this time (delete Helm release, etc.) → When check becomes ready, deletion resumes.

Test manifests/scripts are available at kubernetes-playground/capi/runtime-hooks (includes ExtensionConfig/server templates and manual test procedure notes).

Correspondence with This Proposal (clusterDeletionBehavior)

  • LeavePolicies … Runtime Hook unregistered (or always success) + no deletion.
  • RemovePolicies (default) … Delete what's possible non-blocking. Runtime Hook unused/immediate success, cleanup is async.
  • EnforceRemovePolicies … Leverage CAPI's Lifecycle Hook feature, enable BeforeClusterDelete and block with retryAfterSeconds. Continue deletion after observing cleanup completion.

Notes

The clusterDeletionBehavior: EnforceRemovePolicies option, which uses the CAPI Runtime Hook (BeforeClusterDelete), is only supported for clusters created using a ClusterClass.
For clusters not created with a ClusterClass, the Runtime Hook will not be invoked (see reference: kubernetes-sigs/cluster-api#11491).

Implementation details:

  • The controller logic for EnforceRemovePolicies starts, like RemovePolicies, when the target cluster has metadata.deletionTimestamp set.
  • For clusters created with a ClusterClass, the CAPI Runtime Hook (BeforeClusterDelete) will also be delivered to the controller.
    Upon receiving this Hook, the controller runs an additional blocking step to wait for add-on cleanup to complete.
  • For non-ClusterClass clusters, the Hook is not triggered, so deletion proceeds asynchronously without blocking, just like RemovePolicies.
  • The only difference between EnforceRemovePolicies and RemovePolicies is whether a blocking step is executed in the Runtime Hook server.

SveltosCluster case:
Deleting a SveltosCluster resource does not necessarily mean the cluster itself is being deleted.
It simply means the cluster is no longer managed by Sveltos. Therefore, no special deletion or blocking behavior is performed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions