Skip to content

Conversation

@ary1992
Copy link
Contributor

@ary1992 ary1992 commented Jan 12, 2026

What this PR does / why we need it:

Add technical steering proposal for live control plane migration (Live CPM)

This proposal describes the approach to be followed for implementing Live Control Plane Migration (Live CPM) in Gardener, enabling migration of a Shoot cluster’s control plane without API server downtime.

Which issue(s) this PR fixes:
Part of #gardener/gardener#10686

Special notes for your reviewer:

/cc @ScheererJ @rfranzke @vlerenc @timebertt

/hold
until a committee meeting has been scheduled

Release note:

NONE

@ary1992 ary1992 requested a review from a team as a code owner January 12, 2026 10:22
@gardener-robot gardener-robot added the reviewed/do-not-merge Has no approval for merging as it may break things, be of poor quality or have (ext.) dependencies label Jan 12, 2026
@gardener-github-actions gardener-github-actions bot added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jan 12, 2026
@gardener-robot gardener-robot added needs/review Needs review size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 12, 2026
@github-actions github-actions bot added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jan 12, 2026
@gardener-github-actions gardener-github-actions bot added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jan 12, 2026
@gardener-robot gardener-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs/second-opinion Needs second review by someone else and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 12, 2026
@rfranzke
Copy link
Member

Date proposal: Tue, 10.02., 3pm CET?

@ary1992 @ScheererJ @vlerenc @timebertt

@ScheererJ
Copy link
Member

Date proposal: Tue, 10.02., 3pm CET?

@ary1992 @ScheererJ @vlerenc @timebertt

Works for me, but may be late for @ary1992 due to time zone differences.

@rfranzke
Copy link
Member

True. What about Mon, 09.02., 1pm CET is better?

![LiveCPM etcd 6 member](assets/livecpm-six-member-etcd.png)

#### Member removal from the cluster
- During the six-member cluster formation, the destination seed cluster’s etcd CR contains fields to bootstrap from the source cluster. To complete the migraion, the etcd member count needs to be brought back to three by removing the members that are part of the source seed cluster. For this, the bootstrap with source cluster field is removed from the destination cluster’s etcd CR. At the time of writing the GEP, it was decided that druid will use [EtcdOpsTask](https://github.com/gardener/etcd-druid/issues/1047) to remove members, and a new member removal task will be introduced.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- During the six-member cluster formation, the destination seed cluster’s etcd CR contains fields to bootstrap from the source cluster. To complete the migraion, the etcd member count needs to be brought back to three by removing the members that are part of the source seed cluster. For this, the bootstrap with source cluster field is removed from the destination cluster’s etcd CR. At the time of writing the GEP, it was decided that druid will use [EtcdOpsTask](https://github.com/gardener/etcd-druid/issues/1047) to remove members, and a new member removal task will be introduced.
- During the six-member cluster formation, the destination seed cluster’s etcd CR contains fields to bootstrap from the source cluster. To complete the migration, the etcd member count needs to be brought back to three by removing the members that are part of the source seed cluster. For this, the bootstrap with source cluster field is removed from the destination cluster’s etcd CR. At the time of writing the GEP, it was decided that druid will use [EtcdOpsTask](https://github.com/gardener/etcd-druid/issues/1047) to remove members, and a new member removal task will be introduced.


### Failures and Recovery Strategy

#### **ETCD Quorum loss**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### **ETCD Quorum loss**
#### etcd Quorum Loss


### etcd-druid

#### Six member etcd cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### Six member etcd cluster
#### 6-Member etcd Cluster


If the Kube API Server (KAPI) is down because the underlying etcd has lost quorum, follow the recovery steps outlined under ETCD Quorum Loss.

##### ETCD is healthy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
##### ETCD is healthy
##### ETCD is Healthy


#### Kube API Server is unhealthy

##### ETCD is unhealthy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
##### ETCD is unhealthy
##### ETCD is Unhealthy

To trigger an abort, the annotation `migration.shoot.gardener.cloud/abort-live-migration=true` should be added to the Shoot. This is permitted only while the `liveMigration` status with step `SixMemberETCDReady` is in `Failed` state. Once annotated, you can switch the seed name back to the source seed in the Shoot spec; the gardenlet will then orchestrate the cleanup of migration-specific resources across both the source and destination seeds. The destination gardenlet will also take care to remove the newly added members from the etcd cluster.


##### Quorum is lost at any other stage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
##### Quorum is lost at any other stage
##### Quorum Is Lost at Any Other Stage


#### **ETCD Quorum loss**

##### The destination members are unable to join
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
##### The destination members are unable to join
##### The Destination Members Are Unable to Join

Each etcd member pod is individually exposed to enable direct and controlled peer communication during migration required for the [six member etcd cluster](#six-member-etcd-cluster), allowing it to communicate with its peers in the destination Seed cluster. This exposure is achieved via Istio, using dedicated `Gateway` and `VirtualService` configurations.

#### Components with Shoot webhooks/Controller and Shoot-managed resources (TBD)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### Components with Shoot webhooks/Controller and Shoot-managed resources (TBD)
#### Components with Shoot Webhooks/Controller and Shoot-Managed Resources (TBD)

Comment on lines +40 to +49
- [Components with Shoot webhooks/Controller and Shoot-managed resources (TBD)](#components-with-shoot-webhookscontroller-and-shoot-managed-resources-tbd)
- [Live Migration Flow](#live-migration-flow)
- [etcd-druid](#etcd-druid)
- [Six member etcd cluster](#six-member-etcd-cluster)
- [Member removal from the cluster](#member-removal-from-the-cluster)
- [Decoupling Member Names from Pod Names](#decoupling-member-names-from-pod-names)
- [VPN](#vpn)
- [Failures and Recovery Strategy](#failures-and-recovery-strategy)
- [ETCD Quorum loss](#etcd-quorum-loss)
- [Kube API Server is unhealthy](#kube-api-server-is-unhealthy)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- [Components with Shoot webhooks/Controller and Shoot-managed resources (TBD)](#components-with-shoot-webhookscontroller-and-shoot-managed-resources-tbd)
- [Live Migration Flow](#live-migration-flow)
- [etcd-druid](#etcd-druid)
- [Six member etcd cluster](#six-member-etcd-cluster)
- [Member removal from the cluster](#member-removal-from-the-cluster)
- [Decoupling Member Names from Pod Names](#decoupling-member-names-from-pod-names)
- [VPN](#vpn)
- [Failures and Recovery Strategy](#failures-and-recovery-strategy)
- [ETCD Quorum loss](#etcd-quorum-loss)
- [Kube API Server is unhealthy](#kube-api-server-is-unhealthy)
- [Components with Shoot Webhooks/Controller and Shoot-Managed Resources (TBD)](#components-with-shoot-webhookscontroller-and-shoot-managed-resources-tbd)
- [Live Migration Flow](#live-migration-flow)
- [etcd-druid](#etcd-druid)
- [6-Member etcd Cluster](#six-member-etcd-cluster)
- [Member Removal from the Cluster](#member-removal-from-the-cluster)
- [Decoupling Member Names from Pod Names](#decoupling-member-names-from-pod-names)
- [VPN](#vpn)
- [Failures and Recovery Strategy](#failures-and-recovery-strategy)
- [ETCD Quorum Loss](#etcd-quorum-loss)
- [Kube API Server Is Unhealthy](#kube-api-server-is-unhealthy)


![LiveCPM etcd 6 member](assets/livecpm-six-member-etcd.png)

#### Member removal from the cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### Member removal from the cluster
#### Member Removal from the Cluster

@gardener-robot gardener-robot added the needs/changes Needs (more) changes label Jan 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs/changes Needs (more) changes needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/review Needs review needs/second-opinion Needs second review by someone else reviewed/do-not-merge Has no approval for merging as it may break things, be of poor quality or have (ext.) dependencies reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants