-
Notifications
You must be signed in to change notification settings - Fork 84
Add technical steering proposal for live control plane migration #796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
4b81714 to
b7792f6
Compare
|
Date proposal: Tue, 10.02., 3pm CET? |
Works for me, but may be late for @ary1992 due to time zone differences. |
|
True. What about Mon, 09.02., 1pm CET is better? |
|  | ||
|
|
||
| #### Member removal from the cluster | ||
| - During the six-member cluster formation, the destination seed cluster’s etcd CR contains fields to bootstrap from the source cluster. To complete the migraion, the etcd member count needs to be brought back to three by removing the members that are part of the source seed cluster. For this, the bootstrap with source cluster field is removed from the destination cluster’s etcd CR. At the time of writing the GEP, it was decided that druid will use [EtcdOpsTask](https://github.com/gardener/etcd-druid/issues/1047) to remove members, and a new member removal task will be introduced. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - During the six-member cluster formation, the destination seed cluster’s etcd CR contains fields to bootstrap from the source cluster. To complete the migraion, the etcd member count needs to be brought back to three by removing the members that are part of the source seed cluster. For this, the bootstrap with source cluster field is removed from the destination cluster’s etcd CR. At the time of writing the GEP, it was decided that druid will use [EtcdOpsTask](https://github.com/gardener/etcd-druid/issues/1047) to remove members, and a new member removal task will be introduced. | |
| - During the six-member cluster formation, the destination seed cluster’s etcd CR contains fields to bootstrap from the source cluster. To complete the migration, the etcd member count needs to be brought back to three by removing the members that are part of the source seed cluster. For this, the bootstrap with source cluster field is removed from the destination cluster’s etcd CR. At the time of writing the GEP, it was decided that druid will use [EtcdOpsTask](https://github.com/gardener/etcd-druid/issues/1047) to remove members, and a new member removal task will be introduced. |
|
|
||
| ### Failures and Recovery Strategy | ||
|
|
||
| #### **ETCD Quorum loss** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #### **ETCD Quorum loss** | |
| #### etcd Quorum Loss |
|
|
||
| ### etcd-druid | ||
|
|
||
| #### Six member etcd cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #### Six member etcd cluster | |
| #### 6-Member etcd Cluster |
|
|
||
| If the Kube API Server (KAPI) is down because the underlying etcd has lost quorum, follow the recovery steps outlined under ETCD Quorum Loss. | ||
|
|
||
| ##### ETCD is healthy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ##### ETCD is healthy | |
| ##### ETCD is Healthy |
|
|
||
| #### Kube API Server is unhealthy | ||
|
|
||
| ##### ETCD is unhealthy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ##### ETCD is unhealthy | |
| ##### ETCD is Unhealthy |
| To trigger an abort, the annotation `migration.shoot.gardener.cloud/abort-live-migration=true` should be added to the Shoot. This is permitted only while the `liveMigration` status with step `SixMemberETCDReady` is in `Failed` state. Once annotated, you can switch the seed name back to the source seed in the Shoot spec; the gardenlet will then orchestrate the cleanup of migration-specific resources across both the source and destination seeds. The destination gardenlet will also take care to remove the newly added members from the etcd cluster. | ||
|
|
||
|
|
||
| ##### Quorum is lost at any other stage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ##### Quorum is lost at any other stage | |
| ##### Quorum Is Lost at Any Other Stage |
|
|
||
| #### **ETCD Quorum loss** | ||
|
|
||
| ##### The destination members are unable to join |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ##### The destination members are unable to join | |
| ##### The Destination Members Are Unable to Join |
| Each etcd member pod is individually exposed to enable direct and controlled peer communication during migration required for the [six member etcd cluster](#six-member-etcd-cluster), allowing it to communicate with its peers in the destination Seed cluster. This exposure is achieved via Istio, using dedicated `Gateway` and `VirtualService` configurations. | ||
|
|
||
| #### Components with Shoot webhooks/Controller and Shoot-managed resources (TBD) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #### Components with Shoot webhooks/Controller and Shoot-managed resources (TBD) | |
| #### Components with Shoot Webhooks/Controller and Shoot-Managed Resources (TBD) |
| - [Components with Shoot webhooks/Controller and Shoot-managed resources (TBD)](#components-with-shoot-webhookscontroller-and-shoot-managed-resources-tbd) | ||
| - [Live Migration Flow](#live-migration-flow) | ||
| - [etcd-druid](#etcd-druid) | ||
| - [Six member etcd cluster](#six-member-etcd-cluster) | ||
| - [Member removal from the cluster](#member-removal-from-the-cluster) | ||
| - [Decoupling Member Names from Pod Names](#decoupling-member-names-from-pod-names) | ||
| - [VPN](#vpn) | ||
| - [Failures and Recovery Strategy](#failures-and-recovery-strategy) | ||
| - [ETCD Quorum loss](#etcd-quorum-loss) | ||
| - [Kube API Server is unhealthy](#kube-api-server-is-unhealthy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - [Components with Shoot webhooks/Controller and Shoot-managed resources (TBD)](#components-with-shoot-webhookscontroller-and-shoot-managed-resources-tbd) | |
| - [Live Migration Flow](#live-migration-flow) | |
| - [etcd-druid](#etcd-druid) | |
| - [Six member etcd cluster](#six-member-etcd-cluster) | |
| - [Member removal from the cluster](#member-removal-from-the-cluster) | |
| - [Decoupling Member Names from Pod Names](#decoupling-member-names-from-pod-names) | |
| - [VPN](#vpn) | |
| - [Failures and Recovery Strategy](#failures-and-recovery-strategy) | |
| - [ETCD Quorum loss](#etcd-quorum-loss) | |
| - [Kube API Server is unhealthy](#kube-api-server-is-unhealthy) | |
| - [Components with Shoot Webhooks/Controller and Shoot-Managed Resources (TBD)](#components-with-shoot-webhookscontroller-and-shoot-managed-resources-tbd) | |
| - [Live Migration Flow](#live-migration-flow) | |
| - [etcd-druid](#etcd-druid) | |
| - [6-Member etcd Cluster](#six-member-etcd-cluster) | |
| - [Member Removal from the Cluster](#member-removal-from-the-cluster) | |
| - [Decoupling Member Names from Pod Names](#decoupling-member-names-from-pod-names) | |
| - [VPN](#vpn) | |
| - [Failures and Recovery Strategy](#failures-and-recovery-strategy) | |
| - [ETCD Quorum Loss](#etcd-quorum-loss) | |
| - [Kube API Server Is Unhealthy](#kube-api-server-is-unhealthy) |
|
|
||
|  | ||
|
|
||
| #### Member removal from the cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #### Member removal from the cluster | |
| #### Member Removal from the Cluster |
What this PR does / why we need it:
Add technical steering proposal for live control plane migration (Live CPM)
This proposal describes the approach to be followed for implementing Live Control Plane Migration (Live CPM) in Gardener, enabling migration of a Shoot cluster’s control plane without API server downtime.
Which issue(s) this PR fixes:
Part of #gardener/gardener#10686
Special notes for your reviewer:
/cc @ScheererJ @rfranzke @vlerenc @timebertt
/hold
until a committee meeting has been scheduled
Release note: