Disruption to k8s postgres/redis caused by cluster upgrades

Skip to first unread message

Joey Wang

Oct 30, 2021, 7:07:14 PM10/30/21
to gce-discussion
We have PostgreSQL DB and Redis server as cache running on GKE.
Google is really pushing for upgrading of clusters to latest version. 

Every-time upgrading we will experience 5 minutes disruption to the customers. Does anyone have a better solution for this?


Digil (Google Cloud Platform Support)

Nov 1, 2021, 3:18:02 PM11/1/21
to gce-discussion
GKE master upgrade is inevitable as it contains new fixes for bugs, implementing new feature as per customer's needs etc. This Upgrades can occur on any day of the week, and at any time within the timeframe. However, you could control this upgrade with the help of a)Maintenance Window OR b)Maintenance exclusion OR c) Manual upgrade.

By configuring a maintenance window, could give you more control over when upgrades to the Kubernetes software on your cluster or nodes occur. As explained here, maintenance window is an arbitrary, repeating window of time during which automatic maintenance are permitted. A detailed example on how to configure one such maintenance window for an existing cluster is provided here.

You may also further configure a maintenance exclusion window for your GKE Cluster that also could prevent the auto-upgrade in the exclusion window period. A maintenance exclusion is an arbitrary non-repeating window of time during which automatic maintenance is forbidden.You can find more information about configuring a maintenance exclusion in this help center article. Please refer the example provided here for an additional reference as well.

A third approach is to manually upgrade your cluster at your preferred time. Manual upgrades begin immediately and ignore any maintenance windows. If you have a test cluster(in your case, your non-prod cluster), I would recommend you to do a manual upgrade on that cluster and see if there is an issue after the manual upgrade. If everything goes well, you can immediately start the manual upgrade on your production environment.

Finally, you could also makes your services highly available by deploying it in a 'Regional cluster' because auto-upgrades are minimally disruptive, especially for regional clusters.  For a regional cluster, the cluster will remains as highly available during the upgrade. You can find more information about regional cluster and its upgrade in this help center article. 

Joey Wang

Nov 2, 2021, 7:32:52 AM11/2/21
to gce-discussion
Hi Digil,

Thanks a lot for your help. As you advised I've set up the maintenance window and put production to a stable channel so we can arrange the upgrade by ourselves.

The other option will be to create a new pool with the new version and gradually migrate the pods to the new one, then delete the old pool. We don't want to do this manually so the challenge would be how to hook to know node is upgrading and which pod should go first. And at the same time, we might use Redis Cluster, so that the Redis server won't be the blocker part.

I haven't found any information about the way to customize container migration yet, it seems to me it's randomized by google.

Message has been deleted

Marc Vintró Alonso

Nov 3, 2021, 5:53:29 AM11/3/21
to gce-discussion

The best way to avoid downtime's during a master upgrade is to create a regional cluster with master High Availability [1].
This ensures zero downtime master upgrades, master resize, and overall reduced downtime from master failures.
Basically, the idea behind regional cluster is basically to add redundancy to the control plane of your cluster.

Regarding workloads migration you can follow this documentation [2].

Reply all
Reply to author
0 new messages