Surge Docs

1 view

Skip to first unread message

Niobe Hennigan

unread,

Aug 5, 2024, 12:12:14 PM8/5/24

to chrisarineth

Thisdocument provides a brief overview of standard rolling updates and thengoes into detail about surge updates, which are a special kind of rollingupdate. Compared to standard rolling updates, surge updates let youconfigure the speed of the update. Surge updates also let you exert somecontrol over how disruptive updates are to your workloads.

Some updates to a node pool, such as when you modify a node pool's annotations,don't require the restart of nodes, and so don't elicit a rolling update. IfGKE on AWS can apply changes to a node pool without having to restart orrecreate resources, it will do so to prevent disruptions.

However, most updates to a node pool in GKE on AWS typically involveterminating existing nodes, and launching new nodes with the updated settings.The process of terminating existing nodes can disrupt workloads.

By default, GKE on AWS performs standard rolling updates. This methodupdates nodes one at a time, and they are replaced using a"terminate before create" approach: a node is terminated first, and then a newupdated node is launched. This minimizes disruption because only one node isterminated and replaced at any given moment.

In GKE on AWS, the standard rolling method updates nodes one at a time.Surge updates, which are a form of rolling update, let you updatemultiple nodes simultaneously. Surge updates are therefore faster than standardrolling updates. However, updating several nodes simultaneously can disruptworkloads. To mitigate this, surge updates provide options to modulate thelevel of disruption to your workloads.

Another way surge updates can differ from standard rolling updates is the waynodes are replaced. Standard rolling updates replace nodes using a "terminatebefore create" strategy. Depending on the settings you choose, surge updates canuse either a "create before terminate" strategy, a "terminate before create"strategy, or even a combination of both.

The cluster autoscaler plays a more important role in surge updates than instandard rolling updates, which is why it figures prominently in the followinglist of actions GKE on AWS takes during a surge update:

In this example, max-surge-update is set to 2, max-unavailable-update is setto 1, and you're providing a new node pool version (that is, you're changing theGKE version that is running on the nodes in thenode pool).

While standard rolling updates often use a "terminate before create"approach, surge updates introduce more flexibility. Depending on theconfiguration, surge updates can follow a "create before terminate" strategy, a"terminate before create" strategy, or a combination of both. This sectiondescribes different configurations to help you select the best approach for yourworkloads.

The most straightforward way to use surge updates is with the defaultconfiguration of max-surge-update=1 and max-unavailable-update=0. Thisconfiguration adds only 1 surge node to the node pool during the update, andonly 1 node is updated at a time, following a "create before terminate"approach. Compared to the standard non-surge rolling update, which is equivalentto (max-surge-update=0, max-unavailable-update=1), this method is lessdisruptive, accelerates Pod restarts during updates, and is more conservative inits progression.

It's important to note that adopting the balanced setting can lead to extracosts because of the temporary surge node added during the update. Thisadditional node incurs charges while it's active, slightly raising the overallexpense compared to methods without surge nodes.

For workloads that can tolerate interruptions, a faster update approach might besuitable. Configuring max-surge-update=0 and max-unavailable-update=20achieves this. With this configuration, 20 nodes can be updated simultaneouslywithout adding any surge nodes. This update method follows a "terminate beforecreate" approach. Because no additional surge nodes are introduced during theprocess, this method is also the most cost-effective, avoiding extra expensesassociated with temporary nodes.

If your workloads are sensitive to disruption, you can increase the speed of theupdate with the following settings: max-surge-update=20 andmax-unavailable-update=0. This configuration updates 20 nodes in parallel in a"create before terminate" fashion.

However, the overall speed of the update can be constrained if you've set upPodDisruptionBudgets (PDB) for your workloads. This is because the PDB restricts the number of Pods thatcan be drained at any given moment. Although the configurations of PDBs mayvary, if you create a PDB with maxUnavailable equal to 1 for one ormore workloads running on the node pool, then only one Pod of those workloadscan be evicted at a time, limiting the parallelism of the entire update.

Recall that initiating multiple surge nodes at the start of the update processcan lead to a temporary increase in costs, especially when compared toconfigurations that don't add extra nodes or add fewer nodes during updates.

For information about how to enable and configure surge updates forGKE on AWS, seeConfigure surge updates of node pools. Send feedback Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

By choosing an upgrade strategy for your Standard cluster node pool,you can pick the process with the right balance of speed, workload disruption,risk mitigation, and cost optimization. To learn more about which node upgradestrategy is right for your environment, see Choose surgeupgrades and Choose blue-greenupgrades.

With both strategies, you can configure upgrade settings to optimize the processbased on your environment's needs. To learn more, see Configure your chosenupgrade strategy.Ensure that for the strategy that you pick, you have enough quota, resourceavailability, or reservation capacity to upgrade your nodes using that strategy.For more information, see Ensure resources for nodeupgrades.

Surge upgrades are the default upgrade strategy, and best for applications thatcan handle incremental changes. Surge upgrades use a rolling method to upgradenodes, in an undefined order. Find the optimal balance of speed and disruptionfor your environment by choosing how many new, surge nodes can be created, withmaxSurge, and how many existing nodes can be disrupted at once, withmaxUnavailable.

Use surge upgrade settings to select the appropriate balance between speed anddisruption for your node pool during cluster maintenance using the surgesettings. You can change how many nodes GKE attempts to upgradeat once bychangingthe surge upgrade parameters on a Standard node pool.

Set maxSurge to choose the maximum number of additional, surge nodes that canbe added to the node pool during an upgrade, per zone, increasing the likelihoodthat workloads running on the existing node can migrate to a new nodeimmediately. The default is one. To upgrade one node, GKE does thefollowing steps:

For GKE to create surge nodes, your project must have theresources to temporarily create additional nodes. If you don't have additionalcapacity, GKE won't start upgrading a node until the resourcesare available. To learn more, see Resources for surgeupgrades.

Set maxUnavailable to choose the maximum number of nodes that can besimultaneously unavailable during an upgrade, per zone. The default is zero.Workloads running on the existing node might need to wait for the existing nodeto upgrade, if no other nodes have capacity. To upgrade one node,GKE does the following steps:

When GKE recreates the existing node, GKEtemporarily releases the capacity of the node if the capacity isn't from areservation. This means that if there is limited capacity, you risk losing theexisting capacity. So, if your environment is resource-constrained, use thissetting only if you're using reserved nodes. To learn more, see Upgrade in aresource-constrainedenvironment.

During a surge upgrade with this node pool, in a rolling window,GKE creates two upgraded nodes, and disrupts at most one existingnode at a time. GKE brings down at most three existing nodesafter the upgraded nodes are ready. During the upgrade process, the node poolwill include between four and seven nodes.

The simplest way to take advantage of surge upgrades is to use the defaultconfiguration, maxSurge=1;maxUnavailable=0. With this configuration, upgradesprogress slowly, with only one surge node added at a time, meaning onlyone node is upgraded at a time. Pods can restart immediately on the new, surgenode. This configuration only requires the resources to temporarily create onenew node.

If you have a large node pool and your workload isn't sensitive to disruption(for example, a batch job that has run to completion), use the followingconfiguration to maximize speed without using any additional resources:maxSurge=0;maxUnavailable=20. This configuration does not bring up additionalsurge nodes and allows 20 nodes to be upgraded at the same time.

If your workload is sensitive to disruption and you have already set upPodDisruptionBudgets(PDB) and you are not using externalTrafficPolicy: Local, which does not workwith parallel node drains, you can increase the speed of the upgrade by usingmaxSurge=20;maxUnavailable=0. This configuration upgrades 20 nodes in parallelwhile the PDB limits the number of Pods that can be drained at a given time.Although the configurations of PDBs may vary, if you create a PDB withmaxUnavailable=1 for one or more workloads running on the node pool, thenonly one Pod of those workloads can be evicted at a time, limiting theparallelism of the entire upgrade. This configuration requires the resources totemporarily create 20 new nodes.

You can cancel an in-progress surge upgrade at any time during the upgradeprocess. Cancelling pauses the upgrade, stopping GKE fromupgrading new nodes, but doesn't automatically roll back the upgrade of thealready-upgraded nodes. After you cancel an upgrade, you can eitherresume or roll back.