automatic scaling when using websockets

Toon Knapen

unread,

Jan 31, 2020, 8:59:40 AM1/31/20

to Google App Engine

Can anybody point me to more info on how automatic scaling works when using websockets?

https://cloud.google.com/appengine/docs/flexible/python/how-instances-are-managed mentions automatic scaling is done based on response latency and request rate. But my app engine app is only serving websockets and thus the request rate and number of simultaneous request in-flight are less relevant.

And when scaling down, will the instance with the least number of connections be killed or how does that selection process work?

And when scaling up but the new instance is still warming up, will the websocket-connection go to one of the already existing instances or will the request receive an error?

Thanks in advance for clarifying

toon

Olu

unread,

Feb 3, 2020, 11:12:06 AM2/3/20

to Google App Engine

Hello, Toon

As you may already know, presently, App Engine flex supports the WebSocket protocol in beta[1] and in general, the Autoscaling policy of the App Engine flex[2] is based on the Scaling characteristics of Compute Engine Autoscaler. The App Engine Autoscaler considers a number of parameters[3] to scale which includes those you listed and other application metrics like average CPU utilization. This autoscaler can make scaling decisions based on multiple metrics, not only response latency and request rates.

When scaling down, the autoscaler simply sends shutdown signals[4] to Instances that are considered idle. Instances are considered busy when they are handling requests, in your use-case whenever an instance has no connections, it is considered idle. Autoscaling in App Engine is based on the algorithms that are constantly deciding on whether it’s better to queue a request or to spin up a new instance or shutdown an Instance or play around with resident instances to find the optimized setting for the use-case.

About your inquiry about how connections are handled while new Instances are still warming up, the flag, cool_down_period_sec parameter[5] is used to manage how the App Engine Autoscaler listens to Instances while initializing. You could always define a different value in the App.yaml as it suits your use-case but there is a default value of 120 seconds.

[1]https://cloud.google.com/blog/products/application-development/introducing-websockets-support-for-app-engine-flexible-environment

[2]https://cloud.google.com/appengine/docs/flexible/python/flexible-for-standard-users#scaling_characteristics

[3]https://cloud.google.com/compute/docs/autoscaler/#policies

[4]https://cloud.google.com/appengine/docs/flexible/custom-runtimes/build#application_shutdown

[5]https://cloud.google.com/appengine/docs/flexible/custom-runtimes/configuring-your-app-with-app-yaml#automatic_scaling

Toon Knapen

unread,

Feb 3, 2020, 1:21:11 PM2/3/20

to Google App Engine

Thanks for the answer together with all the references.

You say the autoscaler can make scaling decisions based on multiple metrics. But in the yaml config only the cpu_utlization.target_utilization can be set. So from that I deduce that scaling in the flex app engine is always based on the cpu-utlization, is that correct?

Next, given my instances have continuous connections, what happens if I have 3 instances which each have only 1 client connected and only have a cpu-load of 0.1. So the average cpu-load is well below the target (default) cpu utlization of 0.5 but none of them are idle?

thanks in advance again!

toon

David Do

unread,

Feb 4, 2020, 4:52:53 PM2/4/20

to Google App Engine

Hello,

As Olu mentioned, many metrics, called Resources, can affect scaling decisions as seen in this documentation [1], which lists CPU as you have stated but also memory. Depending on our configuration in the app.yaml file, these will affect the available resources per instance by having App Engine create a Machine Type based on those CPU and memory settings. That machine is guaranteed to have at least what you have specified, or with the possibility of having more.

As for the target utilization, it is an indicator of the average CPU utilization that the autoscaler should maintain [2]. In the case that the average CPU utilization is lower than the target, then it will create a new instance, while a utilization lower than the target will remove instances to maintain the utilization you have set [3]. So, in your case, since you would have set the CPU utilization at 0.5 and your 3 instances are at 0.1, which are not gonna be idle as they have a load to work on, it is lower than the target and the autoscaler would scale down to remove instances. That can be prevented by setting the min_num_instances to 3 in the app.yaml as indicated in [1]

[1] https://cloud.google.com/appengine/docs/flexible/python/reference/app-yaml#resource-settings

[2] https://cloud.google.com/compute/docs/autoscaler/scaling-cpu-load-balancing?hl=en_US&_ga=2.46193063.-1877344601.1575944051#scaling_based_on_cpu_utilization

[3] https://cloud.google.com/compute/docs/autoscaler/scaling-cpu-load-balancing?hl=en_US&_ga=2.46193063.-1877344601.1575944051#scaling_based_on_cpu_utilization

Reply all

Reply to author

Forward