Error while handling "large'ish" number of parallel requests

90 views
Skip to first unread message

Vinay Chitlangia

unread,
Sep 28, 2016, 10:36:46 AM9/28/16
to Google App Engine
Hi,
I am using appengine flexible environment.

I am facing issues when there is a sudden spurt of requests to our system. (About 30 requests in a second). The server gets back to normal in 5-10 seconds (perhaps with the decrease
in traffic again)
Generally the requests come at about 5-8 per second.

I am running with 2 instances.
<automatic-scaling>
<min-num-instances>2</min-num-instances>
<max-num-instances>10</max-num-instances>
<cool-down-period-sec>120</cool-down-period-sec>
<cpu-utilization><target-utilization>0.7</target-utilization></cpu-utilization>
</automatic-scaling>

AFAICT the requests did not hit my servlet. (I added a log, the first line in the servlet, and for failed request it does not get printed).
"loading_request" is 0 for the failed requests as well.

The error code is 502

Is it that the requests are queued while a new instance comes up? Is it possible to override that, that is start the process of bringing up the new servers,
but use the old one (if it is taking a while?). The log shows some of the request waiting for 600seconds, most though fail in 10-15ms.
The failure happens for between 0.5 to 2% of the requests

Adam (Cloud Platform Support)

unread,
Oct 2, 2016, 12:59:31 PM10/2/16
to Google App Engine
How many instances exist when the traffic burst occurs? Are you hitting your defined max of 10 or is it not able to scale up to the maximum?

Two possibilities come to mind: hitting the instance max and requests are then being discarded, or scaling isn't happening quickly enough and there are too many pending requests.

A value of 0 for "loading_request" means it was not the very first request to the instance which caused it to spin up, which is normal for the majority of requests.

Vinay Chitlangia

unread,
Oct 4, 2016, 12:43:43 AM10/4/16
to Google App Engine


On Sunday, October 2, 2016 at 10:29:31 PM UTC+5:30, Adam (Cloud Platform Support) wrote:
How many instances exist when the traffic burst occurs? Are you hitting your defined max of 10 or is it not able to scale up to the maximum?
The number of instances does not go beyond 5.  

Two possibilities come to mind: hitting the instance max and requests are then being discarded, or scaling isn't happening quickly enough and there are too many pending requests.
In the case of a "burst", there are ~20-30 concurrent requests.
And as I mentioned in the previous mail, the error code is 502, and no logs from my servlet. 

Adam (Cloud Platform Support)

unread,
Oct 4, 2016, 1:21:06 PM10/4/16
to Google App Engine
It's possible that new instances may not be spinning up quickly enough to respond to spikes in load, causing requests to be discarded. I'd recommend lowering the <cool-down-period-sec> value to 60 (the minimum) to allow the autoscaler checks to occur more frequently, and keep <target-utilization> at 0.5 to give new instances more headroom to spin up.

Try this first to see if the incidences of 502 errors improve. You can try to adjust further by increasing the number of <min-num-instances> or lowering <target-utilization> further.

Java apps in particular can incur large startup times due to classpath scanning and initialization. If you can give some details about the types of frameworks and libraries you're using, as well as the request times your'e seeing for loading requests (loading_request = 1) I may be able to give some further recommendations.

On Wednesday, September 28, 2016 at 10:36:46 AM UTC-4, Vinay Chitlangia wrote:

Deepak Singh

unread,
Oct 4, 2016, 1:44:06 PM10/4/16
to google-a...@googlegroups.com
Same problem with our app as well.
Java app on flex env. Untraced time goes upto 140000 and then 502 response code.

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-appengine+unsubscribe@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at https://groups.google.com/group/google-appengine.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-appengine/f9fc1696-d71f-4684-a490-e612f9d17d23%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Deepak Singh

Vinay Chitlangia

unread,
Oct 5, 2016, 1:47:51 AM10/5/16
to google-a...@googlegroups.com
On Tue, Oct 4, 2016 at 10:51 PM, 'Adam (Cloud Platform Support)' via Google App Engine <google-a...@googlegroups.com> wrote:
It's possible that new instances may not be spinning up quickly enough to respond to spikes in load, causing requests to be discarded. I'd recommend lowering the <cool-down-period-sec> value to 60 (the minimum) to allow the autoscaler checks to occur more frequently, and keep <target-utilization> at 0.5 to give new instances more headroom to spin up.
Thanks Adam.
I have restarted our servers with target-utilization of 0.4 (have kept the cool-down-period the same..in the interest of changing one variable at a time!!) 

Try this first to see if the incidences of 502 errors improve. You can try to adjust further by increasing the number of <min-num-instances> or lowering <target-utilization> further.

Java apps in particular can incur large startup times due to classpath scanning and initialization. If you can give some details about the types of frameworks and libraries you're using, as well as the request times your'e seeing for loading requests (loading_request = 1) I may be able to give some further recommendations.
This is a backend server with cloud bigtable as its major dependency. I could not correlate the 502s in my server with the bigtable errors (as reported by the cluster dashboard).
I will try to see if I can get some loading_request=1 request when the server is under duress. The absolute first request at the time of server startup are well behaved, that is to say that my server does not have a very big upfront cost...it does about 2 seconds worth of initialization which is done by a <load-on-startup> servlet.
On Wednesday, September 28, 2016 at 10:36:46 AM UTC-4, Vinay Chitlangia wrote:
Hi,
I am using appengine flexible environment.

I am facing issues when there is a sudden spurt of requests to our system. (About 30 requests in a second). The server gets back to normal in 5-10 seconds (perhaps with the decrease
in traffic again)
Generally the requests come at about 5-8 per second.

I am running with 2 instances.
<automatic-scaling>
<min-num-instances>2</min-num-instances>
<max-num-instances>10</max-num-instances>
<cool-down-period-sec>120</cool-down-period-sec>
<cpu-utilization><target-utilization>0.7</target-utilization></cpu-utilization>
</automatic-scaling>

AFAICT the requests did not hit my servlet. (I added a log, the first line in the servlet, and for failed request it does not get printed).
"loading_request" is 0 for the failed requests as well.

The error code is 502

Is it that the requests are queued while a new instance comes up? Is it possible to override that, that is start the process of bringing up the new servers,
but use the old one (if it is taking a while?). The log shows some of the request waiting for 600seconds, most though fail in 10-15ms.
The failure happens for between 0.5 to 2% of the requests

--
You received this message because you are subscribed to a topic in the Google Groups "Google App Engine" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/google-appengine/TvPHzbMMbhc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to google-appengine+unsubscribe@googlegroups.com.

To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at https://groups.google.com/group/google-appengine.

Vinay Chitlangia

unread,
Oct 5, 2016, 3:11:53 AM10/5/16
to google-a...@googlegroups.com


On Wed, Oct 5, 2016 at 11:16 AM, Vinay Chitlangia <chitl...@gmail.com> wrote:



On Tue, Oct 4, 2016 at 10:51 PM, 'Adam (Cloud Platform Support)' via Google App Engine <google-appengine@googlegroups.com> wrote:
It's possible that new instances may not be spinning up quickly enough to respond to spikes in load, causing requests to be discarded. I'd recommend lowering the <cool-down-period-sec> value to 60 (the minimum) to allow the autoscaler checks to occur more frequently, and keep <target-utilization> at 0.5 to give new instances more headroom to spin up.
Thanks Adam.
I have restarted our servers with target-utilization of 0.4 (have kept the cool-down-period the same..in the interest of changing one variable at a time!!) 
So this is what has happened...I am not sure what to make of it...
The number of servers are now at 10. The error rate is 0.6% (with the latest deployment) so its at the lower end, certainly has not broken the proverbial glass:)
The total number of requests for the day are well within reason...so there is no X factor, in that we do not have an unuusally busy or lax day.
While I was at it, I have changed the cool down period to what you suggested...I am all in:):)

Adam (Cloud Platform Support)

unread,
Oct 7, 2016, 5:52:37 PM10/7/16
to Google App Engine
Thanks for the updates. If the issue happens again it will also be useful to know the message detail from the 502 response. Usually in the context of App Engine flexible it means there are no instances ready to serve the request however it's good to avoid any ambiguity.

On Wednesday, October 5, 2016 at 3:11:53 AM UTC-4, Vinay Chitlangia wrote:
Message has been deleted

Vinay Chitlangia

unread,
Oct 8, 2016, 1:14:28 AM10/8/16
to Google App Engine
Things have improved with your suggested changes. Have not seen 502s in the last 2 days.

There is nothing in the logs when 502 does happen. Copy / paste of a sample log entry.

13:38:54.761POST502281 B1,043.1 sAppEngine-Google; (+http://code.google.com/appengine/read
107.178.195.146 - - [06/Oct/2016:13:38:54 +0530] "POST /read HTTP/1.1" 502 281 - "AppEngine-Google; (+http://code.google.com/appengine;)" ms=1043100 cpu_ms=0 cpm_usd=3.1404e-8 loading_request=0 instance=- app_engine_release=1.9.44 trace_id=-
{
metadata: {…} 
protoPayload: {…} 
insertId: "2016-10-06|01:26:18.075585-07|10.106.4.84|873295067" 
httpRequest: {
status: 502 
}
operation: {…} 
}
To unsubscribe from this group and all its topics, send an email to google-appengi...@googlegroups.com.
To post to this group, send email to google-a...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages