502 Bad gateway error

3,109 views
Skip to first unread message

Vinay Chitlangia

unread,
Feb 7, 2017, 12:38:55 AM2/7/17
to Google App Engine
Hi,
We are seeing intermittent occurrences of 502 Bad Gateway error in our server.
About 0.5% requests fail with this error.

Out setup is:
Flex running jetty9-compat
F1 machine
1 server

Our request pattern is bursty. So the server gets ~30 requests in parallel. 
The failures, when they happen are clustered, that is over a period of 10'ish seconds one would see 3-4 errors.

The requests which complete successfully, finish in 50-100 ms, so it does not appear like the server is under major load and not able to keep up.
To rule out this possibility, I started the servers with 5 replicas. However the failure percentage did not change.

From the looks of it, it appears that there is some throttling or quota issue at play. I tried tweaking max-concurrent-requests param. Set it to 300, but that did not make any difference either.

I do not see new instances being created at the time of failure either.


The request log for the failed request:
09:57:30.686POST502262 B4 msAppEngine-Google; (+http://code.google.com/appengine; appid: s~village-test)/read
107.178.194.3 - - [07/Feb/2017:09:57:30 +0530] "POST /read HTTP/1.1" 502 262 - "AppEngine-Google; (+http://code.google.com/appengine; ms=4 cpu_ms=0 cpm_usd=2.9279999999999998e-8 loading_request=0 instance=- app_engine_release=1.9.48 trace_id=-
{
protoPayload: {…} 
insertId: "58994cb30002335cb47fd364" 
httpRequest: {…} 
resource: {…} 
timestamp: "2017-02-07T04:27:30.686052Z" 
labels: {…} 

operation: {…} 
}

Looking around at other logs at around the time of failure I see. 
09:57:30.000[error] 32#32: *35107 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 169.254.160.2, server: , request: "POST /read HTTP/1.1", upstream: "http://172.17.0.4:8080/read", host: "bigtable-dev.appspot.com"
AFAICT this request never made it to our servlet.

Vinay Chitlangia

unread,
Feb 8, 2017, 11:27:43 AM2/8/17
to Google App Engine
Might be related:

The symptoms mentioned in this blog
Somewhat moderate requests
No logs

match our observations.

I do not see the "backend_connection_closed_before_data_sent_to_client" status in the logs.

The error message for a failed request received by the client is:
11:12:44.549com.yawasa.server.storage.RpcStorageService LogError: <html><head><title>502 Bad Gateway</title></head><body bgcolor="white"><center><h1>502 Bad Gateway</h1></center><hr><center>nginx</center></body></html> (RpcStorageService.java:137)

The mention of nginx in the log message appears promising. We are not using nginx deliberately, so I am assuming this is something happening under the hood.

Nicholas (Google Cloud Support)

unread,
Feb 8, 2017, 11:59:08 AM2/8/17
to Google App Engine
Hey Vinay Chitlangia,

Thanks for some preliminary troubleshooting and linking this interesting article.  App Engine runs Nginx processes to handle routes to your application's handlers.  Handlers serving static assets for instance are handled by this Nginx process and the resources are served directly, thus bypassing the application altogether to save on precious application resources.

The Nginx process will often serve a 502 if the application raises an exception, an internal API call raises an exception or if the request simply takes too long.  As such, the status code by itself does not tell us much.

Looking at the GAE logs for your application, I found the 502s you mentioned.  One thing I noticed is that they all occur from the /read endpoint.  From the naming, I assume this endpoint is reading some data from BigTable.  Investigating further, perhaps you could provide some additional information:
  • What exactly is happening at the /read endpoint?  A code sample would be ideal if that's not too sensitive.
  • What kind of error handling exists in said endpoint if the BigTable API returns non-success responses?
  • Can you log various steps in the /read endpoint?  This might help identify the progress the request reaches before the 502 is served.  It would also help in confirming that your application is actually even getting the request as I can't currently confirm that from the logs.
  • If said endpoint does in fact read from BigTable, what API and java library are you using?
Regarding the article you linked, while the configuration of an HTTPS load balancer and nginx.conf can be very important, both the load balancing component and nginx.conf are out of the hands of the developer with App Engine.  Your scaling settings, health check settings and handlers in the app.yaml are the only rules over which you have control that affect load balancing and nginx rules.

Vinay Chitlangia

unread,
Feb 8, 2017, 1:24:01 PM2/8/17
to google-a...@googlegroups.com
On Wed, Feb 8, 2017 at 10:29 PM, 'Nicholas (Google Cloud Support)' via Google App Engine <google-a...@googlegroups.com> wrote:
Hey Vinay Chitlangia,

Thanks for some preliminary troubleshooting and linking this interesting article.  App Engine runs Nginx processes to handle routes to your application's handlers.  Handlers serving static assets for instance are handled by this Nginx process and the resources are served directly, thus bypassing the application altogether to save on precious application resources.

The Nginx process will often serve a 502 if the application raises an exception, an internal API call raises an exception or if the request simply takes too long.  As such, the status code by itself does not tell us much.

Looking at the GAE logs for your application, I found the 502s you mentioned.  One thing I noticed is that they all occur from the /read endpoint.  From the naming, I assume this endpoint is reading some data from BigTable.  Investigating further, perhaps you could provide some additional information:
  • What exactly is happening at the /read endpoint?  A code sample would be ideal if that's not too sensitive.
As you surmised, we are reading some data from bigtable in this endpoint.
  • What kind of error handling exists in said endpoint if the BigTable API returns non-success responses?
The entire endpoint is in a try catch block catching Exception. In the case of failure the exception stack trace gets written to the logs.
The first line of the endpoint is a log message signalling receiveing the request (this was done for this debugging of course!!) 
For the successful request the log message (the introductory one) gets written. For the 502 ones never.
For requests that fail because of bigtable related errors, the logs have the stacktrace but not for 502s.
The 502 failure requests finish in <10ms.
  • Can you log various steps in the /read endpoint?  This might help identify the progress the request reaches before the 502 is served.  It would also help in confirming that your application is actually even getting the request as I can't currently confirm that from the logs.
My best guess is that the request does not make it to the servlet. The reason for that is that for the 100s of failed 502 logs that I have seen, not one has the log message, which is the absolute first line in the code of the read handler. 
  • If said endpoint does in fact read from BigTable, what API and java library are you using?
we are using the google provided bigtable hbase1.2 jars version 0.9.4. 

--
You received this message because you are subscribed to a topic in the Google Groups "Google App Engine" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/google-appengine/zHSuoxkmqjw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to google-appengine+unsubscribe@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at https://groups.google.com/group/google-appengine.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-appengine/ea48946b-fbd9-47af-a7b4-136493f0d583%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Nicholas (Google Cloud Support)

unread,
Feb 9, 2017, 9:22:41 AM2/9/17
to Google App Engine
I realize that we've already begun investigating this here but I think this would be most appropriate for the App Engine public issue tracker.  The issue is leading to an increasingly specific situation and I suspect will require some exchange of code/project to reproduce the behavior you've described.  We monitor that issue tracker closely.

When filing a new issue on the tracker, please link back to this thread for context while posting a link to the issue here so that others in the community can see the whole picture.
  • Be sure to include the latest logs for related to the 502s.  When viewing the logs in Stackdriver Logging for instance, include All logs rather than just request_log as nginx.error, stderr, stdout and vm.* logs may reveal clues as to a root cause.
  • Mention if your are using any middleware like servlet filters that may receive request before that actual handler
  • Lastly, include what the CPU and/or memory usage looks like on the instance(s) at the time of the 502s.  Screenshots of Utilization and Memory Usage graphs from the Developers Console will likely be sufficient
I look forward to this issue report.

On Wednesday, February 8, 2017 at 1:24:01 PM UTC-5, Vinay Chitlangia wrote:

Vinay Chitlangia

unread,
Feb 10, 2017, 1:07:26 AM2/10/17
to google-a...@googlegroups.com
On Thu, Feb 9, 2017 at 7:52 PM, 'Nicholas (Google Cloud Support)' via Google App Engine <google-a...@googlegroups.com> wrote:
I realize that we've already begun investigating this here but I think this would be most appropriate for the App Engine public issue tracker.  The issue is leading to an increasingly specific situation and I suspect will require some exchange of code/project to reproduce the behavior you've described.  We monitor that issue tracker closely.

When filing a new issue on the tracker, please link back to this thread for context while posting a link to the issue here so that others in the community can see the whole picture.
  • Be sure to include the latest logs for related to the 502s.  When viewing the logs in Stackdriver Logging for instance, include All logs rather than just request_log as nginx.error, stderr, stdout and vm.* logs may reveal clues as to a root cause.
  • Mention if your are using any middleware like servlet filters that may receive request before that actual handler
  • Lastly, include what the CPU and/or memory usage looks like on the instance(s) at the time of the 502s.  Screenshots of Utilization and Memory Usage graphs from the Developers Console will likely be sufficient
I look forward to this issue report.
The logs are "All logs" around the time of the incident, however as a copy/paste from the browser. Couldnt retrieve any logs using gcloud beta logging read. This is the command I tried:
gcloud beta logging read 'timestamp >= "2017-02-11T03:00:00Z" AND timestamp <="2017-02-12T03:05:00Z"' 

Tomas Erlingsson

unread,
Aug 15, 2017, 1:05:45 PM8/15/17
to Google App Engine
Did this get resolved?  We have an flex java app running in development with almost no traffic. We are constantly getting 502 telling us to try in 30sec and our server app is rebooted many times a day. I am running this locally without any problems. Not seeing any errors in the log.
To unsubscribe from this group and all its topics, send an email to google-appengi...@googlegroups.com.
To post to this group, send email to google-a...@googlegroups.com.

--
You received this message because you are subscribed to a topic in the Google Groups "Google App Engine" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/google-appengine/zHSuoxkmqjw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to google-appengi...@googlegroups.com.
To post to this group, send email to google-a...@googlegroups.com.

Karikalan Kumaresan

unread,
Aug 23, 2017, 10:38:49 AM8/23/17
to Google App Engine
Hi, I am facing the same issue. I am running spring boot in GAE flex. While it runs fine for around 30 concurrent users, when the number of users increases it throws 502 error and the tomcat gets restarted. I am not sure what causes this issue. Not seeing any useful errors in the logs. Any resolution on this? We are blocked with this issue. Any help would be good.

Shivam(Google Cloud Support)

unread,
Aug 25, 2017, 5:36:00 PM8/25/17
to Google App Engine
Google Groups discussion forum is meant for open-ended discussions. Issues such as these most of the times tend to be project/application specific. I would recommend to post on  App Engine public issue tracker
Reply all
Reply to author
Forward
0 new messages