GAE Intermittently showing '502 Server Error'

Suvodeep Pyne

unread,

Mar 24, 2018, 1:01:48 PM3/24/18

to Google App Engine

Hi there!

We have a service running on GAE. It was running fine for the last few months until yesterday when we are seeing intermittent 502's from nginx (possibly). Our app is getting 5xx errors while our application server shows 0 5xx errors. Is anybody else facing this issue?

The error is simple to reproduce. A parallel (x30) curl request to the domain is enough to choke up the server to throw 502 errors.

Any help would be gladly appreciated.

Regards

Suvodeep

Jordan (Cloud Platform Support)

unread,

Mar 24, 2018, 6:47:23 PM3/24/18

to Google App Engine

This occurs when your actual application becomes too busy to respond to the nginx server Docker container that sits in front of your application Docker container within the instance. Nginx will send health checks to your application in order to make sure it is responsive so that it may accept new requests. If your code blocks the main thread for too long, nginx will not receive any response from your app and will assume it is not healthy. This will produce a 502 Bad Gateway error when App Engine asks nginx if the application is ready to accept new requests, and the instance will be restarted in order to make it healthy again.

It is therefore recommended to increase the number of instances you have (as I assume you are limiting your max to a very low number like one), and to ensure your code is properly configured to handle concurrent requests, returns quickly, and does not block the main thread for too long. You can also configure the health checks as to not render your instance as unhealthy so quickly and to give it more time to recover from traffic spikes.

- Note that Google Groups is reserved for general product discussions and not for technical support. For further technical support it is recommended to post your detailed questions to Stack Overflow using the supported Cloud tags.

Suvodeep Pyne

unread,

Mar 24, 2018, 11:11:09 PM3/24/18

to Google App Engine

Hey Jordan

Thanks for that info. Now, take a look at this.

This is the second time in the past 4 months we faced a P1 issue which took 2 days to get resolved. The LB was throwing 502s intermittently while the app server was totally healthy. You can check my account for logs. And btw, there were others who were complaining about the same issue.

Since as you mentioned that this is a forum for general product discussions; I want you to note that this is perhaps not the best customer experience. I am aware of your support tiers but there should be a better interaction model with your customer. We are paying for the service and you are answerable to us in case of outages/service disruption.

Regards

Ani Hatzis

unread,

Mar 24, 2018, 11:42:51 PM3/24/18

to google-a...@googlegroups.com

Hey Suvodeep, I understand the frustration, but from my point of view, I prefer to pay just for the resources I use and according to the SLA. I wouldn't want to indirectly pay for free premium support of other customers (which probably would increase all prices significantly). Anyway, wish you the best.

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-appengine+unsubscribe@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at https://groups.google.com/group/google-appengine.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-appengine/e6ba838b-5def-4a14-8243-dda691f4fa64%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Suvodeep Pyne

unread,

Mar 25, 2018, 12:37:54 AM3/25/18

to Google App Engine

Hey Ani

I agree that keeping a low service cost is important. But either you run a perfect service or you engage with the customer. I have seen 3 P1s causing service disruption (which Google fixed) in the last 4 months and some other issues causing a degraded service. Clearly, the first one isn't the case. As a customer paying for the service, I don't expect to be redirected to stackoverflow to report platform errors. So I believe there is a problem that needs to be addressed.

And yes, I posted on issuetracker as well. It is nearing 24 hours now and the bug hasn't even been assigned yet.

Anyways, thank you and all the best to you too.

On Sunday, March 25, 2018 at 9:12:51 AM UTC+5:30, Ani Hatzis wrote:

Hey Suvodeep, I understand the frustration, but from my point of view, I prefer to pay just for the resources I use and according to the SLA. I wouldn't want to indirectly pay for free premium support of other customers (which probably would increase all prices significantly). Anyway, wish you the best.

On 25 March 2018 at 05:11, Suvodeep Pyne <pyne.s...@gmail.com> wrote:

Hey Jordan

Thanks for that info. Now, take a look at this.

This is the second time in the past 4 months we faced a P1 issue which took 2 days to get resolved. The LB was throwing 502s intermittently while the app server was totally healthy. You can check my account for logs. And btw, there were others who were complaining about the same issue.

Since as you mentioned that this is a forum for general product discussions; I want you to note that this is perhaps not the best customer experience. I am aware of your support tiers but there should be a better interaction model with your customer. We are paying for the service and you are answerable to us in case of outages/service disruption.

Regards

On Sunday, March 25, 2018 at 4:17:23 AM UTC+5:30, Jordan (Cloud Platform Support) wrote:
This occurs when your actual application becomes too busy to respond to the nginx server Docker container that sits in front of your application Docker container within the instance. Nginx will send health checks to your application in order to make sure it is responsive so that it may accept new requests. If your code blocks the main thread for too long, nginx will not receive any response from your app and will assume it is not healthy. This will produce a 502 Bad Gateway error when App Engine asks nginx if the application is ready to accept new requests, and the instance will be restarted in order to make it healthy again.

It is therefore recommended to increase the number of instances you have (as I assume you are limiting your max to a very low number like one), and to ensure your code is properly configured to handle concurrent requests, returns quickly, and does not block the main thread for too long. You can also configure the health checks as to not render your instance as unhealthy so quickly and to give it more time to recover from traffic spikes.

- Note that Google Groups is reserved for general product discussions and not for technical support. For further technical support it is recommended to post your detailed questions to Stack Overflow using the supported Cloud tags.

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.

To unsubscribe from this group and stop receiving emails from it, send an email to google-appengi...@googlegroups.com.
To post to this group, send email to google-a...@googlegroups.com.

Anu V

unread,

Mar 25, 2018, 6:15:11 AM3/25/18

to Google App Engine

I've already paid for a support and result is the same. No fix in timely manner, no reply, no nothing.

Here's their quote regards this issue:

As of the moment, they confirmed that the 502 errors seem to be coming from the Cloud L2 Google Front End (GFE) and with error message "failed_to_connect_to_backend". They are still investigating about the connection. However, they confirmed that deployments are still fine and they are looking further if there is a certain subset where the backend is unable to be connected to from a GFE.

(My Support Case is: Case/0016000001LHlyt/U-15342525)

At least gcloud support team should share this information among them, not to mention that they should raise this issue on service status page.

And yeah, we're on the same boat, referring thread: https://groups.google.com/forum/#!topic/google-appengine/nmbstD7wIls

The thing that make me so angry is they keep telling that there's our fault without checking their service first.

Jordan (Cloud Platform Support)

unread,

Mar 25, 2018, 12:53:37 PM3/25/18

to Google App Engine

Thank you for the further details. This issue is being tracked in this Public Issue Tracker. As stated in the last comment, only having one instance in the Flexible Environment will often cause this issue (as I see you have 'max_num_instances' set to '1'), and it is recommended to at least have two instances. All further communications about this issue should occur in the Public Issue Tracker.

Suvodeep Pyne

unread,

Mar 25, 2018, 10:26:37 PM3/25/18

to Google App Engine

Jordan,

having one instance in the Flexible Environment will often cause this issue (as I see you have 'max_num_instances' set to '1'), and it is recommended to at least have two instances.

B.S. Why should I pay for a new instance if I don't require it? If App Engine is incapable of handling a single instance, then it should be disabled and stated upfront. Dropping 40% of the traffic is a horrible way of suggesting a recommended value to the customer.

The presence of this public issue clearly indicates that this bug has been sitting for 3 MONTHS! This is appalling! Google should have at least had the decency to apologize for it. Our service is business critical and clearly, App Engine doesn't seem to be production ready.

Regards

Jordan (Cloud Platform Support)

unread,

Mar 26, 2018, 12:53:33 PM3/26/18

to Google App Engine

You can opt-out of the automatic system updates and restarts by placing your single instance in debugging mode. Just note that if you choose to take it out of debugging mode that the instance will be turned down and a new instance will take its place at that time.

For clarity, the recommended solution of increasing instances comes directly from the documented strategies to prevent application downtime due to instance restarts (and will aid in avoiding the 'failed_to_pick_backend' issue as it provides more instances for the load balancer to route to).

The 'failed_to_pick_backend' issue is high priority and has already turned up partial fixes that have already been deployed. Since there are many causes for this issue and it isn't caused by any single point of failure, the engineering team are performing a deep dive to root out any and all fixes. In the meantime it is recommended to follow the best practices to ensure your application has instances up and running while our monitoring systems and health checks are configured to better handle dropped instance (which cause the load balancer 'failed_to_pick_backend' errors).

- Note I will also copy this reply to the issue report for completeness, as all further communications should continue there.

Reply all

Reply to author

Forward