Getting HTTP 502 "Bad Gateway" errors on our cluster

4,729 views
Skip to first unread message

James Lampert

unread,
Sep 11, 2018, 12:15:31 PM9/11/18
to gce-discussion
Once again, this is our webapp, serving both a UI and web services, running on GCE Tomcat 8 servers clustered in an HTTPS load balancer.

I've been occasionally getting an HTTP 502 error, and/or the following message:

The server encountered a temporary error and could not complete your request.
Please try again in 30 seconds.

I've seen it occur in both the UI and in web service responses. Is this by any chance coming from the load balancer? Can anybody shed any light on what to do about it, when it happens on a web service call?

--
JHHL

xchri...@google.com

unread,
Sep 12, 2018, 3:14:43 PM9/12/18
to gce-dis...@googlegroups.com

502 Bad Gateways are usually caused by an intermediary entity getting a bad response from the internal server. So, the error is technically being sent by the load balancer, but the actual issue is likely somewhere behind the scenes, whether it be the instance itself or a firewall issue. This[0] article does a great job explaining what 502’s are, & this[1] Stackoverflow thread addresses various 502 issues within GCP that were resolved.


What is the health check status of your instances? Are you able to track down which instance the 502 is coming from? Have you checked the HTTPS traffic from the log viewer? Also, if you can find out an instance that is producing these errors, check the logs of that machine & make sure it is functioning correctly. You can use a tool like Wireshark or tcpdump to check the traffic


[0] How to Solve 502 Bad Gateway Issues?: https://www.keycdn.com/support/502-bad-gateway/

[1] HTTPS Load Balancer in Google Container Engine: https://stackoverflow.com/questions/32188284/https-load-balancer-in-google-container-engine

James Lampert

unread,
Sep 13, 2018, 6:13:41 PM9/13/18
to gce-discussion
I've managed to make the web service consumer that had been hanging from the 502 errors fault-tolerant with them.

But I have another observation:

It seems that occasionally, and sometimes consistently, the 502 errors are accompanied by one or more additional instances being launched. Most recently, when I was doing a fairly complex and time-consuming task (involving the wiping of an entire experimental database schema) through the user interface, it threw one, and then, a few minutes later, I happened to look at the instance group that had serviced the request, and there were at least four new instances, all created within a period of a few seconds. And within a minute of my seeing them, they all began shutting down, leaving the original instance in place.

Not sure what the causal relationship is here.

Could the health-check be involved, somehow?

James Lampert

unread,
Sep 13, 2018, 6:59:04 PM9/13/18
to gce-discussion
Correction: I'm pretty sure that the 502 occurred after the request that wiped the experimental schema, on a subsequent request involving creation of a new database schema.

I've finally had a chance to read the links offered here by "xchri...@google.com," as well as those offered by "Nur" on my other, related thread about session affinity. And the idea that it could be a too-sensitive health-check, and/or a timeout too short for the more time-consuming requests, is looking more and more like the source of the problem. Can anybody offer any further insights?

Have you checked the HTTPS traffic from the log viewer?

I have not. I'm not entirely sure how I would access it. I can of course sign on to the instances, and look at the Tomcat log files, and our webapp's own log files, but I don't know much about a log viewer at the load balancer level, other than that if I go to the "monitoring" tab on our load balancer, all I see is a message telling me I don't have authority to look at the data (what additional role(s) do I need?) 

Fady (Google Cloud Platform)

unread,
Sep 19, 2018, 6:47:38 PM9/19/18
to gce-discussion

Hello James,


To view the load balancer monitoring graph, I think you need the Stackdriver monitoring viewer role, or Project viewer role per this document. If neither worked or you already have the roles/permissions, please open a private issuetracker  report that includes a screenshot of the error, your account and project information for an investigation.


As for the logs mentioned, I believe Christian meant checking Stackdriver load balancer logs, but that is still in alpha and only available to select customers. The same applies to Stackdriver Monitoring per this document.


James Lampert

unread,
Oct 5, 2018, 2:24:06 PM10/5/18
to gce-discussion
Something weird just happened.

A tomcat server running in an instance group behind our load balancer became unresponsive. The only user connections to it were from me: I was signed on to the UI, through Chrome, and a long series of tens of thousands of web service calls had been running since yesterday; there was no reason for more than one instance to be active in the instance group.

I started getting messages in my browser session, in the form
The server encountered a temporary error and could not complete your request.
Please try again in 30 seconds.
and about the same time, the program making the web service calls got stuck, not logging a single successful call for several minutes (it recognizes the above message, and acts on it by waiting 30 seconds and retrying the request.

On the Instances Page, I counted TEN instances for the group. I recognized the instance that I'd been in, and tried, unsuccessfully, to open an SSH session on it. Eventually, the SSH session gave me a message that the instance no longer existed.

I looked, and the UI session in my browser had once again become responsive. I looked at the program that had been calling the web service, and it was once again logging successful calls.

What just happened? Why did the instance count suddenly jump to 10?

Note: I haven't had a chance to research the links sent by earlier responses on this thread, mainly because I was on vacation the entire second half of September.
Also, if you can find out an instance that is producing these errors, check the logs of that machine & make sure it is functioning correctly.
I knew what instance it was; that's why I was trying to get an SSL connection to it.

James Lampert

unread,
Oct 5, 2018, 6:15:55 PM10/5/18
to gce-discussion
Something else weird happened.

Nothing was calling web services, and I got another browser message, the same old familiar
The server encountered a temporary error and could not complete your request.
Please try again in 30 seconds.
And again, the number of instances in the group jumped, this time from one to five, all of them showing a creation time (as shown in the group's "members" tab) within two seconds of each other.

This time, when it started shutting down instances, it shut down the four it had abruptly created, leaving the one that was created this morning.

James Lampert

unread,
Oct 5, 2018, 8:24:34 PM10/5/18
to gce-discussion
Would anybody here, by any chance, know of a way to put Tomcat (or Tomcat webapp) logs someplace they would survive having an instance in a managed group shut down?

I've given my Tomcat specialist essentially the same information I posted here today, and asked him the same question.

Nur

unread,
Oct 8, 2018, 2:58:56 PM10/8/18
to gce-discussion

I believe you could use Shutdown Scripts, it will execute commands right before an instance is terminated or restarted. As per GCP documentation, Shutdown Scripts is very useful if you rely on automated scripts to start up and shut down instances, allowing instances time to clean up or perform tasks, such as exporting logs or syncing with other systems.

"Shutdown scripts are especially useful for instances in a managed instance group with an autoscaler. If the autoscaler shuts down an instance in the group, the shutdown script runs before the instance stops and the shutdown script performs any actions that you define. The script runs during the limited shutdown period before the instance stops. For example, your shutdown script might copy processed data to Cloud Storage or back up any logs"[1]

I think reviewing the logs might provide you with the information on why Tomcat servers are becoming became unresponsive. It would also be worthwhile to get system logs and which will provide insight on a variety of events, including system error messages, system startups, and system shutdowns.

Regarding the instance count that suddenly jumps to 10, might be explained by the fact that threshold had been reached and as per autoscaling configuration, additional VM instances got created. However, if there are configuration issues where backend VMs, in this case, Tomcat servers, not able to respond or becomes unresponsive(resulting Health check failed) new VMs will be created as per autoscaling feature[2,3].

If you have configured autoscalling based on CPU utilization which is the most basic autoscaling that you can perform, you can check your VM instance CPU utilization from the VM details monitoring options.
More details on how autoscaler decisions are made can be found in this GCP documentation[4], and setting up health checking and autohealing for managed instance groups [5].

Navigation Menu-->Compute-->Compute Engine--> VM instances
Select your VM instance then "Monitoring" from the top left. This might provide information on CPU utilization, which is commonly used for autoscaling policy [6]. Could you provide more details on your autoscaling policy?

Autohealing is another consideration here, "to improve the availability of your application and to verify that your application is responding, you can configure an autohealing policy for your managed instance group. An autohealing policy relies on an application-based health check to verify that an application is responding as expected. Checking that an application respond is more precise than simply verifying that an instance is in a RUNNING state" [7].

James Lampert

unread,
Oct 11, 2018, 8:15:12 PM10/11/18
to gce-discussion

Another incident happened about half an hour ago

please try again in 30 seconds.jpg

Two instances started in the instance group. I found the instance I'd created with the last rolling update (a few hours ago), and it took nearly 20 minutes (and 2 or 3 retries) to get a terminal session on it. Once I did, I scp'd the catalina.out to a safe location where I could easily reach it. (It had some interesting stacktraces; I'm forwarding it to our Tomcat specialist.)


Meanwhile, of the two that had started, only one achieved "green light" status. The other just kept showing a throbber, and eventually the one that had shown a green light also went back to a throbber, and both of them eventually showed a yellow alert icon, with the following tooltip (sensitive information redacted):

Instance 'foo-cluster-01a-0p2w' creation failed: Code: '5684970836364578664'
 
And the original instance now appears to be responsive once again. Yet now, I see another instance that started and stopped during the time I was typing this, and it, too, is showing the same yellow alert icon, with the same tooltip. And yet again.

I am now attempting a rolling update without a template change, to get rid of that one instance . . . . FINALLY I got it to go away.


Justin Reiners

unread,
Oct 11, 2018, 8:29:40 PM10/11/18
to James Lampert, gce-discussion
My sip trunks flapped so there was some type of event in us central as well.

--
© 2018 Google Inc. 1600 Amphitheatre Parkway, Mountain View, CA 94043
 
Email preferences: You received this email because you signed up for the Google Compute Engine Discussion Google Group (gce-dis...@googlegroups.com) to participate in discussions with other members of the Google Compute Engine community and the Google Compute Engine Team.
---
You received this message because you are subscribed to the Google Groups "gce-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gce-discussio...@googlegroups.com.
To post to this group, send email to gce-dis...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gce-discussion/6efcf2b8-e701-4130-838d-7257b103c02d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Fady (Google Cloud Platform)

unread,
Oct 12, 2018, 4:04:39 PM10/12/18
to gce-discussion

We had an incident about that time, and was already mitigated. For more information, you can check Google Cloud Status Dashboard.


rahul kumar

unread,
Dec 10, 2019, 8:17:16 AM12/10/19
to gce-discussion
Bad gateway refers to http error 502 which is mean that a server which is try to fulfill the client request received an invalid response from upstream server. It  is often a network error between servers on the internet, meaning the problem wouldn't be with your computer or internet connection. Empty or incomplete headers or response body typically caused by broken connections or server side crash can cause 502 errors if accessed via a gateway or proxy.  Since it's just a generic error, it doesn't actually tell you the website's exact issue.

How to Solve 502 Errors
  • Perform a hard-refresh in your browser. On Macs, this is done by pressing Cmd + Shift + R.
  • Clear your browser cache and delete cookies. Your browser may be holding on to certain files that were saved once you visited the website with a 502 error.
  • Change your DNS servers. If you’ve never changed them in the past you likely still have the default servers assigned to you by your ISP, try using open DNS servers such as Google's Public DNS.
  • Finally, restart your computer/networking equipment. Some temporary issues with your computer and how it's connecting to your network could be causing 502 errors, especially if you're seeing the error on more than one website. In these cases, a restart would help.


rahul kumar

unread,
Mar 8, 2021, 11:20:28 AM3/8/21
to gce-discussion
The 502 Bad Gateway Error is an indication that something has gone wrong within the server of your application, as opposed to the client side request. At its heart, the cause is simple, two online servers are having trouble communicating. Often, simply refreshing or reloading the page (Ctrl-F5) will work, but sometimes the problem can persist for days. There are 5 main problems that cause 502 Bad Gateway responses. These include:

  • Server failure
  • Domain name not resolvable
  • Webserver overload
  • Firewall blocks request
  • Browser error
Reply all
Reply to author
Forward
0 new messages