Google App Engine, Nodejs, 502 Bad Gateway

3,981 views
Skip to first unread message

Marco Galassi

unread,
Sep 25, 2017, 4:00:13 AM9/25/17
to Google App Engine
We are performing a migration to a new backend hosted on Google App Engine and based on Node.js.

Before performing the switch from the old to the new backend, we need to do tests and see if packets that the legacy server receives are correctly
processed by the new backend.

So, on the old backend, we have modified the handler to "duplicate the request" and both proceed with the regular processing and forward to out new backend.
In this way, we are able to keep the legacy system up and at the same time replicate the requests on our new backend.

We want to slow down the legacy server as little as possible, so what we do is we do not wait for a response on the legacy server.

Instead, on Google App Engine, we have deployed  a second service (to [our-proxy-service]-dot-[our-project].appspot.com) in out project that acts as a proxy and forwards the requests to the new backend. 
We have use node-http-proxy package to implement the proxy, and we made it redirect to out backend URL: [our-backend-service]-dot-[our-project].appspot.com

The proxy also gets the responses so we are able to track the process.

The first question we have is whether this is a good solution with GAE,
and if it is a valid use case to redirect the requests from a service to another service in that way, or if there is a better way.

Here is a MSPaint version of our backend structure:



This brings us to the second problem: this structure actually seems to work, but every once in a while (no pattern discovered yet, seems random) the proxy gets a 502 Bad Gateway response from the backend.
We investigated a little on the Stackdriver Logs on the backend side, and we ended up finding this log filtering for nginx.error:

07:11:15.000 [error] 32#32: *84209 upstream prematurely closed connection while reading response header from upstream, client: 130.211.1.151, server: , request: "POST /collect HTTP/1.1", upstream: "http://172.17.0.1:8080/collect", host: "[ourprojectid].appspot.com"

We have also posted on Stackoverflow, in case someone wants to read it.

We have any other ideas where to go from here.
Any help or hint or suggestion is very welcome.

Jordan (Cloud Platform Support)

unread,
Sep 25, 2017, 4:46:16 PM9/25/17
to Google App Engine
You are very correct in your design pattern for testing a new backend. Duplicating live production requests and testing them against your new backend is actually the preferred way of testing before performing a migration. Having an App Engine proxy act like the actual client is also correct for performing your actual tests.

In terms of the actual 502 errors you are seeing, this is an error thrown by the nginx container that is checking the health of your application code. If your application code is too busy to respond to nginx, nginx will deem your application container to be unhealthy and close the instance's connection from App Engine throwing a 502. App Engine will then automatically start a new instance in its place. 

You can read more into how to code for the cloud and 502 prevention by looking at previous discussions in the Google Cloud Public Issue Tracker and Stack Exchange

Marco Galassi

unread,
Sep 26, 2017, 3:52:42 AM9/26/17
to Google App Engine
Hi jordan,
thank you for your answer.

I am wondering if when you talk about the nginx container checking the health of the application code you are referring to the health check performed by Google,
or you are talking about another kind of health checking performed by the nginx server itself.

Assuming that you are referring to the Google's Health Check, we have set custom settings as you can see in the attached app.yaml file.
As you can see in the app.yaml file, we have set a minimum of 3 instances. This used to be just 1, but we pumped it up to see if this would 
solve the problem, which did not.

I have also attached the plots we have from the GAE dashbord of your project.

Thanks you again.
app.yaml
gae charts.png
Message has been deleted

Jordan (Cloud Platform Support)

unread,
Sep 26, 2017, 10:20:17 AM9/26/17
to google-a...@googlegroups.com
You are very correct, as described in the previously linked Stack Exchange answer, nginx health checks are indeed the same health checks that are configurable via your app.yaml file.  

To clarify, each App Engine instance has its own nginx container running in it that monitors the health of its own instance. So starting more instances will not fix the issue for any single instance. As mentioned in the solutions provided in the previously linked Public Issue Tracker, you could instead try increasing your instances' Resource Settings to make them more powerful. By increasing an instance's CPU, you are allowing your application code to finish faster, and in turn allow it to respond to nginx faster. 

Though as mentioned in the same Public Issue Tracker this is not the actual solution, and the solution is to code your application to handle async requests. By allowing a single instance to handle more than one request at a time, you are freeing up CPU time for health checks. Of course you can always completely turn off the nginx health checks, but ideally you want to code your application for the Cloud to make it scalable and responsive. 
Message has been deleted
Message has been deleted
Message has been deleted
Message has been deleted
Message has been deleted

Marco Galassi

unread,
Sep 27, 2017, 3:35:04 AM9/27/17
to Google App Engine
So, we took a step back in our project and deployed an Hello World application to see if the problem would still happen.

We used the hello-world example in the Google App Engine code examples and edited it (our file is attached): there are no differences,
but the fact that we don't have a .end() call after we send the response to the client res.status(200).send(): we edited the code 
removing the .end() because that is how we do in our project.

We don't think this is the problem. as explained here and seeing that the Express code for res.send() does in fact call a res.end() itself.

Anyway, after the hello-world deployment, we still got a 502 Bad Gateway.
app.txt

Jordan (Cloud Platform Support)

unread,
Sep 27, 2017, 11:36:20 AM9/27/17
to Google App Engine
If you have tested turning the nginx health checks off and you still see these 502 errors, then I highly suggest you report this in the Public Issue Tracker. If the issue is resolved when the health checks are off, then you may be able to reduce the thresholds and intervals of the checks to allow your instances more time to respond to nginx. 

- If you do open an issue report, it is recommended to provide your project ID, the type of health checks you are using (legacy or updated), and a stacktrace of the error you are seeing. 

Marco Galassi

unread,
Oct 2, 2017, 3:16:18 AM10/2/17
to Google App Engine
Hi Jordan,
we would like to share with you what we have found.

We have delete our custom health check that was an handler responding with 200 OK as follows:

app.get('/_ah/health', function (req, res) {
 res
.sendStatus(200);
});

It looks like this has solved the problem as we do not see any 502 Bad Gateway errors anymore.

We created the custom health check handler following the documentation, so we did not thought that this could be the problem:
You can write your own custom health-checking code. It should reply to /_ah/health requests with a HTTP status code 200. 
The response must include a message body, however, the value of the body is ignored (it can be empty).

Maybe this could be a bug of some sort? Should I point this out in the Public Issue Tracker?

Marco

Jordan (Cloud Platform Support)

unread,
Oct 2, 2017, 10:53:43 AM10/2/17
to Google App Engine
If properly responding to the health check with a message body res.status(200).send(''); as the documentation states causes the health checks to fail, that would indeed warrant an issue to be reported in the PIT (public issue tracker). 

Note that responding to health checks is not required as even a 404 response from your application tells nginx that you application is alive (as it responded saying it cannot find the requested resource 404). 



Marco Galassi

unread,
Oct 10, 2017, 3:37:00 AM10/10/17
to Google App Engine

Hello Jordan, we are back because we have found the problem again.
We have deleted the custom health check handler, disabled the health check in the app.yaml configuration file,
but the problem is still there.

Just to recap, what we keep seeing is the following.
  • From the proxy perspective, all we see is a 502 Bad Gateway response from the backend server
  • On the backend server itself, we see the following error:

  • 07:11:15.000 [error] 32#32: *84209 upstream prematurely closed connection while reading response header from upstream, client: 130.211.1.151, server: , request: "POST /collect HTTP/1.1", upstream: "http://172.17.0.1:8080/collect", host: "[ourprojectid].appspot.com"

It does not look like the problem is caused by long running tasks, as we encountered the problem also with a simple hello-world application with no long running tasks at all.
We also do not receive requests bigger than 12MB, which could possibly reach the App Engine request size limit.
In any case, to be sure, we also put a filter at the beginning of the app to filter out requests bigger that 1500 Bytes (so we are sure we are handling only small requests)

Jordan (Cloud Platform Support)

unread,
Oct 11, 2017, 2:37:19 PM10/11/17
to Google App Engine
So just for completeness and clarity, the error you are seeing on your backend service "upstream prematurely closed connection while reading response header from upstream" strictly means that the nginx proxy on that specific instance of your backend service fails to contact the webserver (aka your app) in that instance (it is all localized to a specific instance). 

Normally these often occur due to nginx's health checks failing, but this can also just happen when nginx does its normal job of proxying incoming requests to your application. Essentially, the connection between nginx and your application closes due to your application not responding to it, or your application completely stops listening to nginx (it listens on port 8080 as seen in the error). 

Therefore, the issue all comes down to your application code and the Node.js runtime. Node.js is a single threaded runtime, this means that your code must be properly coded to take full advantage of Node.js's event loop in order to be asynchronous. By having async code, you are allowing concurrent requests to be executed, which in turn allows your application to always listen and respond to nginx. Once your application becomes asynchronous, it is able to properly live and scale in the cloud.  

Note that you should always perform exponential backoff-retry on the client-side (aka you proxy service) in the case that your backend service becomes too busy and times out. This way, even if you see 5xx responses, your client should always eventually succeed (once your backend has recovered). 


Marco Galassi

unread,
Oct 12, 2017, 3:51:53 AM10/12/17
to Google App Engine
Hi Jordan, thank you for your detailed response.

 Node.js is a single threaded runtime, this means that your code must be properly coded to take full advantage of Node.js's event loop in order to be asynchronous.

We understand the concepts of single-threaded runtime, and we also thought that our code was not correctly structured to take advantage of Node.js's asynchronous capabilities,
and so for some reason we were blocking the event loop causing the application to be busy and not responsive to other incoming requests.

This is actually why we have deployed an hello-world test in the first place, consisting of one single GET requests handler responding "hello-world" (actually taken from the GAE's
examples). To out great surprise, though, we experienced the same issue there, and that's why we are even more confused.

Note that you should always perform exponential backoff-retry on the client-side (aka you proxy service) in the case that your backend service becomes too busy and times out.

Absolutely, and in fact we wouldn't have missing packets with retries, but since we are still into preliminary tests, we want to reduce the retransmissions as much as possible and
only when the backend is really overloaded (which is not our case, if you look at the plots I attached a couple of messages ago).

Jonas Klemming

unread,
Nov 26, 2017, 5:20:49 PM11/26/17
to Google App Engine
I am seeing the same issue. Did you resolve this?
Reply all
Reply to author
Forward
0 new messages