502 Bad Gateway on a PHP GAE Standard - no app logs

503 views
Skip to first unread message

Maxime Lebastard

unread,
Oct 17, 2018, 8:22:17 AM10/17/18
to Google App Engine

Hi,

We've got a Node Express app on a GAE standard instance. It calls a PHP service on another GAE standard instance - through a server-to-server http request. 

Most of the time it works well, but since our first production deployment we had a few (32 times) 502 Bad Gateway errors every day on that request.

By looking to the trace logs:
  • I can see the NodeJS errors (getting a 502 response, throwing an exception etc...), 
  • I can see a first log at 22:59:52.974 http_load_balancer WARN log because a 502 error has been returned.  jsonPayload.statusDetail value is "failed_to_connect_to_backend"
  • I can see a second log at 22:59:52.979 http_load_balancer WARN log because a 502 error has been returned. jsonPayload.statusDetail value is "response_sent_by_backend"
But I can't see any trace on the PHP service. I checked the health calls of the same timeframe for the PHP service, all the healthchecks return a 200 OK.

My guess is a network error from Google Cloud - or maybe a latency matter, but is there a way I can diagnose and fix that ? What can be the causes of a load balancer returning a Bad Gateway whereas the GAE instance is up and running ?

Max

Olu

unread,
Oct 17, 2018, 1:32:18 PM10/17/18
to Google App Engine
The "failed_to_connect_to_backend" is similar to "failed_to_pick_backend" response and this often occurs when the load balancer knows where it wants to send traffic but cannot successfully route its request or find a healthy or available instance. Reviewing the instance usage and health checks, might be appropriate to ensure these responses are correlated. 

This error message may also be returned whenever there are a low number of instances serving your service. For example, this error may be returned whenever an unexpected spikes in traffic occur and there are not enough instances to handle the received requests. In this type of situations, the load balancer attempts to route your application requests but it does not have instances to serve those request, hence, it will produce this "failed_to_connect_to_backend" or 'failed_to_pick_backend' error. 

Perhaps you can add a screenshot of your app.yaml file of your AppEngine application, this should display the configurations that apply to your instance presently, and of course, the number of instances you configured.

Maxime Lebastard

unread,
Oct 18, 2018, 5:49:06 AM10/18/18
to Google App Engine
I don't think we experienced a peak, I only see the requests of less than 5 users (it was during the night, we usually don't have a lot of users at this time).

One component I forgot in this stack is a Nginx instance that takes the role of a gateway using the reverse proxy feature. 
The request makes NodeJS app ---> Nginx Gateway ---> PHP app.

On the Nginx gateway logs, I can see two kind of errors:

  • The Nginx gateway is unable to connect to the PHP service:

 "2018/10/16 22:03:45 [error] 7#7: *4276 connect() to [2a00:1450:400c:c04::99]:80 failed (101: Network is unreachable) while connecting to upstream, client: 172.17.0.4, server: , request: "GET /messagings HTTP/1.1", upstream: "http://[2a00:1450:400c:c04::99]:80/messagings", host: "services.maxime.pro"

  • The Nginx gateway seems to forward a 502 error returned by the PHP service (the error of my first message)

GET /find" 502 332 "-" "axios/0.18.0

What's weird for me is how random is this error. It happens a few times every day, but most of my tests are ok - so it's quite hard to reproduce and debug...

Here is the app.yaml of the PHP app

runtime: php
env: flex

service: platform

runtime_config:
 document_root: public

automatic_scaling:
 min_num_instances: 1
 max_num_instances: 10
 cpu_utilization:
   target_utilization: 0.6

env_variables:
 PHP_ENV: production

readiness_check:
 path: "/health"
 check_interval_sec: 5
 timeout_sec: 4
 failure_threshold: 2
 success_threshold: 2
 app_start_timeout_sec: 60

I don't know if it's the same issue, but some other threads talk about a routing error for flex environements when min_instances is set to 1 and suggest to set it to 2. 
We'd rather use a standard environement instead (because of the costs), I don't know if it can fix the issue.

Dan S (Cloud Platform Support)

unread,
Oct 18, 2018, 11:31:38 PM10/18/18
to Google App Engine

The quantity of instances deployed default is 2, the minimum of instances required is 1, as you can confirm in this link[1]. However, maybe only 1 instance it will not be enough for ~5 users. It seems that by setting 2 instances, could help you to solve this issue. You can retrieve more logs in your PHP instance by using the “Stackdriver Logging in App Engine apps”, such as errors surfaced from within the Nginx process, Nginx logs specifically for health checks. You can find more details in the following documentation[2]. [1]https://cloud.google.com/appengine/docs/flexible/python/reference/app-yaml#automatic_scaling [2]https://cloud.google.com/appengine/articles/logging

Reply all
Reply to author
Forward
0 new messages