HTTP Cloud load Balancer 502 intermittent responses , doesn’t let request reach our backend services

745 views
Skip to first unread message

Jorge Barrachina

unread,
Mar 10, 2017, 8:00:37 AM3/10/17
to Google App Engine

Description of the problem :


HTTP Cloud load Balancer 502 intermittent responses ,  doesn’t let request reach our backend services. It seems that this problem rise up randomly . Maybe related with https://groups.google.com/forum/#!topic/google-appengine-downtime-notify/C_fCwHb73wc

issued on February, 4th?


Also, we were looking for alternative solutions and we’ve found this post , that maybe has some correlation with this issue, and bring some light to find what happened


https://blog.percy.io/tuning-nginx-behind-google-cloud-platform-http-s-load-balancer-305982ddb340#.rw4tbv6gl




Issue date periods:


March 9th, from 14:12:29.697 ( UTC/GMT +1 hour ) till  March 10th, 09:28:36.085 ( UTC/GMT +1 hour )


Projects affected :


oa-staging


Services affected:


clientaccount ( versions:  release-2017-10-a , release-2017-8-a , release-2017-07-a)



Service url :


https://clientaccount-dot-oa-staging.appspot-preview.com



Deployment Details of our stack :



app.yaml


api_version: 1

service: clientaccount

runtime: python

env: flex

entrypoint: gunicorn 'client_account.wsgi:load_app("prod")'


runtime_config:

 python_version: 3


automatic_scaling:

 min_num_instances: 1

 max_num_instances: 5

 cool_down_period_sec: 120 # default value

 cpu_utilization:

   target_utilization: 0.8



requirements.txt


bingads==v10.4.11

boto3==1.3.0

botocore==1.4.10

docutils==0.12

Flask==0.10.1

Flask-Cors==2.1.2

Flask-Security==1.7.5

flask-swagger==0.2.12

gcloud==0.13.0

google-api-python-client==1.5.0

googleads==5.0.0

gunicorn==19.4.1

itsdangerous==0.24

Jinja2==2.8

jmespath==0.9.0

MarkupSafe==0.23

mongoengine==0.10.6

pymongo==3.2.2

python-dateutil==2.5.2

PyYAML==3.11

recurly==2.4.2

requests==2.9.1

sendgrid==3.0.1

six==1.10.0

Werkzeug==0.11.4

wheel==0.24.0



Transaction Example Details :


If you want to check the transaction on Google Cloud Console , here is the link to check it out :


https://console.cloud.google.com/logs/viewer?project=oa-staging&hl=es&minLogLevel=0&expandAll=false&resource=http_load_balancer&advancedFilter=resource.type%3D%22http_load_balancer%22%0Aresource.labels.zone%3D%22global%22%0Aresource.labels.project_id%3D%22oa-staging%22%0Atimestamp%3D%222017-03-09T15:35:43.842872262Z%22%0AinsertId%3D%221snolhbg22kclfq%22&timestamp=2017-03-09T15:35:43.842872262Z


GET  https://clientaccount-dot-oa-staging.appspot-preview.com/check_token/**********LONG_TOKEN*************



This request arrives to the HTTP Cloud Load Balancer  at : 16:35:43.842 (Madrid Time)



Trace Log of the HTTP Cloud Balancer


{

insertId: "1snolhbg22kclfq"   

jsonPayload: {

 statusDetails: "failed_to_connect_to_backend"    

 @type: "type.googleapis.com/google.cloud.loadbalancing.type.LoadBalancerLogEntry"    

}

httpRequest: {

 requestMethod: "GET"    

 requestUrl: "https://clientaccount-dot-oa-staging.appspot-preview.com/check_token/*********LONG_TOKEN*************"    

 requestSize: "1370"    

 status: 502    

 responseSize: "421"    

 remoteIp: "34.197.229.75"    

}

resource: {

 type: "http_load_balancer"    

 labels: {

  url_map_name: ""     

  forwarding_rule_name: ""     

  backend_service_name: ""     

  target_proxy_name: ""     

  zone: "global"     

  project_id: "oa-staging"     

 }

}

timestamp: "2017-03-09T15:35:43.842872262Z"   

severity: "WARNING"   

logName: "projects/oa-staging/logs/requests"   

}



The connection with our backend ( https://clientaccount-dot-oa-staging.appspot-preview.com ) stopped . If you look into the jsonPayload of the trace provided , it says “failed_to_connect_to_backend” so it means that our backend seems to be not processing the request arriving from the http load balancer.


This is weird , because we didn’t change our stack and it was working days before of the date reported.

Could you tell us what happened , our provide some logs to review with our team.


Thanks for your support


Adam (Cloud Platform Support)

unread,
Mar 11, 2017, 2:21:41 PM3/11/17
to Google App Engine
I've posted a response to the public issue you filed at https://issuetracker.google.com/36144028, please feel free to direct any replies there.

julien silverston

unread,
Sep 29, 2017, 7:05:25 PM9/29/17
to Google App Engine
Adam, access is denied to the issue you mentioned. Thank you.

Christian Aquino

unread,
Nov 3, 2017, 4:52:16 PM11/3/17
to Google App Engine
Hi Jorge, I was wondering if you ever got to the bottom of this issue?


Thanks,
Christian


On Friday, March 10, 2017 at 8:00:37 AM UTC-5, Jorge Barrachina wrote:

Jorge Barrachina

unread,
Nov 6, 2017, 5:04:22 PM11/6/17
to Google App Engine
Yes, 

I copy the response from support team ( Sorry but this was solved long time ago, and i can't remember all the details )

-------
Getting back to your app configuration, the part which stands out the most is this:


    entrypoint: gunicorn 'client_account.wsgi:load_app("prod")'

It looks like you're using gunicorn with the default config, which is to use one sync worker which limits your app to serving a single request at a time. This could cause health check pings and other requests to time out if the app is currently busy. One quick fix is to simply spawn more worker processes to handle concurrent requests eg:

    entrypoint: gunicorn -w 4 'client_account.wsgi:load_app("prod")'
---------

In summary , gunicorn has to start with some workers . By default,  gunicorn config launches only one . In case  health_check request and a common request arrives almost at the same time , gunicorn cannot deal with both of them.  That's why 5xx HTTP status codes appear.

At the time of this issue , track those kind of issues in AppEngine logs  were a nightmare. Hope now is better. 

Hope this answer can help you in some way , 

Cheers

julien silverston

unread,
Nov 7, 2017, 8:39:37 AM11/7/17
to Google App Engine
Hi,

I think I've solved this issue. No HTTP/502 so far.

First my loadbalancer config was wrong.
Check your named port, check your health conditions on both instances groups and loadbalancer (disable them if necesarry to troubleshoot).
I also applied the following tuning :
- enable gzip (already was)
- increase timeout in Nginx
- add GCP IP ranges to firewall rules.

Take a look to the same blog from Percy.io, thanks to him.

https://blog.percy.io/tuning-nginx-behind-google-cloud-platform-http-s-load-balancer-305982ddb340

Cheers
Reply all
Reply to author
Forward
0 new messages