AppEngine down? - 502 errors

111 views
Skip to first unread message

bFlood

unread,
Jul 17, 2018, 3:27:46 PM7/17/18
to Google App Engine
AppEngine down? - 502 errors 

anyone else? 

Alexandru Farcaş

unread,
Jul 17, 2018, 3:29:57 PM7/17/18
to Google App Engine
Also for me for both applications: in US and EU.
Started around 20 minutes ago.

Thanasis Delenikas

unread,
Jul 17, 2018, 3:30:46 PM7/17/18
to Google App Engine
Me too, EU location.

Will Shepherdson

unread,
Jul 17, 2018, 3:39:17 PM7/17/18
to Google App Engine
An incident has now been created on the Google Cloud Platform status page: https://status.cloud.google.com/incident/cloud-networking/18012

Andrin von Rechenberg

unread,
Jul 17, 2018, 3:39:18 PM7/17/18
to Google App Engine
all our projects are down too

Armen Babikyan

unread,
Jul 17, 2018, 4:15:50 PM7/17/18
to google-a...@googlegroups.com
It boggles the mind that Google has a release engineering and configuration management process that allows for a problem to have a blast radius of more than one region.

We've seen this before with other AppEngine outages, and it's supremely unsettling.  It basically tells me that building applications for redundancy across multiple Google regions is basically a worthless effort.

Is this a distributed systems design issue, a human process issue, or something else?  What is Google doing to rectify this?

Armen


--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-appengine+unsubscribe@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at https://groups.google.com/group/google-appengine.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-appengine/4e3f545a-8525-4682-acbc-29511b826b57%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Marcel Manz

unread,
Jul 17, 2018, 4:38:22 PM7/17/18
to Google App Engine
I want to know the answer to this too. A major outage like this calls for notification that a detailed post-mortem will follow and not just your default 'prevent or minimize future recurrence' message:

The issue with Google App Engine has been resolved for all affected users as of Tuesday, 2018-07-17 13:05 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Not only were our App Engine apps down, but so were our GCP auto-scaling groups. They didn't receive any further traffic from the load balancers until we forced a manual instance rolling restart/replacement of the group. It seems they somehow got detached from the load balancers and Google systems didn't detect / auto-resolve this.

Marcel

Jordan (Cloud Platform Support)

unread,
Jul 18, 2018, 1:52:50 PM7/18/18
to Google App Engine
I have notified the team responsible for this specific post-mortem of the valid feedback provided here in order to ensure that the questions raised will be addressed in the incident report. 

The team is very aware of your concerns and is working very very hard to thoroughly investigate and mitigate the cause of this incident from happening again. I agree that the initial incident status updates are very programmatic in wording, but this is done to ensure that the entire team can be focused on actually fixing the issue fast, leaving them time to properly investigate the detailed information for the final incident report after the issue has been resolved. 
Reply all
Reply to author
Forward
0 new messages