App Engine seeing elevated error rates

Google Cloud Platform Status

unread,

Feb 15, 2018, 3:42:06 PM2/15/18

to google-appengine...@googlegroups.com

We are investigating an issue with App Engine. We will provide more
information by 13:00 US/Pacific.

Google Cloud Platform Status

unread,

Feb 15, 2018, 4:04:41 PM2/15/18

to google-appengine...@googlegroups.com

We're seeing widespread improvement in error rates in many / most regions
since ~12:40 PST. We're continuing to investigate and will provide another
update by 13:30 PST.

Google Cloud Platform Status

unread,

Feb 15, 2018, 5:04:51 PM2/15/18

to google-appengine...@googlegroups.com

The issue with App Engine has been resolved for all affected
projects as of 12:44 US/Pacific. We will conduct an internal investigation
of this issue and make appropriate improvements to our systems to help
prevent or minimize future recurrence.

We will provide a more detailed analysis of this incident once we have
completed
our internal investigation.

Google Cloud Platform Status

unread,

Feb 22, 2018, 10:13:38 AM2/22/18

to google-appengine...@googlegroups.com

On Thursday 15 February 2018, specific Google Cloud Platform services
experienced elevated errors and latency for a period of 62 minutes from
11:42 to 12:44 PST. The following services were impacted:

Cloud Datastore experienced a 4% error rate for get calls and an 88% error
rate for put calls.

App Engine's serving infrastructure, which is responsible for routing
requests to instances, experienced a 45% error rate, most of which were
timeouts.

App Engine Task Queues would not accept new transactional tasks, and also
would not accept new tasks in regions outside us-central1 and europe-west1.
Tasks continued to be dispatched during the event but saw start delays of
0-30 minutes; additionally, a fraction of tasks executed with errors due to
the aforementioned Cloud Datastore and App Engine performance issues.

App Engine Memcache calls experienced a 5% error rate.

App Engine Admin API write calls failed during the incident, causing
unsuccessful application deployments. App Engine Admin API read calls
experienced a 13% error rate.

App Engine Search API index writes failed during the incident though search
queries did not experience elevated errors.

Stackdriver Logging experienced delays exporting logs to systems including
Cloud Console Logs Viewer, BigQuery and Cloud Pub/Sub. Stackdriver Logging
retries on failure so no logs were lost during the incident. Logs-based
Metrics failed to post some points during the incident.

We apologize for the impact of this outage on your application or service.
For Google Cloud Platform customers who rely on the products which were
part of this event, the impact was substantial and we recognize that it
caused significant disruption for those customers. We are conducting a
detailed post-mortem to ensure that all the root and contributing causes of
this event are understood and addressed promptly.

Reply all

Reply to author

Forward