High logging service error rate

Google Cloud Platform Status

unread,

Apr 17, 2015, 7:48:19 PM4/17/15

to google-appengine...@googlegroups.com

We're investigating an issue with Google App Engine Logging service
beginning at 2015-04-17 16:00 (all times are in US/Pacific). We will
provide more information within one hour.

Google Cloud Platform Status

unread,

Apr 17, 2015, 8:45:39 PM4/17/15

to google-appengine...@googlegroups.com

The problem with the Google App Engine Logging service was resolved as of
Friday April 18th, 2015 at 17:00 PDT. We apologize for the inconvenience
and thank you for your patience and continued support. Please rest assured
that system reliability is a top priority at Google, and we are making
continuous improvements to make our systems better.

Google Cloud Platform Status

unread,

Apr 17, 2015, 8:48:41 PM4/17/15

to google-appengine...@googlegroups.com

Apologies - the date in the previous post was incorrect. The resolution
time was Friday, April 17th, 2015 at 17:00 PDT.

Google Cloud Platform Status

unread,

Apr 24, 2015, 8:46:21 AM4/24/15

to google-appengine...@googlegroups.com

SUMMARY:

On Friday 17 April 2015, the Google App Engine Logs API experienced
intermittent failures and reduced throughput for read requests for a
duration of 54 minutes. If your service or application was affected, we
apologize — this is not the level of quality and reliability we strive to
offer you, and we are taking immediate steps to improve the platform’s
performance and availability.

DETAILED DESCRIPTION OF IMPACT:

On Friday 17 April 2015 from 16:02 to 16:56 PDT, 3% of read requests to the
Logs API failed and there was a 96% drop in throughput. The problem
affected 16% of applications that rely on this API to export logs. In this
time window, users experienced intermittent timeouts while attempting to
view application logs on App Engine Admin Console or Google Cloud
Developers console.

ROOT CAUSE:

Hotspotting in the App Engine Logs API's storage subsystem caused a number
of storage nodes to fail. This eventually resulted in resource depletion
and request failures.

REMEDIATION AND PREVENTION:

At 16:05 on Friday 17 April 2015, an automated alert on depletion of
available resources for the Logs API was sent out to Google Engineers. To
resolve the immediate problem they started redirecting traffic away from
the affected storage layer. The service started recovering at 16:51 and
normal operation was restored at 16:56.

To prevent similar incidents in future, we are implementing changes to
reallocate resources consumed by high use individual nodes of the storage
layer backing the Logs API.

Reply all

Reply to author

Forward