Serving Issues on 2012-11-22

854 views
Skip to first unread message

Christina Ilvento

unread,
Nov 22, 2012, 9:02:11 PM11/22/12
to google-appengine...@googlegroups.com
At about 5 PM US/Pacific time this evening, App Engine experienced a brief serving outage lasting about 20 minutes. Normal service has been restored and no action is needed from developers at this time. We are still investigating the root cause and scope of this incident, and will follow up with more details. We apologize for any inconvenience this incident may have caused you or your customers.



Happy Thanksgiving,
Regards,
Christina Ilvento on behalf of the Google App Engine Team

Christina Ilvento

unread,
Nov 23, 2012, 4:59:42 PM11/23/12
to google-appengine...@googlegroups.com
On November 22, 2012 at approximately 5 PM US/Pacific time, App Engine experienced a serving outage of increasing scope with a total duration of 36 minutes. This incident was identified by our regular monitoring, and we were able to respond within minutes to take corrective action.

The root cause of this incident was a configuration change that impacted performance when routing traffic for applications serving from custom domains (URLs other than appspot.com). An increase in requests for applications serving from custom domains triggered the actual outage, which ultimately affected all applications. A more detailed timeline is included below (all times are in US/Pacific).

2012-11-19
  • A configuration file is updated which modifies request handling from applications serving from custom domains.
2012-11-22
  • 17:02 - Requests to custom domains start noticeably increasing and some applications begin experiencing elevated error rates.
  • 17:05 - Our monitoring systems alert us, and investigation begins.
  • 17:21 - As part of our normal remediation for an outage of this kind, traffic to all applications is stopped to allow the affected components to recover.
  • 17:33 - Most requests to all applications are successful.
  • 17:38 - All requests to all applications are successful, fully resolving the serving outage.


We have identified the performance problem as an unoptimized algorithm that performed a linear scan through a lookup table for custom domains. This was exacerbated by the recent configuration change, which greatly increased the number of entries in the lookup table. We have taken immediate steps to remediate this issue by removing unneeded entries, and are optimizing the algorithm in question to scale appropriately.

We apologize for the inconvenience caused by this outage. If you believe your paid application experienced an SLA violation during this incident, please fill out our refund request form.

Regards,

Posted on behalf of Peter S. Magnusson, Engineering Director, Google App Engine
Reply all
Reply to author
Forward
0 new messages