Google App Engine issue with URLFetch service beginning at 4:30 AM US/Pacific on February 11, 2014

464 views
Skip to first unread message

Jose Montes de Oca

unread,
Feb 11, 2014, 12:57:49 PM2/11/14
to google-appengine...@googlegroups.com
We are currently experiencing an issue with Google App Engine URLFetch and some applications may experience increased latency and errors beginning at approximately 04:30 AM US/Pacific Time on February 11, 2014. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by Tuesday, 2014-02-11 11:00 AM with current details.

Jose Montes de Oca

unread,
Feb 11, 2014, 2:05:30 PM2/11/14
to google-appengine...@googlegroups.com
We are still investigating the issue with Google App Engine URLFetch. We will provide another status update by Tuesday, 2014-02-11 12:00 US/Pacific.

Jose Montes de Oca

unread,
Feb 11, 2014, 3:19:01 PM2/11/14
to google-appengine...@googlegroups.com
We are still investigating the issue with Google App Engine URLFetch. At this point error rates for affected applications are declining. We will provide another status update by Tuesday, 2014-02-11 01:00 PM US/Pacific.

Jose Montes de Oca

unread,
Feb 11, 2014, 4:22:46 PM2/11/14
to google-appengine...@googlegroups.com
We continue to work at a resolution in regard to the issue with Google App Engine URLFetch. Error rates are continuing to decline for affected applications. We will provide another status update by Tuesday, 2014-02-11 03:00 PM US/Pacific.

John Lowry

unread,
Feb 11, 2014, 7:20:13 PM2/11/14
to google-appengine...@googlegroups.com
The problem with Google App Engine URL Fetch was resolved as of Tuesday, 2014-02-11 16:00 (all times are in US/Pacific).

URL Fetch to Google Cloud Storage URLs continue to have slightly elevated latency and error rates. But URL Fetch to other URLs is performing as normal.

We apologize for the inconvenience and thank you for your patience and continued support. Please rest assured that system reliability is a top priority at Google, and we are making continuous improvements to make our systems better.

Google App Engine Downtime Notify

unread,
Feb 24, 2014, 12:42:39 PM2/24/14
to google-appengine...@googlegroups.com

SUMMARY:

On Tuesday 11 February 2014, Google App Engine applications utilizing the URL Fetch API experienced elevated errors and latency from the API for 3 hours and 3 minutes.  If your service or application was affected, we apologize - this is not the level of reliability and performance we strive to offer you, and we have taken and are taking immediate steps to improve the URL Fetch performance and availability.


DETAILED DESCRIPTION OF IMPACT:

On Tuesday 11 February 2014, most applications calling the URL Fetch API experienced high error rates from the API from 0610 to 0913 PST.  Error rates varied by application, but reached a peak of 60% by 0640 and decreased to 25-35% at 0745 PST.


ROOT CAUSE:

The URL Fetch API is built on top of bespoke Google infrastructure.  This infrastructure has capacity in the same sites as Google App Engine (GAE), and also in a remote backup site.  One of the critical components of URL Fetch was, for reasons unrelated to GAE, nonfunctional in the local sites on the morning of 11 February, and URL Fetch failed over to the remote site.  The increased latency of invoking the API to a remote site caused some applications, which relied on low latency fetches, to retry their requests, resulting in an exponential increase of load on the remote site until the combined [original + retry] demand level stabilized.


REMEDIATION AND PREVENTION:

The immediate remediation was achieved by isolating applications with retry behavior from those which gracefully tolerated the increased latency.  Google engineers also selectively dropped retry requests, decreasing overall demand and therefore decreasing the overall error rate.  Finally, Google engineers brought new URL Fetch capacity online and assigned some applications to the new capacity.


To prevent recurrences, in the short term Google engineers have already increased the total capacity of the URL Fetch system to tolerate retry behavior without overloading the API.  In the intermediate term, we will be switching the URL Fetch service to use the same high-capacity systems used by other large Google infrastructure systems like Search, thus benefiting from scale and reliability improvements achieved by other Google services.


Reply all
Reply to author
Forward
0 new messages