SUMMARY:
On Tuesday 11 February 2014, Google App Engine applications utilizing the URL Fetch API experienced elevated errors and latency from the API for 3 hours and 3 minutes. If your service or application was affected, we apologize - this is not the level of reliability and performance we strive to offer you, and we have taken and are taking immediate steps to improve the URL Fetch performance and availability.
DETAILED DESCRIPTION OF IMPACT:
On Tuesday 11 February 2014, most applications calling the URL Fetch API experienced high error rates from the API from 0610 to 0913 PST. Error rates varied by application, but reached a peak of 60% by 0640 and decreased to 25-35% at 0745 PST.
ROOT CAUSE:
The URL Fetch API is built on top of bespoke Google infrastructure. This infrastructure has capacity in the same sites as Google App Engine (GAE), and also in a remote backup site. One of the critical components of URL Fetch was, for reasons unrelated to GAE, nonfunctional in the local sites on the morning of 11 February, and URL Fetch failed over to the remote site. The increased latency of invoking the API to a remote site caused some applications, which relied on low latency fetches, to retry their requests, resulting in an exponential increase of load on the remote site until the combined [original + retry] demand level stabilized.
REMEDIATION AND PREVENTION:
The immediate remediation was achieved by isolating applications with retry behavior from those which gracefully tolerated the increased latency. Google engineers also selectively dropped retry requests, decreasing overall demand and therefore decreasing the overall error rate. Finally, Google engineers brought new URL Fetch capacity online and assigned some applications to the new capacity.
To prevent recurrences, in the short term Google engineers have already increased the total capacity of the URL Fetch system to tolerate retry behavior without overloading the API. In the intermediate term, we will be switching the URL Fetch service to use the same high-capacity systems used by other large Google infrastructure systems like Search, thus benefiting from scale and reliability improvements achieved by other Google services.