An update on App Engine URL Fetch problems on June 11, 2013

226 views
Skip to first unread message

Andrew Jessup

unread,
Jul 30, 2013, 8:31:20 PM7/30/13
to google-appengine...@googlegroups.com

We wanted to give you an update on an incident on June 11, in which many customers using the App Engine URL Fetch service were unable to fetch URLs from several sites hosted by Google, such as https://accounts.google.com/o/oauth2/token. We estimate that at peak, approximately 65% of all URL Fetch requests to such Google URLs were unable to connect successfully.


The incident first began at approximately 2:00 AM US/Pacific on June 11, 2013  and continued until approximately 2:50 PM on the same day. The root cause was eventually determined to be a configuration change in Google's network. A more detailed timeline is included below (all times are in US/Pacific).


Tuesday, 2013-06-11

  • 2:15am - An App Engine customer notices extremely high rate of urlfetch timeouts to Google OAuth2 URLs, and files an issue with App Engine support.

  • 2:54am - App Engine support makes an initial problem report to our on-call engineering team.

  • 10:30am - Additional customer complaints have been attached to the problem report; investigation begins

  • 12:34pm - Problem is tentatively isolated to an interaction between App Engine’s URL Fetch infrastructure and the recent network configuration change. Investigation continues.

  • 2:11pm - Problem is confirmed to be an interaction between App Engine’s URL Fetch infrastructure and the recent network configuration change, when requests arrive at one particular Google datacenter.

  • 2:49pm - Analysis of the network configuration of the affected datacenter is complete, and it is judged safe and correct take the affected datacenter offline for maintenance. The affected datacenter no longer handles requests for Google URLs, and the rate of URL Fetch errors to Google properties returns to normal.


When App Engine applications use URL Fetch to reach URLs hosted by Google, the requests can arrive at any number of Google datacenters, depending on factors such as datacenter or network load, network proximity, or whether some datacenters or portions of the network are temporarily offline for maintenance.


A recent audit of Google network traffic revealed an error in Google’s network routing configuration. The error in the routing configuration was corrected immediately upon discovery. Unfortunately, the correction to the routing configuration did not properly take into account a special network testing configuration in one of Google’s datacenters, and impeded that datacenter’s ability to respond to requests originating from App Engine applications.


As a result an App Engine application requested a Google-hosted URL, and the request was routed to the datacenter that was affected by the updated routing configuration, the response could not be transmitted back to the application, and the application would see that the URL request had timed out.


We analyzed the patterns of the failing URL Fetch requests, and identified that all of them were directed to one particular datacenter. We took that datacenter offline for maintenance immediately, and all requests were now being directed to Google datacenters without networking issues, which could respond to requests from App Engine applications without complication. This ended the incident for App Engine customers.


We recognize that it took longer for us to identify and correct the underlying issue than we would have liked. The analysis was complicated by the following issues:


  • Only traffic to Google URLs was potentially affected. Most URL Fetch traffic is not destined for Google URLs, so the overall URL Fetch error rate did not increase measurably enough to trip alerting thresholds.

  • Of traffic to Google URLs, only requests that arrived at a single specific datacenter resulted in timeouts. While we have existing monitoring and visibility into errors and latency by URL, we have not yet extended that monitoring into breaking down requests by URL and destination IP simultaneously. It was only that analysis which directed us to the misconfigured datacenter, and we had to create this breakdown dynamically during the outage.


As a result of this outage, we’re taking the following steps to improve reliability:


  • Revising network routing configuration change procedures to ensure that updates function correctly in every Google datacenter, and do not have bad interactions with network testing configurations.

  • Investigating adding monitoring for errors and latency for a set of important Google URLs often requested via URL Fetch by App Engine applications for actions critical to their operation, e.g. Google APIs, OAuth2 endpoints.

  • Investigating adding monitoring for errors and latency for a set of important Google URLs, broken down by destination datacenter, to automatically surface problems that are restricted to a small set of destination datacenters.


We apologize for any inconvenience experienced as a result of this issue. Many of our customers depend on being able to access Google services and APIs to run their applications. Ensuring the reliability of this infrastructure, including URL Fetch, remains a top priority for us.


Regards,


Andrew Jessup, on behalf of the Google App Engine Team
Reply all
Reply to author
Forward
0 new messages