Information on Google App Engine's recent US datacenter relocations

2,247 views
Skip to first unread message

Michael Handler

unread,
Aug 22, 2013, 4:31:44 PM8/22/13
to google-appengine...@googlegroups.com

Google App Engine recently relocated our entire US serving footprint to a different location within the US, without requiring any scheduled maintenance period or reduction in functionality, and while serving normally throughout the work period. This work required multiple engineer-months of careful preparation, repeated refinement of our datacenter setup and data copy automation, and a world-class network backbone between our locations. Online migration of all stored data required an initial transfer of multiple petabytes, and online replication of all changes written to stored data after the initial transfer, until the relocation process was complete.


This was the first time the US serving footprint for App Engine had been relocated en masse since the launch of the High Replication Datastore (HRD) in January of 2011. Uncommon processes of this complexity and scale are rarely completed without some small number of bumps and missteps, and this migration proved to be no exception to that rule. While we worked hard to make this process as automatic and invisible to our customers as possible, we fell short in a few places, and we’d like to take this opportunity to explain what happened and what we’re doing to address the issues we encountered.

Storage Layer Overload

As we completed the migration of one of our replicas into its new datacenter, the storage infrastructure in that datacenter began consuming all resources allocated to it. This was above all previously observed demand and capacity planning, without any corresponding increase in traffic that would logically have caused such a jump in demand. As with any piece of infrastructure where demand exceeds supply, the performance of the portions of App Engine that depend heavily on the storage layer degraded.


We simultaneously began investigating potential causes of the increase in storage infrastructure resource demand, and allocating more resources to the storage infrastructure. As the additional resources came online, demand grew too quickly consume them as well, clarifying that this was not a simple case of underprovisioning. We directed traffic away from the affected datacenter, and immediately saw storage infrastructure resource demand return to normal levels for a drained datacenter.


After further investigation, and detailed consultation with the engineers from the storage infrastructure teams, we were unable to determine the origin of the increased resource demand. We returned traffic to this datacenter during the period of lowest global App Engine load and saw resource demand increase to expected levels without incident. The datacenter has performed as expected since that time, without any unusual behavior or incidents similar to this one.


What you saw: Applications serving from this particular datacenter would have experienced elevated Datastore latency and errors (for both reads and writes), and irregular and delayed Task Queue performance. Additionally, all US applications may have experienced a slightly elevated error rate for datastore write operations, as a consequence of the localized overload.


When you saw it: Tuesday, 25 June 2013, from 7:30 AM to 3:10 PM US/Pacific


What we’re doing about it: We are chagrined to say that, since this event, we have been unable to further diagnose the origin of this issue, nor reproduce it in our testing infrastructure or labs. Without a clear root cause, we cannot commit to specific fixes, other than continuing to monitor and improve our diagnostic tools, and investigating ways to gradually slowly ramp up traffic into a datacenter, to avoid any potential complications when beginning to serve from a datacenter for the first time.

Application Server Hotspots

Before beginning this move process, App Engine had recently completed an internal migration to the next generation of our scheduler system. The scheduler is responsible for deploying your applications to our serving environment, assigning more resources to your application as required by your traffic and directed by your performance and scaling settings, and rebalancing to ensure a uniform load across our infrastructure.


As the migration to the new scheduler had occurred exclusively within already-serving datacenters, historical load data about each application from the old scheduler was already on hand and available for use by the new scheduler’s algorithm, which provided for a smooth transition. However, when we tested bringing up a new datacenter for the first time under the new scheduler, which lacked any historical load data for all applications assigned to the datacenter, we discovered that the new scheduler’s algorithm performed poorly in the absence of historical load data, did not properly distribute load across application servers, and created hotspots.


For the en masse migration to our new datacenters, we prepared for this situation by replicating the historical load data from each old datacenter to the each new datacenter for use by the new scheduler during startup. While this process generally functioned sufficiently well, the load distribution in the new datacenter still resulted in a small percentage of overloaded application servers, which were prone to serving mostly errors. The new scheduler’s algorithm would have corrected these hotspots eventually but at an unacceptably slow rate, and we manually intervened to rebalance and eliminate the hotspots.


What you saw: A small number of application servers in one datacenter were overloaded and unable to create new instances. Larger applications are assigned to multiple application servers in their datacenter, and any traffic arriving at the affected application server would be retried at a different application server after a small delay, at a cost of some latency.


Small applications are assigned a small number of application servers and do not automatically retry their traffic on other application servers unless the first application server is completely down. If the application server is unable to start instances at all, any request will return a 500 error. (This minimizes creating new instances on other application servers, which would elevate the instance hour cost of your application, but can result in substantial disruption in this particular case.)


When you saw it: Tuesday, 25 June 2013 through Wednesday, 26 June 2013


What we’re doing about it:

  • Improving the scheduler algorithm to make better load assignment choices when limited or no load data is available.

  • Installing safety limits in the scheduler to keep it from assigning too much load to a given application server, to guard against unexpected edge cases in the scheduler algorithm.

  • Instrumenting the application servers and the schedulers to automatically and aggressively detect and disable any application server that is consistently unable to start instances, or is serving a high percentage of internal errors in response to requests.

Delayed Datastore Eventual Consistency

The High Replication Datastore (HRD) has been designed and documented (Python, Java) from its creation as being eventually consistent, i.e. a non-ancestor query (a query that can return results from multiple entity groups) may not immediately return results that reflect recent writes to those entity groups.


We strive under normal circumstances to keep the eventual consistency delay (the time before all current writes will be reflected in non-ancestor queries) as small as possible. However, the process of moving App Engine en masse to its new datacenters required us to take each replica completely offline for some amount of time during the final phase of the move. Taking the replica completely offline prevents it from accepting writes and participating in Paxos majority consensus, and also prevents the background replication system from keeping the replica up to date. When re-enabled, the backlog of updates causes the replica to be further behind than what is typically observed during normal operation, and increases the eventual consistency delay to levels not seen during routine functioning of App Engine.


During the design of this migration, alternative processes were investigated which could have minimized the increase in eventual consistency delay that was observed, but analysis showed they would have required periods of elevated datastore and serving latency to complete, and were judged too risky to pursue.


Given that HRD is designed and documented to be eventually consistent with non-ancestor queries, we strongly encourage you to view the documentation linked above about eventual consistency and expected behavior for Datastore queries, and evaluate whether you may need to modify your application to better handle any potential issues that would arise during periods of elevated eventual consistency delay.


What you saw: Non-ancestor queries may have returned results that did not reflect writes performed a notable amount of time in the past, i.e. elevated eventual consistency delay


When you saw it: Monday, 24 June 2013 through Friday, 28 June 2013


What we’re doing about it: Improving our infrastructure to generate better per-application views of eventual consistency delay, and using that to drive improvements that will reduce it systemwide. We are also changing our local development tools so that eventual consistency is enabled by default, allowing developers to experience and build for this behavior earlier in the development cycle.

Closing Thoughts

If your application experienced any disruption due to the above issues, we would like to extend sincere apologies to you and your users. We hope that this document makes it clear what happened, why, and what we’re doing to guard against any of these issues from recurring, as well as preventing future similar issues from occurring at all.


That said, despite the disruptions we outlined above, we consider this migration a success for Google App Engine. Through careful planning and technical innovation, we were able to relocate the entirety of the US serving footprint to a new location in the US without requiring a scheduled maintenance window, a read-only period of the Datastore (as was required of Master/Slave applications), or causing substantial disruptions to our serving traffic. The new datacenters allow for the Google Cloud Platform to expand even more rapidly than before to handle increased demand for our services, and additionally, colocates more Google Cloud Platform services in close proximity, guaranteeing reduced latency for communication between them, and allowing development of new innovative products and integrations.


This level of reliability and ongoing innovation and improvement are built into the design of Google Cloud Platform products from the start, and we’re glad to provide it to your applications standard and without additional cost as part of the services we are selling to you. We recognize that you have built your businesses on the Google Cloud Platform because you trust Google to handle the difficult and complicated tasks of growing and reliably maintaining computing resources as service, while letting you just focus on growing your application and your business. We’re substantially committed to continuous, thoughtful, strategic, and pre-emptive improvement in the reliability of all parts of the Google Cloud Platform, no matter the difficulty or the complexity.


As always, if you believe your paid application experienced an SLA violation due to any of the issues that we describe above, please fill out our refund request form.


Regards,


Michael Handler, on behalf of the Google App Engine Team

Reply all
Reply to author
Forward
0 new messages