I'll keep updating on this thread. Sorry for the inconvenience.
--
Takashi Matsuo | Developer Advocate | tma...@google.com
Thanks for the patience.
Summary
On Aug 19th, 2011, application deployments to App Engine began taking
longer than normal, and sometimes did not succeed (returned as a 500
error). The issue started at around 7:00pm PDT, and lasted for ~4.5
hours. This issue affected only application deployment; traffic to
currently deployed versions of applications was not affected by this
issue.
(Applications using of the Master/Slave and the High Replication
Datastore configurations were affected equally by this issue.)
Root cause
As part of the application deployment process, your application’s code
is stored into a replicated code repository, for later retrieval when
creating instances of your application. During the above incident, one
replica of the code repository malfunctioned, and attempts to store
code to that replica failed.
The current implementation of the code storage system requires that
your application’s code is successfully committed to all members of
the replicated repository before your application is considered
successfully deployed. Failure to write to any one replica will block
application deployment.
Remediation
Monitoring: Immediately, we’re improving our internal monitoring to
bring malfunctions like this to our attention faster, so we can affect
repairs quickly.
Reliability: In the short to medium term, we’re modifying the code
storage system and how it interacts with the replicated code
repository. We’ve identified a simpler redesign where your code only
needs to be committed to a subset of the code repository replicas at
upload time. A failure in storing your code to one or more replicas
will not block uploads. Background replication processes will
synchronize your code to all replicas in case of temporary storage
failure at upload time. The code repository is queried in a
strongly-consistent manner, so your application’s code can always be
retrieved correctly, even if some of the replicas are temporarily not
up to date. This system is comparable to the replication scheme based
on the Paxos algorithm used in the High Replication Datastore
configuration.
Once this redesigned code storage system is in place, application
deployments will be more resilient, and less prone to transient
errors. We’ll use our improved monitoring to verify the improvements
in the application deployment process, as we deploy this new design.
-- App Engine Team