Deployments are very slow and often ends up with an error, starting from 2011 Aug 19 7:00pm

Takashi Matsuo ♟

unread,

Aug 20, 2011, 12:51:30 AM8/20/11

to google-appengine...@googlegroups.com

Currently, deployments to App Engine are very slow, and often results
in an error.
We're aware of this issue and working on a fix.

I'll keep updating on this thread. Sorry for the inconvenience.

--
Takashi Matsuo | Developer Advocate | tma...@google.com

Takashi Matsuo ♟

unread,

Aug 20, 2011, 4:16:23 AM8/20/11

to google-appengine...@googlegroups.com

Things are getting back to normal now. We'll keep putting eyes on it
for a while though.
A detailed postmortem will be coming hopefully next week.

Thanks for the patience.

Takashi Matsuo ♟

unread,

Aug 27, 2011, 8:17:06 PM8/27/11

to google-appengine...@googlegroups.com

Here is a postmortem for the deployment issue on Aug 19.
Thanks for your patience.

Summary

On Aug 19th, 2011, application deployments to App Engine began taking
longer than normal, and sometimes did not succeed (returned as a 500
error). The issue started at around 7:00pm PDT, and lasted for ~4.5
hours. This issue affected only application deployment; traffic to
currently deployed versions of applications was not affected by this
issue.

(Applications using of the Master/Slave and the High Replication
Datastore configurations were affected equally by this issue.)

Root cause

As part of the application deployment process, your application’s code
is stored into a replicated code repository, for later retrieval when
creating instances of your application. During the above incident, one
replica of the code repository malfunctioned, and attempts to store
code to that replica failed.

The current implementation of the code storage system requires that
your application’s code is successfully committed to all members of
the replicated repository before your application is considered
successfully deployed. Failure to write to any one replica will block
application deployment.

Remediation

Monitoring: Immediately, we’re improving our internal monitoring to
bring malfunctions like this to our attention faster, so we can affect
repairs quickly.

Reliability: In the short to medium term, we’re modifying the code
storage system and how it interacts with the replicated code
repository. We’ve identified a simpler redesign where your code only
needs to be committed to a subset of the code repository replicas at
upload time. A failure in storing your code to one or more replicas
will not block uploads. Background replication processes will
synchronize your code to all replicas in case of temporary storage
failure at upload time. The code repository is queried in a
strongly-consistent manner, so your application’s code can always be
retrieved correctly, even if some of the replicas are temporarily not
up to date. This system is comparable to the replication scheme based
on the Paxos algorithm used in the High Replication Datastore
configuration.

Once this redesigned code storage system is in place, application
deployments will be more resilient, and less prone to transient
errors. We’ll use our improved monitoring to verify the improvements
in the application deployment process, as we deploy this new design.

-- App Engine Team

Reply all

Reply to author

Forward