AWSQS: Google App Engine Maintenance scheduled for 2009.09.02 00:00 UTC for one hour

C Sowa

unread,

Aug 27, 2009, 4:35:37 PM8/27/09

to SowaCS Consulting

FYI, just received this notice:
http://groups.google.com/group/google-appengine-downtime-notify/browse_thread/thread/e46e77cf05de0c6

The impact on AWSQS will be variable.

Signing Requests:

Worst case is you will receive an XML document with an error message
and an HTTP 500 error code.

If, however, you have made a "recent" signing request and your AWS
Id / Key pair is still in memory at the time of the outage, AWSQS will
be able to sign your request and respond as usual, as long as GAE does
not flush the app from memory in the interim. The exact nature of
"recent", however, depends on many external factors.

Accumulated Usage Counts:

Counts of requests that are successful will not be accumulated for
this period and for up to twenty minutes before the outage.

Key Registration:

During this period no AWS Id / Key pairs may be registered or changed.

Thanks for your patience !

Piper

unread,

Aug 27, 2009, 8:30:02 PM8/27/09

to SowaCS Consulting

How long are the requests cached?

On Aug 27, 4:35 pm, C Sowa <sow...@gmail.com> wrote:
> FYI, just received this notice:http://groups.google.com/group/google-appengine-downtime-notify/brows...

C Sowa

unread,

Aug 27, 2009, 11:51:06 PM8/27/09

to SowaCS Consulting

On Aug 27, 8:30 pm, Piper <47digi...@gmail.com> wrote:
> How long are the requests cached?
>

It's not the requeasts, but the Id / Key pairs that the requests use
for signatures that are cached.

The lifecycle of this data for AWSQS signing requests is as follows:

- When a request reaches an application instance, it first checks the
instance's memory for the AWSAccessKeyId value sent with the request.
This storage lasts as long as an application instance is kept in
memory. Whether a particular request is serviced by a particular
application instance will depend on many things, including chance.
GAE tends to remove app instances fairly quickly when not is use, on
the order of a minute or two, sometimes far less. To help mitigate
this somewhat, AWSQS has a "ping" cron job running every minute. It
generally seems to keep at least one instance around. High loads will
cause new instances to be started with initially empty in-memory
caches. I'm not certain what specific load levels & types will fire
up new instances; this would take significant experimentation to
determine.

- If an Id / Key pair is not found in an instance's memory, then
memcache is checked. Memcache storage is shared by all instances of
an application. If found, the value is also stored in the current
application instance's memory for possible retrieval by any subsequent
requests. AWSQS memcache storage of Id / Key pairs is not set to
expire, but application or system memory pressure can cause early
cache evictions.

- If an Id / Key pair is not found in memcache or in-memory, the data
is retrieved from the datastore, and then placed in both memcache and
the current application instance's memory. The datastore is, of
course, persistent, but is an expensive resource to access when
compared to memcache, and memcache is expensive when compared to in-
memory storage.

- If an Id / Key pair is still not located, it is still cached (both
in-memory and in memcache) to minimize resource usage by any
subsequent requests using the same Id. If an Id / Key pair is
registered later, the value currently in memcache and the current
application instance's memory is removed so that the registered values
can be picked up by later requests. **

In the upcoming downtime, supposedly memcache will be entirely down,
but datastore reads will be functioning, so in theory, only the middle
step will be missing, and no signing requests should fail. In
practice, it seems the GAE downtimes are often not as bad as stated;
it makes sense for them to be conservative about these matters. And I
add a bit more conservatism myself when describing the AWSQS service's
behavior in such circumstances, because things can & do go wrong with
system maintenance.

I have considered adding special error codes & messages for these
specific types of situations, just in case requests do fail, but I'm
not sure that this really means anything different to the service
consumers than any other failure

-----------------------

** Yes, I see a hole in this logic: if there's a different running
instance that has the "miss" already cached in memory, it will
continue to return "key not found" errors until - well, various
events, some obvious such as instance shutdown, and some fairly
complex, such as seemingly unrelated events causing in-memory cache
registry maintenance. Setting cache expiration times is
unsatisfactory. I'll have to think about this some more. Luckily,
this sort of thing is both rare, and will tend to resolve itself
anyway. It would help to know more about how GAE allocates requests
when there is more than one application instance running, in
particular, whether sessions are "sticky" at all.

C Sowa

unread,

Sep 1, 2009, 9:46:57 PM9/1/09

to SowaCS Consulting

GAE downtime follow-up:

No errors reported in the server logs, despite (or perhaps because
of?) an in-progress two+ request per second "session" of several
hours, among others.

Accumulated usage counts were undoubtedly affected, as they rely on
memcache operation. However, these are merely informative.

Reply all

Reply to author

Forward