It's not the requeasts, but the Id / Key pairs that the requests use
for signatures that are cached.
The lifecycle of this data for AWSQS signing requests is as follows:
- When a request reaches an application instance, it first checks the
instance's memory for the AWSAccessKeyId value sent with the request.
This storage lasts as long as an application instance is kept in
memory. Whether a particular request is serviced by a particular
application instance will depend on many things, including chance.
GAE tends to remove app instances fairly quickly when not is use, on
the order of a minute or two, sometimes far less. To help mitigate
this somewhat, AWSQS has a "ping" cron job running every minute. It
generally seems to keep at least one instance around. High loads will
cause new instances to be started with initially empty in-memory
caches. I'm not certain what specific load levels & types will fire
up new instances; this would take significant experimentation to
determine.
- If an Id / Key pair is not found in an instance's memory, then
memcache is checked. Memcache storage is shared by all instances of
an application. If found, the value is also stored in the current
application instance's memory for possible retrieval by any subsequent
requests. AWSQS memcache storage of Id / Key pairs is not set to
expire, but application or system memory pressure can cause early
cache evictions.
- If an Id / Key pair is not found in memcache or in-memory, the data
is retrieved from the datastore, and then placed in both memcache and
the current application instance's memory. The datastore is, of
course, persistent, but is an expensive resource to access when
compared to memcache, and memcache is expensive when compared to in-
memory storage.
- If an Id / Key pair is still not located, it is still cached (both
in-memory and in memcache) to minimize resource usage by any
subsequent requests using the same Id. If an Id / Key pair is
registered later, the value currently in memcache and the current
application instance's memory is removed so that the registered values
can be picked up by later requests. **
In the upcoming downtime, supposedly memcache will be entirely down,
but datastore reads will be functioning, so in theory, only the middle
step will be missing, and no signing requests should fail. In
practice, it seems the GAE downtimes are often not as bad as stated;
it makes sense for them to be conservative about these matters. And I
add a bit more conservatism myself when describing the AWSQS service's
behavior in such circumstances, because things can & do go wrong with
system maintenance.
I have considered adding special error codes & messages for these
specific types of situations, just in case requests do fail, but I'm
not sure that this really means anything different to the service
consumers than any other failure
-----------------------
** Yes, I see a hole in this logic: if there's a different running
instance that has the "miss" already cached in memory, it will
continue to return "key not found" errors until - well, various
events, some obvious such as instance shutdown, and some fairly
complex, such as seemingly unrelated events causing in-memory cache
registry maintenance. Setting cache expiration times is
unsatisfactory. I'll have to think about this some more. Luckily,
this sort of thing is both rare, and will tend to resolve itself
anyway. It would help to know more about how GAE allocates requests
when there is more than one application instance running, in
particular, whether sessions are "sticky" at all.