Mysteriously ramping request times

74 views
Skip to first unread message

Robert Morgan

unread,
May 22, 2012, 3:02:27 PM5/22/12
to google-a...@googlegroups.com
Beginning about 18 hours ago, the milliseconds/Request on one of our apps began to ramp and is now over 100 seconds for a trivial response (which historically takes 2 seconds). Our other apps are fine, and the AE status page is clear.

It's an app that we use for development, it's been around for ages and still runs the Master-Slave datastore. Python. The ramping just started -- we had not deployed a new server version in about a week, and our client usage remains low and patterns haven't changed.

I've poked at it trying to see if I can determine some threshold, but even a very simple return now fails due to DeadlineExceeded. 

This really feels like it's a system issue -- any ideas on what I can do next?

Thanks,
:R

Rishi Arora

unread,
May 22, 2012, 5:01:17 PM5/22/12
to google-a...@googlegroups.com
Perhaps similar to what I'm seeing.  Although I was able to use the appstats tool to determine that the deadlines were being missed specifically because of memcache API calls.  Perhaps you can determine from appstats if its the same for you?

If so, please star this issue that I logged:

code.google.com/p/googleappengine/issues/detail?id=7554


--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/rYY5Gw9UuE8J.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.

Rishi Arora

unread,
May 22, 2012, 5:04:53 PM5/22/12
to google-a...@googlegroups.com
Here's what my milliseconds/request looks like, although the #instances follows a similar pattern, and translates directly into higher costs (5x normal usage today)
Screen Shot 2012-05-22 at 4.03.26 PM.png

Robert Morgan

unread,
May 22, 2012, 5:15:46 PM5/22/12
to google-a...@googlegroups.com
My graph is attached -- interesting, eh?
I don't think we're using memcache in this instance.

:R
chart.png

Rishi Arora

unread,
May 22, 2012, 5:33:07 PM5/22/12
to google-a...@googlegroups.com
Hmmm.  That means the problem isn't specific to memcache.  And that means it might take longer for Google to fix this, if indeed they determine there's a problem.

Rishi Arora

unread,
May 22, 2012, 5:36:15 PM5/22/12
to google-a...@googlegroups.com
Here's my appstats output for a specific request showing my problem is squarely related to memcache:
Screen Shot 2012-05-22 at 4.35.10 PM.png

Takashi Matsuo

unread,
May 22, 2012, 6:19:44 PM5/22/12
to google-a...@googlegroups.com
Thanks for reporting guys!

Ideally, the memcache service should have steady latencies, but
unfortunately it is not as stable as HR datastore. Additionally, our
SLA is not covering the memcache service now. In general, we're
working hard for improving stability of our service. However, of
course we prioritize SLA-covered serviced first.

Thus, you may want to file a feature request for making the memcache
service covered by SLA. That way, others can show their
interests/business needs by staring it, and if there's many stars on
the issue, our engineering team will eventually prioritize providing
SLA-covered memcache service.

For the time being, you can put appropriate deadline on your memcache
call to prevent those kind of latency problems happening. Python
memcache Client class has some methods like get_multi_async and
set_multi_async that you can put a deadline(by specifying an rpc
object).

For more details, please see:
https://developers.google.com/appengine/docs/python/memcache/clientclass#Client_get_multi_async

-- Takashi
Takashi Matsuo | Developer Advocate | tma...@google.com

Rishi Arora

unread,
May 22, 2012, 11:21:35 PM5/22/12
to google-a...@googlegroups.com
Takashi,
Thanks for a meaningful suggestion.  I'm on my way to implement timeouts / deadlines for my memcache API calls.  However, I wonder what the rational is behind not having SLAs for memcaches.  SLA can be made lenient, but they should still exist.  Otherwise how do we hold you accountable for the features you provide.  Not having any SLAs sounds like you are absolving yourself of all responsibilities.  Memcache, as a feature, should provide a "cheaper" access to underlying data, by definition, in exchange for loss of reliability.  If Memcache API calls have huge latencies associated with them, then its not cheap anymore.  And if the net cost of using memcache goes higher than the cost of using datastore, why does the memcache even exist?  Additionally, I guess if memcache has strictly temporary spikes in latency, then that's understandable too, but if an issue persists through an entire day, then it calls into question the very existence of memcache.  And finally, the issues described in this email chain are frighteningly correlated with the latest 1.6.6 upgrade of google app engine.

Here's the production issue I have logged, starred by several others, and nothing concrete has been done about it:

http://code.google.com/p/googleappengine/issues/detail?id=7554

Daniel

unread,
May 23, 2012, 12:44:42 AM5/23/12
to google-a...@googlegroups.com
I'm seeing the same issue.. for the last 3 days latency has skyrocketed and my costs have over doubled.   Here's my graph for the past 7 days




On Tuesday, May 22, 2012 2:04:53 PM UTC-7, Rishi Arora wrote:
Here's what my milliseconds/request looks like, although the #instances follows a similar pattern, and translates directly into higher costs (5x normal usage today)

On Tue, May 22, 2012 at 4:01 PM, Rishi Arora <rishi...@ship-rack.com> wrote:
Perhaps similar to what I'm seeing.  Although I was able to use the appstats tool to determine that the deadlines were being missed specifically because of memcache API calls.  Perhaps you can determine from appstats if its the same for you?

If so, please star this issue that I logged:

code.google.com/p/googleappengine/issues/detail?id=7554
On Tue, May 22, 2012 at 2:02 PM, Robert Morgan <gummyw...@gmail.com> wrote:
Beginning about 18 hours ago, the milliseconds/Request on one of our apps began to ramp and is now over 100 seconds for a trivial response (which historically takes 2 seconds). Our other apps are fine, and the AE status page is clear.

It's an app that we use for development, it's been around for ages and still runs the Master-Slave datastore. Python. The ramping just started -- we had not deployed a new server version in about a week, and our client usage remains low and patterns haven't changed.

I've poked at it trying to see if I can determine some threshold, but even a very simple return now fails due to DeadlineExceeded. 

This really feels like it's a system issue -- any ideas on what I can do next?

Thanks,
:R

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/rYY5Gw9UuE8J.
To post to this group, send email to google-appengine@googlegroups.com.
To unsubscribe from this group, send email to google-appengine+unsubscribe@googlegroups.com.

Rishi Arora

unread,
May 23, 2012, 12:55:59 AM5/23/12
to google-a...@googlegroups.com
If you believe this may be because of memcache API latency, please star this issue:
http://code.google.com/p/googleappengine/issues/detail?id=7554

To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/eDY4e0D-tH4J.

To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.

Jeff Schnitzer

unread,
May 23, 2012, 2:37:28 AM5/23/12
to google-a...@googlegroups.com
While I am sympathetic to your plight, I think there's a mistaken assumption in this issue.

A number of Google comments in public threads have led me to believe that M/S apps run in entirely separate clusters (possibly in entirely separate datacenters) from HRD apps.  It's likely that all of the infrastructure - memcache, task queue, urlfetch, etc - are separate.  When Google says "we're deprecating Master/Slave" they're really thinking "we're deprecating the entire cluster", including all these other services.  So when anything at all goes wrong in the old cluster, the solution is the same - move to HRD.

There may very well be problems with memcache in the HRD cluster(s), but it will be hard to get anyone to pay attention unless the issue is demonstrated.  This is a little like someone reporting a bug in a very old version of a piece of software.  Yes, that bug shouldn't be there, and maybe it is still present in the new version, but the first step towards a solution is still "upgrade".  Nobody pays attention to bug reports in obsolete software.

So... upgrade to HRD.  At the very least this will cause Google to pay closer attention to your bug reports.  It might even fix the issue.

Jeff

Rishi Arora

unread,
May 23, 2012, 7:29:50 AM5/23/12
to google-a...@googlegroups.com
Jeff,
If what you said is true, then Google should be honest about it and say that all M/S app hosting clusters are deprecated, and they should clearly state that "Any latency related issues on M/S-based apps will not be investigated".  I have not seen such unambiguous statements from anyone at Google.  I do agree though that the strongest reason yet to move to HRD is to make Google pay closer attention to bug reports.

Ajax

unread,
May 31, 2012, 4:22:15 AM5/31/12
to google-a...@googlegroups.com
I would say the strongest reason to move to HRD is the fact that you don't have to pay instance hours waiting for datastore indexes to write; once your operation is committed, your thread doesn't have to wait any longer.

It will definitely save you money, and if they are newer clusters, you can expect better performance. 

Ronoaldo José de Lana Pereira

unread,
May 31, 2012, 7:31:06 AM5/31/12
to google-a...@googlegroups.com
Hello Takashi,

Can you please point me how to acomplish this timeout in Java?

Thanks!
>>>>>> To unsubscribe from this group, send email to
>>>>>> For more options, visit this group at
>>>>>> http://groups.google.com/group/google-appengine?hl=en.
>>>>>
>>>>>
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "Google App Engine" group.
>>>> To post to this group, send email to google-appengine@googlegroups.com.
>>>> To unsubscribe from this group, send email to
>>>> For more options, visit this group at
>>>> http://groups.google.com/group/google-appengine?hl=en.
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups
>>> "Google App Engine" group.
>>> To post to this group, send email to google-appengine@googlegroups.com.
>>> To unsubscribe from this group, send email to
>>> For more options, visit this group at
>>> http://groups.google.com/group/google-appengine?hl=en.
>>
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To post to this group, send email to google-appengine@googlegroups.com.
> To unsubscribe from this group, send email to
Reply all
Reply to author
Forward
0 new messages