Downtime or memcache issue? Latency spike (up to 5/8 times) and memcache write error

Cristian Marastoni

unread,

Jul 16, 2015, 3:40:31 AM7/16/15

to google-a...@googlegroups.com

My app is experiencing latency problem (up to 5 times). I noticed many problem also accessing and writing memcache stuff.

Are there any issue reported?

Nick (Cloud Platform Support)

unread,

Jul 16, 2015, 7:24:20 PM7/16/15

to google-a...@googlegroups.com, c.mar...@reludo.com

Hey Cristian,

This is understandable given that Memcache is shared using datacenter resources across all apps. It's likely that apps in the same location as yours also experienced the same latency for that period. You can read about Memcache in the docs to find that there is not an SLA for response times. It's likely that the response time was still significantly faster than Datastore, Cloud SQL, your own MySQL instance, etc., so there's that to keep in mind.

Usually, if an issue is large enough to violate some SLA, or if many apps are affected, a status alert will go out at status.cloud.google.com, although in this case, the latency you saw was not enough to trigger a detailed issue report.

If you have any further questions about memcache, feel free to ask, and also to consult the docs to learn more.

Regards,

Nick

Cristian Marastoni

unread,

Jul 17, 2015, 4:11:22 AM7/17/15

to google-a...@googlegroups.com, c.mar...@reludo.com

Hi Nick,

thank for the response.

I know that the memcache response time is not covered by SLA (at least for best effort, dedicated probably have), however yesterday it was very high indeed. Because I'm using ndb and I'm using the default memcache policy my server were very slow (at a certain time for sure was terribile).

Honestly speaking yesterday was the first day we started having such an high load, 20/30 request per seconds (like x 10 in respect to the days before).The front-end tier was F1, probably that servers weren't able to cope with too many concurrent request (module configure to handle up to 10) and the response time couldn't be better. What really surprised me is that there was 10 server up to handle that load (like at most 2 concurrent request at most per server).

I'm still investigating about that, potentially there is something wrong in my code. Today I changed the tier to F2 and that (obviously) is much better.

Nick (Cloud Platform Support)

unread,

Jul 20, 2015, 4:16:35 PM7/20/15

to google-a...@googlegroups.com, c.mar...@reludo.com

Hey Christian,

If you don't have it installed on your app it might be too late to diagnose the past issue, but using appstats (java | python) you could determine exactly where the latency occurred, whether it was the memcache calls or the issue of instances (and instance class) themselves. Given that the instance class increase seems to have solved the issue, it could have even been a mix.

Best wishes,

Nick

troberti

unread,

Jul 21, 2015, 5:23:46 AM7/21/15

to google-a...@googlegroups.com, c.mar...@reludo.com

Instead of using AppStats, you should just use Cloud Trace instead. Afaik, it does everything AppStats does, but without the overhead.

Nick (Cloud Platform Support)

unread,

Jul 21, 2015, 11:51:20 AM7/21/15

to google-a...@googlegroups.com, tij...@firigames.com, c.mar...@reludo.com

Great point, troberti. Very true!

Cristian Marastoni

unread,

Jul 21, 2015, 12:57:46 PM7/21/15

to Nick (Cloud Platform Support), google-a...@googlegroups.com, tij...@firigames.com

Thank Nick and troberti. Great suggestions indeed. And in fact I had both: app stat and cloud trace. App stats record a bit more stuff but I found cloud trace a good compromise for the rough analysis.

I investigated a bit more and I found 4 major problem:

1 - memcached layer due to the usage pattern in ndb if start to fail or slow down will case a huge impact (even if batched). But that should be expected. I need to find if mc is down or slow and disable the usage in ndb context probably during that period.

2 - my app had a bug, due to the usage of a Reader/Writer lock I wrote (I tested it with up to 50 threads but it seem that the test is not always enough). I removed that part and the threads started to work better (I will ask suggestion for an usage pattern in a different thread)

3 - while I used a lot ndb.multi, async call and I cached in practice all the static data in memory (when the app start) an F1 machine wasn’t enough to granted more than 3/4 concurrent threads while keeping the latency under 1/1.5 seconds. (now it is time to understand why, because the code of the handlers is quite simple to be honest)

4 - I was mixing some slow work from task queue to the same instances and because they get the thread for a while, they could slow down a lot the other handler in queue

5 - I noticed that sometime the datastore transactions and put have great spike (in term of time). Normally they take 0.2 / 0.5 second and something they need 6 second to complete. I’m worried a bit about the entity ids. There are many entity (that belong to different ancestor) that use the same name, maybe are used or close to the same db shard.

As usual, we learn by doing thing :)

PS: troberti, are you from firigames ? So I have to thank you for both Phoenix HD and for your great BTree library! I use it a lot for my leaderboards ;)

Cristian Marastoni

unread,

Jul 21, 2015, 1:21:17 PM7/21/15

to google-a...@googlegroups.com

Thank Nick and troberti. Great suggestions indeed. And in fact I had both: app stat and cloud trace. App stats record a bit more stuff but I found cloud trace a good compromise for the rough analysis.

I investigated a bit more and I found 4 major problem:

1 - memcached layer due to the usage pattern in ndb if start to fail or slow down will case a huge impact (even if batched). But that should be expected. I need to find if mc is down or slow and disable the usage in ndb context probably during that period.

2 - my app had a bug, due to the usage of a Reader/Writer lock I wrote (I tested it with up to 50 threads but it seem that the test is not always enough). I removed that part and the threads started to work better (I will ask suggestion for an usage pattern in a different thread)

3 - while I used a lot ndb.multi, async call and I cached in practice all the static data in memory (when the app start) an F1 machine wasn’t enough to granted more than 3/4 concurrent threads while keeping the latency under 1/1.5 seconds. (now it is time to understand why, because the code of the handlers is quite simple to be honest)

4 - I was mixing some slow work from task queue to the same instances and because they get the thread for a while, they could slow down a lot the other handler in queue

5 - I noticed that sometime the datastore transactions and put have great spike (in term of time). Normally they take 0.2 / 0.5 second and something they need 6 second to complete. I’m worried a bit about the entity ids. There are many entity (that belong to different ancestor) that use the same name, maybe are used or close to the same db shard.

As usual, we learn by doing thing :)

PS: troberti, are you from firigames ? So I have to thank you for both Phoenix HD and for your great BTree library! I use it a lot for my leaderboards ;)

troberti

unread,

Jul 21, 2015, 5:46:58 PM7/21/15

to Google App Engine

PS: troberti, are you from firigames ? So I have to thank you for both Phoenix HD and for your great BTree library! I use it a lot for my leaderboards ;)

Heh, yes that's me :) Great to hear you use and like my BTree library!

Reply all

Reply to author

Forward