What I did wrong? 2000 instances!

109 views
Skip to first unread message

Ronoaldo José de Lana Pereira

unread,
Feb 12, 2012, 6:44:16 PM2/12/12
to google-a...@googlegroups.com
Some inputs:


Latency:

Req/Sec:

Instances:


Our app usualy, with normal traffic, needs about 40 - 60 instances to run. Our setup is Java runtime with multithreading disabled (we tried enabling it but the error rate was too high due to DeadlineExceededExceptions and HardDeadlineExceededExceptions). Currently, we are on MS, but we are in progress to migrate to HRD.

Since Friday, we started an operation to sync 500k contacts with an external app, and this sync required about 10 API calls to the remote server (urlfetch calls). The overall operation of sync one contact is slow, and for some limitations of the remote service, we need to sync each contacts individually. We started running this sync in a queue with rate of 1/s. This proven to work and to be extremely slow.

Today I decided to go faster and configured the queue to run at 20/s with max_concurrent of 3000, since this is a sunday, with less traffic than usual, on both our app and the remote service. At this point, there was around 350k contacts to sync. A few hours later, our app was running with 2000 instances, the app responding slowly, and still 150k remaining contacts to sync.

I'm assuming that I did something very, very very wrong, but don't know where to start. What I found weird, was that the instance count started to grow in a weird, strange, unstoppable way while the req/sec was stable. So, my question: what I did so wrong that cost me around $320 in a few hours!?

Any tips on how to solve this problem more eficiently? I followed the suggestion to try doing small things in tasks, so, 1 contact sync (~10 urlfetch calls + ~5 datastore read ops) = 1 task.

Thanks in advance for any suggestion, ans sorry for this long post.

Best Regards,

-Ronoaldo Pereira

Brandon Wirtz

unread,
Feb 12, 2012, 7:19:35 PM2/12/12
to google-a...@googlegroups.com

>Today I decided to go faster and configured the queue to run at 20/s with max_concurrent of 3000, s

 

I am surprised that if you don’t have Multi-thread enabled that you don’t have 3000 instances.

 

Probably you hit a hard Quota limit.

 

I would have borrowed Ikai’s code.
http://ikaisays.com/2010/06/29/using-asynchronous-urlfetch-on-java-app-engine/

 

Done the fetches

 

Then async puts to the DB.

 

I’d have done this on a Back End Instance so the task could take longer.

 

150k isn’t very many, Assuming the fetch completed in a reasonable amount of time I’d do 8 or 15 at a time. I’d have another app/url set the pace. You could set a cron to fire every 5s or something similar on a bit of code that all it did was fire 4 of the backends, or 8 or 16 or 64.  So that you could change the speed as you saw that things were working.

 

 

Barry Hunter

unread,
Feb 12, 2012, 7:22:43 PM2/12/12
to google-a...@googlegroups.com

Today I decided to go faster and configured the queue to run at 20/s with max_concurrent of 3000, 

Doesnt that explicity say that its allowed to process 3000 tasks at once?  With no multi-threading that means pretty much 3000 instances. 

the 20/s is just the starting point - it can go higher. 


In effect the queue is trying to span 20 new tasks each and every second. But because tasks are talking longer than a second - it can't be run on a fixed number of instances. in the first second 20 instances are spawned. In the second second, 20 more instances, because the first 20 are still busy. In the third, 20 more. --- basically this pattern continues. Some tasks do eventually finish, freeing up the instance - but new tasks are still coming quicker than your instances can process them. 

And because you are loading the remote service even more, the RPCs are probably getting slower - just compuunding your problem. 


Lower max_concurrent right away! 

Brandon Wirtz

unread,
Feb 12, 2012, 7:37:15 PM2/12/12
to google-a...@googlegroups.com

> Lower max_concurrent right away! 

 

Disable app Right away. Fix code. Try again.

 

Always do 1 before you do 50

Always do 50 before you do 1000

Always do 1000 before you do 100k

 

Robert Kluin

unread,
Feb 13, 2012, 2:26:50 AM2/13/12
to google-a...@googlegroups.com
Hi Ronaldo,
  I'd probably go with either Barry or Brandon's suggestion to lower the max concurrency or disable the process while you're debugging.

  Do you know what is causing the latency to slowly increase?  If it is due to loading the remote servers, have you investigated if "batching" will help reduce the load?  Or is it on the App Engine side, perhaps your using offsets to get to the next entity to process?



Robert




2012/2/12 Ronoaldo José de Lana Pereira <rper...@beneficiofacil.com.br>

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/Kmmd_14YDmQJ.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.

Ronoaldo José de Lana Pereira

unread,
Feb 13, 2012, 3:42:56 AM2/13/12
to google-a...@googlegroups.com
Brandon,

Thanks for the tip. I'll try enabling again the multithreading. There is a huge sync again in next weekend.

Best Regards,

-Ronoaldo

Ronoaldo José de Lana Pereira

unread,
Feb 13, 2012, 3:55:57 AM2/13/12
to google-a...@googlegroups.com
Thanks for posting.

That's right. I just started without any limit, just allowing the 20/s. I limited the max_concurrent when I noticed an increase in how frequently the instance count was growing.

Thanks for pointing that.

Ronoaldo José de Lana Pereira

unread,
Feb 13, 2012, 4:00:39 AM2/13/12
to google-a...@googlegroups.com
This is too drasticaly! My boss will pay the bills with the app revenue ;)

I can't disable the app, it is a real life multitenant e-commerce with 14 stores. The traffic is getting higher now, but usually it works in a reasonable cost. This sync was an aside problem: I just badly configured the concurrency of the app, and I do depend on another service to get better response times. In any case, doing this sync costly was better that not doing it.

Anyway, lesson lerned: I'll fix the code and run all this again next week, with async urlfetch and multithreading.

Thanks!

Ronoaldo José de Lana Pereira

unread,
Feb 13, 2012, 4:04:38 AM2/13/12
to google-a...@googlegroups.com
Thanks robert.

I guess that as we were doing more RPCs, the remote servers where getting slower. I'll propose them to imlement batches and follow the previous suggestions too: multithreading, proper configure the queue and also, try async fetchs to better use the instance hours.

The AppEngine side was performing gracefully, I didn'd found a single datastore timeout while reading, even in my MS app...

Best Regards,

-Ronoaldo

Brandon Wirtz

unread,
Feb 13, 2012, 4:08:54 AM2/13/12
to google-a...@googlegroups.com

Imports should not be in your Primary app.

 

You should have a version specific to the task, and if the world ends or goes wrong you upload a new version over the top of the importer that doesn’t do anything.

 

 

 

Brandon Wirtz
BlackWaterOps: President / Lead Mercenary

Description: http://www.linkedin.com/img/signature/bg_slate_385x42.jpg

Work: 510-992-6548
Toll Free: 866-400-4536

IM: dra...@gmail.com (Google Talk)
Skype: drakegreene
YouTube: BlackWaterOpsDotCom

BlackWater Ops

Cloud On A String Mastermind Group


 

 

--

You received this message because you are subscribed to the Google Groups "Google App Engine" group.

To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/VIGISa4Om8sJ.

image001.jpg
Reply all
Reply to author
Forward
0 new messages