Backend Deferred Tasks eating CPU Time?

Prateek Malhotra

unread,

Oct 4, 2011, 12:12:10 AM10/4/11

to Google App Engine

I'm not sure if I understand the usage correctly, but I have a
function I wrote with the intended use of being a deferred task. The
task is designed in a way to "fan out" hundreds (and those can also
create hundreds more) of other tasks for some data mining I need to
accomplish. There is usually about 30 minutes of intensive data mining
that is accomplished over 4500 tasks every time I need to do this.

When I ran this under my normal app, I would quickly eat up all my CPU
time (about $10 enabled billing/day). I looked into backends and
thought my use case was a good fit, so I added the
_target='backend_id' parameter to all my deferred task calls but my
CPU time is still being used up quickly and on top of that I'm paying
for the backend up time as well (I cut $5 to backends and $5 to CPU).

Am I wrong to assume that my deferred tasks should just use my backend
instance up time and not my normal CPU time? This was my hope in
trying to reduce the daily cost of my application. Are deferred tasks
the wrong way to utilize backends?

I am open to any suggestions or advice that can be provided.

Thanks in advanced.

-Prateek

Gerald Tan

unread,

Oct 4, 2011, 1:06:45 AM10/4/11

to google-a...@googlegroups.com

I believe CPU time will no longer be billable after the new pricing is out

Rishi Arora

unread,

Oct 4, 2011, 8:38:45 AM10/4/11

to google-a...@googlegroups.com

I think deferred tasks is an excellent use case for backends. That's how I use my backend as well. Can you confirm from your logs that your tasks are indeed being processed on the backend? In the drop down for app versions, there's a special "version" which is named after your backend. Select that to check your logs specific to the backend. Also, I'm assuming the reason you're blowing through your budget is because you're spanning out multiple, possibly hundreds of instances. Can you find out how many instances get spawned for your deferred tasks? Can you find out how many backend instances are being spawned, if the backend is indeed being used for your tasks? Finally, when you configured your backend, what did you set as your "instances" parameter in backends.yaml? I don't know what the default is, but it is likely "unlimited". In your case, a instance of 1 or 2 sounds sufficient, but you'll have to play around with that, based on how much queueing occurs for your tasks.

On Tue, Oct 4, 2011 at 12:06 AM, Gerald Tan <woeful...@gmail.com> wrote:

I believe CPU time will no longer be billable after the new pricing is out

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/Crry-7yTG4QJ.

To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.

Prateek Malhotra

unread,

Oct 4, 2011, 10:33:37 AM10/4/11

to Google App Engine

Hello,

Thank you for the replies.

The tasks are indeed being run on my backend as I am being charged for
backend usage (I don't use backends in any other way otherwise). I
also see the _ah/deferred logs on my backend ID but not on my app. The
CPU time being used seems to correspond to the Datastore CPU time, are
the two currently linked? Even under the new pricing scheme, would I
be billed for my main app up-time when executing deferred tasks on my
backend? What part of my main app is considered under use when
executing tasks on the backend?

The max number of instances allowed on any backend is 20. My backend
is setup as dynamic B1 with 20 instances. I let Google determine how
many instances needed to be up and running when I queue up my tasks
(usually all 20 run for 20-30 minutes each).

Again, I'd just like to know what part of my main app is being used
during the backend operation that is eating up my CPU. I really
haven't coded anything else on my app except for this data mining
portion which should all be run on the backend now.

Thanks,
Prateek

Rishi Arora

unread,

Oct 4, 2011, 11:27:27 AM10/4/11

to google-a...@googlegroups.com

I don't think backend CPU time counts against your front-end instance CPU hours quota. Backends are purely billed based on uptime - and I think this is true for both current and new pricing (starting November 1). But you mentioned something about datastore CPU? It is likely that the most of your billing is because of this.

Few more questions: Is the reason for 20 backend instances that you want to execute all your 20 deferred tasks parallely? You mentioned you have ~4500 tasks. How long does each one take, and how often does each one need to execute in a day? Lets assume that neither of these tasks are sensitive to latency, and can be executed at any time during the entire day. If each task takes 10 seconds on average, and needs to execute, for example, 6 times a day... that's a total of 4500 * 10 * 6 / 3600 = 75 instance hours. You should try to "schedule" your backends yourself so that you only pay for 75 instance hours.

If you allow 20 instances to get created, then at some point all these 20 instances will complete their work, and then idle for 15 minutes (or at least billed for 15 minutes of idle time after the last task completes processing), before they're shutdown. You'll be wasting 5 instance hours each time this happens. I think your focus should be to minimize your instance hours by minimizing the number of parallel instances you allow running. In my calculation above you only need a total of 75 instance hours, and so you should set "instances" to 3 or 4 in backends.yaml.

A yet another way of doing this in a more controlled fashion is by using "pull" queues instead of enqueueing tasks on the regular "push" type task queues. You can enqueue all your 4500 tasks on a single "pull" queue, and all your backends will constantly run pulling tasks out of pull queues and executing them, until the pull queue is empty. Then they can be woken up again by a cron job to go check the pull queue again.

Lastly, any cost you're incurring because of Datastore CPU hours in the current pricing model, or because of Datastore writes in the new pricing model - those can't be avoided. You will incur those costs regardless of the context of execution of your tasks - front-end or backend.

Hope this helps.

Prateek Malhotra

unread,

Oct 4, 2011, 12:53:32 PM10/4/11

to Google App Engine

Thanks for the reply Rishi.

I think I will try the pull queue idea since I was going to move to a
model like that after switching to backend instances. I will also try
to see how fast my process can finish using a lower amount of
instances with this in place, each task takes about 4 seconds and we'd
like to finish running within 30 minutes, so 10 instances should be
fine.

If the datastore CPU time is being billed against what I see as CPU
Time, then this answers my question. The new pricing scheme estimates
millions of writes everytime I run my task, I hope I can find a way to
reduce this as it will end up costing more than what the current
pricing scheme will cost me.

Thanks again!

-Prateek

Rishi Arora

unread,

Oct 4, 2011, 3:03:52 PM10/4/11

to google-a...@googlegroups.com

You're welcome. I struggled with minimizing data store writes as well. Some tips, based on what worked for me:

1. multiple writes to the same entity (for example a "number of accesses per day" counter - instead of incrementing the counter and writing it to the datastore every time, its better to add each increment operation to a pull queue. Then at the end of the day, run a cron job to pull items from the queue and increment the counter in memory, and then write the final result to the datastore). I'm kind of cheating and using the pull queue as a reliable transient storage, accessible from either front-ends or backends. The 100MB taskqueue capacity limitation and 100k taskqueue API calls limitation works just fine for me.

2. Moved away things like high level application logging to an external database instead of the datastore. My GAE app now logs such events to a pull queue (again, another hack making use of pull queues), and a cron job batches these logs and sends them to an external database over an HTTP connection.

3. I discovered I had twice as many indexes as I really needed. Back when datastore writes weren't so expensive, I lavishly created indexes all over the place. I trimmed that down by 50% - huge impact to number of datastore writes. Inserting a new Entity record, with 10 properties and 5 multi-property indices will cause 16 write operations to the datastore (1 for the entity, 10 for each of the indices corresponding to the 10 properties, and 5 for each of the multi-property indices). A delete will cause 16 writes as well. An update will cause upto 16 writes, depending on what indices are affected.

I think, overall, creative uses of the memcache and pull queues can help avoid using the datastore for "transient" storage. For instances, the GAE Appstats utility uses memcache exclusively for its functioning.

Nick Johnson

unread,

Oct 4, 2011, 10:47:10 PM10/4/11

to google-a...@googlegroups.com

On Wed, Oct 5, 2011 at 2:27 AM, Rishi Arora <rishi...@ship-rack.com> wrote:

I don't think backend CPU time counts against your front-end instance CPU hours quota. Backends are purely billed based on uptime - and I think this is true for both current and new pricing (starting November 1). But you mentioned something about datastore CPU? It is likely that the most of your billing is because of this.

That's correct - datastore operations are currently billed by charging for CPU time. If you do any datastore operations on a backend, they'll show up as regular CPU hours.