How to increase Task Queue Execution timing

782 views
Skip to first unread message

Maneesh Tripathi

unread,
Dec 6, 2014, 6:58:03 AM12/6/14
to google-a...@googlegroups.com
I have Created an task queue which stop working after 10 Min.
I want to increase the timing. 
Please help me on this 

Vinny P

unread,
Dec 10, 2014, 1:24:30 AM12/10/14
to google-a...@googlegroups.com
Task queue requests are limited to 10 minutes of execution time: https://cloud.google.com/appengine/docs/java/taskqueue/overview-push#task_deadlines

If you need to go past the 10 minute deadline, you're better off using a manual or basic scaled module: https://cloud.google.com/appengine/docs/java/modules/#scaling_types

 
-----------------
-Vinny P
Technology & Media Consultant
Chicago, IL

App Engine Code Samples: http://www.learntogoogleit.com

Emanuele Ziglioli

unread,
Dec 10, 2014, 3:22:17 PM12/10/14
to google-a...@googlegroups.com
It all comes at a cost: increased complexity.
You can't beat the simplicity of task queues and the 10m limit seems artificially imposed to me. I mean, we pay for CPU time, as we would pay for 20m, 30m, 1h tasks.
I've got a simple task that takes a long time, looping through hundreds of thousands of rows to produced ordered files in output.
The current code is simple and elegant but I have to keep increasing the CPU size in order to finish the task within 10m.
A solution could be using MapReuce, but I haven't figured out yet how MapReduce would solve my problem without hitting the memory limit: with my simple task there are only 1000 rows in memory at any given time (of course, minus the GC). A MapReduce shuffle stage would require all entities, or at least their keys, to be kept in memory, and that's impossible with F1s or F2s.

Emanuele

Gilberto Torrezan Filho

unread,
Dec 11, 2014, 8:29:54 AM12/11/14
to google-a...@googlegroups.com
I've used MapReduce myself for while, and I can say it to you: 100+MB of keys means A LOT of keys at the shuffle stage. And the real limitations of MapReduce are:

"The total size of all the instances of Mapper, InputReader, OutputWriter and Counters must be less than 1MB between slices. This is because these instances are serialized and saved to the datastore between slices."

Source

The real problem of MapReduce, in my opinion, is the latency of the operations and huge amount of read/writes to the datastore to maintain the things working between slices (which considerably increases costs). You can't rely on MapReduce to do real time or near real time work as you could with pure tasks queues. And it actually only shines when you can afford a large number of machines to run your logic - running MapReduce in few machines is sometimes worse than pure sequential brute force.

Fitting your problem in a MapReduce process is actually good for your code - even if you don't use the library itself. It forces you to think on how can you split your huge tasks into smaller, more manageable and more scalable pieces. It's a good exercise - sometimes you think you can't parallelize your problem, but when you're forced to the MapReduce workflow, you might find you were actually wrong, and by the end of the day you have a better code.

Emanuele Ziglioli

unread,
Dec 11, 2014, 10:57:00 AM12/11/14
to google-a...@googlegroups.com
Thank you very much Gilberto!

It's great to make contact with people out there who are on the same boat.
I've just been watching a series of videos on pipelines, and I'm starting to get the pattern for big data processing that Google promotes:

Datastore -> Cloud Storage -> BigQuery

The key point is that BigQuery is "append only", something I didn't realize before.
Here are the videos:
  1. Google I/O 2012 - Building Data Pipelines at Google Scale: http://youtu.be/lqQ6VFd3Tnw 
  2. BigQuery: Simple example of a data collection and analysis pipeline + Yo...: http://youtu.be/btJE659h5Bg
  3. GCP Cloud Platform Integration Demo: http://youtu.be/JcOEJXopmgo via @YouTube
All I need it seems is the Pipeline API, iterating over the Datastore (I guess, in order with a query) and producing a CSV (and other formats) output.
That should allow me to do what I do already but on top of multiple (perhaps sequential) task queues, rather than just one.

From the point of view of costs, currently I heavily rely on, possibly abusing, memcache. With no memcache, I expect costs to go up.
A further improvement would be to update only subsets of data, rather than a whole lot. I've been designing a new datastore 'schema' so that my data is hierarchically organized in entity groups, therefore I could generate a file per entity group (once that's changed) and have a final stage that assembles those files together.
I'm pretty happy with my current task because, as I wrote, is simple and elegant.
If I could upgrade the same algorithm to a Datastore input reader for pipelines, that should do for us.

Emanuele

Gilberto Torrezan Filho

unread,
Dec 11, 2014, 12:17:22 PM12/11/14
to google-a...@googlegroups.com
Actually I just migrated my statistics job from MapReduce to BigQuery (using the Datastore -> Cloud Storage -> BigQuery pattern) =)

I strongly recommend the book "Google BigQuery Analytics" from Jordan Tigani and Siddartha Naidu if you plan to use or know more about BigQuery. I got mine at I/O this year (the last book in the box) =)

BigQuery is awesome but have its quirks - the append-only tables is just one of them. You have to shape your business logic to handle that before starting to heavily use it.

If you don't need statistics, you probably don't need BigQuery.

The sad part is that I spent more than 2 months tweaking and improving my whole pipeline stack trying to get a better performance (or cost-effectiveness), when I could just be using BigQuery to solve my problems. Anyway, it was a good lesson.

Emanuele Ziglioli

unread,
Dec 15, 2014, 4:05:51 PM12/15/14
to google-a...@googlegroups.com
Hi Gilberto,

quick question: do you think BigQuery could possibly replace the Datastore for queries?
A big Datastore pain point is the fact for each query requires an index, while BigQuery doesn't have this restriction.
Do you think it would be feasible for a GAE app to internally redirect client requests to BigQuery? 

I'm tempted to add support for BigQuery to Siena (a java ORM), that would be a big win for this project I keep maintaining for our own use:

Emanuele

Nickolas Daskalou

unread,
Dec 15, 2014, 5:35:13 PM12/15/14
to Google App Engine
We use BigQuery (Python) to analyse visitor and click data on followus.com pages.

From what we've seen, BigQuery queries scale really well over large datasets and complex queries.

However, there is an overhead to each BigQuery query which makes even simple queries over small datasets take a couple of seconds.

So keep that in mind.

There is also the ability to parallelise multiple different queries in order to reduce total query time.

We skip the Google Cloud Storage step and use streaming inserts with a combination of Memcache (App Engine side) and insertIds (BigQuery side) to avoid duplicate inserts. This could work for you too if your rows are not too large.

Since BigQuery is append-only, if you want it to replace the Datastore for queries, you will need to add versioning to each row you insert into BigQuery, and construct a query which only considers the latest version of an entity/record.

Nick


--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-appengi...@googlegroups.com.
To post to this group, send email to google-a...@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
For more options, visit https://groups.google.com/d/optout.

Emanuele Ziglioli

unread,
Jan 20, 2015, 4:15:19 PM1/20/15
to google-a...@googlegroups.com
Thanks Nickolaus for sharing your experience.

I've just come across this simple yet very elegant solution by Nacho Coloma to break through the 10m time limit of taks queues: 

Reply all
Reply to author
Forward
0 new messages