In My Usage the Overhead of Jug Tasks Seems Too High

Simon Haendeler

unread,

May 28, 2019, 9:35:06 PM5/28/19

to jug-users

Dear Jug Mailing list,

I already run a few experiments using a low amount of jug.Tasks and everything worked perfectly. I'm currently running one experiment with ~1.000.000 jug tasks and the overhead seems extremely high.

Generating the jug task list takes around 25 seconds, which seems fine to me. Executing `jug status` takes ~4 minutes. A single task takes around 20 seconds processing time to finish. `jug status` tells me that there are only ~80 out of 400 workers running. I guess that the other workers are busy with finding a free task to lock. If this interpretation is correct my jug tasks are spending 80% on overhead.

I'm using an NFS backend. I already had some problems with my NFS server being slow in some (jug unrelated) corner cases. So far I didn't thought that jug triggered that specific problem in my configuration.

Does anyone have experience running jug with millions of tasks and what overhead I should expect? Which parameters influence jug overhead performance, e.g. does reducing the size of jug task output increase jug overhead performance? Is overhead connected to how many tasks there are in total? It feels like enumerating task status shouldn't take so long and I'm doing something wrong.

Cheers,

Simon

Luis Pedro Coelho

unread,

May 28, 2019, 9:44:53 PM5/28/19

to jug-users

Hi Simon,

I have run millions of tasks without a problem using the redis backend (in fact, I initially wrote that code because I had a problem where it made sense to use millions of tasks and NFS was too slow). With NFS, each task will generate a file on disk and locking/checking will generate a network call to the NFS server, so it can easily overwhelm the process.

Jug status will do the equivalent of "ls jugfile.jugdata/*". As these directories fill with small files, it becomes slower and slower (especially if other files are also reading/writing to the same directories).

My advice is to switch to the redis backend.

HTH

Luis

Luis Pedro Coelho | Fudan University | http://luispedro.org

https://orcid.org/0000-0002-9280-7885

--
You received this message because you are subscribed to the Google Groups "jug-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jug-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jug-users/d9dce621-a943-4c74-8b59-84742833b669%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Justin Porter

unread,

May 28, 2019, 9:47:30 PM5/28/19

to jug-users

Have you tried the --cache option for jug status (except when you change your task topology)? My largest project is about 50k tasks, and for me it cuts it down from ~20 seconds to ~0.7 seconds. I would expect that you'd see a substantial performance benefit...

I don't know if there's a way to ask it to use --cache when you run jug execute, though..

HTH,

Justin

Simon Haendeler

unread,

May 31, 2019, 7:40:21 AM5/31/19

to jug-users

Hey Luis,

I have run millions of tasks without a problem using the redis backend (in fact, I initially wrote that code because I had a problem where it made sense to use millions of tasks and NFS was too slow). With NFS, each task will generate a file on disk and locking/checking will generate a network call to the NFS server, so it can easily overwhelm the process.

I now migrated all tasks and data in this experiment to a redis backend. `jug status` is now faster with 3 minutes, although I would have hoped for sub 1 minute calls. But I'm more interested in the performance of workers finding a free task. Does workers use the same mechanism as jug status to find ready tasks? Do they iterate through all tasks to check if one of them is free? Jug status tells me that 70 of 100 workers are running. While that is of course better than the 80/400 from using the NFS backend, 30% time spend on overhead seems still high. But is that number even meaningfull? I guess "running" counts the number of currently locked tasks, but I guess iterating through all tasks and checking whether they are locked can cause "race conditions" between the status process and the worker processes, so that not all really running worker processes are detected.

I also tried looking into jugs source code to identify if I'm doing something wrong. Maybe you can point me in the right direction in the source code? I think [jug.execution_loop](https://github.com/luispedro/jug/blob/master/jug/jug.py#L163) is the correct line for the abstract logic of finding a runnable task which should then calculate the hash of the task and checks in redis if the result is already present. So I would guess performance is similar to `jug status`? In that case I think it's reasonable to expect that the overhead to find a free task can get high if the number of tasks is high. Do you have any idea how I can measure if it is overhead?

@Justin

Have you tried the --cache option for jug status (except when you change your task topology)? My largest project is about 50k tasks, and for me it cuts it down from ~20 seconds to ~0.7 seconds. I would expect that you'd see a substantial performance benefit...

I don't know if there's a way to ask it to use --cache when you run jug execute, though..

I just tried it out and it cuts back my execution time of `jug status` to 6 seconds. But when comparing status with --cache and without, --cache seems to not refresh numbers at all, but only showing the number of ready/finished tasks when status --cache was first executed.

Also as --cache seems to rely on a slqite3 db I guess there is no good way to incorporate with executing tasks, as sqlite3 isn't really designed for concurrent writes.

Thanks for your quick answers!

Simon

Luis Pedro Coelho

unread,

Jun 3, 2019, 12:51:18 AM6/3/19

to jug-users

I have run millions of tasks without a problem using the redis backend (in fact, I initially wrote that code because I had a problem where it made sense to use millions of tasks and NFS was too slow). With NFS, each task will generate a file on disk and locking/checking will generate a network call to the NFS server, so it can easily overwhelm the process.

I now migrated all tasks and data in this experiment to a redis backend. `jug status` is now faster with 3 minutes, although I would have hoped for sub 1 minute calls. But I'm more interested in the performance of workers finding a free task. Does workers use the same mechanism as jug status to find ready tasks? Do they iterate through all tasks to check if one of them is free?

They go through the list in topographic order, so it's more likely than not that the loop will find a free task fast. Plus there are a bunch of heuristics to try to avoid re-querying the backed too much.

Jug status tells me that 70 of 100 workers are running. While that is of course better than the 80/400 from using the NFS backend, 30% time spend on overhead seems still high. But is that number even meaningfull? I guess "running" counts the number of currently locked tasks, but I guess iterating through all tasks and checking whether they are locked can cause "race conditions" between the status process and the worker processes, so that not all really running worker processes are detected.

Yes, this does happen. The "finished" numbers can be trusted, but the other ones may be slightly off and there is no care taken to avoid this type of inconsistencies.

I also tried looking into jugs source code to identify if I'm doing something wrong. Maybe you can point me in the right direction in the source code? I think [jug.execution_loop](https://github.com/luispedro/jug/blob/master/jug/jug.py#L163) is the correct line for the abstract logic of finding a runnable task which should then calculate the hash of the task and checks in redis if the result is already present.

Yes, this is correct. The code started out simple, but has accumulated heuristics that try to avoid as much querying of the backend as possible.

So I would guess performance is similar to `jug status`?

A little different: running requires locking, which requires more roundtrips to the backend (redis or NFS), whilst jug status relies on listing the locked/finished tasks, so it's more efficient (although it can lead to slightly wrong results as mentioned above).

In that case I think it's reasonable to expect that the overhead to find a free task can get high if the number of tasks is high. Do you have any idea how I can measure if it is overhead?

The only really good measure is to see whether the jug processes are occupying close to 100% CPU. Most of the overhead is either disk access or waiting on the network, so it will show up as lowered CPU usage.

It is possible that you have tasks that are too short in runing time and jug is not the right tool for you. Sometimes, one can just batch things (e.g., using jug.mapreduce https://jug.readthedocs.io/en/latest/mapreduce.html?highlight=mapreduce).

@Justin

Have you tried the --cache option for jug status (except when you change your task topology)? My largest project is about 50k tasks, and for me it cuts it down from ~20 seconds to ~0.7 seconds. I would expect that you'd see a substantial performance benefit...

I don't know if there's a way to ask it to use --cache when you run jug execute, though..

I just tried it out and it cuts back my execution time of `jug status` to 6 seconds. But when comparing status with --cache and without, --cache seems to not refresh numbers at all, but only showing the number of ready/finished tasks when status --cache was first executed.

Also as --cache seems to rely on a slqite3 db I guess there is no good way to incorporate with executing tasks, as sqlite3 isn't really designed for concurrent writes.

The `--cache` option is only for `jug status`: it avoids having to recompute all the hashes from the Task tree. The worker processes are still all running and writing to the redis DB, so there should be no concurrency problem, but it's independent of the caching mechanism.