Process large amounts of data by several workers in parallel?

Johann Petrak

unread,

May 8, 2019, 9:03:47 PM5/8/19

to jug-users

Is it possible to process large amounts data by several workers in parallel? For example if i have several terabytes of images and each image needs to get processed by worker, and I want to have 100 workers on 10 machines running in parallel, how would I do this with Jug, is it possible? How would those images get passed on to the workers in a way that makes sure that workers can process the next image as soon as they have finished the previous one but also that the machines do not run out of memory or resources because only a few image can be put into memory at any one time?

Luis Pedro Coelho

unread,

May 8, 2019, 9:21:37 PM5/8/19

to jug-users

Hi Johan et al.,

The goal of processing the images in parallel, should be a standard use of jug. Start 10 `jug execute` jobs on each machine and they should chug along without problems...

If I understand you correctly, you want to avoid, however, that too many workers load the images at the same time. Presumably, there is other work that can be done too and is less memory intensive.

Whilst jug has no support for such complex requirements, you can get 90% there by judicious use of the --target option, so that you start a small number of jobs targetting the memory intensive tasks + a others targetting the other tasks

*

This is a bit of a hassle to manage if you have to do it manually, which is why I also recommend you take a look at jug-schedule:

https://pypi.org/project/jug-schedule/

which might fit your needs very well

HTH

Luis

--

Luis Pedro Coelho | Fudan University | http://luispedro.org

Lab website: http://big-data-biology.org

Latest publication: "Meta-analysis of fecal metagenomes reveals global microbial signatures that are specific for colorectal cancer" https://doi.org/10.1038/s41591-019-0406-6

On Thu, 9 May 2019, at 3:03 AM, Johann Petrak wrote:

Is it possible to process large amounts data by several workers in parallel? For example if i have several terabytes of images and each image needs to get processed by worker, and I want to have 100 workers on 10 machines running in parallel, how would I do this with Jug, is it possible? How would those images get passed on to the workers in a way that makes sure that workers can process the next image as soon as they have finished the previous one but also that the machines do not run out of memory or resources because only a few image can be put into memory at any one time?

--
You received this message because you are subscribed to the Google Groups "jug-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jug-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jug-users/a6c90f91-71cd-4424-8319-6df2cab641ea%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Luis Pedro Coelho | Fudan University | http://luispedro.org

https://orcid.org/0000-0002-9280-7885

Johann Petrak

unread,

May 9, 2019, 2:08:41 AM5/9/19

to jug-...@googlegroups.com

Thank you very much!

My main problem is not avoiding that too many workers load the images though, sorry if my explanation was unclear.

What I struggle with is how work gets distributed among the workers:

* every remote worker will take an unknown time to process an image

* there are many images to be processed

* workers should always be busy

So the process that provides the images to the workers needs a way to "send over" the images, right?

Clearly that cannot be done by calling the remote processes synchronously, we cannot wait for each of k remote workers to complete

before sending the next image, can we?

So instead, one has to somehow asynchronously queue images so that the workers can retrieve them.

But that queue has to be able to limit the size of data the producer can stuff into it at any one time, otherwise the images would

eat up too many resources (memory, diskspace) for that queue.

So this kind of asynchronous communication with the worker processes is really what I do not understand how it can work with Jug.

The example code does not let me understand how communication is done, it all looks magically identical to doing it synchronously

when of course there must be something else going on under the hood.

All the best,

johann

To view this discussion on the web visit https://groups.google.com/d/msgid/jug-users/45c8f938-99df-4966-84e6-2261f049c465%40www.fastmail.com.

Luis Pedro Coelho

unread,

May 9, 2019, 3:33:06 AM5/9/19

to jug-...@googlegroups.com

Ah, I see.

My main problem is not avoiding that too many workers load the images though, sorry if my explanation was unclear.
What I struggle with is how work gets distributed among the workers:
* every remote worker will take an unknown time to process an image
* there are many images to be processed
* workers should always be busy

Yes, jug tries to do this.

So the process that provides the images to the workers needs a way to "send over" the images, right?

There is no process providing the images to the workers. Rather each worker chugs along and takes available tasks when it can. All the workers communicate with a central database of tasks. This can be the file system (the default, in a shared filesystem setting) or a redis database (it's not hard to add other backends too).

If the workers have the ability of loading the image data themselves, that is often a better option too: you just pass around file paths to disk and have each worker load what it needs from the shared file system

HTH,

Luis

To view this discussion on the web visit https://groups.google.com/d/msgid/jug-users/CANSE3H5eVm71M7dY1KpXd5M%2Bao6kgRNpvN19zs%3Dev%3Db3Sg6mqA%40mail.gmail.com.

Johann Petrak

unread,

May 9, 2019, 5:17:04 AM5/9/19

to jug-...@googlegroups.com

On Thu, 9 May 2019 at 09:33, Luis Pedro Coelho <lu...@luispedro.org> wrote:

There is no process providing the images to the workers. Rather each worker chugs along and takes available tasks when it can. All the workers communicate with a central database of tasks. This can be the file system (the default, in a shared filesystem setting) or a redis database (it's not hard to add other backends too).

Hmm I guess I get the gist of that but I still see a problem: if there is no queue, how do workers know what an available task is so all tasks are distributed fairly between workers and no tasks is done twice?

* if each worker gets a list of ids to work on then work may be extremely inbalanced between workers, because some workers may get all the long-lasting tasks, or some workers are running on a much slower machine. There is no way to distribute the work in advance in order to maximize throughput.

* if workers do not get pre-allocated chunks of work assigned, how else do they determine what has to be done but also know in a thread-safe manner what other workers work on?

* less important because it probably can be worked around by passing around objects and configs, if workers directly fetch/retrieve the real data, this means that it is hard to separate the source/sink processing (the details of how data comes in and goes out) from the actual data processing because every work has to know this.

Sorry for discussing this in so much detail, I just want to make sure I understand everything fully before I decide which library to use for my problem (and how much I would have to hack myself).

Thank you for your help!

j.

To view this discussion on the web visit https://groups.google.com/d/msgid/jug-users/264fdf16-cf58-49c7-923a-5b290447b17e%40www.fastmail.com.

Luis Pedro Coelho

unread,

May 9, 2019, 5:23:57 AM5/9/19

to jug-...@googlegroups.com

Hmm I guess I get the gist of that but I still see a problem: if there is no queue, how do workers know what an available task is so all tasks are distributed fairly between workers and no tasks is done twice?

Each worker locks a task before starting to work on it and marks it as done when finished, so that's not a problem. There is a bit of overhead, which is why jug does not work well for very fine grained parallelism, your tasks should take a couple of seconds at least (in which case, this overhead can be written off as negligible).

* if each worker gets a list of ids to work on then work may be extremely inbalanced between workers, because some workers may get all the long-lasting tasks, or some workers are running on a much slower machine. There is no way to distribute the work in advance in order to maximize throughput.

Yes, but each worker pulls a single task when it needs to, so this only occurs when you have fewer tasks than workers. At that point, it may matter which worker gets which task.

* less important because it probably can be worked around by passing around objects and configs, if workers directly fetch/retrieve the real data, this means that it is hard to separate the source/sink processing (the details of how data comes in and goes out) from the actual data processing because every work has to know this.

There is no fundamental reason to not pass the data around by storing it in the jug backend, it's just often inefficient.

HTH

uis

To view this discussion on the web visit https://groups.google.com/d/msgid/jug-users/CANSE3H5peLr8%3DTx8Ye3KzB-tEva1i_HFs%3Dy2gGYC-H6dCYodFA%40mail.gmail.com.

Johann Petrak

unread,

May 9, 2019, 6:14:47 AM5/9/19

to jug-...@googlegroups.com

On Thu, 9 May 2019 at 11:23, Luis Pedro Coelho <lu...@luispedro.org> wrote:

Each worker locks a task before starting to work on it and marks it as done when finished, so that's not a problem. There is a bit of overhead, which is why jug does not work well for very fine grained parallelism, your tasks should take a couple of seconds at least (in which case, this overhead can be written off as negligible).

I have trouble understanding how this could work in general: it would require that "looking at the task and starting work on it" must happen as an exclusive transaction.

Otherwise, several workers could look at the same thing at the same time, see that it needs to get worked on and all decide at the same time to work on it.

Even if a task getting worked on gets marked somehow on the file system, there is a latency between fetching the completed indicator and writing the worked on indicator

which could lead to the problem described above. So without a reliable locking mechanism and making "check if done and if not work on it" a safe transaction, I cannot see how

this can work?

Does Jug have some kind of built-in mechanism for this?

Luis Pedro Coelho

unread,

May 9, 2019, 6:20:09 AM5/9/19

to jug-...@googlegroups.com

Yes. This is all taken care of by jug with locks.

Basically :

1. Get next non completed, but available task
2. Lock it
3. Recheck that it is not completed
4. Run it.

Hth,
Luis

--
Luis Pedro Coelho

Renato Alves

unread,

May 9, 2019, 2:24:18 PM5/9/19

to jug-...@googlegroups.com

Adding to what Luis said, you might find
https://jug.readthedocs.io/en/latest/tutorial.html#tasks helpful in
understanding the concept of "Task" in jug and the convenient
@TaskGenerator decorator.

If this is still unclear, I suggest you play a little with the example
https://jug.readthedocs.io/en/latest/index.html#short-example and change
the functions to take somewhat longer time (add time.sleep(...)). Then
by running 'jug execute --verbose debug' you will get a peek under the
hood and see what each 'jug execute' instance is doing. Make sure to run
multiple in separate terminals so you get an idea of how they decide
what tasks to work on.

Cheers,
Renato

On 5/9/19 12:20 PM, Luis Pedro Coelho wrote:
> Yes. This is all taken care of by jug with locks.
>
> Basically :
>
> 1. Get next non completed, but available task
> 2. Lock it
> 3. Recheck that it is not completed
> 4. Run it.
>
> Hth,
> Luis
>
>
> On May 9, 2019 6:14:09 PM GMT+08:00, Johann Petrak
> <johann...@gmail.com> wrote:
>
>
>
> On Thu, 9 May 2019 at 11:23, Luis Pedro Coelho <lu...@luispedro.org

> <mailto:lu...@luispedro.org>> wrote:
>
> __

> Each worker locks a task before starting to work on it and marks
> it as done when finished, so that's not a problem. There is a
> bit of overhead, which is why jug does not work well for very
> fine grained parallelism, your tasks should take a couple of
> seconds at least (in which case, this overhead can be written
> off as negligible).
>
>
> I have trouble understanding how this could work in general: it
> would require that "looking at the task and starting work on it"
> must happen as an exclusive transaction.
> Otherwise, several workers could look at the same thing at the same
> time, see that it needs to get worked on and all decide at the same
> time to work on it.
> Even if a task getting worked on gets marked somehow on the file
> system, there is a latency between fetching the completed indicator
> and writing the worked on indicator
> which could lead to the problem described above. So without a
> reliable locking mechanism and making "check if done and if not work
> on it" a safe transaction, I cannot see how
> this can work?
> Does Jug have some kind of built-in mechanism for this?
>
>
> --
> Luis Pedro Coelho
>

> --
> You received this message because you are subscribed to the Google
> Groups "jug-users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to jug-users+...@googlegroups.com

> <mailto:jug-users+...@googlegroups.com>.

> To view this discussion on the web visit

> https://groups.google.com/d/msgid/jug-users/F3BBF5F2-F935-4A8B-93B1-E119A4AFBD94%40luispedro.org
> <https://groups.google.com/d/msgid/jug-users/F3BBF5F2-F935-4A8B-93B1-E119A4AFBD94%40luispedro.org?utm_medium=email&utm_source=footer>.

signature.asc

Reply all

Reply to author

Forward