Is it possible to process large amounts data by several workers in parallel? For example if i have several terabytes of images and each image needs to get processed by worker, and I want to have 100 workers on 10 machines running in parallel, how would I do this with Jug, is it possible? How would those images get passed on to the workers in a way that makes sure that workers can process the next image as soon as they have finished the previous one but also that the machines do not run out of memory or resources because only a few image can be put into memory at any one time?
--You received this message because you are subscribed to the Google Groups "jug-users" group.To unsubscribe from this group and stop receiving emails from it, send an email to jug-users+...@googlegroups.com.To view this discussion on the web visit https://groups.google.com/d/msgid/jug-users/a6c90f91-71cd-4424-8319-6df2cab641ea%40googlegroups.com.For more options, visit https://groups.google.com/d/optout.
To view this discussion on the web visit https://groups.google.com/d/msgid/jug-users/45c8f938-99df-4966-84e6-2261f049c465%40www.fastmail.com.
My main problem is not avoiding that too many workers load the images though, sorry if my explanation was unclear.What I struggle with is how work gets distributed among the workers:* every remote worker will take an unknown time to process an image* there are many images to be processed* workers should always be busy
So the process that provides the images to the workers needs a way to "send over" the images, right?
To view this discussion on the web visit https://groups.google.com/d/msgid/jug-users/CANSE3H5eVm71M7dY1KpXd5M%2Bao6kgRNpvN19zs%3Dev%3Db3Sg6mqA%40mail.gmail.com.
There is no process providing the images to the workers. Rather each worker chugs along and takes available tasks when it can. All the workers communicate with a central database of tasks. This can be the file system (the default, in a shared filesystem setting) or a redis database (it's not hard to add other backends too).
To view this discussion on the web visit https://groups.google.com/d/msgid/jug-users/264fdf16-cf58-49c7-923a-5b290447b17e%40www.fastmail.com.
Hmm I guess I get the gist of that but I still see a problem: if there is no queue, how do workers know what an available task is so all tasks are distributed fairly between workers and no tasks is done twice?
* if each worker gets a list of ids to work on then work may be extremely inbalanced between workers, because some workers may get all the long-lasting tasks, or some workers are running on a much slower machine. There is no way to distribute the work in advance in order to maximize throughput.
* less important because it probably can be worked around by passing around objects and configs, if workers directly fetch/retrieve the real data, this means that it is hard to separate the source/sink processing (the details of how data comes in and goes out) from the actual data processing because every work has to know this.
To view this discussion on the web visit https://groups.google.com/d/msgid/jug-users/CANSE3H5peLr8%3DTx8Ye3KzB-tEva1i_HFs%3Dy2gGYC-H6dCYodFA%40mail.gmail.com.
Each worker locks a task before starting to work on it and marks it as done when finished, so that's not a problem. There is a bit of overhead, which is why jug does not work well for very fine grained parallelism, your tasks should take a couple of seconds at least (in which case, this overhead can be written off as negligible).