Writing many tasks to the same file

17 views
Skip to first unread message

Justin Porter

unread,
May 5, 2018, 12:33:57 PM5/5/18
to jug-users
Hi all,

I am using jug to parallelize analysis of large sets of molecular dynamics trajectories, and so far it's been awesome.

Something that I'd like to do is take many computation results (more data than will fit in memory) and one by one write them to a single file (HDF5 in my case).

The naive solution was to write something like:

def write_stuff(task_results, output_file):
   
for task_result in task_results:
        output_file
.write(task_results)

   
return output_file

The problem here is that this loads _all_ the data into memory, which causes the process to be killed. I could also write each output as its own task, but that has the problem that it will clobber the output file by each process trying to write to that file (I could obviously implement some kind of locking thing but I'd rather avoid that).

I suspect that I'm doing this wrong, or have missed an important piece of documentation. Any guidance would be appreciated.

Cheers!

Justin

Luis Pedro Coelho

unread,
May 5, 2018, 4:56:24 PM5/5/18
to Justin Porter, jug-users
Hi Justin, 

Glad you like Jug!

> Something that I'd like to do is take many computation results (more data than will fit in memory) and one by one write them to a single file (HDF5 in my case).
>
> The naive solution was to write something like:
>
> def write_stuff(task_results, output_file):
>     for task_result in task_results:
>         output_file.write(task_results)
>
>     return output_file

There is an alternative, which is to "look under the hood" and use jug.io.NoLoad (this is not documented, but it should work).

Assuming your script looks something like:

...
write_stuff(task_results, 'output.hdf5')

You transform it to:

from jug.io import NoLoad

task_results_no_load = [NoLoad(r) for r in task_results]
write_stuff(task_results_no_load, 'output.hdf5')

Now, `write_stuff` will be called with the Task objects, so you need to load/unload explicitly:

@TaskGenerator
def write_stuff(task_results, output_file):
    for task_result in task_results:
val = task_results.load()
        output_file.write(val)
task_results.unload()

HTH
Luis

Justin Porter

unread,
May 5, 2018, 5:41:09 PM5/5/18
to jug-users
Luis,

Thanks for the quick reply!

I am a bit confused... NoLoad looks like it's a kind of wrapper that still hashes, so that it can be serialized but won't trigger loading, but I don't see it inheriting from Task, nor does it implement load/unload (here).

The reason I bring this up is that I get the following error inside the function:

[...snip...]
 
File "/home/jrporter/miniconda2/envs/std3/lib/python3.6/site-packages/jug/task.py", line 111, in run
   
self._result = self._execute()
 
File "/home/jrporter/miniconda2/envs/std3/lib/python3.6/site-packages/jug/task.py", line 121, in _execute
   
return self.f(*args,**kwargs)
 
File "/home/jrporter/modules/jugbook/pockets.py", line 369, in assemble_sasa_h5
    data
= sasa.load()
AttributeError: 'NoLoad' object has no attribute 'load'

The relevant task assembly code is:
 
    sidechain_sasas = [condense_sidechain_sasas(sasa, topology))
                     
for sasa in sasas]

    SC_SASA_FILE
= assemble_sasa_h5(
        sasas
=[NoLoad(r) for r in sidechain_sasas],
        filename
='sasas-sidechains.h5')


In my mind, I am thinking that 'sidechain_sasas' should take the place of 'task_results' and 'assemble_sasa_h5' is the equlivalent of 'write_stuff'

Thanks!

Cheers,
Justin

Luis Pedro Coelho

unread,
May 6, 2018, 3:37:42 AM5/6/18
to jug-users
Sorry, I wrote without testing my code.

Yes, what I sent would not work, but a slight variation does: 

from jug import value

@TaskGenerator
def write_stuff(task_results, output_file):
     for task_result in task_results:
         val = value(task_results.t)
         output_file.write(val)
         task_results.t.unload()


1. You need to access the `.t` member

2. You need to use the `value()` function and not just load()


The actual trick of NoLoad is that the __jug_value__ method does not really load anything.

HTH,
Luis

Justin Porter

unread,
May 6, 2018, 10:04:07 AM5/6/18
to jug-users
That's working great.

Thanks!

Justin Porter

unread,
May 7, 2018, 12:10:45 PM5/7/18
to jug-users
Ok, so I've implemented this in my workflow, but I'm seeing an interesting behavior where the task that depends on the NoLoads is being marked as ready when I don't think it should be.

Pipeline code:
    sasas = [map_sasa_shm(trj, topology, float(probe_radius)/10)
             
for trj in trajectories]

    sidechain_sasas
= [condense_sidechain_sasas(sasa, topology)
                       
for sasa in sasas]

    SC_SASA_FILE
= DATA_STEM+'sasas-sidechains.h5'
    sasa_sidechain_h5
= assemble_sasa_h5(

        sasas
=[NoLoad(r) for r in sidechain_sasas],

        filename
=SC_SASA_FILE)

A snapshot of `status` is: 

     Waiting       Ready    Finished     Running  Task name
           
0           1           0           0  jugbook.pockets.assemble_sasa_h5
         
59        4507           0           0  jugbook.pockets.condense_sidechain_sasas
           
0           0        4507          59  jugbook.pockets.map_sasa_shm

So, you can see that, even though none of the 'condense_sasa_sidechain' tasks have finished, the 'assemble_sasa_h5' task is still marked as ready.

Is this expected behavior? If not, how can I dig it and figure out what the problem is?

Cheers!

Luis Pedro Coelho

unread,
May 9, 2018, 11:55:56 AM5/9/18
to Justin Porter, jug-users
HI Justin,

Thanks for opening an issue on github. These last few days have been pretty busy, so I was late getting back to this.

You are right that NoLoad is not completely implemented. It was just used internally and so, it was never properly tested. I will give it a go, now.

(To answer your Q on github: it may just be a matter of giving it a dependencies() method indeed)

HTH,
Luis
--

Luis Pedro Coelho | EMBL | http://luispedro.org
My blog: http://metarabbit.wordpress.com
> --
> You received this message because you are subscribed to the Google Groups "jug-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to jug-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Luis Pedro Coelho

unread,
May 9, 2018, 12:36:23 PM5/9/18
to jug-users
Ok, this is fixed (I also converted your example into a test, so it will stay fixed).

For your usage, I recommend just pulling out the new NoLoad object out of the code:

https://github.com/luispedro/jug/blob/f7f1c1a1044ba1b37aef67ff01d8b469a3a21571/jug/io.py#L19

HTH
Luis

Luis Pedro Coelho | EMBL | http://luispedro.org
My blog: http://metarabbit.wordpress.com

Reply all
Reply to author
Forward
0 new messages