jug.barrier() combined with sleep-until results in early exit

9 views
Skip to first unread message

Zachary Hafen

unread,
Apr 1, 2018, 3:03:19 AM4/1/18
to jug-users
Hello,

I've been using jug for a month or two now, and I've been very happy with it. I'm performing an embarrassingly parallel analysis of my scientific simulation data, and jug has made it very easy to speed up the analysis by distributing it across multiple nodes. Its also helped me avoid some nasty memory leaks in the default Python multiprocessing module. I'll definitely be recommending jug to my computational colleagues (and of course citing it properly in my forthcoming paper :) ).

However, I've encountered a bit of an issue, and I was hoping you might have some insight. I have a code that requires multiple merge points. At these merge points, a file is written, and the next section of the code uses that file. To address this, I've used jug.barrier(). Since I would like to submit this as a batch job, I created a script run_jug.sh, which will run multiple jug processes on a single node. In that script, I use sleep-until to prevent the batch job from ending early. The problem I'm encountering is that sleep-until seems to be satisfied when all jug processes hit the first jug.barrier(), and as such the batch job ends early.

I've attached my jugfile (test_jugfile.py) and submission file (run_jug_test.sh), which should reproduce the issue. When I execute run_jug_test.sh, I get the following output.

Sc497-111(4)$ ./run_jug_test.sh 

0

1

2

3

4

[0.0, 1.0, 4.0, 9.0, 16.0]

10

11

12

Script finished, exiting!

Sc497-111(5)$ 13

14

[100.0, 121.0, 144.0, 169.0, 196.0]

    Executed      Loaded  Task name                                                                                                                                                  

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

           4           6  test_jugfile.dummy_fn                                                                                                                                      

           1           2  test_jugfile.show_results                                                                                                                                  

.....................................................................................................................................................................................

           5           8  Total                                                                                                                                                      


    Executed      Loaded  Task name                                                                                                                                                  

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

           6           5  test_jugfile.dummy_fn                                                                                                                                      

           1           2  test_jugfile.show_results                                                                                                                                  

.....................................................................................................................................................................................

           7           7  Total  


In the expected output, "Script finished, exiting!" shouldn't occur until after "14".

As a note, I currently have wait-cycle-time=12 and nr-wait-cycles=150, which should give plenty of time before jug stops waiting.

Thanks a lot!
Zach
run_jug_test.sh
test_jugfile.py

Renato Alves

unread,
Apr 2, 2018, 8:39:48 AM4/2/18
to jug-...@googlegroups.com, Zachary Hafen
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Hi Zachary,

Indeed there seems to be a bug in sleep-until when using barriers.
The issue is that if a jugfile has barriers, not all tasks are visible from the start.
When sleep-until starts it will only see tasks up to the first barrier and exit once 'all' tasks are done.

A possible workaround until this is fixed is to call sleep-until as many times as the number of barriers.
The second call should have access to tasks just after the barrier which should produce the correct behavior.

I suggest reporting this on https://github.com/luispedro/jug/issues.

Cheers,
Renato

On 01/04/18 00:32, Zachary Hafen wrote:
> Hello,
>
> I've been using jug for a month or two now, and I've been very happy with it. I'm performing an embarrassingly parallel analysis of my scientific simulation data, and jug has made it very easy to speed up the analysis by distributing it across multiple nodes. Its also helped me avoid some nasty memory leaks in the default Python multiprocessing module. I'll definitely be recommending jug to my computational colleagues (and of course citing it properly in my forthcoming paper :) ).
>
> However, I've encountered a bit of an issue, and I was hoping you might have some insight. I have a code that requires multiple merge points. At these merge points, a file is written, and the next section of the code uses that file. To address this, I've used jug.barrier(). Since I would like to submit this as a batch job, I created a script run_jug.sh, which will run multiple jug processes on a single node. In that script, I use sleep-until to prevent the batch job from ending early. The problem I'm encountering is that *sleep-until seems to be satisfied when all jug processes hit the first jug.barrier()*, and as such the batch job ends early.
> --
> You received this message because you are subscribed to the Google Groups "jug-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to jug-users+...@googlegroups.com <mailto:jug-users+...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEErWzyn8km9Qx1yUkopwPBKi+mXQQFAlrCJJIACgkQpwPBKi+m
XQRAzA//RZPoA2Q+8O1bKg6QQU2jPH2t/yavGU6KaB3p0s/u8q+8aNj8ZG84bHK7
hyXWgKnxTzZsW/RfyquAQgDx0+ULpAuXVC3ePznkhwPH6xnQBrYG069j7GKgtmvV
a3LGhlNGVCE77ie7buCj5UHbPPPXP5C/AwJhUeBws+7iTchPpPhI8d2MlyJT2kb2
o9nXrXXaUw16z0zix0rNGkxYpF7BEA9pyaRz/DsfdtJcvyx13kC1ByO/SmpQZU7D
KpsLNjp3D4gGer7/qx5dMpJ0ZSUpbOU8Mc41Xcvhp3qXfvJfDRIYtIFUsEPWO2HE
aAfHFz92mLCQQq01QVvnEZGSUuXpDvBwqIe3iYvAcUFSBD09hfOjbQ0E0Mqqv3CZ
YoLV7vA6G7yairhQWcDL+c9n8h5G985dzkxdPg9kC88/gWX0DQ+bIy7YGfY+1bnI
jHXIi56VxB+OEpML1sG1Y0eOFnNt9+svPbuyKb5mqFZuKlk/Jq+FOMUl2U3II7GH
wqegOWzree/NsogUJRm17sIhZ6Az7sABEKL14bX7Mr/HubQvqpiGHdL0t3foLAa2
pJfIB6708GTbI75aJt0uVtbvt4Ud5SlWL6ajZaVf9JaxMDRtnAlZ/iCHtF3baxmu
VC6Ba+D3jBM298sMlv21gr6sMZWghyllt2zFsRPMGwynKgrRwiU=
=8lqy
-----END PGP SIGNATURE-----

Luis Pedro Coelho

unread,
Apr 2, 2018, 10:29:11 AM4/2/18
to Renato Alves, jug-...@googlegroups.com, Zachary Hafen
Indeed, this is a bug. Thanks for the report.

I went ahead and opened an issue:

https://github.com/luispedro/jug/issues/71

This way it will not be lost.

A temporary workaround is to do something like the following in your bash script:

while ! jug check $JUGFILE; do
jug sleep-until $JUGFILE
done

HTH,
Luis

Luis Pedro Coelho | EMBL | http://luispedro.org
My blog: http://metarabbit.wordpress.com
> an email to jug-users+...@googlegroups.com.

Zachary Hafen

unread,
Apr 2, 2018, 5:32:25 PM4/2/18
to Luis Pedro Coelho, Renato Alves, jug-...@googlegroups.com
Hi Luis, Renato,

The workaround works great, thanks! Thanks for looking into this.

Zach
Reply all
Reply to author
Forward
0 new messages