FUSE still hanging on paralell

18 views
Skip to first unread message

Leonardo Borges

unread,
Aug 18, 2014, 8:34:48 AM8/18/14
to po...@googlegroups.com
Dear all,

The last days I have tried to run DO (10 sets of 3 SEARCHs) for one data set with POY 5.1 under parallel, but the analysis is not even going trough the first search procedure.

I have spoken to Denis Carvalho and he told me that this may be due to problems with FUSE already reported for past versions.

He also told me that this was supposed not to be happening on 5.1, but it looks like the bug is still there.

Of course it can be just the fool of me also.


Anyway, I am sending you the folder with all the files I am using (matrices, script, etc) so you can have a better glimpse on the problem I am having.

The command I use to get it going on the cluster is "nohup mpiexec -36 poy5.1 search_m0111.poy >/dev/null 2>std.err </dev/null &".

The output of the std.err file days, after days running, is only this:

177 seconds
Information :
               Timer:
                  178 seconds
Status : TBR Finished
Status : TBR Finished
Status : Automated Search : 0 Best tree: 2999.; Time left: -2 s; Hits: 6
Status : RAS + TBR Finished
Status : Fusing Trees : 0
Status : Fusing Trees Finished


By the way, I have run a test search without parallelization and with max_time set for a minimum and it worked alright. I mean, it produced the expected files, but I am not sure if it is just because it did not have time to properly run FUSE.

Well, just letting you guys know about the issue. Hope there is a way to fix that.

Best,

Leonardo
FUSEhanging.zip

Leonardo Borges

unread,
Aug 18, 2014, 9:02:54 AM8/18/14
to po...@googlegroups.com
I gave the wrong cluster command. Sorry for that.

Correcting it:

"nohup mpiexec -n 36 poy5.1 search_m0111.poy >/dev/null 2>std.err </dev/null &"

--
You received this message because you are subscribed to the Google Groups "POY - Phylogenetic Analysis Software" group.
To unsubscribe from this group and stop receiving emails from it, send an email to poy4+uns...@googlegroups.com.
To post to this group, send email to po...@googlegroups.com.
Visit this group at http://groups.google.com/group/poy4.
For more options, visit https://groups.google.com/d/optout.

Ward C Wheeler

unread,
Aug 19, 2014, 8:46:52 AM8/19/14
to po...@googlegroups.com
Leonardo--thansk,
I'll take a look.
W

Leonardo Borges

unread,
Aug 22, 2014, 7:21:55 AM8/22/14
to po...@googlegroups.com
Dear Ward,

Another data set not going through fuse. Totally different taxa, same script.

I am attaching the files of this one too just in case.
DO.zip

Ward C Wheeler

unread,
Aug 22, 2014, 9:06:34 AM8/22/14
to po...@googlegroups.com
Leonard--thanks,
More cases always help.
W

--
You received this message because you are subscribed to the Google Groups "POY - Phylogenetic Analysis Software" group.
To unsubscribe from this group and stop receiving emails from it, send an email to poy4+uns...@googlegroups.com.
To post to this group, send email to po...@googlegroups.com.
Visit this group at http://groups.google.com/group/poy4.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "POY - Phylogenetic Analysis Software" group.
To unsubscribe from this group and stop receiving emails from it, send an email to poy4+uns...@googlegroups.com.
To post to this group, send email to po...@googlegroups.com.
Visit this group at http://groups.google.com/group/poy4.
For more options, visit https://groups.google.com/d/optout.
<DO.zip>

Ward Wheeler
Division of Invertebrate Zoology
American Museum of Natural History
Central Park West at 79th Street
New York, NY 10024-5192
USA

Leonardo Borges

unread,
Aug 22, 2014, 4:46:26 PM8/22/14
to po...@googlegroups.com
Ward,

just an update.

I tried to run the last data set again and it worked with 20, 30 and 40 processors for max_time = 1 hour.

Before, when I got the bug, I was using 52 processor and for the first dataset, 36. It looks to me that it is having some problems with weird numbers of processors...

Leonardo

Nick Lucaroni

unread,
Aug 22, 2014, 5:11:18 PM8/22/14
to POY Google Group
How many trees are in memory before you do the fuse? Unfortunately, if there are two few this can affect the performance. Regardless, I wonder if each processor is getting only one tree, aN S that they are naively waiting for other nodes to finish.

nick

Leonardo Borges

unread,
Aug 22, 2014, 5:25:06 PM8/22/14
to po...@googlegroups.com
Nick, the first data set had 6 trees in memory when FUSE stuck. The second, 9.

The script I am using runs ten rounds of DO with this settings:

read("*.fas")
transform((names:("its.fas","matk.fas","trndt.fas","trnlf.fas"),(tcm:("m0111"),gap_opening:0)))
set(root:"cendna_00427")
search(max_time:0:05:00)
search(max_time:0:04:00)
search(max_time:0:01:00)
select()
report("trees_do_search.tre",trees:(nomargin),"score_do_m0111.sts",treestats,"search_statiscs.txt",searchstats)
wipe()


The plan is to IP the resulting trees after the whole DO procedure.

Do you think I should replace select() for something like select(within:5.0) in order to increase the number of trees and avoid this kind of problem with FUSE?

Indeed it looks like the cluster is waiting results form part of the processors in order to follow to the next computational step.

Leonardo

Nick Lucaroni

unread,
Aug 22, 2014, 5:42:17 PM8/22/14
to POY Google Group

Yeah, you should increase the number of trees. I'll look into the hanging issue this weekend.

Leonardo Borges

unread,
Aug 22, 2014, 5:44:36 PM8/22/14
to po...@googlegroups.com
Cool. I will try that!

Thanks for the help.

Leo

Ward C Wheeler

unread,
Aug 22, 2014, 10:14:51 PM8/22/14
to po...@googlegroups.com
Thanks Nick,
W
nick



--
You received this message because you are subscribed to the Google Groups "POY - Phylogenetic Analysis Software" group.
To unsubscribe from this group and stop receiving emails from it, send an email to poy4+uns...@googlegroups.com.
To post to this group, send email to po...@googlegroups.com.
Visit this group at http://groups.google.com/group/poy4.
For more options, visit https://groups.google.com/d/optout.
Ward Wheeler
Division of Invertebrate Zoology
American Museum of Natural History
Central Park West at 79th Street
New York, NY 10024-5192
USA

Nick Lucaroni

unread,
Aug 23, 2014, 9:29:22 PM8/23/14
to POY Google Group
Do you know if you're using openMPI or MPICH2 as the backend? You should be able to examine the version information on mpiexec --argonne labs distributes.

nick

Leonardo Borges

unread,
Aug 23, 2014, 10:10:23 PM8/23/14
to po...@googlegroups.com
Nick, the command you gave me did not work, but I did that:

mpiexec --version
mpiexec (OpenRTE) 1.4.3

mpirun --version
mpirun (Open MPI) 1.4.3


Guess it is openMPI.

Leonardo Borges

unread,
Aug 23, 2014, 11:14:31 PM8/23/14
to po...@googlegroups.com
By the way, I tried your suggestion to increase found trees (commands below) and it stuck at FUSE again, although it did found a few more trees.

Search Stats:
                                                
# of Builds + TBR  101
# of Fuse          567
# of Ratchets      34
                                
Tree length        Number of hits
4831.              50
4862.              1
4953.              1
4969.              1


search commands (1 of 10 rounds):

search(max_time:0:05:00, min_time:0:05:00)
search(max_time:0:04:00, min_time:0:04:00)
search(max_time:0:01:00, min_time:0:01:00)
select(unique, within:10.0)


Leonardo Borges

unread,
Aug 26, 2014, 8:10:06 AM8/26/14
to po...@googlegroups.com
Nick, just an update.

I have run IP on the cluster with some trees I found with DO in a regular pc and I got no problems with FUSE.

Pretty weird...

Here is the script that I used with the first dataset I sent in this thread.

(* here compiling results with IP+FUSE *)
read("its.fas","ndhf.fas","psba-trnh.fas","trnl.fas", "trnq.fas")
transform((names:("its.fas","ndhf.fas","psba-trnh.fas","trnl.fas", "trnq.fas"),(tcm:("m0111"),gap_opening:0)))
set(root:"10347.Mt.communis")
read("trees_do_search.tre")
select(unique)
report("scores_unique_from_do.sts",treestats)
set(iterative:exact)
fuse()
select()
report("alignment_ip_sens.ia",ia,"score_ip_sens.sts",treestats,"trees_ip_sens.tre",trees:(branches))
transform (all, (static_approx))
report ("matrix_after_sensitivity.ss", phastwinclad)
exit()


Leonardo

Nick Lucaroni

unread,
Sep 2, 2014, 4:23:51 PM9/2/14
to POY Google Group
Interesting. I ran the fuse hanging script with success with four nodes. Could you let me know what version of POY you're using? There was a problem with fuse hanging that was resolved. If my memory serves me in 5.1.0 , but it may have been 5.1.1 . 

nick

Leonardo Borges

unread,
Sep 2, 2014, 4:34:47 PM9/2/14
to po...@googlegroups.com
Nick, I am using 5.1.1.

The last few days, advised by Fernando Marques,  I have tried to start everything from scratch and to increase the script little by little to check what would happen.

I was using four nodes and everything worked with one round of DO with two hours of search, then with one round of DO with two replications of two hours of search and then 2 rounds of DO with 3 replications of 2 hours search.

Now I am trying with 10 nodes and a script with 10 rounds of 3 replications of 2 hours of search and it is going through until now.

By the way Fernando told me he ran on the same bug with his data set, and that he solved it by running POY with option "-e".

Interestingly I started to have problems when using more than 30 nodes.

Leo

Nick Lucaroni

unread,
Sep 2, 2014, 4:39:41 PM9/2/14
to POY Google Group
Yes, previously we had an issue with how the exit() command is "parallelized". The -e is actually implied now when compiling with mpi and you can remove all exit() commands from your script. I wonder if that still is still an issue under certain conditions?

nick

Leonardo Borges

unread,
Sep 2, 2014, 4:46:28 PM9/2/14
to po...@googlegroups.com
That is kind of weird, though, since we only have and exit() command on the last line and the analysis was getting jammed on the third or fourth round of DO.

If you can, try to run it with a large number of nodes. Maybe you will get to see the bug.
Reply all
Reply to author
Forward
0 new messages