ExaML doesn't finish; batch script

38 views
Skip to first unread message

Christian Rinke

unread,
Feb 8, 2018, 6:44:51 PM2/8/18
to raxml
Hi ExaML team,

I have the following problem: I am calling ExaML using a batch script, like so:

srun -N 4  -n 16 -c 16  --cpu_bind=cores ./run_examl_01.sh


And the run_examl_01.sh looks like this:

#!/bin/bash
set -ex
start=`date +%s`
date '+%A %W %Y %X'
  examl-AVX -t   RAxML_randomTree.trimmed_bacteria.class.250x.phyx_raxmlStartingtree_2.tree    -m PSR -s   trimmed_bacteria.class.250x.bin.binary       -n    ExaML_ar122.official_1227x_done_04    -D
end=`date +%s`
date '+%A %W %Y %X'


It all works well, and the output files are created after 2 hours. However, the job (batchscript is submitted to a slurm queue) never finishes and eventually times out after 12 hours:


Best rearrangement radius: 25
ML search convergence criterion fast cycle 0->1 Relative Robinson-Foulds 0.663968
ML search convergence criterion fast cycle 1->2 Relative Robinson-Foulds 0.251012
ML search convergence criterion fast cycle 2->3 Relative Robinson-Foulds 0.182186
ML search convergence criterion fast cycle 3->4 Relative Robinson-Foulds 0.068826
ML search convergence criterion fast cycle 4->5 Relative Robinson-Foulds 0.064777
ML search convergence criterion fast cycle 5->6 Relative Robinson-Foulds 0.080972
ML search convergence criterion fast cycle 6->7 Relative Robinson-Foulds 0.024291
ML search convergence criterion fast cycle 7->8 Relative Robinson-Foulds 0.012146
ML search convergence criterion fast cycle 8->9 Relative Robinson-Foulds 0.012146
ML search convergence criterion fast cycle 9->10 Relative Robinson-Foulds 0.012146
ML search convergence criterion fast cycle 10->11 Relative Robinson-Foulds 0.012146
ML fast search converged at fast SPR cycle 12 with stopping criterion
Relative Robinson-Foulds (RF) distance between respective best trees after one succseful SPR cycle: 0.008097%
ML search convergence criterion thorough cycle 0->1 Relative Robinson-Foulds 0.040486
ML search convergence criterion thorough cycle 1->2 Relative Robinson-Foulds 0.012146
ML search converged at thorough SPR cycle 3 with stopping criterion
Relative Robinson-Foulds (RF) distance between respective best trees after one succseful SPR cycle: 0.008097%

Likelihood of best tree: -1567495.709699

Overall Time for 1 Inference 1230.279842

Overall accumulated Time (in case of restarts): 1230.279842

Likelihood   : -1567495.709699

Model parameters written to:           /global/u1/a/ace/01_ExaML/03_r80.bc.120.prot/5000AA/class_120markers/ExaML_modelFile.ExaML_ar122.official_1227x_done_03
Final tree written to:                 /global/u1/a/ace/01_ExaML/03_r80.bc.120.prot/5000AA/class_120markers/ExaML_result.ExaML_ar122.official_1227x_done_03
Execution Log File written to:         /global/u1/a/ace/01_ExaML/03_r80.bc.120.prot/5000AA/class_120markers/ExaML_log.ExaML_ar122.official_1227x_done_03
Execution information file written to: /global/u1/a/ace/01_ExaML/03_r80.bc.120.prot/5000AA/class_120markers/ExaML_info.ExaML_ar122.official_1227x_done_03
slurmstepd: error: *** STEP 10034985.0 ON nid00393 CANCELLED AT 2018-02-08T11:00:48 DUE TO TIME LIMIT ***
srun: got SIGCONT
srun: got SIGCONT
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: forcing job termination


The sys admin tells me that this must be an ExaMl error:
"I don't see this printing "ExaML is DONE" anywhere, which means ExaML isn't finishing, which is why your job is running out of time."


So I was wondering if you have any idea what's going on?
Cheers,
Chris



Alexandros Stamatakis

unread,
Feb 9, 2018, 6:10:05 AM2/9/18
to ra...@googlegroups.com
That looks weird, are you using the latest ExaML version?

https://github.com/stamatak/ExaML/releases

Can you access and read the tree file?

Alexis
> /"I don't see this printing "ExaML is DONE" anywhere, which means ExaML
> isn't finishing, which is why your job is running out of time."/
>
>
> So I was wondering if you have any idea what's going on?
> Cheers,
> Chris
>
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

--
Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.exelixis-lab.org

Christian Rinke

unread,
Feb 10, 2018, 10:46:28 PM2/10/18
to ra...@googlegroups.com
Thanks Alexis,
It's
module load ExaML/3.0.17
That's the latest installed version.
Cheers,
Chris

To unsubscribe from this group and stop receiving emails from it, send an email to raxml+unsubscribe@googlegroups.com <mailto:raxml+unsubscribe@googlegroups.com>.

For more options, visit https://groups.google.com/d/optout.

--
Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.exelixis-lab.org


--
You received this message because you are subscribed to a topic in the Google Groups "raxml" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/raxml/irY3899mvoY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to raxml+unsubscribe@googlegroups.com.

Christian Rinke

unread,
Feb 11, 2018, 5:37:38 PM2/11/18
to raxml
And yes, the tree files are created and are accessible- the trees look fine.  Cheers, Chris

Alexandros Stamatakis

unread,
Feb 12, 2018, 6:10:06 AM2/12/18
to ra...@googlegroups.com
Okay, then you should maybe update to v 3.0.20, as far as I remember we
recently fixed an MPI exit issue, the trees were computed correctly,
just the termination of all MPI processes after completion of the ML
part was not implemented correctly ... so you can safely use those trees
but should update ExaML at some point,

Alexis

On 11.02.2018 23:37, Christian Rinke wrote:
> And yes, the tree files are created and are accessible- the trees look
> fine.  Cheers, Chris
>
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.

Christian Rinke

unread,
Feb 13, 2018, 12:36:29 AM2/13/18
to raxml
Thanks, I'll update to the latest version and see how it goes.
-Chris

Christian Rinke

unread,
Feb 14, 2018, 2:20:16 AM2/14/18
to raxml
Here is the happy end of this thread:  3.0.20 fixed the problem
Cheers, Chris


Alexandros Stamatakis

unread,
Feb 14, 2018, 2:51:30 AM2/14/18
to ra...@googlegroups.com
great :-)

alexis

On 14.02.2018 08:20, Christian Rinke wrote:
> Here is the happy end of this thread:  3.0.20 fixed the problem
> Cheers, Chris
>
>

Karen

unread,
Feb 16, 2018, 12:56:14 PM2/16/18
to raxml
indeed: I had the same problem with 3.0.17 and 3.0.19. Thanks for fixing this!
Best Karen
Reply all
Reply to author
Forward
0 new messages