ExaML 3.0.19 does not properly finish

24 views
Skip to first unread message

Karen

unread,
Jan 23, 2018, 6:08:46 AM1/23/18
to raxml
Hi all,

Doing quartet analyses with predefined groups, and after a crash with 3.0.17 (I could not restart it properly from a Checkpoint) , I repeated the analysis with 3.0.19. Again since the walltime was not sufficient, I had to restart it from a Checkpoint. It started, but failed after ca. 6 hours again with ExaML 3.0.19_gcc

The ExaML_info file the last lines say:

Overall quartet computation time: 17455.538698 secs
All quartets and corresponding likelihoods written to file ExaML_quartets.aln

1) So its actually finished...! But why then the error message? At least the number of the quartets is correct that there should to be (954750 quartets).
2)  the end of the job log file to me looks weired see below

Any help appreciated (I never experienced this bug before  - since the analyses are quite urgent (resubmission)
(I consider for now this analyses as finished...- hopefully). I contacted alsready my admin of tze cluster but they prefer if any hints first come from the developers before rerunning and trying more options.

Thanks in advance, Karen

>
> ###
> from slurm-12490312.out, (used 200 CPUs)
> ...
> command: ExaML was called as follows
> examl-AVX -f q -Y group.txt -m GAMMA -s aln.binary -t RAxML_parsimonyTree.starttree -R ExaML_binaryCheckpoint.aln_598 -n XX_restart -p 2153
>
> ...
>
> Printing checkpoint after 17420.032806 seconds of run-time
> [c159:52265:0] Caught signal 11 (Segmentation fault)
> [c159:52266:0] Caught signal 11 (Segmentation fault)
> [c159:52267:0] Caught signal 11 (Segmentation fault)
> [c159:52270:0] Caught signal 11 (Segmentation fault)
> [c159:52272:0] Caught signal 11 (Segmentation fault)
> [c159:52274:0] Caught signal 11 (Segmentation fault)
> [c159:52276:0] Caught signal 11 (Segmentation fault)
> [c159:52278:0] Caught signal 11 (Segmentation fault)
> [c159:52280:0] Caught signal 11 (Segmentation fault)
> [c159:52282:0] Caught signal 11 (Segmentation fault)
> [c159:52284:0] Caught signal 11 (Segmentation fault)
> [c159:52286:0] Caught signal 11 (Segmentation fault)
> [c159:52288:0] Caught signal 11 (Segmentation fault)
> [c159:52290:0] Caught signal 11 (Segmentation fault)
> [c159:52292:0] Caught signal 11 (Segmentation fault)
> [c159:52294:0] Caught signal 11 (Segmentation fault)
> [c159:52296:0] Caught signal 11 (Segmentation fault)
> [c159:52298:0] Caught signal 11 (Segmentation fault)
> [c159:52300:0] Caught signal 11 (Segmentation fault)
>
> Computed all 954750 possible grouped quartets
> [c315:4076 :0] Caught signal 11 (Segmentation fault)
> [c329:10123:0] Caught signal 11 (Segmentation fault)
> [c327:116893:0] Caught signal 11 (Segmentation fault)
> [c305:74933:0] Caught signal 11 (Segmentation fault)
> [...]
>
> Overall quartet computation time: 17455.538698 secs
> [c327:116868:0] Caught signal 11 (Segmentation fault)
> [c305:74947:0] Caught signal 11 (Segmentation fault)
> [c356:123817:0] Caught signal 11 (Segmentation fault)
> [c376:87709:0] Caught signal 11 (Segmentation fault)
> [c368:139720:0] Caught signal 11 (Segmentation fault)
> [c315:4090 :0] Caught signal 11 (Segmentation fault)
> [c166:47781:0] Caught signal 11 (Segmentation fault)
> [c323:21460:0] Caught signal 11 (Segmentation fault)
> [c329:10099:0] Caught signal 11 (Segmentation fault)
> [c166:47783:0] Caught signal 11 (Segmentation fault)
> [...]
> All quartets and corresponding likelihoods written to file ExaML_quartets.aln
> [c327:116873:0] Caught signal 11 (Segmentation fault)
> [c305:74953:0] Caught signal 11 (Segmentation fault)
> [...]
>
> followed then very often by this:
> ==== backtrace ====
>  2 0x00000000000ff9ac mxm_handle_error()  /var/tmp/OFED_topdir/BUILD/mxm-3.5.3093/src/mxm/util/debug/debug.c:641
>  3 0x00000000000ffefc mxm_error_signal_handler()  /var/tmp/OFED_topdir/BUILD/mxm-3.5.3093/src/mxm/util/debug/debug.c:616
>  4 0x0000000000034950 killpg()  ??:0
>  5 0x0000000000068834 _IO_new_fclose()  ??:0
>  6 0x000000000045fabe computeQuartets()  ??:0
>  7 0x0000000000403f9a main()  ??:0
>  8 0x00000000000206e5 __libc_start_main()  ??:0
>  9 0x0000000000404539 _start()  /home/abuild/rpmbuild/BUILD/glibc-2.22/csu/../sysdeps/x86_64/start.S:118
> ===================
> ==== backtrace ====
>  2 0x00000000000ff9ac mxm_handle_error()  /var/tmp/OFED_topdir/BUILD/mxm-3.5.3093/src/mxm/util/debug/debug.c:641
>  3 0x00000000000ffefc mxm_error_signal_handler()  /var/tmp/OFED_topdir/BUILD/mxm-3.5.3093/src/mxm/util/debug/debug.c:616
>  4 0x0000000000034950 killpg()  ??:0
>  5 0x0000000000068834 _IO_new_fclose()  ??:0
>  6 0x000000000045fabe computeQuartets()  ??:0
>  7 0x0000000000403f9a main()  ??:0
>  8 0x00000000000206e5 __libc_start_main()  ??:0
>  9 0x0000000000404539 _start()  /home/abuild/rpmbuild/BUILD/glibc-2.22/csu/../sysdeps/x86_64/start.S:118
> ===================
>
> ...
>
> ==== backtrace ====
>  2 0x00000000000ff9ac mxm_handle_error()  /var/tmp/OFED_topdir/BUILD/mxm-3.5.3093/src/mxm/util/debug/debug.c:641
>  3 0x00000000000ffefc mxm_error_signal_handler()  /var/tmp/OFED_topdir/BUILD/mxm-3.5.3093/src/mxm/util/debug/debug.c:616
>  4 0x0000000000034950 killpg()  ??:0
>  5 0x0000000000068834 _IO_new_fclose()  ??:0
>  6 0x000000000045fabe computeQuartets()  ??:0
>  7 0x0000000000403f9a main()  ??:0
>  8 0x00000000000206e5 __libc_start_main()  ??:0
>  9 0x0000000000404539 _start()  /home/abuild/rpmbuild/BUILD/glibc-2.22/csu/../sysdeps/x86_64/start.S:118
> ===================
> --------------------------------------------------------------------------
> WARNING: A process refused to die despite all the efforts!
> This process may still be running and/or consuming resources.
>
> Host: c159
> PID:  52265
>
> --------------------------------------------------------------------------
> [c159:52239] 7 more processes have sent help message help-orte-odls-base.txt / orte-odls-base:could-not-kill
> [c159:52239] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
> --------------------------------------------------------------------------
> mpirun noticed that process rank 56 with PID 0 on node c305 exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
>
>

Alexey Kozlov

unread,
Jan 23, 2018, 8:04:05 AM1/23/18
to ra...@googlegroups.com
Hi Karen,

this error should not affect results in any way (I figured out why it happens). So please just double-check that the
output file contains all quartets, and then it's all right.

Best,
Alexey
> --
> You received this message because you are subscribed to the Google Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

Karen

unread,
Jan 23, 2018, 9:22:01 AM1/23/18
to raxml
Hi Alexey,

thanks for your reply. Yes it does contain all quartets thanks! Maybe in a next release this can be fixed then that the job per se does not crash in the end?
Thanks for considering!

Best, karen

Alexey Kozlov

unread,
Jan 23, 2018, 9:24:22 AM1/23/18
to ra...@googlegroups.com
sure, the fix is trivial and will be included into the next release. thanks for reporting!
> <javascript:>
> > <mailto:raxml+un...@googlegroups.com <javascript:>>.
> > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
Reply all
Reply to author
Forward
0 new messages