[maker-devel] Maker-Error when started with OpenMPI

568 views
Skip to first unread message

Rainer Rutka

unread,
Jan 27, 2017, 5:37:00 AM1/27/17
to maker...@yandell-lab.org
Hi everybody.

My name is Rainer. I am an administrator for our HPC-Systems at our
university in Konstanz, Baden-Wuertemberg/Germany.
The procect is called bwHPC-C5.

See: https://www.bwhpc-c5.de/en/index.php

I try to get Maker running on our bwUniCluster since weeks. Unfortunately
i get errors while running a Maker job in the MPI-environment.

BUILD STATUS

==============================================================================
STATUS MAKER v2.31.9
==============================================================================
PERL Dependencies: VERIFIED
External Programs: VERIFIED
External C Libraries: VERIFIED
MPI SUPPORT: ENABLED
MWAS Web Interface: DISABLED
MAKER PACKAGE: CONFIGURATION OK

MODULES / INCLUDES / COMPILERS

# knbw03 20170117 r.rutka Initial revision knbw02 of module version 2.31.9
#
##### (B) Dependencies:
#
# conflict: any other maker version
# module load compiler/gnu/5.2
# module load mpi/openmpi/2.0-gnu-5.2
[...]

MPI/MOAB SUBMIT

[...]
### Queues ###
#MSUB -q fat
#MSUB -l nodes=1:ppn=16
#MSUB -l mem=20gb
#MSUB -l walltime=50:00:00
#
[...]
echo " "
echo "### Loading MAKER module:"
echo " "
module load bio/maker/2.31.9
[ "$MAKER_VERSION" ] || { echo "ERROR: Failed to load module
'bio/maker/2.31.9'."; exit 1; }
echo "MAKER_VERSION = $MAKER_VERSION"
module list
[...]
echo " "
echo "### Runing Maker example"
echo " "
export LD_PRELOAD=${MPI_LIB_DIR}/libmpi.so
export OMPI_MCA_mpi_warn_on_fork=0

echo "LD_PRELOAD=${LD_PRELOAD}"
#
# "STATUS: Processing and indexing input FASTA files..."
#
mpiexec -mca btl ^openib -n 16 maker
[...]


E R R O R S
=======
[...]
LD_PRELOAD=/opt/bwhpc/common/mpi/openmpi/2.0.1-gnu-5.2/lib/libmpi.so
STATUS: Parsing control files...
STATUS: Processing and indexing input FASTA files...
[uc1n338:113607] *** Process received signal ***
[uc1n338:113607] Signal: Segmentation fault (11)
[uc1n338:113607] Signal code: Address not mapped (1)
[uc1n338:113607] Failing at address: 0x4b0
[uc1n338:113608] *** Process received signal ***
[uc1n338:113608] Signal: Segmentation fault (11)
[uc1n338:113608] Signal code: Address not mapped (1)
[uc1n338:113608] Failing at address: 0x4b0
[uc1n338:113621] *** Process received signal ***
[uc1n338:113621] Signal: Segmentation fault (11)
[uc1n338:113621] Signal code: Address not mapped (1)
[uc1n338:113621] Failing at address: 0x4b0
--------------------------------------------------------------------------
mpiexec noticed that process rank 2 with PID 113608 on node uc1n338
exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
[...]

WHATS WRONG HERE!?

Thank you for your help!

All the best ,

Rainer

--
Rainer Rutka
University of Konstanz
Communication, Information, Media Centre (KIM)
* High-Performance-Computing (HPC)
* KIM-Support and -Base-Services
Room: V511
78457 Konstanz, Germany
+49 7531 88-5413

Carson Holt

unread,
Jan 28, 2017, 3:59:06 PM1/28/17
to Rainer Rutka, maker...@yandell-lab.org
Try adding one of the following to your mpiexec command —>

1. --mca btl ^openib
2. --mca btl vader,tcp,self --mca btl_tcp_if_include ib0
3. --mca btl vader,tcp,self --mca btl_tcp_if_include eth0

One or the other may fix your issue. The first causes OpenMPI to not use the infiniband communication option (infiniband libraries use registered memory in a way that causes system calls to generate segfaults). It will usually force communication to go over another adapter. The second tries to use the infiband adapter, but uses TCP over infiniband (way to indirectly bypass problem causing libraries). The third specifically forces the use of the ethernet adapter instead of infiniband adapter.

--Carson

> _______________________________________________
> maker-devel mailing list
> maker...@box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


_______________________________________________
maker-devel mailing list
maker...@box290.bluehost.com
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

Rainer Rutka

unread,
Jan 30, 2017, 3:37:32 AM1/30/17
to Carson Holt, maker...@yandell-lab.org
Hi Carson!

Thank you VERY MUCH for your hints.

Much appreciated!

I'll test these today and let you know about the results.

Again: THANKS! :-)

BTW: I'm not a scientist. Only a system operator.

:-)

Am 28.01.2017 um 21:53 schrieb Carson Holt:
> Try adding one of the following to your mpiexec command —>
> 1. --mca btl ^openib
> 2. --mca btl vader,tcp,self --mca btl_tcp_if_include ib0
> 3. --mca btl vader,tcp,self --mca btl_tcp_if_include eth0
> One or the other may fix your issue. The first causes OpenMPI to not use the infiniband communication option (infiniband libraries use registered memory in a way that causes system calls to generate segfaults). It will usually force communication to go over another adapter. The second tries to use the infiband adapter, but uses TCP over infiniband (way to indirectly bypass problem causing libraries). The third specifically forces the use of the ethernet adapter instead of infiniband adapter.
> --Carson

Rainer Rutka

unread,
Feb 16, 2017, 5:49:58 AM2/16/17
to Carson Holt, maker...@yandell-lab.org
Hi!

Unfortunately all of the options failed on our cluster.

See:

Hi,

Most recent Maker test with
--mca btl vader,tcp,self --mca btl_tcp_if_include eth0
Error:
--> rank=2, hostname=uc1n518.localdomain
[uc1n518:67009] *** Process received signal ***
[uc1n518:67009] Signal: Segmentation fault (11)
[uc1n518:67009] Signal code: Address not mapped (1)
[uc1n518:67009] Failing at address: 0x4b0
--------------------------------------------------------------------------
mpiexec noticed that process rank 1 with PID 67009 on node uc1n518
exited on signal 11 (Segmentation fault).


With:
--mca btl ^openib
and also this
--mca btl vader,tcp,self --mca btl_tcp_if_include ib0
Error:
### Runing Maker example

STATUS: Parsing control files...
STATUS: Processing and indexing input FASTA files...
[uc1n514:59985] *** Process received signal ***
[uc1n514:59985] Signal: Segmentation fault (11)
[uc1n514:59985] Signal code: Address not mapped (1)
[uc1n514:59985] Failing at address: 0x4b0
--------------------------------------------------------------------------
mpiexec noticed that process rank 10 with PID 59985 on node uc1n514
exited on signal 11 (Segmentation fault).


Am 28.01.2017 um 21:53 schrieb Carson Holt:
--
Rainer Rutka
Universität Konstanz
Kommunikations-, Informations-, Medienzentrum (KIM)
Raum: V511, Tel: 54 13

Carson Holt

unread,
Feb 20, 2017, 12:49:09 AM2/20/17
to Rainer Rutka, maker...@yandell-lab.org
Try running just on a single node (not across nodes). If it still fails, you might need to try installing an updated OpenMPI version then reinstalling and running MAKER with that new version. You can install it in your home directory and test from there, just make sure to add it to your path.

Alternatively MPICH3 and IntelMPI (with some extra configuration for IntelMPI) can be used. If you decide to try Intel MPI let me know, and I can provide you with the info on configuration.

—Carson

Rainer Rutka

unread,
Feb 22, 2017, 8:16:52 AM2/22/17
to Carson Holt, maker...@yandell-lab.org, Robert Kraus
@Robert Kraus: FYI

Am 20.02.2017 um 06:43 schrieb Carson Holt:
> Try running just on a single node (not across nodes).
THATS WHAT I DID.


> If it still fails, you might need to try installing an updated OpenMPI version then reinstalling and
> running MAKER with that new version. You can install it in your home directory and test from there,
> just make sure to add it to your path.
Shure it is.

> Alternatively MPICH3 and IntelMPI (with some extra configuration for IntelMPI) can be used.
> If you decide to try Intel MPI let me know, and I can provide you with the info on configuration.

OK, send the infos please.

Thank you very much!

Carson Holt

unread,
Feb 22, 2017, 11:22:17 AM2/22/17
to Rainer Rutka, maker...@yandell-lab.org, Robert Kraus
If OpenMPI fails on a single node, it means you have a compilation issue, which indicates a problem with your installation. This sometimes happens if you compiled on one node and run on another (if could either be MAEKR or OpenMPI itself that was compiled on another node).


A few options you will need if trying intel MPI:
-binding pin=disable #requires to disable processor affinity (otherwise MAKER calls to BLAST and other programs which are parallelized independent of MPI may not work)

Environmental variables to set:
export I_MPI_PIN_DOMAIN=node #otherwise MAKER calls to BLAST and other programs which are parallelized independent of MPI may not work
export I_MPI_FABRICS='shm:tcp’ #avoid potential complication with OpenFabrics libraries (they block system calls because of how they use registered memory, i.e. MAKER calling BLAST would fail)
export I_MPI_HYDRA_IFACE=ib0 #set to eth0 if you don’t have an infiniband over ip inerface (required because of the above I_MPI_FABRICS change)

Also make sure to compile on the node you run. You can try expanding to other nodes after that.

—Carson

Reply all
Reply to author
Forward
0 new messages