ULFM most recent version for high performance?

29 views
Skip to first unread message

Daniel Torres

unread,
Jan 29, 2021, 9:53:33 AM1/29/21
to User Level Fault Mitigation

Hi everyone.

Recently I have been trying to update my ULFM version for executing high performance tests on a cluster.
Previously, I was using the implementation located in the "ulfm" branch, with external "LIBEVENT 2.1.12" and "OpenPMIX 3.2".

As far as I know, the "export/ulfm-to-ompi5-expanded" branch is the newest branch (that is under development yet).
I'm trying to install it on my machine but I'm getting some errors. I describe the steps I'm doing:

CLONING OK
--------------------------------------------------------------------------------------------------------------------
git clone --recursive -b export/ulfm-to-ompi5-expanded https://bitbucket.org/icldistcomp/ulfm2.git
--------------------------------------------------------------------------------------------------------------------

AUTOGEN OK
--------------------------------------------------------------------------------------------------------------------
mv ulfm2 ulfm-to-ompi5-expanded
cd ulfm-to-ompi5-expanded
./autogen.pl

    Open MPI configuration:
    -----------------------
    Version: 5.0.0a1
    Build MPI C bindings: yes
    Build MPI Fortran bindings: mpif.h, use mpi, use mpi_f08
    Build MPI Java bindings (experimental): no
    Build Open SHMEM support: false (no spml)
    Debug build: no
    Platform file: (none)

    Miscellaneous
    -----------------------
    CUDA support: no
    Fault Tolerance support: mpi
    hwloc: internal
    libevent: internal
    pmix: internal
    PRRTE: yes
    Threading Package: pthreads
--------------------------------------------------------------------------------------------------------------------

CONFIGURING AND INSTALLING OK
--------------------------------------------------------------------------------------------------------------------
./configure --with-ft=mpi --prefix=$HOME/ULFM2 --enable-mpi-ext=ftmpi --disable-man-pages
make all install
make check
make clean
cd ..

export TOOLS=$HOME/ULFM2
export PATH=$TOOLS/bin:$PATH
export LD_LIBRARY_PATH=$TOOLS/lib:$LD_LIBRARY_PATH
--------------------------------------------------------------------------------------------------------------------

Until here, everything has been successful (even the "make check" command), but when I try to execute a test code I get the error:

--------------------------------------------------------------------------------------------------------------------
[Daniel-Lap:00000] *** An error occurred in MPI_Comm_set_errhandler
[Daniel-Lap:00000] *** reported by process [638058497,2]
[Daniel-Lap:00000] *** on communicator MPI_COMM_SELF
[Daniel-Lap:00000] *** MPI_ERR_COMM: invalid communicator
[Daniel-Lap:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[Daniel-Lap:00000] ***    and MPI will try to terminate your MPI job as well)

[Daniel-Lap:00000] *** An error occurred in MPI_Comm_set_errhandler
[Daniel-Lap:00000] *** reported by process [638058497,1]
[Daniel-Lap:00000] *** on communicator MPI_COMM_SELF
[Daniel-Lap:00000] *** MPI_ERR_COMM: invalid communicator
[Daniel-Lap:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[Daniel-Lap:00000] ***    and MPI will try to terminate your MPI job as well)

[Daniel-Lap:00000] *** An error occurred in MPI_Comm_set_errhandler
[Daniel-Lap:00000] *** reported by process [638058497,0]
[Daniel-Lap:00000] *** on communicator MPI_COMM_SELF
[Daniel-Lap:00000] *** MPI_ERR_COMM: invalid communicator
[Daniel-Lap:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[Daniel-Lap:00000] ***    and MPI will try to terminate your MPI job as well)

[Daniel-Lap:00000] *** An error occurred in MPI_Comm_set_errhandler
[Daniel-Lap:00000] *** reported by process [638058497,3]
[Daniel-Lap:00000] *** on communicator MPI_COMM_SELF
[Daniel-Lap:00000] *** MPI_ERR_COMM: invalid communicator
[Daniel-Lap:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[Daniel-Lap:00000] ***    and MPI will try to terminate your MPI job as well)

[Daniel-Lap:235979] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2870
--------------------------------------------------------------------------------------------------------------------

Here, I saw that maybe PMIX could have some trouble, so I decided to use the "PMIX 4.0.0" version.
I installed it just like I did before with my previous ULFM installation, with "LIBEVENT 2.1.12", adding the corresponding environment variables.

--------------------------------------------------------------------------------------------------------------------
export LIBE=$HOME/LIBEVENT/
export PATH=$LIBE/bin:$PATH
export LD_LIBRARY_PATH=$LIBE/lib:$LD_LIBRARY_PATH

export PMIX=$HOME/PMIX/
export PATH=$PMIX/bin:$PATH
export LD_LIBRARY_PATH=$PMIX/lib:$LD_LIBRARY_PATH
--------------------------------------------------------------------------------------------------------------------

Here, on the configuration step, I got the error:
--------------------------------------------------------------------------------------------------------------------
./configure --with-ft=mpi --prefix=$HOME/ULFM2 --enable-mpi-ext=ftmpi --disable-man-pages --with-libevent=$HOME/LIBEVENT/ --with-pmix=$HOME/PMIX/

    *****************************************************************************
     THIS IS A DEBUG BUILD!  DO NOT USE THIS BUILD FOR PERFORMANCE MEASUREMENTS!
    *****************************************************************************

    configure: ===== done with 3rd-party/openpmix configure =====
    configure: error: Building against an external PMIx with an internal Libevent or HWLOC is unsupported.  Cannot continue.
--------------------------------------------------------------------------------------------------------------------

So I decided to also install an external HWLOC. Well.. I did it and added the corresponding environment variables too.

--------------------------------------------------------------------------------------------------------------------
export HWL=$HOME/HWLOC/
export PATH=$HWL/bin:$PATH
export LD_LIBRARY_PATH=$HWL/lib:$LD_LIBRARY_PATH
--------------------------------------------------------------------------------------------------------------------

Now, the configuration step ends fine:

--------------------------------------------------------------------------------------------------------------------
./configure --with-ft=mpi --prefix=$HOME/ULFM2 --enable-mpi-ext=ftmpi --disable-man-pages --with-hwloc=$HOME/HWLOC/ --with-libevent=$HOME/LIBEVENT/ --with-pmix=$HOME/PMIX/

    Miscellaneous
    -----------------------
    CUDA support: no
    Fault Tolerance support: mpi
    hwloc: external
    libevent: external
    pmix: external
    PRRTE: yes
    Threading Package: pthreads
--------------------------------------------------------------------------------------------------------------------

Finally, when I try to execute my test again, it just simply does nothing: no output, no nothing, even if I add the "--mca btl_base_verbose 99" flag.

If I execute the same code with an OpenMPI 4.1.0 clean installation, everything works fine. It displays the right messages and finishes well.

Do you know what could be wrong with my installation process?

Also, in some part of the configuration step, the message "THIS IS A DEBUG BUILD!  DO NOT USE THIS BUILD FOR PERFORMANCE MEASUREMENTS!" was shown.
If I want to use the most recent ULFM version for measurement purposes, what ULFM version should I use? What do you recommend: use HWLOC, LIBEVENT, PMIX internal or external?

Thanks for your attention and help.

EXTRA INFO
--------------------------------------------------------------------------
The command line I use to compile is:
mpicc -g -O3 test.c -o test -lm

The command line I use to execute is:
mpiexec -np 4 --machinefile hostfile --mca btl_base_verbose 99 ./test 4096 4096

My machine is:
Linux Daniel-Lap 5.4.0-65-generic #73-Ubuntu SMP Mon Jan 18 17:25:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
--------------------------------------------------------------------------

Aurelien Bouteiller

unread,
Feb 10, 2021, 3:12:55 AM2/10/21
to User Level Fault Mitigation
Hello Daniel, 

Good news is that the ulfm-expanded branch got merged into Open MPI master. I now recommend you switch over to the Open MPI master branch. 

The procedure for installing what is essentially a preview of the next release should be straightforward: ULFM is compiled-in by default, so just do the normal autogen/configure/make install sequence, no extra flags needed for FT.

Internal everything is how we test it, and it is now the recommended setup for running with FT (as using externals could pull older versions that do not have full FT support).

I recommend you start from a fresh clone and install directories. Leftovers from prior compilations are known to cause issues.

To run your FT code, you will need to call `mpiexec —with-ft=mpi` and then pass the rest of your options as usual.


~~~ 

Hopefully that will resolve your original issue altogether; if not, the error message indicates that you have called SET_ERRHANDLER on MPI_COMM_NULL, probably obtained from MPI_COMM_GET_PARENT, or from MPI_COMM_SPAWN? The most common reason for spawn to fail is that you are running out of free slots on your allocation. This can be resolved by using the `-—map-by :oversubscribe` mpiexec argument, or by providing explicit host mapping in the INFO argument of MPI_COMM_SPAWN (example below provides a custom hostfile, you can also use the `host` key to select a list of spawn hosts; see man mpiexec).

```
    if (parentcomm == MPI_COMM_NULL) {

        if(argc < 2 ){
            printf("Processes number needed!");
            return 0;
        }
        processesToRun = atoi(argv[1]);
        MPI_Info_create( &info );
        MPI_Info_set( info, "hostfile", "./hostfile" );
        MPI_Info_set( info, "map_by", "node" );
        printf("PARENT  h: %s  r/s: %i/%i.\n", hostName, rank, size );

        MPI_Comm_spawn( argv[0], MPI_ARGV_NULL, processesToRun, info, 0, MPI_COMM_WORLD, &intercomm, MPI_ERRCODES_IGNORE);
    } else {
        printf("SPAWN   h: %s  r/s: %i/%i.\n", hostName, rank, size );
    }
```







Best,
Aurelien

--
You received this message because you are subscribed to the Google Groups "User Level Fault Mitigation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ulfm+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/8b07fe4a-1143-4468-9ff4-827b79cf44aan%40googlegroups.com.

Reply all
Reply to author
Forward
0 new messages