Number of processors different from number of subenvironments

37 views
Skip to first unread message

Ernesto Lima

unread,
Mar 1, 2016, 11:34:26 AM3/1/16
to QUESO-users mailing list
Hello,

I am learning queso, and I am using the gravity example as a test subject.
When I run queso using env numSubEnvironments = 2 and run with
mpirun -np 2 ./gravity_gsl gravity_inv_fwd.inp

I have no problems and everything works. However, if I keep env numSubEnvironments = 2 and run with 
mpirun -np 4 ./gravity_gsl gravity_inv_fwd.inp

It gives me problem with m_inter0Comm. As I am running the example without any alteration and the the total number of processors in the environment is  a multiple of the specified number of subenvironments, I am thinking that it might be a mistake that I did when installing.
Does anyone have any idea where should I be looking to solve this problem? I installed 
petsc-3.6.3
slepc-3.6.2
libmesh-0.9.5
queso-0.54.0

Thanks a lot!

Ernesto

Error below:

--------------------------------------------------------------------------------------------------------------
QUESO Library: Version = 0.54.0 (5400)

External Release

Build Date   = 2016-01-27 16:34
Build Host   = MSI
Build User   = ernesto
Build Arch   = x86_64-unknown-linux-gnu
Build Rev    = N/A

C++ Config   = mpic++ -std=c++11

Trilinos DIR = 
GSL Libs     = -L/usr/lib -lgsl -lgslcblas -lm
GRVY DIR     = 
GLPK DIR     = 
HDF5 DIR     = 
--------------------------------------------------------------------------------------------------------------
Beginning run at Tue Mar  1 10:20:28 2016

MPI node of worldRank 0 has fullRank 0, belongs to subEnvironment of id 0, and has subRank 0
MPI node of worldRank 0 belongs to sub communicator with full ranks 0 1
MPI node of worldRank 0 also belongs to inter0 communicator with full ranks 0 2, and has inter0Rank 0

MPI node of worldRank 1 has fullRank 1, belongs to subEnvironment of id 0, and has subRank 1
MPI node of worldRank 1 belongs to sub communicator with full ranks 0 1


MPI node of worldRank 2 has fullRank 2, belongs to subEnvironment of id 1, and has subRank 0
MPI node of worldRank 2 belongs to sub communicator with full ranks 2 3
MPI node of worldRank 2 also belongs to inter0 communicator with full ranks 0 2, and has inter0Rank 1

MPI node of worldRank 3 has fullRank 3, belongs to subEnvironment of id 1, and has subRank 1
MPI node of worldRank 3 belongs to sub communicator with full ranks 2 3



Beginning run of 'Gravity + Projectile motion' example at Tue Mar  1 10:20:28 2016

 my fullRank = 0
 my subEnvironmentId = 0
 my subRank = 0
 my interRank = 0

Beginning 'SIP -> Gravity estimation' at Tue Mar  1 10:20:28 2016

*** Warning, this code is deprecated and likely to be removed in future library versions:  core/src/GslMatrix.C, line 429, compiled Jan 27 2016 at 16:35:41 ***
*** Warning, this code is deprecated and likely to be removed in future library versions:  core/src/GslMatrix.C, line 429, compiled Jan 27 2016 at 16:35:41 ***
*** Warning, this code is deprecated and likely to be removed in future library versions:  core/src/GslMatrix.C, line 429, compiled Jan 27 2016 at 16:35:41 ***
*** Warning, this code is deprecated and likely to be removed in future library versions:  core/src/GslMatrix.C, line 429, compiled Jan 27 2016 at 16:35:41 ***
Assertion `m_inter0Comm' failed.
m_inter0Comm variable is NULL
core/src/Environment.C, line 269, compiled Jan 27 2016 at 16:34:24

--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD 
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
*** Warning, AutoPtr is deprecated and will be removed in a future library version! ./include/libmesh/auto_ptr.h, line 271, compiled Jan 26 2016 at 11:00:03 ***
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 8138 on
node MSI exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------




Damon McDougall

unread,
Mar 4, 2016, 1:55:15 PM3/4/16
to Ernesto Lima, QUESO-users mailing list
I'm able to recreate this. I'll let you know if I get the bottom of it.

>
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "QUESO-users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to queso-users...@googlegroups.com.
> Visit this group at https://groups.google.com/group/queso-users.
> For more options, visit https://groups.google.com/d/optout.

Damon McDougall

unread,
Mar 4, 2016, 3:27:02 PM3/4/16
to Ernesto Lima, QUESO-users mailing list
Hi Ernesto,

I think I know what the problem is. I have a potential fix in my
`ensure_nonNULL_inter0Comm` branch here:
https://github.com/dmcdougall/queso/tree/ensure_nonNULL_inter0Comm

Would you mind trying that out, and checking that you still get the
correct output? If so, I'll write a test for this so it doesn't happen
again.

Best wishes,
Damon

--
Damon McDougall
http://www.damon-is-a-geek.com
Institute for Computational Engineering Sciences
201 E. 24th St.
Stop C0200
The University of Texas at Austin
Austin, TX 78712-1229

Ernesto Lima

unread,
Mar 4, 2016, 5:59:22 PM3/4/16
to QUESO-users mailing list, ernesto...@gmail.com
Hello Damon,

Thank you for the correction.
I just ran the gravity example to test it.
I tried with 2 subenvironments and 2 processors and also 2 subenvironments and 4 processors.
The first chain was the same to both cases and the 2nd chain was different just because the way the sed works (MPI RANK+z). So everything is working.
Didn't tried yet an actual parallel forward problem. After I test it I will post here.
Trying to get parallel libmesh forward problem to work with queso with multiple subenvironments.

Thanks a lot!

Best,

Ernesto

Damon McDougall

unread,
Apr 13, 2016, 6:24:29 PM4/13/16
to Ernesto Lima, QUESO-users mailing list
Ok great. The patch I wrote was merged into the v0.55.0-release branch
(with a test), so you should see the fix in the next QUESO release.

Thanks for the bug report.

>
> Thanks a lot!
>
> Best,
>
> Ernesto
>
Reply all
Reply to author
Forward
0 new messages