Sockets vs MPI

753 views
Skip to first unread message

Chatura Atapattu

unread,
Dec 12, 2010, 1:54:40 PM12/12/10
to coms4995pppp
So I know people have had problems with all kinds of various issues.  I'm about to bring some new ones into the mix.


With sockets, my program compiles, and if I run with "-N2 -n16", reports 16 places and runs as expect with a fork/join model.  However, its really slow, even for a basic async and print asyncID code.

With lapi, my program compiles fine, but I get the following error:

[cpa2116@athos src]$ salloc -n16 srun.lapi ./BodySystem.lapi
salloc: Granted job allocation 67539
>> poe ./BodySystem.lapi -hostfile hosts.67539 -procs 16 -msg_api lapi
ERROR: 0031-212  pmd: node porthos.watson.ibm.com: user cpa2116 denied from access from host athos.watson.ibm.com
ERROR: 0031-024  porthos.watson.ibm.com: no response; rc = -1
salloc: Relinquishing job allocation 67539

With mpi, my program compiles fine, and if run with "-N2 -n16", the whole program runs 16 times, each reporting one place.  Is this the expected behavior?  If each have an placeID of 1, how can we index into different parts of an array, etc?

I try to run "ldd" on my program, and cannot due to permission issues.

Any help or suggestions are welcome.

Thanks,

- Chat

Chatura Atapattu

unread,
Dec 12, 2010, 2:00:32 PM12/12/10
to coms4995pppp
For reference, this is my code I'm using to test this:

val numPlaces:Int = Place.MAX_PLACES;

clocked finish for ([placeID] in 0..numPlaces-1) {
clocked async at (Place.place(placeID)) {
val placeChunk = numBodies/numPlaces;
val placeStart = placeChunk * placeID;
val placeEnd = (placeID == (numPlaces - 1)) ? numBodies - 1 : placeStart + placeChunk - 1;
Console.OUT.printf("Place %d: %d to %d\n", placeID, placeStart, placeEnd);
next;
Console.OUT.printf("Place %d returning\n", placeID);
}
}

Mashooq Muhaimen

unread,
Dec 12, 2010, 2:03:44 PM12/12/10
to coms49...@googlegroups.com
I get the exact same errors with lapi and the exact same behavior with mpi, even when I am not using parallel constructs, so it's unlikely it's the code,  With regards to MPI, probably some env variable issue.
- Mashooq

John Gallagher

unread,
Dec 12, 2010, 2:28:30 PM12/12/10
to coms49...@googlegroups.com
I get a feeling that X10LAUNCHER_NPROCS doesn't work for mpi. I
looked at the source code for the run time. I think we would need to
actually have mpirun handle it for us (as it is suggested in some
places on the x10 site), like this:

salloc --cpus-per-task=8 --ntasks=2 /opt/openmpi-1.4/bin/mpirun -n 2
-report-bindings /test.mpi

However, when mpirun tries to allocate the resources, it can't maybe
because slurm is consuming them?

--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
-------------------------------------------------------------------------

Martha Kim

unread,
Dec 12, 2010, 3:16:50 PM12/12/10
to coms49...@googlegroups.com
Something does seem funny.  Could one of the IBM folks have a look at this?  The attached code and makefile compiles and runs the example Shreedhar uses on his documentation here, but the results are different. 

[martha@athos chat-help]$ export X10LAUNCHER_NPROCS=2
[martha@athos chat-help]$ srun -N2 -n2 ./FRASimpleDist.mpi 
Main table size   = 2^12*1 = 4096 words
Number of places = 1
Number of updates = 16384
Main table size   = 2^12*1 = 4096 words
Number of places = 1
Number of updates = 16384
CPU time used  = 0.049639999866486 seconds
3.3E-4 Billion(10^9) Updates per second (GUP/s)
Found 1 errors.
CPU time used  = 0.062922000419348 seconds
2.6E-4 Billion(10^9) Updates per second (GUP/s)
Found 2 errors.

It looks like I'm getting two instances of the program each seeing Place.MAX_PLACES=1, whereas Shreedhar's runs were running one instance in which Place.MAX_PLACES=2.

Martha





 to compile and run the 
chat-help.tar.gz

Vijay Saraswat

unread,
Dec 12, 2010, 6:02:25 PM12/12/10
to coms49...@googlegroups.com
You will get multiple instances of a program running unless the program links in pmi.

http://docs.codehaus.org/display/XTENLANG/X10+Application+Development

============================

MPI Transport

Building an mpi based executable requires specifying mpi value to -x10rt option on the x10c++ command-line. In addition, you must also link with pmi library, which is part of Slurm installation. This linkage ensures that, MPICH2 (default MPI distribution available on Three Musketeers) based executables can be directly launched with Slurm. If you compile on athos with default x10c++ (v 2.0.6) compiler, this is taken care of automatically. For all other cases (when you use your own x10c++ compiler on athos), specify -post option with value "# # -lpmi #" on the x10c++ command-line.

=============================

Whether or not it does can be determined by ldd.

Chatura Atapattu

unread,
Dec 12, 2010, 6:09:46 PM12/12/10
to coms49...@googlegroups.com
I compile with the following command (as per the instructions on the webpage):

x10c++ -t -v -report postcompile=1 -o BodySystem.mpi -x10rt mpi -post "# # -lpmi #" -optimize BodySystem.x10

It compiles fine.  When I run "ldd BodySystem.mpi", I get the following:

libpmi.so.0 => /usr/lib64/libpmi.so.0 (0x00002b894d7f0000)
libx10.so => /opt/x10/lib/libx10.so (0x00002b894d9f5000)
libgc.so.1 => /opt/x10/lib/libgc.so.1 (0x00002b894e06d000)
libx10rt_mpi.so => /opt/x10/lib/libx10rt_mpi.so (0x00002b894e2c4000)
libdl.so.2 => /lib64/libdl.so.2 (0x000000365fc00000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003660000000)
librt.so.1 => /lib64/librt.so.1 (0x0000003662400000)
libmpi_cxx.so.0 => /opt/openmpi-1.4/lib/libmpi_cxx.so.0 (0x00002b894e4f8000)
libmpi.so.0 => /opt/openmpi-1.4/lib/libmpi.so.0 (0x00002b894e711000)
libopen-rte.so.0 => /opt/openmpi-1.4/lib/libopen-rte.so.0 (0x00002b894e9bb000)
libopen-pal.so.0 => /opt/openmpi-1.4/lib/libopen-pal.so.0 (0x00002b894ec06000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003666c00000)
libutil.so.1 => /lib64/libutil.so.1 (0x000000366e200000)
libm.so.6 => /lib64/libm.so.6 (0x000000365f800000)
libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x0000003665800000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003664400000)
libc.so.6 => /lib64/libc.so.6 (0x000000365f400000)
libslurm.so.21 => /usr/lib64/libslurm.so.21 (0x00002b894ee7b000)
/lib64/ld-linux-x86-64.so.2 (0x000000365f000000)

*Note, I did not have access to "ldd" before, thanks to whoever secretly gave me enough permission.

When I run the program, it still spawns multiple copies of the same program.

John Gallagher

unread,
Dec 12, 2010, 6:11:17 PM12/12/10
to coms49...@googlegroups.com
If it is correctly linked, what exactly should we be expecting in ldd.
It seems like anything that actually links to mpich2 can't load,
because of the linker path problems I mentioned. Could we get an
example trace of the x10 2.1 compilation steps, the output of ldd
showing linkage to the mpich2 library, and program output showing more
than one place is available to x10 that works in the current
environment? Just a print out of MAX_PLACES should do fine.

I'm definitely linking to the pmi lib, and it seems like everyone else
is too, based on their compile traces.


john

On Sun, Dec 12, 2010 at 6:02 PM, Vijay Saraswat <vi...@saraswat.org> wrote:

John Gallagher

unread,
Dec 12, 2010, 6:13:53 PM12/12/10
to coms49...@googlegroups.com
Yeah, but note that none of those are mpich2 libs. I want to know how
to actually successfully link those libs if MPICH2 is really needed to
run.

The MPICH2 libs are:

[jmg2016@athos ~]$ rpm -qal mpich2 | grep .so
/usr/lib64/mpich2/lib/libfmpich.so.1
/usr/lib64/mpich2/lib/libfmpich.so.1.2
/usr/lib64/mpich2/lib/libmpich.so.1
/usr/lib64/mpich2/lib/libmpich.so.1.2
/usr/lib64/mpich2/lib/libmpichcxx.so.1
/usr/lib64/mpich2/lib/libmpichcxx.so.1.2
/usr/lib64/mpich2/lib/libmpichf90.so.1
/usr/lib64/mpich2/lib/libmpichf90.so.1.2


john

Vijay Saraswat

unread,
Dec 12, 2010, 6:25:13 PM12/12/10
to coms49...@googlegroups.com
Yes, I dont expect you can use mpirun to launch jobs on the cluster --
you have to use slurm. slurm guarantees you have the resources allocated
to you for the duration of your run. This would be an empty guarantee if
there were other ways of launching a job on the cluster at the same time
(e.g. via mpirun).

Chatura Atapattu

unread,
Dec 12, 2010, 6:26:43 PM12/12/10
to coms49...@googlegroups.com
I run with "srun -N2 -n16 ./BodySystem.mpi -a 16" and get 16 copies of my program.

Vijay Saraswat

unread,
Dec 12, 2010, 6:31:08 PM12/12/10
to coms49...@googlegroups.com
On 12/12/2010 6:26 PM, Chatura Atapattu wrote:
I run with "srun -N2 -n16 ./BodySystem.mpi -a 16" and get 16 copies of my program.

And you can use ldd to establish that BodySystem.mpi is linked in with pmi?

John Gallagher

unread,
Dec 12, 2010, 6:33:34 PM12/12/10
to coms49...@googlegroups.com
I believe that was in his previous post.

Chatura Atapattu

unread,
Dec 12, 2010, 6:37:15 PM12/12/10
to coms49...@googlegroups.com
This is a sample of the code:

val numPlaces:Int = Place.MAX_PLACES;
Console.OUT.println("Number of places: " + numPlaces);
clocked finish for ([placeID] in 0..numPlaces-1) {
clocked async at (Place.place(placeID)) {
val placeChunk = numBodies/numPlaces;
val placeStart = placeChunk * placeID;
val placeEnd = (placeID == (numPlaces - 1)) ? numBodies - 1 : placeStart + placeChunk - 1;
Console.OUT.printf("Place %d: %d to %d\n", placeID, placeStart, placeEnd);
next;
Console.OUT.printf("Place %d returning\n", placeID);
}
}

I compile with the following command (as per the instructions on the webpage):

x10c++ -t -v -report postcompile=1 -o BodySystem.mpi -x10rt mpi -post "# # -lpmi #" -optimize BodySystem.x10

It compiles fine.  When I run "ldd BodySystem.mpi", I get the following:

libpmi.so.0 => /usr/lib64/libpmi.so.0 (0x00002b894d7f0000)
libx10.so => /opt/x10/lib/libx10.so (0x00002b894d9f5000)
libgc.so.1 => /opt/x10/lib/libgc.so.1 (0x00002b894e06d000)
libx10rt_mpi.so => /opt/x10/lib/libx10rt_mpi.so (0x00002b894e2c4000)
libdl.so.2 => /lib64/libdl.so.2 (0x000000365fc00000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003660000000)
librt.so.1 => /lib64/librt.so.1 (0x0000003662400000)
libmpi_cxx.so.0 => /opt/openmpi-1.4/lib/libmpi_cxx.so.0 (0x00002b894e4f8000)
libmpi.so.0 => /opt/openmpi-1.4/lib/libmpi.so.0 (0x00002b894e711000)
libopen-rte.so.0 => /opt/openmpi-1.4/lib/libopen-rte.so.0 (0x00002b894e9bb000)
libopen-pal.so.0 => /opt/openmpi-1.4/lib/libopen-pal.so.0 (0x00002b894ec06000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003666c00000)
libutil.so.1 => /lib64/libutil.so.1 (0x000000366e200000)
libm.so.6 => /lib64/libm.so.6 (0x000000365f800000)
libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x0000003665800000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003664400000)
libc.so.6 => /lib64/libc.so.6 (0x000000365f400000)
libslurm.so.21 => /usr/lib64/libslurm.so.21 (0x00002b894ee7b000)
/lib64/ld-linux-x86-64.so.2 (0x000000365f000000)

I run with "srun -N2 -n16 ./BodySystem.mpi -a 16".  On the screen are 16 copies in which it says the number of places is 1 (and printing the remainder of the program).

Vijay Saraswat

unread,
Dec 12, 2010, 7:14:22 PM12/12/10
to coms49...@googlegroups.com
Unfortunately, I cannot reproduce the problem. Things work fine for me as advertised.

Here is what I did.

(1) I rebuilt X10 from SVN head:
cd x1021
svn co https://x10.svn.sf.net/svnroot/x10/trunk
cd trunk/x10.dist
ant -DX10RT_MPI=true distclean dist

I have given world read permission on the x1021 directory, so you should be able to use this X10 installation as well.

(2) Wrote and executed an HelloWorld.x10 over multiple places successfully, per the attached log.txt.

Does this help you to get your code running?

Best,
Vijay
log.txt

Vijay Saraswat

unread,
Dec 12, 2010, 7:21:00 PM12/12/10
to coms49...@googlegroups.com
I copied your code over, compiled, linked, ran -- seems to work fine...?

See attached log.



On 12/12/2010 6:37 PM, Chatura Atapattu wrote:
bs_log.txt

Vijay Saraswat

unread,
Dec 12, 2010, 7:23:12 PM12/12/10
to coms49...@googlegroups.com
FWIW, attached the output for

srun -N2 -n16 ./bs -a 16

as well


On 12/12/2010 6:37 PM, Chatura Atapattu wrote:
bs_log2.txt

John Gallagher

unread,
Dec 12, 2010, 7:29:18 PM12/12/10
to coms49...@googlegroups.com
With your build, it works, thanks.

I think the problem is that
/opt/x10-2.1.0 has the wrong mpich2 configuration, so it cannot
compile working versions (ldd always has mpichcxx.so unlinked)
/opt/x10 uses openmpi, which doesn't work with slurm.

So basically the only solution was to build the x10 dist ourselves or
use the one that you have provided.

Thanks,
john

Martha Kim

unread,
Dec 12, 2010, 7:29:57 PM12/12/10
to coms49...@googlegroups.com
Chat, John, et al:  I was able to run the FRASimpleDist program (Shreedar's example) and get the same output simply by changing my path to X10_PATH=~vj/x1021/trunk/x10.dist/bin/

Vijay:  Do you know the difference between the latest release and the SVN head that would have caused things to behave differently?  

Vijay Saraswat

unread,
Dec 12, 2010, 7:46:14 PM12/12/10
to coms49...@googlegroups.com
On 12/12/2010 7:29 PM, Martha Kim wrote:
Chat, John, et al:  I was able to run the FRASimpleDist program (Shreedar's example) and get the same output simply by changing my path to X10_PATH=~vj/x1021/trunk/x10.dist/bin/

Vijay:  Do you know the difference between the latest release and the SVN head that would have caused things to behave differently?  


I dont think anything has changed --- its prolly just that i compiled x10 with -DX10RT_MPI=true

(Sreedhar will know -- he should be online in a few hours.)

Chatura Atapattu

unread,
Dec 12, 2010, 7:48:18 PM12/12/10
to coms49...@googlegroups.com
Thanks, it works now.  I now tried compiling my project, and while it compiled fine when looking at /opt, now when looking at either your build, or the one I created myself (following your directions), I get a ton of errors regarding "Void".  Example:

/home/cpa2116/NBodyx10/trunk/src/BodySystem.x10:70: Could not find type "Void".
/home/cpa2116/NBodyx10/trunk/src/BodySystem.x10:85: Could not find type "Void".
/home/cpa2116/NBodyx10/trunk/src/StaticRandom.x10:14: Could not find type "Void".
/home/cpa2116/NBodyx10/trunk/src/BodySystem.x10:125: Could not find type "Void".
/home/cpa2116/NBodyx10/trunk/src/StaticPrimative.x10:12: Could not find type "Void".
/home/cpa2116/NBodyx10/trunk/src/BodySystem.x10:165: Could not find type "Void".
/home/cpa2116/NBodyx10/trunk/src/BodySystem.x10:288: Could not find type "Void".
/home/cpa2116/NBodyx10/trunk/src/BodySystem.x10:190: Could not find type "Void".
/home/cpa2116/NBodyx10/trunk/src/BodySystem.x10:234: Could not find field or local variable "placeID".
/home/cpa2116/NBodyx10/trunk/src/BodySystem.x10:235: Could not find field or local variable "placeID".
/home/cpa2116/NBodyx10/trunk/src/BodySystem.x10:237: Could not find field or local variable "placeID".
/home/cpa2116/NBodyx10/trunk/src/BodySystem.x10:239: Could not find field or local variable "placeID".
/home/cpa2116/NBodyx10/trunk/src/BodySystem.x10:263: Could not find type "Void".
/home/cpa2116/NBodyx10/trunk/src/BodySystem.x10:343: Could not find type "Void".
/home/cpa2116/NBodyx10/trunk/src/BodySystem.x10:303: Could not find type "Void".
/home/cpa2116/NBodyx10/trunk/src/BodySystem.x10:358: Could not find type "Void".
/home/cpa2116/NBodyx10/trunk/src/BodySystem.x10:373: Could not find type "Void".
/home/cpa2116/NBodyx10/trunk/src/BodySystem.x10:394: Could not find type "Void".
/home/cpa2116/NBodyx10/trunk/src/BodySystem.x10:438: Could not find type "Void".
/home/cpa2116/NBodyx10/trunk/src/BodySystem.x10:442: Could not find type "Void".
/home/cpa2116/NBodyx10/trunk/src/BodySystem.x10:514: Could not find type "Void".
/home/cpa2116/NBodyx10/trunk/src/BodySystem.x10:526: Could not find type "Void".
/home/cpa2116/NBodyx10/trunk/src/BodySystem.x10:540: Could not find type "Void".
/home/cpa2116/NBodyx10/trunk/src/BodySystem.x10:70-83: Method must return a value of type Void
/home/cpa2116/NBodyx10/trunk/src/BodySystem.x10:85-123: Method must return a value of type Void
/home/cpa2116/NBodyx10/trunk/src/BodySystem.x10:125-163: Method must return a value of type Void
/home/cpa2116/NBodyx10/trunk/src/BodySystem.x10:165-188: Method must return a value of type Void
/home/cpa2116/NBodyx10/trunk/src/BodySystem.x10:190-261: Method must return a value of type Void
/home/cpa2116/NBodyx10/trunk/src/BodySystem.x10:263-283: Method must return a value of type Void
/home/cpa2116/NBodyx10/trunk/src/BodySystem.x10:288-301: Method must return a value of type Void
/home/cpa2116/NBodyx10/trunk/src/BodySystem.x10:303-338: Method must return a value of type Void
/home/cpa2116/NBodyx10/trunk/src/BodySystem.x10:343-356: Method must return a value of type Void
/home/cpa2116/NBodyx10/trunk/src/BodySystem.x10:358-371: Method must return a value of type Void
/home/cpa2116/NBodyx10/trunk/src/BodySystem.x10:373-392: Method must return a value of type Void
/home/cpa2116/NBodyx10/trunk/src/BodySystem.x10:394-436: Method must return a value of type Void
/home/cpa2116/NBodyx10/trunk/src/BodySystem.x10:438-440: Method must return a value of type Void
/home/cpa2116/NBodyx10/trunk/src/BodySystem.x10:442-460: Method must return a value of type Void
/home/cpa2116/NBodyx10/trunk/src/BodySystem.x10:514-524: Method must return a value of type Void
/home/cpa2116/NBodyx10/trunk/src/BodySystem.x10:526-538: Method must return a value of type Void
/home/cpa2116/NBodyx10/trunk/src/BodySystem.x10:540-562: Method must return a value of type Void
/home/cpa2116/NBodyx10/trunk/src/StaticPrimative.x10:12-14: Method must return a value of type Void
/home/cpa2116/NBodyx10/trunk/src/StaticRandom.x10:14-16: Method must return a value of type Void

Vijay Saraswat

unread,
Dec 12, 2010, 7:58:39 PM12/12/10
to coms49...@googlegroups.com
Replace Void with void.

Chatura Atapattu

unread,
Dec 12, 2010, 8:22:32 PM12/12/10
to coms49...@googlegroups.com
I was running your version of my program a couple of times.  While it runs fine generally, every once in a while, it blows up with the following (below).  Is this MPI related and can it be ignored, or is it an issue in the program?  Code attached.

Fatal error in MPI_Iprobe: Other MPI error, error stack:
MPI_Iprobe(122)................: MPI_Iprobe(src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG, comm=0x84000002, flag=0x7fff5ec3b90c, status=0x7fff5ec3b8f0) failed
MPIDI_CH3I_Progress(150).......: 
MPID_nem_mpich2_test_recv(800).: 
MPID_nem_tcp_connpoll(1720)....: 
state_commrdy_handler(1556)....: 
MPID_nem_tcp_recv_handler(1446): socket closed
pthread_mutex_destroy: Operation now in progress
Fatal error in MPI_Test: Other MPI error, error stack:
MPI_Test(153).................: MPI_Test(request=0x8133074, flag=0x7fff0038ec9c, status=0x7fff0038ec80) failed
MPIDI_CH3I_Progress(150)......: 
MPID_nem_mpich2_test_recv(800): 
MPID_nem_tcp_connpoll(1709)...: Communication error
pthread_mutex_destroy: Resource temporarily unavailable
Fatal error in MPI_Iprobe: Other MPI error, error stack:
MPI_Iprobe(122)................: MPI_Iprobe(src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG, comm=0x84000002, flag=0x7fffa5924fbc, status=0x7fffa5924fa0) failed
MPIDI_CH3I_Progress(150).......: 
MPID_nem_mpich2_test_recv(800).: 
MPID_nem_tcp_connpoll(1720)....: 
state_commrdy_handler(1556)....: 
MPID_nem_tcp_recv_handler(1446): socket closed
pthread_mutex_destroy: Operation now in progress
Fatal error in MPI_Iprobe: Other MPI error, error stack:
MPI_Iprobe(122)................: MPI_Iprobe(src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG, comm=0x84000002, flag=0x7fffc79ba04c, status=0x7fffc79ba030) failed
MPIDI_CH3I_Progress(150).......: 
MPID_nem_mpich2_test_recv(800).: 
MPID_nem_tcp_connpoll(1720)....: 
state_commrdy_handler(1556)....: 
MPID_nem_tcp_recv_handler(1446): socket closed
pthread_mutex_destroy: Operation now in progress
Fatal error in MPI_Iprobe: Other MPI error, error stack:
MPI_Iprobe(122)................: MPI_Iprobe(src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG, comm=0x84000002, flag=0x7ffff12e097c, status=0x7ffff12e0960) failed
MPIDI_CH3I_Progress(150).......: 
MPID_nem_mpich2_test_recv(800).: 
MPID_nem_tcp_connpoll(1720)....: 
state_commrdy_handler(1556)....: 
MPID_nem_tcp_recv_handler(1446): socket closed
pthread_mutex_destroy: Resource temporarily unavailable
Fatal error in MPI_Iprobe: Other MPI error, error stack:
MPI_Iprobe(122)................: MPI_Iprobe(src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG, comm=0x84000002, flag=0x7fffd8ef158c, status=0x7fffd8ef1570) failed
MPIDI_CH3I_Progress(150).......: 
MPID_nem_mpich2_test_recv(800).: 
MPID_nem_tcp_connpoll(1720)....: 
state_commrdy_handler(1556)....: 
MPID_nem_tcp_recv_handler(1446): socket closed
pthread_mutex_destroy: Operation now in progress
Fatal error in MPI_Iprobe: Other MPI error, error stack:
MPI_Iprobe(122)...............: MPI_Iprobe(src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG, comm=0x84000002, flag=0x7fff7e5a277c, status=0x7fff7e5a2760) failed
MPIDI_CH3I_Progress(150)......: 
MPID_nem_mpich2_test_recv(800): 
MPID_nem_tcp_connpoll(1709)...: Communication error
pthread_mutex_destroy: Resource temporarily unavailable
Fatal error in MPI_Iprobe: Other MPI error, error stack:
MPI_Iprobe(122)................: MPI_Iprobe(src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG, comm=0x84000002, flag=0x7fff9b4c619c, status=0x7fff9b4c6180) failed
MPIDI_CH3I_Progress(150).......: 
MPID_nem_mpich2_test_recv(800).: 
MPID_nem_tcp_connpoll(1720)....: 
state_commrdy_handler(1556)....: 
MPID_nem_tcp_recv_handler(1446): socket closed
pthread_mutex_destroy: Resource temporarily unavailable
srun: error: dartagnan: tasks 0-4: Aborted
srun: error: dartagnan: tasks 5-7: Segmentation fault
srun: error: porthos: tasks 8-9,12-13,15: Segmentation fault
srun: error: porthos: tasks 10-11,14: Aborted
BodySystem.x10

Vijay Saraswat

unread,
Dec 13, 2010, 5:39:29 AM12/13/10
to coms49...@googlegroups.com
On 12/12/2010 7:29 PM, John Gallagher wrote:
> With your build, it works, thanks.
>
> I think the problem is that
> /opt/x10-2.1.0 has the wrong mpich2 configuration, so it cannot
> compile working versions (ldd always has mpichcxx.so unlinked)
> /opt/x10 uses openmpi, which doesn't work with slurm.
>
> So basically the only solution was to build the x10 dist ourselves or
> use the one that you have provided.
>
> Thanks,
> john
OK, thanks. Hope you are not blocked any more.

Vijay Saraswat

unread,
Dec 13, 2010, 5:46:15 AM12/13/10
to coms49...@googlegroups.com
On 12/12/2010 8:22 PM, Chatura Atapattu wrote:
I was running your version of my program a couple of times.  While it runs fine generally, every once in a while, it blows up with the following (below).  Is this MPI related and can it be ignored, or is it an issue in the program?  Code attached.
I tried a few times and it ran fine.

Are you able to run your application?

Chatura Atapattu

unread,
Dec 13, 2010, 8:55:49 AM12/13/10
to coms49...@googlegroups.com
Thanks Prof. Vijay.  With this we were able to build against your compiler/runtime (or the one I created following your directions) and get MPI to successfully build and run.  However, it did cause a whole lot of problems to code which compiled fine before, some of them easy (like the "Void"/"void" issue), and others still not solved (errors regarding a dictionary we were using - will send that out if we can't solve it [its non critical for our implementation]).

Our code runs fine, although he test still continued to blow up occasionally.
Reply all
Reply to author
Forward
0 new messages