GA 5-0B termination on OS X

17 views
Skip to first unread message

Andy May

unread,
Nov 10, 2010, 8:00:18 AM11/10/10
to hpct...@emsl.pnl.gov
Hi,

When building and running Molpro using GA 5-0B (TCGMSGMPI with MPICH2)
on Mac OS X all of our jobs exit with a strange error:

0:Child process terminated prematurely, status=: 0
(rank:0 hostname:pjkws9.chem.cf.ac.uk pid:58356):ARMCI DASSERT fail.
src/signaltrap.c:SigChldHandler():178 cond:0
Last System Error Message from Task 0:: No such file or directory
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 0
Last System Error Message from Task 1:: No such file or directory
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 1

This seems independent of compilers, certainly both

gcc 4.2.1 / pgf90 10.9
gcc 4.5.0 / gfortran 4.5.0

combinations have this problem.

If I build using the same compilers, MPICH2 etc. on a Linux system, then
we do not see such errors at the end of calculations.

This problem does not occur using GA 4-3-3.

The configure for GA 5-0B is:

./configure F77=/opt/gcc/bin/gfortran CC=/opt/gcc/bin/gcc
--prefix=/Users/andy/gfortran/molpro2010.2/src/ga-install --with-tcgmsg
--with-mpi='/Users/andy/gfortran/molpro2010.2/src/mpich2-install/lib
-lpmpich -lmpich -lopa -lmpl
-I/Users/andy/gfortran/molpro2010.2/src/mpich2-install/include'

Do you have any idea what might be the problem?

Many thanks,

Andy

Andy May

unread,
Nov 19, 2010, 7:18:50 AM11/19/10
to hpct...@emsl.pnl.gov

After further investigation, I have a cut down version of code (open and
read from pipe) which shows the problem on Linux and Mac (see below).
Perhaps GA catches a signal from the sub-process?

This doesn't happen with GA 4-3-3, but does with GA 5. Could you take a
look?

Thanks,

Andy

#include <stdlib.h>
#include <stdio.h>
#include "ga.h"
#include "tcgmsg.h"

int main(int argc, char **argv) {
FILE *s;
char *p;
tcg_pbegin(argc, argv);
GA_Initialize();

s=popen("hostname","r");
p=(char *) malloc(128);
fgets(p,128,s);
fprintf(stdout,"hostname=%s",p);
free(p);

GA_Terminate();
tcg_pend();

Jeff

unread,
Nov 19, 2010, 12:23:59 PM11/19/10
to Andy May, hpct...@googlegroups.com
On Nov 19, 4:18 am, Andy May <May...@Cardiff.ac.uk> wrote:
> After further investigation, I have a cut down version of code (open and
> read from pipe) which shows the problem on Linux and Mac (see below).
> Perhaps GA catches a signal from the sub-process?
>
> This doesn't happen with GA 4-3-3, but does with GA 5. Could you take a
> look?
>
> Thanks,
>
> Andy

Thanks for the test code -- that helps considerably. I'm filing this
as a bug, as well. I hope once everyone is back from travel we can
start taking care of these.

Jeff

unread,
Nov 22, 2010, 3:59:27 PM11/22/10
to Andy May, hpct...@googlegroups.com
On Nov 19, 4:18 am, Andy May <May...@Cardiff.ac.uk> wrote:
> After further investigation, I have a cut down version of code (open and
> read from pipe) which shows the problem on Linux and Mac (see below).
> Perhaps GA catches a signal from the sub-process?
>
> This doesn't happen with GA 4-3-3, but does with GA 5. Could you take a
> look?
>
> Thanks,
>
> Andy

For ga-5-0, I was able to reproduce this using GCC 4.6.0 (from
hpc.sourceforge.net) and mpich2-1.3.1 on OSX 10.6.5. I do not see
this using OpenMPI 1.4.2, however.
For ga-4-3, sure enough, it works just fine.
Thanks again for the small test program. That's what I used to verify
the behavior. I'm looking into it.

Jeff

unread,
Nov 22, 2010, 5:59:56 PM11/22/10
to hpctools, May...@cardiff.ac.uk
On Nov 22, 12:59 pm, Jeff <jeffrey.da...@gmail.com> wrote:
> On Nov 19, 4:18 am, Andy May <May...@Cardiff.ac.uk> wrote:
>
> > After further investigation, I have a cut down version of code (open and
> > read from pipe) which shows the problem on Linux and Mac (see below).
> > Perhaps GA catches a signal from the sub-process?
>
> > This doesn't happen with GA 4-3-3, but does with GA 5. Could you take a
> > look?
>
> > Thanks,
>
> > Andy
>
> For ga-5-0, I was able to reproduce this using GCC 4.6.0 (from
> hpc.sourceforge.net) and mpich2-1.3.1 on OSX 10.6.5.  I do not see
> this using OpenMPI 1.4.2, however.
> For ga-4-3, sure enough, it works just fine.
> Thanks again for the small test program.  That's what I used to verify
> the behavior.  I'm looking into it.

I think I've narrowed this down to GA_Sync() which internally calls
MPI_Barrier(MPI_COMM_WORLD), but at that point it gets cloudy. I put
GA_Sync() before the popen block and another after. The first
GA_Sync() succeeds, but the latter does not. GA_Terminate calls
ga_sync_ internally, which is why it's failing. But ga_sync_ calls
ga_msg_sync_ which calls armci_msg_barrier which calls MPI_Barrier. I
then replaced GA_Sync() before/after the popen block with direct calls
to MPI_Barrer(MPI_COMM_WORLD) and got the same results. I don't know
what is causing the barrier to fail after the popen.

I also verified that this occurs if you build without TCGMSG. Using
an MPI-only build I'm getting the same problem.

I need to keep investigating. Hopefully someone else might have an
idea or two.

Andy May

unread,
Nov 23, 2010, 9:31:04 AM11/23/10
to Jeff, hpctools
Jeff,

Actually the example I posted is missing a pclose. Below is an updated
version, which fails when GA_Sync() is called after a popen() and before
a pclose(), due to the signal SIGCHLD. If compiled with
-D_IGNORE_SIGCHLD_ it works fine.

Actually, a failure can even be seen when calling GA_Sync after pclose,
but the failure is statistical. Probably a race condition between pclose
and GA_sync.

Hope this helps,

Andy

#include <stdlib.h>
#include <stdio.h>
#ifdef _IGNORE_SIGCHLD_
#include <signal.h>
#endif
#include "ga.h"
#include "tcgmsg.h"

int main(int argc, char **argv) {
FILE *s;
char *p;

#ifdef _IGNORE_SIGCHLD_
void * sigchld_handler;
#endif
tcg_pbegin(argc, argv);
GA_Initialize();

#ifdef _IGNORE_SIGCHLD_
sigchld_handler=signal(SIGCHLD,SIG_IGN); /* restore SIGCHLD */
#endif


s=popen("hostname","r");
p=(char *) malloc(128);
fgets(p,128,s);
fprintf(stdout,"hostname=%s",p);
free(p);

GA_Sync();
pclose(s);
#ifdef _IGNORE_SIGCHLD_
signal(SIGCHLD,sigchld_handler); /* restore SIGCHLD */
#endif

GA_Terminate();
tcg_pend();

Reply all
Reply to author
Forward
0 new messages