When building and running Molpro using GA 5-0B (TCGMSGMPI with MPICH2)
on Mac OS X all of our jobs exit with a strange error:
0:Child process terminated prematurely, status=: 0
(rank:0 hostname:pjkws9.chem.cf.ac.uk pid:58356):ARMCI DASSERT fail.
src/signaltrap.c:SigChldHandler():178 cond:0
Last System Error Message from Task 0:: No such file or directory
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 0
Last System Error Message from Task 1:: No such file or directory
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 1
This seems independent of compilers, certainly both
gcc 4.2.1 / pgf90 10.9
gcc 4.5.0 / gfortran 4.5.0
combinations have this problem.
If I build using the same compilers, MPICH2 etc. on a Linux system, then
we do not see such errors at the end of calculations.
This problem does not occur using GA 4-3-3.
The configure for GA 5-0B is:
./configure F77=/opt/gcc/bin/gfortran CC=/opt/gcc/bin/gcc
--prefix=/Users/andy/gfortran/molpro2010.2/src/ga-install --with-tcgmsg
--with-mpi='/Users/andy/gfortran/molpro2010.2/src/mpich2-install/lib
-lpmpich -lmpich -lopa -lmpl
-I/Users/andy/gfortran/molpro2010.2/src/mpich2-install/include'
Do you have any idea what might be the problem?
Many thanks,
Andy
This doesn't happen with GA 4-3-3, but does with GA 5. Could you take a
look?
Thanks,
Andy
#include <stdlib.h>
#include <stdio.h>
#include "ga.h"
#include "tcgmsg.h"
int main(int argc, char **argv) {
FILE *s;
char *p;
tcg_pbegin(argc, argv);
GA_Initialize();
s=popen("hostname","r");
p=(char *) malloc(128);
fgets(p,128,s);
fprintf(stdout,"hostname=%s",p);
free(p);
GA_Terminate();
tcg_pend();
Thanks for the test code -- that helps considerably. I'm filing this
as a bug, as well. I hope once everyone is back from travel we can
start taking care of these.
For ga-5-0, I was able to reproduce this using GCC 4.6.0 (from
hpc.sourceforge.net) and mpich2-1.3.1 on OSX 10.6.5. I do not see
this using OpenMPI 1.4.2, however.
For ga-4-3, sure enough, it works just fine.
Thanks again for the small test program. That's what I used to verify
the behavior. I'm looking into it.
Actually the example I posted is missing a pclose. Below is an updated
version, which fails when GA_Sync() is called after a popen() and before
a pclose(), due to the signal SIGCHLD. If compiled with
-D_IGNORE_SIGCHLD_ it works fine.
Actually, a failure can even be seen when calling GA_Sync after pclose,
but the failure is statistical. Probably a race condition between pclose
and GA_sync.
Hope this helps,
Andy
#include <stdlib.h>
#include <stdio.h>
#ifdef _IGNORE_SIGCHLD_
#include <signal.h>
#endif
#include "ga.h"
#include "tcgmsg.h"
int main(int argc, char **argv) {
FILE *s;
char *p;
#ifdef _IGNORE_SIGCHLD_
void * sigchld_handler;
#endif
tcg_pbegin(argc, argv);
GA_Initialize();
#ifdef _IGNORE_SIGCHLD_
sigchld_handler=signal(SIGCHLD,SIG_IGN); /* restore SIGCHLD */
#endif
s=popen("hostname","r");
p=(char *) malloc(128);
fgets(p,128,s);
fprintf(stdout,"hostname=%s",p);
free(p);
GA_Sync();
pclose(s);
#ifdef _IGNORE_SIGCHLD_
signal(SIGCHLD,sigchld_handler); /* restore SIGCHLD */
#endif
GA_Terminate();
tcg_pend();