Weird errors from running GraphLab HelloWorld on a 100-machine cluster

248 views
Skip to first unread message

Zekai Jacob Gao

unread,
Sep 24, 2013, 5:14:30 PM9/24/13
to graph...@googlegroups.com

Hey guys,

I am having problems running GraphLab on a 100 machine cluster.

When I run the following program:

#include <graphlab.hpp>

 int main(int argc, char** argv) {

   graphlab::mpi_tools::init(argc, argv);

   graphlab::distributed_control dc;


   printf("Hello World!\n");

   graphlab::mpi_tools::finalize();

 }

Using 40 machines with the command: mpiexec -n 40 -hostfile ~/machines ./helloworld 

I get:

GRAPHLAB_SUBNET_ID/GRAPHLAB_SUBNET_MASK environment variables not defined.

Using default values

Subnet ID: 0.0.0.0

Subnet Mask: 0.0.0.0

Will find first IPv4 non-loopback address matching the subnet

INFO:     dc.cpp(init:554): Cluster of 40 instances created.

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

, which goes as expected.

But when I run the program on 100 (actually, any number >41) machines, I get the errors in the attachment which tries to allocate insanely large memory!

(Part of the errors:

tcmalloc: large alloc 5066549580800000 bytes == (nil) @  0x65b851 0x7f9f1d417a89

tcmalloc: large alloc 5348024557510656 bytes == (nil) @  0x65b851 0x7f9d8bf47a89

tcmalloc: large alloc 3759995725912481792 bytes == (nil) @  0x65b851 0x7fa9b084ca89

terminate called after throwing an instance of 'std::bad_alloc'

tcmalloc: large alloc 5348024557510656 bytes == (nil) @  0x65b851 0x7f0dbf228a89

ERROR:    basic_types.hpp(exec:140): Check failed: !(iarc.fail())

ERROR:    basic_types.hpp(exec:140): Check failed: !(iarc.fail())

tcmalloc: large alloc 3472556787679371264 bytes == (nil) @  0x65b851 0x7f91e13dfa89

ERROR:    basic_types.hpp(exec:140): Check failed: !(iarc.fail())

terminate called after throwing an instance of 'std::bad_alloc'

tcmalloc: large alloc 3530822107858477056 bytes == (nil) @  0x65b851 0x7fdffe09fa89

tcmalloc: large alloc 3530822107858477056 bytes == (nil) @  0x65b851 0x7f90fcd2aa89

ERROR:    basic_types.hpp(exec:140): Check failed: !(iarc.fail())

tcmalloc: large alloc 5348024557510656 bytes == (nil) @  0x65b851 0x7fe0dde7fa89

ERROR:    basic_types.hpp(exec:140): Check failed: !(iarc.fail())

terminate called after throwing an instance of '  what():  std::bad_alloc

tcmalloc: large alloc 3472556787679371264 bytes == (nil) @  0x65b851 0x7f75d7ccea89

ERROR:    basic_types.hpp(exec:140): Check failed: !(iarc.fail())

tcmalloc: large alloc 3530822107858477056 bytes == (nil) @  0x65b851 0x7f24d966ea89

tcmalloc: large alloc 3328214000696565760 bytes == (nil) @  0x65b851 0x7fe5b45a9a89

tcmalloc: large alloc 3543822943798697984 bytes == (nil) @  0x65b851 0x7f4210902a89

ERROR:    basic_types.hpp(exec:140): Check failed: !(iarc.fail())

tcmalloc: large alloc 3759995725912481792 bytes == (nil) @  0x65b851 0x7f8183724a89

tcmalloc: large alloc 3472556787679371264 bytes == (nil) @  0x65b851 0x7f16a1a76a89

tcmalloc: large alloc 3905239000875810816 bytes == (nil) @  0x65b851 0x7f7b00cf7a89

tcmalloc: large alloc 3530822107858477056 bytes == (nil) @  0x65b851 0x7fc71b620a89

terminate called after throwing an instance of 'std::bad_alloc'

tcmalloc: large alloc 3530822107858477056 bytes == (nil) @  0x65b851 0x7fa8888f0a89

tcmalloc: large alloc 3530822107858477056 bytes == (nil) @  0x65b851 0x7f77f656fa89

[node006:02419] *** Process received signal ***

terminate called after throwing an instance of 'std::bad_alloc'

  what():  std::bad_alloc

[node019:02426] *** Process received signal ***

  what():  std::bad_alloc

terminate called after throwing an instance of 'std::bad_alloc'

  what():  std::bad_alloc

terminate called after throwing an instance of 'std::bad_alloc'

terminate called after throwing an instance of 'std::bad_alloc'

[node053:02219] *** Process received signal ***

[node053:02219] Signal: Aborted (6)

[node053:02219] Signal code:  (-6)

std::bad_alloc'

[node059:02144] *** Process received signal ***

terminate called after throwing an instance of 'std::bad_alloc'

terminate called after throwing an instance of 'std::bad_alloc'

terminate called after throwing an instance of 'std::bad_alloc'

terminate called after throwing an instance of 'std::bad_alloc'

[node081:02117] *** Process received signal ***

terminate called after throwing an instance of 'std::bad_alloc'

terminate called after throwing an instance of 'std::bad_alloc'

terminate called after throwing an instance of 'std::bad_alloc'

terminate called after throwing an instance of 'std::bad_alloc'

  what():  std::bad_alloc

)


The problem does not appear to be my cluster or MPI. I can run the following MPI code with no problem:

int main(argc,argv)

int argc;

char *argv[];

{

   int myid, numprocs;

   FILE *f1;

   int i;


   MPI_Init(&argc,&argv);

   MPI_Comm_size(MPI_COMM_WORLD,&numprocs);

   MPI_Comm_rank(MPI_COMM_WORLD,&myid);


   printf("Hello from %d\n",myid);

   printf("Numprocs is %d\n",numprocs);


   MPI_Finalize();

}


When I run the above program with the following command:

root@node001:~# mpiexec -n 100 -hostfile ~/machines ./c_ex00

Here are the results:

Hello from 0

Numprocs is 100

Hello from 1

Numprocs is 100

Hello from 32

Numprocs is 100

Hello from 16

Numprocs is 100

Hello from 64

Numprocs is 100

Hello from 2

Numprocs is 100

Hello from 4

Numprocs is 100

Hello from 8

Numprocs is 100

Hello from 33

Numprocs is 100

Hello from 3

Numprocs is 100

Hello from 5

Numprocs is 100

Hello from 6

Numprocs is 100

Hello from 10

Numprocs is 100

Hello from 18

Numprocs is 100

Hello from 12

Numprocs is 100

Hello from 9

Numprocs is 100

Hello from 17

Numprocs is 100

Hello from 11

Numprocs is 100

Hello from 34

Numprocs is 100

Hello from 35

Numprocs is 100

Hello from 40

Numprocs is 100

Hello from 24

Numprocs is 100

Hello from 48

Numprocs is 100

Hello from 36

Numprocs is 100

Hello from 65

Numprocs is 100

Hello from 20

Numprocs is 100

Hello from 70

Numprocs is 100

Hello from 68

Numprocs is 100

Hello from 66

Numprocs is 100

Hello from 72

Numprocs is 100

Hello from 80

Numprocs is 100

Hello from 96

Numprocs is 100

Hello from 7

Numprocs is 100

Hello from 14

Numprocs is 100

Hello from 15

Numprocs is 100

Hello from 23

Numprocs is 100

Hello from 13

Numprocs is 100

Hello from 19

Numprocs is 100

Hello from 25

Numprocs is 100

Hello from 29

Numprocs is 100

Hello from 21

Numprocs is 100

Hello from 26

Numprocs is 100

Hello from 37

Numprocs is 100

Hello from 38

Numprocs is 100

Hello from 39

Numprocs is 100

Hello from 41

Numprocs is 100

Hello from 28

Numprocs is 100

Hello from 42

Numprocs is 100

Hello from 54

Numprocs is 100

Hello from 27

Numprocs is 100

Hello from 30

Numprocs is 100

Hello from 44

Numprocs is 100

Hello from 49

Numprocs is 100

Hello from 45

Numprocs is 100

Hello from 46

Numprocs is 100

Hello from 43

Numprocs is 100

Hello from 50

Numprocs is 100

Hello from 60

Numprocs is 100

Hello from 53

Numprocs is 100

Hello from 57

Numprocs is 100

Hello from 55

Numprocs is 100

Hello from 52

Numprocs is 100

Hello from 56

Numprocs is 100

Hello from 58

Numprocs is 100

Hello from 51

Numprocs is 100

Hello from 71

Numprocs is 100

Hello from 22

Numprocs is 100

Hello from 69

Numprocs is 100

Hello from 67

Numprocs is 100

Hello from 83

Numprocs is 100

Hello from 75

Numprocs is 100

Hello from 73

Numprocs is 100

Hello from 76

Numprocs is 100

Hello from 74

Numprocs is 100

Hello from 82

Numprocs is 100

Hello from 81

Numprocs is 100

Hello from 78

Numprocs is 100

Hello from 91

Numprocs is 100

Hello from 85

Numprocs is 100

Hello from 90

Numprocs is 100

Hello from 89

Numprocs is 100

Hello from 88

Numprocs is 100

Hello from 92

Numprocs is 100

Hello from 84

Numprocs is 100

Hello from 77

Numprocs is 100

Hello from 97

Numprocs is 100

Hello from 99

Numprocs is 100

Hello from 98

Numprocs is 100

Hello from 94

Numprocs is 100

Hello from 86

Numprocs is 100

Hello from 31

Numprocs is 100

Hello from 47

Numprocs is 100

Hello from 59

Numprocs is 100

Hello from 63

Numprocs is 100

Hello from 62

Numprocs is 100

Hello from 61

Numprocs is 100

Hello from 79

Numprocs is 100

Hello from 95

Numprocs is 100

Hello from 93

Numprocs is 100

Hello from 87

Numprocs is 100


Do you guys have any ideas? Did anyone experience such problems when large number of machines were used?


Any help is appreciated!

Thanks,

Zekai

100_machines_errors.txt

Yucheng Low

unread,
Sep 26, 2013, 2:36:34 AM9/26/13
to graph...@googlegroups.com
Hi,

That seems like a deserialization failure.
I am assuming that you are up to date from the github repository.

What compiler are you using?

I don't seem to be able to reproduce that error locally. Is it possible to get me a core dump of one of the failing processes (preferably from a debug/ build) ?



Yucheng

--
You received this message because you are subscribed to the Google Groups "GraphLab API" group.
To unsubscribe from this group and stop receiving emails from it, send an email to graphlabapi...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
<100_machines_errors.txt>

Yucheng Low

unread,
Sep 26, 2013, 2:37:22 AM9/26/13
to graph...@googlegroups.com
Also, I am on vacation, so replies may be somewhat delayed.

Yucheng

Zekai Jacob Gao

unread,
Sep 26, 2013, 3:15:29 AM9/26/13
to graph...@googlegroups.com
More information is included in the config.log attached.

The results of running ompi_info:

root@node001:~/graphlab# mpiexec --version
mpiexec (OpenRTE) 1.4.3

root@node001:~/graphlab# ompi_info
                 Package: Open MPI ro...@starnix.mit.edu Distribution
                Open MPI: 1.4.3
   Open MPI SVN revision: r23834
   Open MPI release date: Oct 05, 2010
                Open RTE: 1.4.3
   Open RTE SVN revision: r23834
   Open RTE release date: Oct 05, 2010
                    OPAL: 1.4.3
       OPAL SVN revision: r23834
       OPAL release date: Oct 05, 2010
            Ident string: 1.4.3
                  Prefix: /usr
 Configured architecture: x86_64-pc-linux-gnu
          Configure host: starnix.mit.edu
           Configured by: root
           Configured on: Wed Jan  9 02:26:24 UTC 2013
          Configure host: starnix.mit.edu
                Built by: root
                Built on: Wed Jan  9 02:34:00 UTC 2013
              Built host: starnix.mit.edu
              C bindings: yes
            C++ bindings: yes
      Fortran77 bindings: yes (all)
      Fortran90 bindings: yes
 Fortran90 bindings size: small
              C compiler: gcc
     C compiler absolute: /usr/bin/gcc
            C++ compiler: g++
   C++ compiler absolute: /usr/bin/g++
      Fortran77 compiler: gfortran
  Fortran77 compiler abs: /usr/bin/gfortran
      Fortran90 compiler: gfortran
  Fortran90 compiler abs: /usr/bin/gfortran
             C profiling: yes
           C++ profiling: yes
     Fortran77 profiling: yes
     Fortran90 profiling: yes
          C++ exceptions: no
          Thread support: posix (mpi: no, progress: no)
           Sparse Groups: no
  Internal debug support: no
     MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
         libltdl support: yes
   Heterogeneous support: yes
 mpirun default --prefix: no
         MPI I/O support: yes
       MPI_WTIME support: gettimeofday
Symbol visibility support: yes
   FT Checkpoint support: yes  (checkpoint thread: no)
           MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.4.3)
              MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component v1.4.3)
           MCA paffinity: linux (MCA v2.0, API v2.0, Component v1.4.3)
               MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.4.3)
               MCA carto: file (MCA v2.0, API v2.0, Component v1.4.3)
           MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.4.3)
           MCA maffinity: libnuma (MCA v2.0, API v2.0, Component v1.4.3)
               MCA timer: linux (MCA v2.0, API v2.0, Component v1.4.3)
         MCA installdirs: env (MCA v2.0, API v2.0, Component v1.4.3)
         MCA installdirs: config (MCA v2.0, API v2.0, Component v1.4.3)
                 MCA crs: none (MCA v2.0, API v2.0, Component v1.4.3)
                 MCA dpm: orte (MCA v2.0, API v2.0, Component v1.4.3)
              MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.4.3)
           MCA allocator: basic (MCA v2.0, API v2.0, Component v1.4.3)
           MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.4.3)
                MCA coll: basic (MCA v2.0, API v2.0, Component v1.4.3)
                MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.4.3)
                MCA coll: inter (MCA v2.0, API v2.0, Component v1.4.3)
                MCA coll: self (MCA v2.0, API v2.0, Component v1.4.3)
                MCA coll: sm (MCA v2.0, API v2.0, Component v1.4.3)
                MCA coll: sync (MCA v2.0, API v2.0, Component v1.4.3)
                MCA coll: tuned (MCA v2.0, API v2.0, Component v1.4.3)
                  MCA io: romio (MCA v2.0, API v2.0, Component v1.4.3)
               MCA mpool: fake (MCA v2.0, API v2.0, Component v1.4.3)
               MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.4.3)
               MCA mpool: sm (MCA v2.0, API v2.0, Component v1.4.3)
                 MCA pml: cm (MCA v2.0, API v2.0, Component v1.4.3)
                 MCA pml: crcpw (MCA v2.0, API v2.0, Component v1.4.3)
                 MCA pml: csum (MCA v2.0, API v2.0, Component v1.4.3)
                 MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.4.3)
                 MCA pml: v (MCA v2.0, API v2.0, Component v1.4.3)
                 MCA bml: r2 (MCA v2.0, API v2.0, Component v1.4.3)
              MCA rcache: vma (MCA v2.0, API v2.0, Component v1.4.3)
                 MCA btl: ofud (MCA v2.0, API v2.0, Component v1.4.3)
                 MCA btl: openib (MCA v2.0, API v2.0, Component v1.4.3)
                 MCA btl: self (MCA v2.0, API v2.0, Component v1.4.3)
                 MCA btl: sm (MCA v2.0, API v2.0, Component v1.4.3)
                 MCA btl: tcp (MCA v2.0, API v2.0, Component v1.4.3)
                MCA topo: unity (MCA v2.0, API v2.0, Component v1.4.3)
                 MCA osc: pt2pt (MCA v2.0, API v2.0, Component v1.4.3)
                 MCA osc: rdma (MCA v2.0, API v2.0, Component v1.4.3)
                MCA crcp: bkmrk (MCA v2.0, API v2.0, Component v1.4.3)
                 MCA iof: hnp (MCA v2.0, API v2.0, Component v1.4.3)
                 MCA iof: orted (MCA v2.0, API v2.0, Component v1.4.3)
                 MCA iof: tool (MCA v2.0, API v2.0, Component v1.4.3)
                 MCA oob: tcp (MCA v2.0, API v2.0, Component v1.4.3)
                MCA odls: default (MCA v2.0, API v2.0, Component v1.4.3)
                 MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.4.3)
                 MCA ras: slurm (MCA v2.0, API v2.0, Component v1.4.3)
                 MCA ras: tm (MCA v2.0, API v2.0, Component v1.4.3)
               MCA rmaps: load_balance (MCA v2.0, API v2.0, Component v1.4.3)
               MCA rmaps: rank_file (MCA v2.0, API v2.0, Component v1.4.3)
               MCA rmaps: round_robin (MCA v2.0, API v2.0, Component v1.4.3)
               MCA rmaps: seq (MCA v2.0, API v2.0, Component v1.4.3)
                 MCA rml: ftrm (MCA v2.0, API v2.0, Component v1.4.3)
                 MCA rml: oob (MCA v2.0, API v2.0, Component v1.4.3)
              MCA routed: binomial (MCA v2.0, API v2.0, Component v1.4.3)
              MCA routed: direct (MCA v2.0, API v2.0, Component v1.4.3)
              MCA routed: linear (MCA v2.0, API v2.0, Component v1.4.3)
                 MCA plm: rsh (MCA v2.0, API v2.0, Component v1.4.3)
                 MCA plm: slurm (MCA v2.0, API v2.0, Component v1.4.3)
                 MCA plm: tm (MCA v2.0, API v2.0, Component v1.4.3)
               MCA snapc: full (MCA v2.0, API v2.0, Component v1.4.3)
               MCA filem: rsh (MCA v2.0, API v2.0, Component v1.4.3)
              MCA errmgr: default (MCA v2.0, API v2.0, Component v1.4.3)
                 MCA ess: env (MCA v2.0, API v2.0, Component v1.4.3)
                 MCA ess: hnp (MCA v2.0, API v2.0, Component v1.4.3)
                 MCA ess: singleton (MCA v2.0, API v2.0, Component v1.4.3)
                 MCA ess: slurm (MCA v2.0, API v2.0, Component v1.4.3)
                 MCA ess: tool (MCA v2.0, API v2.0, Component v1.4.3)
             MCA grpcomm: bad (MCA v2.0, API v2.0, Component v1.4.3)
             MCA grpcomm: basic (MCA v2.0, API v2.0, Component v1.4.3)

I git clone the latest repository (tonight) and also tried it on 64 m2.4xlarge machines from Amazon EC2.

Some weird things happened again. For example, when I do mpiexec -n 41 -hostfile ~/machines ./dc_consensus_test or mpiexec -n 44 -hostfile ~/machines ./dc_consensus_test or mpiexec -n 64 -hostfile ~/machines ./dc_consensus_test, everything works fine. But when I do mpiexec -n 42 -hostfile ~/machines ./dc_consensus_test or mpiexec -n 43 -hostfile ~/machines ./dc_consensus_test, it fails! The errors look like this:


GRAPHLAB_SUBNET_ID/GRAPHLAB_SUBNET_MASK environment variables not defined.
Using default values
Subnet ID: 0.0.0.0
Subnet Mask: 0.0.0.0
Will find first IPv4 non-loopback address matching the subnet
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.87.101.209:45019
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.127.133.58:35323
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.120.46.150:33013
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.42.201.155:52191
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.125.118.65:45821
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.92.50.118:50689
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.120.117.30:51721
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.94.49.246:60382
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.42.149.98:57442
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.92.46.250:56321
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.87.106.251:38850
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.125.54.141:37106
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.125.42.194:57849
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.42.207.237:58351
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.44.107.189:44160
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.44.110.151:56499
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.94.103.137:60105
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.94.79.126:57874
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.125.43.43:47516
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.42.201.164:39652
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.42.35.220:52149
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.44.98.86:45670
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.92.94.109:46157
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.90.15.34:39570
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.92.59.39:51799
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.90.7.199:57803
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.125.37.179:38743
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.42.241.96:53170
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.94.106.182:35466
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.126.166.184:38463
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.127.138.186:48907
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.92.59.245:42414
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.44.11.190:33432
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.42.238.80:35112
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.42.49.12:37596
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.120.26.46:44069
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.92.50.163:52221
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.94.51.172:46435
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.86.99.181:56030
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.42.227.223:46326
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.42.219.157:35486
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.94.85.24:40527
INFO:     dc_init_from_mpi.cpp(init_param_from_mpi:51): Will Listen on: 10.92.107.53:44205
ERROR:    basic_types.hpp(exec:233): Check failed: c<=3  [0 <= 3]
ERROR:    basic_types.hpp(exec:233): Check failed: c<=3  [0 <= 3]
ERROR:    basic_types.hpp(exec:233): Check failed: c<=3  [0 <= 3]
ERROR:    basic_types.hpp(exec:233): Check failed: c<=3  [ <= 3]
ERROR:    basic_types.hpp(exec:233): Check failed: c<=3  [0 <= 3]
[node028:02775] *** Process received signal ***
[node025:02746] *** Process received signal ***
[node025:02746] Signal: Aborted (6)
[node025:02746] Signal code:  (-6)
[node024:02733] *** Process received signal ***
[node024:02733] Signal: Aborted (6)
[node024:02733] Signal code:  (-6)
[node026:02750] *** Process received signal ***
[node026:02750] Signal: Aborted (6)
[node026:02750] Signal code:  (-6)
[node028:02775] Signal: Aborted (6)
[node028:02775] Signal code:  (-6)
[node027:02552] *** Process received signal ***
[node027:02552] Signal: Aborted (6)
[node027:02552] Signal code:  (-6)
[node025:02746] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0) [0x7f8fcf24bcb0]
[node025:02746] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x7f8fcdfce425]
[node025:02746] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x17b) [0x7f8fcdfd1b8b]
[node025:02746] [ 3] ./dc_consensus_test(_ZN8graphlab9mpi_tools10all_gatherISsEEvRKT_RSt6vectorIS2_SaIS2_EE+0xf5c) [0x473aac]
[node025:02746] [ 4] ./dc_consensus_test(_ZN8graphlab19init_param_from_mpiERNS_13dc_init_paramENS_12dc_comm_typeE+0x497) [0x471827]
[node025:02746] [ 5] ./dc_consensus_test(main+0xb1) [0x440301]
[node025:02746] [ 6] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7f8fcdfb976d]
[node025:02746] [ 7] ./dc_consensus_test() [0x4419b1]
[node025:02746] *** End of error message ***
[node024:02733] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0) [0x7faaa1da2cb0]
[node024:02733] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x7faaa0b25425]
[node024:02733] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x17b) [0x7faaa0b28b8b]
[node024:02733] [ 3] ./dc_consensus_test(_ZN8graphlab9mpi_tools10all_gatherISsEEvRKT_RSt6vectorIS2_SaIS2_EE+0xf5c) [0x473aac]
[node024:02733] [ 4] ./dc_consensus_test(_ZN8graphlab19init_param_from_mpiERNS_13dc_init_paramENS_12dc_comm_typeE+0x497) [0x471827]
[node024:02733] [ 5] ./dc_consensus_test(main+0xb1) [0x440301]
[node024:02733] [ 6] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7faaa0b1076d]
[node024:02733] [ 7] ./dc_consensus_test() [0x4419b1]
[node026:02750] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0) [0x7f2971e5dcb0]
[node026:02750] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x7f2970be0425]
[node026:02750] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x17b) [0x7f2970be3b8b]
[node026:02750] [ 3] ./dc_consensus_test(_ZN8graphlab9mpi_tools10all_gatherISsEEvRKT_RSt6vectorIS2_SaIS2_EE+0xf5c) [0x473aac]
[node026:02750] [ 4] ./dc_consensus_test(_ZN8graphlab19init_param_from_mpiERNS_13dc_init_paramENS_12dc_comm_typeE+0x497) [0x471827]
[node026:02750] [ 5] ./dc_consensus_test(main+0xb1) [0x440301]
[node026:02750] [ 6] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7f2970bcb76d]
[node026:02750] [ 7] ./dc_consensus_test() [0x4419b1]
[node026:02750] *** End of error message ***
[node028:02775] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0) [0x7fb6ee036cb0]
[node028:02775] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x7fb6ecdb9425]
[node028:02775] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x17b) [0x7fb6ecdbcb8b]
[node028:02775] [ 3] ./dc_consensus_test(_ZN8graphlab9mpi_tools10all_gatherISsEEvRKT_RSt6vectorIS2_SaIS2_EE+0xf5c) [0x473aac]
[node028:02775] [ 4] ./dc_consensus_test(_ZN8graphlab19init_param_from_mpiERNS_13dc_init_paramENS_12dc_comm_typeE+0x497) [0x471827]
[node028:02775] [ 5] ./dc_consensus_test(main+0xb1) [0x440301]
[node028:02775] [ 6] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7fb6ecda476d]
[node028:02775] [ 7] ./dc_consensus_test() [0x4419b1]
[node028:02775] *** End of error message ***
[node024:02733] *** End of error message ***
[node027:02552] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0) [0x7f3bb8ccccb0]
[node027:02552] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x7f3bb7a4f425]
[node027:02552] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x17b) [0x7f3bb7a52b8b]
[node027:02552] [ 3] ./dc_consensus_test(_ZN8graphlab9mpi_tools10all_gatherISsEEvRKT_RSt6vectorIS2_SaIS2_EE+0xf5c) [0x473aac]
[node027:02552] [ 4] ./dc_consensus_test(_ZN8graphlab19init_param_from_mpiERNS_13dc_init_paramENS_12dc_comm_typeE+0x497) [0x471827]
[node027:02552] [ 5] ./dc_consensus_test(main+0xb1) [0x440301]
[node027:02552] [ 6] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7f3bb7a3a76d]
[node027:02552] [ 7] ./dc_consensus_test() [0x4419b1]
[node027:02552] *** End of error message ***
[node023][[63639,1],22][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node009][[63639,1],8][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node017][[63639,1],16][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
--------------------------------------------------------------------------
mpiexec noticed that process rank 24 with PID 2746 on node node025 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
[node021][[63639,1],20][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)


I tried multiple times, and 42/43 machines (or some other numbers, like 60) fail all the time, while 40/41/44/64 succeed every time.

Same thing happens for the helloworld example (in my first post of this thread). 42/43 machines always fail, 40/41/44/64 always succeed.

I just feel that this should be some errors that can be reproduced. I also tried to get the core dumped, but failed. I'll try again tomorrow.

Thanks,
Jacob
config.log

Yucheng Low

unread,
Sep 26, 2013, 3:30:44 AM9/26/13
to graph...@googlegroups.com
Aha. That backtrace actually isolated the error to an mpi all-gather call used to exchange the list of IP addresses.

It seems like for some reason or other, the all gather returned something that could not be deserialized correctly.
i.e. Either the serializer is buggy, or something strange happened in the all_gather.

Have you tried switching machines around in your machines list?
Like move machines 42-43 to the end of the list?

Also, can you try running without -hostfile? i.e. run all processes on localhost.

Yucheng

--
You received this message because you are subscribed to the Google Groups "GraphLab API" group.
To unsubscribe from this group and stop receiving emails from it, send an email to graphlabapi...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
<config.log>

Zekai Jacob Gao

unread,
Sep 26, 2013, 3:42:03 PM9/26/13
to graph...@googlegroups.com
Hey Yucheng,
I tried switching machines around in the list. Still getting the same results.

Tried running on localhost. Everything worked well.

The problem is, I simply ran the example dc_consensus_test under ./release/tests/  (and also, helloworld which does a simple printing). Are you able to do it on 42/43 machines? Do you think it might be the MPI version that caused the problem?

Thanks,
Jacob 
You received this message because you are subscribed to a topic in the Google Groups "GraphLab API" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/graphlabapi/6g2OCUs3aow/unsubscribe.
To unsubscribe from this group and all its topics, send an email to graphlabapi...@googlegroups.com.

Yucheng Low

unread,
Sep 27, 2013, 5:05:22 AM9/27/13
to graph...@googlegroups.com
Hi,

Hmm.... That is very interesting.
I have not had issues running with a large number of machines before; at least we had ran on up to 128.
I am not entirely sure what is the issue here especially since it runs fine on localhost.
I will need to look into this further. Right now the key question is what was sent/received in the MPI-all-gather call.
I will be back on Sunday and will be in office on Monday and I can do some interactive debugging there. (Like can we arrange for you to give me ssh access to the machines?)

I am not sure if the MPI version has anything to do with it. Though we can certainly try with a more recent version of openmpi and see how that works out.
You will have to reconfigure and recompile.

Yucheng
Reply all
Reply to author
Forward
0 new messages