Hey guys,
I am having problems running GraphLab on a 100 machine cluster.
When I run the following program:
#include <graphlab.hpp>
int main(int argc, char** argv) {
graphlab::mpi_tools::init(argc, argv);
graphlab::distributed_control dc;
printf("Hello World!\n");
graphlab::mpi_tools::finalize();
}
Using 40 machines with the command: mpiexec -n 40 -hostfile ~/machines ./helloworld
I get:
GRAPHLAB_SUBNET_ID/GRAPHLAB_SUBNET_MASK environment variables not defined.
Using default values
Subnet ID: 0.0.0.0
Subnet Mask: 0.0.0.0
Will find first IPv4 non-loopback address matching the subnet
INFO: dc.cpp(init:554): Cluster of 40 instances created.
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
, which goes as expected.
But when I run the program on 100 (actually, any number >41) machines, I get the errors in the attachment which tries to allocate insanely large memory!
(Part of the errors:
tcmalloc: large alloc 5066549580800000 bytes == (nil) @ 0x65b851 0x7f9f1d417a89
tcmalloc: large alloc 5348024557510656 bytes == (nil) @ 0x65b851 0x7f9d8bf47a89
tcmalloc: large alloc 3759995725912481792 bytes == (nil) @ 0x65b851 0x7fa9b084ca89
terminate called after throwing an instance of 'std::bad_alloc'
tcmalloc: large alloc 5348024557510656 bytes == (nil) @ 0x65b851 0x7f0dbf228a89
ERROR: basic_types.hpp(exec:140): Check failed: !(iarc.fail())
ERROR: basic_types.hpp(exec:140): Check failed: !(iarc.fail())
tcmalloc: large alloc 3472556787679371264 bytes == (nil) @ 0x65b851 0x7f91e13dfa89
ERROR: basic_types.hpp(exec:140): Check failed: !(iarc.fail())
terminate called after throwing an instance of 'std::bad_alloc'
tcmalloc: large alloc 3530822107858477056 bytes == (nil) @ 0x65b851 0x7fdffe09fa89
tcmalloc: large alloc 3530822107858477056 bytes == (nil) @ 0x65b851 0x7f90fcd2aa89
ERROR: basic_types.hpp(exec:140): Check failed: !(iarc.fail())
tcmalloc: large alloc 5348024557510656 bytes == (nil) @ 0x65b851 0x7fe0dde7fa89
ERROR: basic_types.hpp(exec:140): Check failed: !(iarc.fail())
terminate called after throwing an instance of ' what(): std::bad_alloc
tcmalloc: large alloc 3472556787679371264 bytes == (nil) @ 0x65b851 0x7f75d7ccea89
ERROR: basic_types.hpp(exec:140): Check failed: !(iarc.fail())
tcmalloc: large alloc 3530822107858477056 bytes == (nil) @ 0x65b851 0x7f24d966ea89
tcmalloc: large alloc 3328214000696565760 bytes == (nil) @ 0x65b851 0x7fe5b45a9a89
tcmalloc: large alloc 3543822943798697984 bytes == (nil) @ 0x65b851 0x7f4210902a89
ERROR: basic_types.hpp(exec:140): Check failed: !(iarc.fail())
tcmalloc: large alloc 3759995725912481792 bytes == (nil) @ 0x65b851 0x7f8183724a89
tcmalloc: large alloc 3472556787679371264 bytes == (nil) @ 0x65b851 0x7f16a1a76a89
tcmalloc: large alloc 3905239000875810816 bytes == (nil) @ 0x65b851 0x7f7b00cf7a89
tcmalloc: large alloc 3530822107858477056 bytes == (nil) @ 0x65b851 0x7fc71b620a89
terminate called after throwing an instance of 'std::bad_alloc'
tcmalloc: large alloc 3530822107858477056 bytes == (nil) @ 0x65b851 0x7fa8888f0a89
tcmalloc: large alloc 3530822107858477056 bytes == (nil) @ 0x65b851 0x7f77f656fa89
[node006:02419] *** Process received signal ***
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
[node019:02426] *** Process received signal ***
what(): std::bad_alloc
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
terminate called after throwing an instance of 'std::bad_alloc'
terminate called after throwing an instance of 'std::bad_alloc'
[node053:02219] *** Process received signal ***
[node053:02219] Signal: Aborted (6)
[node053:02219] Signal code: (-6)
std::bad_alloc'
[node059:02144] *** Process received signal ***
terminate called after throwing an instance of 'std::bad_alloc'
terminate called after throwing an instance of 'std::bad_alloc'
terminate called after throwing an instance of 'std::bad_alloc'
terminate called after throwing an instance of 'std::bad_alloc'
[node081:02117] *** Process received signal ***
terminate called after throwing an instance of 'std::bad_alloc'
terminate called after throwing an instance of 'std::bad_alloc'
terminate called after throwing an instance of 'std::bad_alloc'
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
)
The problem does not appear to be my cluster or MPI. I can run the following MPI code with no problem:
int main(argc,argv)
int argc;
char *argv[];
{
int myid, numprocs;
FILE *f1;
int i;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
printf("Hello from %d\n",myid);
printf("Numprocs is %d\n",numprocs);
MPI_Finalize();
}
When I run the above program with the following command:
root@node001:~# mpiexec -n 100 -hostfile ~/machines ./c_ex00
Here are the results:
Hello from 0
Numprocs is 100
Hello from 1
Numprocs is 100
Hello from 32
Numprocs is 100
Hello from 16
Numprocs is 100
Hello from 64
Numprocs is 100
Hello from 2
Numprocs is 100
Hello from 4
Numprocs is 100
Hello from 8
Numprocs is 100
Hello from 33
Numprocs is 100
Hello from 3
Numprocs is 100
Hello from 5
Numprocs is 100
Hello from 6
Numprocs is 100
Hello from 10
Numprocs is 100
Hello from 18
Numprocs is 100
Hello from 12
Numprocs is 100
Hello from 9
Numprocs is 100
Hello from 17
Numprocs is 100
Hello from 11
Numprocs is 100
Hello from 34
Numprocs is 100
Hello from 35
Numprocs is 100
Hello from 40
Numprocs is 100
Hello from 24
Numprocs is 100
Hello from 48
Numprocs is 100
Hello from 36
Numprocs is 100
Hello from 65
Numprocs is 100
Hello from 20
Numprocs is 100
Hello from 70
Numprocs is 100
Hello from 68
Numprocs is 100
Hello from 66
Numprocs is 100
Hello from 72
Numprocs is 100
Hello from 80
Numprocs is 100
Hello from 96
Numprocs is 100
Hello from 7
Numprocs is 100
Hello from 14
Numprocs is 100
Hello from 15
Numprocs is 100
Hello from 23
Numprocs is 100
Hello from 13
Numprocs is 100
Hello from 19
Numprocs is 100
Hello from 25
Numprocs is 100
Hello from 29
Numprocs is 100
Hello from 21
Numprocs is 100
Hello from 26
Numprocs is 100
Hello from 37
Numprocs is 100
Hello from 38
Numprocs is 100
Hello from 39
Numprocs is 100
Hello from 41
Numprocs is 100
Hello from 28
Numprocs is 100
Hello from 42
Numprocs is 100
Hello from 54
Numprocs is 100
Hello from 27
Numprocs is 100
Hello from 30
Numprocs is 100
Hello from 44
Numprocs is 100
Hello from 49
Numprocs is 100
Hello from 45
Numprocs is 100
Hello from 46
Numprocs is 100
Hello from 43
Numprocs is 100
Hello from 50
Numprocs is 100
Hello from 60
Numprocs is 100
Hello from 53
Numprocs is 100
Hello from 57
Numprocs is 100
Hello from 55
Numprocs is 100
Hello from 52
Numprocs is 100
Hello from 56
Numprocs is 100
Hello from 58
Numprocs is 100
Hello from 51
Numprocs is 100
Hello from 71
Numprocs is 100
Hello from 22
Numprocs is 100
Hello from 69
Numprocs is 100
Hello from 67
Numprocs is 100
Hello from 83
Numprocs is 100
Hello from 75
Numprocs is 100
Hello from 73
Numprocs is 100
Hello from 76
Numprocs is 100
Hello from 74
Numprocs is 100
Hello from 82
Numprocs is 100
Hello from 81
Numprocs is 100
Hello from 78
Numprocs is 100
Hello from 91
Numprocs is 100
Hello from 85
Numprocs is 100
Hello from 90
Numprocs is 100
Hello from 89
Numprocs is 100
Hello from 88
Numprocs is 100
Hello from 92
Numprocs is 100
Hello from 84
Numprocs is 100
Hello from 77
Numprocs is 100
Hello from 97
Numprocs is 100
Hello from 99
Numprocs is 100
Hello from 98
Numprocs is 100
Hello from 94
Numprocs is 100
Hello from 86
Numprocs is 100
Hello from 31
Numprocs is 100
Hello from 47
Numprocs is 100
Hello from 59
Numprocs is 100
Hello from 63
Numprocs is 100
Hello from 62
Numprocs is 100
Hello from 61
Numprocs is 100
Hello from 79
Numprocs is 100
Hello from 95
Numprocs is 100
Hello from 93
Numprocs is 100
Hello from 87
Numprocs is 100
Do you guys have any ideas? Did anyone experience such problems when large number of machines were used?
Any help is appreciated!
Thanks,
Zekai
--
You received this message because you are subscribed to the Google Groups "GraphLab API" group.
To unsubscribe from this group and stop receiving emails from it, send an email to graphlabapi...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
<100_machines_errors.txt>
--
You received this message because you are subscribed to the Google Groups "GraphLab API" group.
To unsubscribe from this group and stop receiving emails from it, send an email to graphlabapi...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
<config.log>
You received this message because you are subscribed to a topic in the Google Groups "GraphLab API" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/graphlabapi/6g2OCUs3aow/unsubscribe.
To unsubscribe from this group and all its topics, send an email to graphlabapi...@googlegroups.com.