Hi,
I followed the link
http://graphlab.org/fine-tuning-graphlab-performance/ and verified that MPI is setup properly.
I had missed out giving the "machines" argument while running the program and hence the graph was loading to hdfs in all the machines separately.
I am attaching the logs of the execution of the program. After first iteration, I get the following error.
Connection to
ec2-50-17-123-14.compute-1.amazonaws.com closed by remote host.
Traceback (most recent call last):
File "./gl_ec2.py", line 736, in <module>
main()
File "./gl_ec2.py", line 616, in main
\"""" % (opts.identity_file, proxy_opt, master), shell=True)
File "/usr/lib/python2.7/subprocess.py", line 511, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no -i /home/prashanth/.ssh/graphlab.pem
ubu...@ec2-50-17-123-14.compute-1.amazonaws.com "export PATH=$PATH:/opt/hadoop-1.0.1/bin;
export CLASSPATH=$CLASSPATH:.:\`hadoop classpath\`;
export JAVA_HOME=/usr/lib/jvm/java-6-sun;
cat ~/machines
mpiexec.mpich2 -f ~/machines -envlist CLASSPATH -n 7 /home/ubuntu/graphlabapi/release/toolkits/meng/set --graph=hdfs://\`head -n 1 ~/machines\`/input --iterations=3 --topic=0;
"' returned non-zero exit status 255
Yucheng mentioned that this is due to running out of memory. But I am using a m1.xlarge which has 15Gb of ram and I am running 7 instances of slaves.
My question is, whose memory is running out ? Does all the intermediate data between iterations stored in Master's RAM or is it distributed among slaves or is it written to HDFS temporarily ?
My program finds all the neighbors at 3 degrees of separation of all the vertices. I am attaching the program as well with changes in load and save functions as suggested by Yucheng. I also reduced the amount of data flowing between vertices to overcome the running out of memory issue.
Thanks a lot in advance,
Prashanth