GraphLab can not read hdfs file by specified file prefix?

262 views
Skip to first unread message

kai

unread,
Jun 27, 2013, 2:25:09 AM6/27/13
to graph...@googlegroups.com
[titan@cucstorage sdb]$ time mpiexec -n 3 pagerank --graph hdfs://172.16.12.5:9288/user/titan/data-4000000/graphlab-adj --format="adj" --saveprefix hdfs://172.16.12.5:9288/user/titan/data-4000000/output/pagerank/pagerank.log --ncpus=8

GRAPHLAB_SUBNET_ID/GRAPHLAB_SUBNET_MASK environment variables not defined.
Using default values
Subnet ID: 0.0.0.0
Subnet Mask: 0.0.0.0
Will find first IPv4 non-loopback address matching the subnet
GRAPHLAB_SUBNET_ID/GRAPHLAB_SUBNET_MASK environment variables not defined.
Using default values
Subnet ID: 0.0.0.0
Subnet Mask: 0.0.0.0
Will find first IPv4 non-loopback address matching the subnet
TCP Communication layer constructed.
TCP Communication layer constructed.
INFO:     distributed_graph.hpp(set_ingress_method:2822): Use random ingress
Loading graph in format: adj
Loading graph in format: adj
INFO:     distributed_graph.hpp(set_ingress_method:2822): Use random ingress
GRAPHLAB_SUBNET_ID/GRAPHLAB_SUBNET_MASK environment variables not defined.
Using default values
Subnet ID: 0.0.0.0
Subnet Mask: 0.0.0.0
Will find first IPv4 non-loopback address matching the subnet
TCP Communication layer constructed.
INFO:     distributed_graph.hpp(set_ingress_method:2822): Use random ingress
Loading graph in format: adj
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00002b3a2e50fbc0, pid=16132, tid=47528894618784
#
# JRE version: 6.0_31-b04
# Java VM: Java HotSpot(TM) 64-Bit Server VM (20.6-b01 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x50fbc0]  unsigned+0xb0
#
# An error report file with more information is saved as:
# /sdb/hs_err_pid16132.log
#
# If you would like to submit a bug report, please visit:
#
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f564f748bc0, pid=5420, tid=140008671612000
#
# JRE version: 6.0_31-b04
# Java VM: Java HotSpot(TM) 64-Bit Server VM (20.6-b01 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x50fbc0]  unsigned+0xb0
#
# An error report file with more information is saved as:
# /sdb/hs_err_pid5420.log
#
# If you would like to submit a bug report, please visit:
#
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f706a772bc0, pid=18710, tid=140120791805024
#
# JRE version: 6.0_31-b04
# Java VM: Java HotSpot(TM) 64-Bit Server VM (20.6-b01 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x50fbc0]  unsigned+0xb0
#
# An error report file with more information is saved as:
# /sdb/hs_err_pid18710.log
#
# If you would like to submit a bug report, please visit:
#

-rw-r--r--   3 titan supergroup   38712961 2013-06-24 15:48 /user/titan/data-4000000/graphlab-adj-1790000.adj
-rw-r--r--   3 titan supergroup   38870630 2013-06-24 15:48 /user/titan/data-4000000/graphlab-adj-180000.adj
-rw-r--r--   3 titan supergroup   38958915 2013-06-24 15:48 /user/titan/data-4000000/graphlab-adj-1800000.adj
-rw-r--r--   3 titan supergroup   39038166 2013-06-24 15:48 /user/titan/data-4000000/graphlab-adj-1810000.adj
-rw-r--r--   3 titan supergroup   39437524 2013-06-24 15:48 /user/titan/data-4000000/graphlab-adj-1820000.adj
-rw-r--r--   3 titan supergroup   38646297 2013-06-24 15:48 /user/titan/data-4000000/graphlab-adj-1830000.adj
-rw-r--r--   3 titan supergroup   38594307 2013-06-24 15:48 /user/titan/data-4000000/graphlab-adj-1840000.adj
...

Haijie Gu

unread,
Jun 27, 2013, 1:03:43 PM6/27/13
to graph...@googlegroups.com
Please add "env CLASSPATH=`hadoop classpath`" before pagerank. Because you are loading from HDFS, the program needs to know the hadoop class path.

Also there are still 3 independent instance of pagerank running. Try specify the hostile for mpiexec. For example:

mpiexec -n 3 -f hostfile env CLASSPATH=`hadoop classpath` ./pagerank …


-jay

--
You received this message because you are subscribed to the Google Groups "GraphLab API" group.
To unsubscribe from this group and stop receiving emails from it, send an email to graphlabapi...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

kai

unread,
Jul 4, 2013, 12:53:43 AM7/4/13
to graph...@googlegroups.com
hi,

pagerank execution error specified hadoop file prefix,but using the full file path to perform normal.

GraphLab Version is V2.2

[titan@smartstorage03 ~]$ time mpiexec --hostfile /home/titan/mpdmaster.hosts -n 6 env CLASSPATH=`hadoop classpath` /home/titan/graphlab/graphlab-v2.2/release/toolkits/graph_analytics/pagerank --graph hdfs://172.16.12.5:9288/user/titan/data-4000000/graphlab-adj --format="adj" --saveprefix hdfs://172.16.12.5:9288/user/titan/data-4000000/output/pagerank/pagerank.log --ncpus=8
GRAPHLAB_SUBNET_ID/GRAPHLAB_SUBNET_MASK environment variables not defined.
Using default values
Subnet ID: 0.0.0.0
Subnet Mask: 0.0.0.0
Will find first IPv4 non-loopback address matching the subnet
INFO:     dc.cpp(init:554): Cluster of 6 instances created.
Loading graph in format: adj
INFO:     distributed_graph.hpp(set_ingress_method:2902): Automatically determine ingress method: grid
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f17815febc0, pid=16500, tid=139738926193568
#
# JRE version: 6.0_31-b04
# Java VM: Java HotSpot(TM) 64-Bit Server VM (20.6-b01 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x50fbc0]  unsigned+0xb0
#
# An error report file with more information is saved as:
# /home/titan/hs_err_pid16500.log
#
# If you would like to submit a bug report, please visit:
#
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f95674dfbc0, pid=26213, tid=140279654689696
#
# JRE version: 6.0_31-b04
# Java VM: Java HotSpot(TM) 64-Bit Server VM (20.6-b01 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x50fbc0]  unsigned+0xb0
#
# An error report file with more information is saved as:
# /home/titan/hs_err_pid26213.log
#
# If you would like to submit a bug report, please visit:
#

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 6
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:5...@smartstorage03.yoyoyws.com] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:883): assert (!closed) failed
[proxy:0:5...@smartstorage03.yoyoyws.com] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:5...@smartstorage03.yoyoyws.com] main (./pm/pmiserv/pmip.c:210): demux engine error waiting for event
[proxy:0:0...@cucstorage.wocloud.com.cn] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:883): assert (!closed) failed
[proxy:0:0...@cucstorage.wocloud.com.cn] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0...@cucstorage.wocloud.com.cn] main (./pm/pmiserv/pmip.c:210): demux engine error waiting for event
[mpi...@smartstorage03.yoyoyws.com] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpi...@smartstorage03.yoyoyws.com] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpi...@smartstorage03.yoyoyws.com] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:216): launcher returned error waiting for completion
[mpi...@smartstorage03.yoyoyws.com] main (./ui/mpich/mpiexec.c:325): process manager error waiting for completion

real    0m2.349s
user    0m0.029s
sys     0m0.024s

在 2013年6月28日星期五UTC+8上午1时03分43秒,Jay写道:
hs_err_pid16500.log

Yucheng Low

unread,
Jul 9, 2013, 12:49:46 AM7/9/13
to graph...@googlegroups.com
Hi,

Interesting. We are not emitting enough details in the HDFS error message.
Make sure that all machines can enumerate the path hdfs://172.16.12.5:9288/user/titan/data-4000000/

Also, what version of Hadoop are you using?

Yucheng

Yingxia Shao

unread,
Nov 26, 2013, 8:40:05 PM11/26/13
to graph...@googlegroups.com
Hi, 
   I encountered the same problem. I have 60 separated file on HDFS with same prefix.
When I run the undirected_triangle_count with more than 10 instance, it fails with the following "SIGSEGV" error.

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000000000541c45, pid=18741, tid=139898438657808
#
# JRE version: 7.0_05-b05
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.1-b03 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [undirected_triangle_count+0x141c45]  graphlab::cuckoo_map_pow2<unsigned long, graphlab::fixed_dense_bitset<128>, 3ul, unsigned int, boost::hash<unsigned long>, std::equal_to<unsigned long> >::do_insert(std::pair<unsigned long const, graphlab::fixed_dense_bitset<128> > const&)+0x345
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /tmp/hs_err_pid18741.log
#
# If you would like to submit a bug report, please visit:
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
rank 7 in job 18  changping11_49111   caused collective abort of all ranks
  exit status of rank 7: killed by signal 9 

It looks like the do_insert method cause the error.

In summary,

the "mpiexec -n 9 undirected_triangle_count --graph xxx --format xxx" run successfully.
the  "mpiexec -n 10 undirected_triangle_count --graph xxx --format xxx" failed with above error.

Thanks for help. 


Following are the Environment Information:

Hadoop 0.20.2, no_secure

MPICH2 Info:
MPICH2 Version:         1.5
MPICH2 Release date:    Mon Oct  8 14:00:48 CDT 2012
MPICH2 Device:          ch3:nemesis
MPICH2 configure:       --enable-shared --with-pm=mpd
MPICH2 CC:      gcc    -O2
MPICH2 CXX:     c++   -O2
MPICH2 F77:     gfortran   -O2
MPICH2 FC:      gfortran   -O2

Java version:
java version "1.7.0_05"
Java(TM) SE Runtime Environment (build 1.7.0_05-b05)
Java HotSpot(TM) 64-Bit Server VM (build 23.1-b03, mixed mode)

Rong Chen

unread,
Dec 11, 2013, 10:02:05 PM12/11/13
to graph...@googlegroups.com
maybe you problem is from the graph ingress.

GraphLab uses "grid" ingress for 9 machines, uses "oblivious" ingress for 10 machines.
(you can get it from log "Automatically determine ingress method: grid")

To confirm it, you can try "mpiexec -n 9 undirected_triangle_count --graph xxx --format xxx --graph_opts ingress=oblivious"
If it failed, maybe above assumption is right.


The problem is from parallel loading multiple files (even using NFS).
There is a simple patch (https://github.com/graphlab-code/graphlab/pull/103)
It only provides correctness to "oblivious", but performance is poor.
It also speedup loading phase of "grid" (real parallel loading)


Thanks,
Rong
Institute of Parallel and Distributed Systems (IPADS)
Shanghai Jiao Tong University
http://ipads.se.sjtu.edu.cn/projects/powerlyra.html
[prox...@smartstorage03.yoyoyws.com] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:883): assert (!closed) failed
[prox...@smartstorage03.yoyoyws.com] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[prox...@smartstorage03.yoyoyws.com] main (./pm/pmiserv/pmip.c:210): demux engine error waiting for event
[prox...@cucstorage.wocloud.com.cn] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:883): assert (!closed) failed
[prox...@cucstorage.wocloud.com.cn] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[prox...@cucstorage.wocloud.com.cn] main (./pm/pmiserv/pmip.c:210): demux engine error waiting for event
Reply all
Reply to author
Forward
0 new messages