PageRank on Large Graph

308 views
Skip to first unread message

Joe Obernberger

unread,
Sep 22, 2017, 9:41:12 AM9/22/17
to JanusGraph users
Hi All - I've been experimenting with SparkGraphComputer, and have it
working, but I'm having performance issues.  What is the best way to run
PageRank against a very large graph stored inside of JanusGraph?

Thank you!

-Joe

HadoopMarc

unread,
Sep 22, 2017, 1:05:32 PM9/22/17
to JanusGraph users
Hi Joe,

This question reminds me to an earlier discussion we had on the performance of OLAP traversals for janusgraph-hbase. My conclusion there that janusgraph-hbase needs a better HbaseInputFormat that delivers more partitions than one partition per HBase region. I guess Pagerank suffers from that in the same way. Do you maybe have the option to use Cassandra, which has a configurable cassandra.inpit.split.size ? I did not try this myself.

HTH,    Marc

Op vrijdag 22 september 2017 15:41:12 UTC+2 schreef Joseph Obernberger:

Joe Obernberger

unread,
Sep 25, 2017, 11:24:55 AM9/25/17
to HadoopMarc, JanusGraph users

It reminds me of that one too!  At present, I'm locked in with HBase, so I can't make the switch to Cassandra very easily.  I did try:
result = graph.compute().program(PageRankVertexProgram.build().create()).submit().get()

It took a little over 8 hours to run, but did complete once I adjusted the hbase.client.scanner.timeout.period to something very long.  Interestingly, I had to modify that in the included jar file, not in the file in /etc/hbase/conf. 

Would really like to get this time to run way down, but not sure what other method to try.

-Joe

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/1bf6c7c5-84b6-483e-982c-c299fca3e8ef%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Virus-free. www.avg.com

HadoopMarc

unread,
Sep 25, 2017, 3:21:13 PM9/25/17
to JanusGraph users
Hi Joe,

Maybe a suggestion after all. I believe you ran the PageRankVertexProgram directly on the JanusGraph instance, but it should also be possible to run it on a HadoopGraph with compute(SparkGraphComputer) via JanusGraph's HBaseInputFormat. That would at least parallelize the table scan to the number of HBase regions. In my previous answer I assumed you did that!

Cheers,     Marc

Op maandag 25 september 2017 17:24:55 UTC+2 schreef Joseph Obernberger:

Joe Obernberger

unread,
Sep 25, 2017, 6:46:26 PM9/25/17
to HadoopMarc, JanusGraph users

Thank you Marc.  I assume this would be java code that would be executed via spark-submit?

-Joe

HadoopMarc

unread,
Sep 26, 2017, 3:40:06 PM9/26/17
to JanusGraph users
Hi Joe,

No, not exactly, because the TinkerPop recipe points at spark-submit as the source of most of the version conflicts. Spark-submit is just a big wrapper around the Spark launch API that sets the environment but does not do that in an application-friendly way. I would first try from the gremlin console for which the recipe was written. Doing the OLAP pagerank in a java project without spark-submit will require some effort to get the classpath right.

HTH,   Marc

Op dinsdag 26 september 2017 00:46:26 UTC+2 schreef Joseph Obernberger:

Joe Obernberger

unread,
Sep 27, 2017, 9:06:19 AM9/27/17
to HadoopMarc, JanusGraph users

Hi Marc - not sure I understand.  I tried this:

gremlin> g=graph.traversal()
==>graphtraversalsource[standardjanusgraph[hbase:[10.22.5.63:2181, 10.22.5.64:2181, 10.22.5.65:2181]], standard]
gremlin> result=graph.compute().program(PageRankVertexProgram.build().create()).submit().get()

Is that what you mean?  That does not work on very large graphs.  Even on a small graph (about 9 million nodes), it took 8 hours to complete, and uses only one machine to do the work.  I'm looking for methods to calculate values on very large graphs.  Any ideas?
Thank you!

-Joe

HadoopMarc

unread,
Sep 27, 2017, 12:04:20 PM9/27/17
to JanusGraph users
Hi Joe,

My thoughts were more like:

graph = GraphFactory.open('conf/hadoop-graph/read-hbase-spark-yarn.properties')
result=graph.compute(SparkGraphComputer).program(PageRankVertexProgram.build().create()).submit().get()

along the lines of "Exporting with BulkDumperVertexProgram" in http://tinkerpop.apache.org/docs/3.2.3/reference/#sparkgraphcomputer

I am curious whether it works!

Marc

Op woensdag 27 september 2017 15:06:19 UTC+2 schreef Joseph Obernberger:

Joe Obernberger

unread,
Sep 27, 2017, 4:30:33 PM9/27/17
to HadoopMarc, JanusGraph users

Thank you Marc.  That runs on my cluster, but takes a very long time.  If I try it on a larger graph, the YARN jobs run out of heap.  Right now I'm giving them 10G each.

On a small graph, I can run it OK, and I can run the BulkDumperVertexProgram as well.  What I can't do, when I run with SparkGraphComputer, is look at the results.

After running:
result = graph.compute(SparkGraphComputer).program(PageRankVertexProgram.build().create()).submit().get()
I can do a result.memory().runtime, which returns a number (in my case 609821).
I then do:

g = result.graph().traversal(computer(SparkGraphComputer))
Unfortunately, any command with g, gives the same error - for example:
g.V().valueMap() returns:
java.io.IOException: No input paths specified in job

Since this is a small graph, if I run it without SparkGraphComputer, those commands on g work fine, such as:
g.V(id).valueMap('gremlin.pageRankVertexProgram.pageRank')

Trying to find any method to run PageRank on a very large graph that is stored in JanusGraph.  Thanks!  Anything you would like me to try?

-Joe

HadoopMarc

unread,
Sep 28, 2017, 4:09:59 PM9/28/17
to JanusGraph users
Hi Joe,

Thanks for reporting back. So, it indeed seems the same problem as for OLAP traversals: input splits of HBaseInutFormat have the size of a complete region which is a bit too much for SparkGraphComputer. I think it should be fairly easy to adapt JanusGraph->HBaseInputFormat a bit, such that the splits coming from parent HBase->TableInputFormat are split in smaller parts, let's say smaller than some configurable janusgraph.hbase.mapreduce.maxinputsplitsize=128M. All the necessary variables and methods are present in HBase->TableInputFormat. I plan to do it some time in the future, but please do not rely on it. If someone else wants to take up the work sooner, please create a ticket first so that others know.

Cheers,    Marc

Op woensdag 27 september 2017 22:30:33 UTC+2 schreef Joseph Obernberger:

Joe Obernberger

unread,
Sep 28, 2017, 5:51:18 PM9/28/17
to HadoopMarc, JanusGraph users

Thank you Marc.  This seems to suggest that if I split the HBase table up into many many regions, that would correct the issue of running PageRank.

Any idea why I can't execute any commands on the graph once the SparkGraphComputer job completes?  They all return java.io.IOException: No input paths specified in job

Thanks again!

-Joe

HadoopMarc

unread,
Sep 29, 2017, 11:27:30 AM9/29/17
to JanusGraph users
Hi Joe,

Regarding not finding the OLAP output, did you try this section of the TinkerPop ref docs?

Cheers,     Marc

Op donderdag 28 september 2017 23:51:18 UTC+2 schreef Joseph Obernberger:

Joe Obernberger

unread,
Oct 3, 2017, 11:47:50 AM10/3/17
to HadoopMarc, JanusGraph users

Hi Marc,

Ah - I see the output in /user/username/output/~g.  This appears to be gryo format.  Thank you!  Do you know of a way to update the actual JanuGraph with a new page rank property on each vertex instead of writing out an entire graph in HDFS?  Would that be a modification of the PageRank code?

What appears to work, increases performance, and reduces memory requirements is splitting the tables up into many regions.  I have a graph that is about 24.4 million vertices, uses 7.8G of space in HBase and I've split it into 462 regions.  I can run PageRank on that graph in 44 minutes on a 5 server cluster with 128G of RAM in each server.  In this case, I gave each task 10G of RAM with a max memory per node of 96G.  I think what may work is to set the max file size in HBase to something very small like 16M to force splits with:
alter table 't1', MAX_FILESIZE => 16777216

Interestingly, I lowered the spark.executor.memory from 10G to 4G and the process completed, but it took almost twice as long.  I was thinking that since it then had more CPU to use (96G/4G instead of 96G/10G), it would run faster.  Running more tests.  Thanks again for the help on this!

-Joe
Reply all
Reply to author
Forward
0 new messages