Hi dev team,
The JanusGraph users list has seen a number of threads regarding OLAP performance with janusgraph-hbase. In particular, it turns out that initial loading of a graph is problematic when the Hbase table is stored in a small number of large regions of say 10Gb. Such large region sizes result in optimal performance of HBase, so system managers are not expected to like HBase backed graphs with many small regions needed for good parellelism during OLAP operations. As a result, HBase 2.0 alpha has introduced a mappers.per.region option to TableInputFormatBase which allows a single region to be spread over multiple mappers cq Spark tasks. Anxious to use this feature before HBase 2.0 and a JG version supporting it, will come out, I made a quick attempt to backport the feature. This turns out to be quite doable, see:
https://github.com/vtslab/janusgraph/commit/87bf1000c01dfce92e857349ba479db0d3ef6bd1. This is initial work and I plan to do a performance benchmark with the friendster graph, like the TinkerPop team did.
My questions to you:
- would this work be welcomed as a JanusGraph PR before a release based on HBase 2.0 comes out?
- if so, do you have any suggestions to improve on the work?
Some additional notes:
- SparkGraphComputer has an option to repartition the graph using the workers() method of the GraphComputer builder, but this does not help in a better parallelization of the initial load
- The current HBaseInputFormat has a rather intricate inheritance structure, which will probably need rigorous refactoring to use the HBase 2.0 TableInputFormatBase
Cheers, Marc