Hadoop result graph as RDD

HadoopMarc

unread,

Feb 9, 2018, 3:28:36 AM2/9/18

to Gremlin-users

The TinkerPop ref docs give as nice example how to get the hadoop result graph as an RDD for the PeerPressureVertexProgram :

gremlin> Spark.create('local[4]')
==>org.apache.spark.SparkContext@4293e066
gremlin> graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties')
==>hadoopgraph[gryoinputformat->gryooutputformat]
gremlin> graph.configuration().setProperty('gremlin.hadoop.graphWriter', PersistedOutputRDD.class.getCanonicalName())
gremlin> graph.configuration().setProperty('gremlin.spark.persistContext',true)
gremlin> graph.compute(SparkGraphComputer).program(PeerPressureVertexProgram.build().create(graph)).mapReduce(ClusterCountMapReduce.build().memoryKey('clusterCount').create()).submit().get();
==>result[hadoopgraph[persistedinputrdd->persistedoutputrdd],memory[size:1]]
gremlin> spark.ls()
==>output/clusterCount [Memory Deserialized 1x Replicated]
==>output/~g [Memory Deserialized 1x Replicated]
gremlin> spark.ls('output')
==>output/clusterCount [Memory Deserialized 1x Replicated]
==>output/~g [Memory Deserialized 1x Replicated]
gremlin> Spark.getRDD('output/~g').getClass()
==>class org.apache.spark.rdd.MapPartitionsRDD

However, when you try to do the same for the TraversalVertexProgram, spark.ls() and Spark.getRDDs() return null. When looking in the code of TraversalVertexProgram I find the following hardcoded values which I believe are the cause of this behavior:

    @Override
    public GraphComputer.ResultGraph getPreferredResultGraph() {
        return GraphComputer.ResultGraph.ORIGINAL;
    }

    @Override
    public GraphComputer.Persist getPreferredPersist() {
        return GraphComputer.Persist.NOTHING;
    }

Of cource, it is easy to subclass the TraversalVertexProgram and change these values, but is there any reason for this chosen, fixed behaviour?

Cheers, Marc

HadoopMarc

unread,

Feb 9, 2018, 3:53:43 AM2/9/18

to Gremlin-users

Answering my own question, the properties I assumed fixed are accessible via the GraphComputer API:

graph.compute(SparkGraphComputer).program(TraversalVertexProgram.build().traversal(graph.traversal().V()).create(graph)).result(GraphComputer.ResultGraph.NEW).persist(GraphComputer.Persist.VERTEX_PROPERTIES).submit().get()
rdd = Spark.getRDD('output/~g')

Happy Sparking!

Marc

Op vrijdag 9 februari 2018 09:28:36 UTC+1 schreef HadoopMarc:

HadoopMarc

unread,

Jun 23, 2019, 8:02:50 AM6/23/19

to Gremlin-users

The above only clones the input graph rather than presents the result graph. For DKuppitz's bug report and workaround, see: