Hadoop result graph as RDD

150 views
Skip to first unread message

HadoopMarc

unread,
Feb 9, 2018, 3:28:36 AM2/9/18
to Gremlin-users

The TinkerPop ref docs give as nice example how to get the hadoop result graph as an RDD  for the PeerPressureVertexProgram :

gremlin> Spark.create('local[4]')
==>org.apache.spark.SparkContext
@4293e066
gremlin
> graph = GraphFactory.open(
'conf/hadoop/hadoop-gryo.properties')
==>hadoopgraph[gryoinputformat->gryooutputformat]
gremlin
> graph.configuration().setProperty(
'gremlin.hadoop.graphWriter', PersistedOutputRDD.class.getCanonicalName())
gremlin
> graph.configuration().setProperty(
'gremlin.spark.persistContext',true)
gremlin
> graph.compute(SparkGraphComputer).program(PeerPressureVertexProgram.build().create(graph)).mapReduce(ClusterCountMapReduce.build().memoryKey(
'clusterCount').create()).submit().get();
==>result[hadoopgraph[persistedinputrdd->persistedoutputrdd],memory[
size:1]]
gremlin
> spark.ls()
==>output/clusterCount [Memory Deserialized
1x Replicated]
==>output/~g [Memory Deserialized
1x Replicated]
gremlin
> spark.ls(
'output')
==>output/clusterCount [Memory Deserialized
1x Replicated]
==>output/~g [Memory Deserialized
1x Replicated]
gremlin
> Spark.getRDD('output/~g').getClass()
==>class org.apache.spark.rdd.MapPartitionsRDD


However, when you try to do the same for the TraversalVertexProgram, spark.ls() and Spark.getRDDs() return null. When looking in the code of TraversalVertexProgram I find the following hardcoded values which I believe are the cause of this behavior:

    @Override
   
public GraphComputer.ResultGraph getPreferredResultGraph() {
       
return GraphComputer.ResultGraph.ORIGINAL;
   
}

   
@Override
   
public GraphComputer.Persist getPreferredPersist() {
       
return GraphComputer.Persist.NOTHING;
   
}


Of cource, it is easy to subclass the TraversalVertexProgram and change these values, but is there any reason for this chosen, fixed behaviour?

Cheers,     Marc


HadoopMarc

unread,
Feb 9, 2018, 3:53:43 AM2/9/18
to Gremlin-users
Answering my own question, the properties I assumed fixed are accessible via the GraphComputer API:

graph.compute(SparkGraphComputer).program(TraversalVertexProgram.build().traversal(graph.traversal().V()).create(graph)).result(GraphComputer.ResultGraph.NEW).persist(GraphComputer.Persist.VERTEX_PROPERTIES).submit().get()
rdd
= Spark.getRDD('output/~g')

Happy Sparking!

Marc


Op vrijdag 9 februari 2018 09:28:36 UTC+1 schreef HadoopMarc:

HadoopMarc

unread,
Jun 23, 2019, 8:02:50 AM6/23/19
to Gremlin-users
The above only clones the input graph rather than presents the result graph. For DKuppitz's bug report and workaround, see:


Op vrijdag 9 februari 2018 09:53:43 UTC+1 schreef HadoopMarc:
Reply all
Reply to author
Forward
0 new messages