Persist TraversalVertexProgram results to RDD using Java

227 views
Skip to first unread message

MarKri

unread,
Dec 15, 2019, 11:38:27 AM12/15/19
to Gremlin-users
Hello,

what I try to do is extracting some data of a JanusGraph DB with a Cassandra storage backend using an Gremlin OLAP (TinkerPop 3.4.3) traversal with the SparkGraphComputer and persist the result to a Spark RDD, convert it to a Dataset and do some Spark calculation on the extracted data. Also I am using Java. To get a little more precise I am using a TraversalVertexProgram as a VertexProgram and I am able to get a RDD as an output, but it is just the input graph data, which I saw here 
and the corresponding issue here
is known to be some bug. 

Nonetheless there is a workaround described in the issue section,
but I was not able up to now to implement a working Java application to solve my problem.
 
To test the RDD persistence on a small exmample I used the toy graph "tinkerpop-modern" and the properties file "hadoop-gryo.properties". I slightly modified the properties file so I have:

####################################hadoop-gryo.properties##################
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphReader=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoInputFormat
gremlin.hadoop.jarsInDistributedCache=true

gremlin.spark.persistContext=true
gremlin.hadoop.graphWriter=import org.apache.tinkerpop.gremlin.spark.structure.io.PersistedOutputRDD;

gremlin.hadoop.inputLocation=/path_to_janusgraph/janusgraph-0.4.0-hadoop2/data/tinkerpop-modern.kryo
gremlin.hadoop.outputLocation=output

spark.master=local[4]
spark.executor.memory=1g
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoRegistrator
############################################################################

My Java code looks like this:

##############################Java code#####################################
//read example graph
Graph graph = GraphFactory.open("/path_to_janusgraph/janusgraph-0.4.0-hadoop2/conf/hadoop-graph/hadoop-gryo.properties");

//calculate TraversalVertexProgram with SparkGraphComputer
graph.compute(SparkGraphComputer.class)
.program(TraversalVertexProgram.build()
.traversal(
//some example traversal
graph.traversal().V().has("age", P.gt(30)).asAdmin()
).create(graph))
.persist(GraphComputer.Persist.VERTEX_PROPERTIES)
.submit()
.get();
//get persisted RDD
RDD rdd  = Spark.getRDD("output/~g");

//do some example calculation, which shows just the input graph was persisted
System.out.println("RDD size: "+ rdd.count());
#############################################################################

Whenever I tried to use the CloneVertexProgram workaround, I just got some copy of the input graph RDD.  

Thank you for your help!

HadoopMarc

unread,
Dec 21, 2019, 9:57:37 AM12/21/19
to Gremlin-users
I checked and experienced the same problem.  I will add this to the bug report you already mentioned.

Regarding the analytics you want to do, the following options appear:
  1. If you traversal is just a simple filter, you can use the CassandraInputFormat on Spark's newAPIHadoopFile method right away and do the filtering without gremlin.
  2. If you want to use gremlin for preprocessing, you use the GryoOutputFormat for the graphWriter in the traversal + clone job. After that you recreate the spark dataset from hdfs using the GryoInputFormat and Spark's newAPIHadoopFile. If your gremlin traversal has vertices as output, you still have to filter the vertices that have the TraversalVertexProgram.HALTED_TRAVERSERS property.
HTH,    Marc


Op zondag 15 december 2019 17:38:27 UTC+1 schreef MarKri:
Reply all
Reply to author
Forward
0 new messages