Persist TraversalVertexProgram results to RDD using Java

227 views

Skip to first unread message

MarKri

unread,

Dec 15, 2019, 11:38:27 AM12/15/19

to Gremlin-users

Hello,

what I try to do is extracting some data of a JanusGraph DB with a Cassandra storage backend using an Gremlin OLAP (TinkerPop 3.4.3) traversal with the SparkGraphComputer and persist the result to a Spark RDD, convert it to a Dataset and do some Spark calculation on the extracted data. Also I am using Java. To get a little more precise I am using a TraversalVertexProgram as a VertexProgram and I am able to get a RDD as an output, but it is just the input graph data, which I saw here

https://groups.google.com/forum/#!topic/gremlin-users/wi7N8uVsIfE

and the corresponding issue here

https://issues.apache.org/jira/browse/TINKERPOP-1306

is known to be some bug.

Nonetheless there is a workaround described in the issue section,

but I was not able up to now to implement a working Java application to solve my problem.

To test the RDD persistence on a small exmample I used the toy graph "tinkerpop-modern" and the properties file "hadoop-gryo.properties". I slightly modified the properties file so I have:

####################################hadoop-gryo.properties##################

gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph

gremlin.hadoop.graphReader=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoInputFormat

gremlin.hadoop.jarsInDistributedCache=true

gremlin.spark.persistContext=true

gremlin.hadoop.graphWriter=import org.apache.tinkerpop.gremlin.spark.structure.io.PersistedOutputRDD;

gremlin.hadoop.inputLocation=/path_to_janusgraph/janusgraph-0.4.0-hadoop2/data/tinkerpop-modern.kryo

gremlin.hadoop.outputLocation=output

spark.master=local[4]

spark.executor.memory=1g

spark.serializer=org.apache.spark.serializer.KryoSerializer

spark.kryo.registrator=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoRegistrator

############################################################################

My Java code looks like this:

##############################Java code#####################################

//read example graph

Graph graph = GraphFactory.open("/path_to_janusgraph/janusgraph-0.4.0-hadoop2/conf/hadoop-graph/hadoop-gryo.properties");

//calculate TraversalVertexProgram with SparkGraphComputer

graph.compute(SparkGraphComputer.class)

.program(TraversalVertexProgram.build()

.traversal(

//some example traversal

graph.traversal().V().has("age", P.gt(30)).asAdmin()

).create(graph))

.result(GraphComputer.ResultGraph.NEW)

.persist(GraphComputer.Persist.VERTEX_PROPERTIES)

.submit()

.get();

//get persisted RDD

RDD rdd = Spark.getRDD("output/~g");

//do some example calculation, which shows just the input graph was persisted

System.out.println("RDD size: "+ rdd.count());

#############################################################################

Whenever I tried to use the CloneVertexProgram workaround, I just got some copy of the input graph RDD.

Thank you for your help!

HadoopMarc

unread,

Dec 21, 2019, 9:57:37 AM12/21/19

to Gremlin-users

I checked and experienced the same problem. I will add this to the bug report you already mentioned.

Regarding the analytics you want to do, the following options appear:

If you traversal is just a simple filter, you can use the CassandraInputFormat on Spark's newAPIHadoopFile method right away and do the filtering without gremlin.
If you want to use gremlin for preprocessing, you use the GryoOutputFormat for the graphWriter in the traversal + clone job. After that you recreate the spark dataset from hdfs using the GryoInputFormat and Spark's newAPIHadoopFile. If your gremlin traversal has vertices as output, you still have to filter the vertices that have the TraversalVertexProgram.HALTED_TRAVERSERS property.

HTH, Marc

Op zondag 15 december 2019 17:38:27 UTC+1 schreef MarKri:

Reply all

Reply to author

Forward

0 new messages