Hi,
We are using Janusgraph (0.5.2) with Scylladb as backend. So far we are only using OLTP capabilities but would now like to also do some more advanced batch processing to create shortcut edges, for example for recommendations. To do that, I would like to use the OLAP features.
Reading the
documentation this sounds pretty straightforward, assuming one has a Hadoop cluster up and running. But here comes my problem: I would like to use
Dataproc - Google's managed solution for Hadoop and Spark. Unfortunately I couldn't find any further information on how to get those two things playing well together.
Does anyone have any experience, hints or documentation on how to properly configure Janusgraph with Dataproc?
In a very first step, a was trying the following (Java application with embedded Janusgraph)
GraphTraversalSource g = GraphFactory.open("graph.properties").traversal().withComputer(SparkGraphComputer.class);
long count = g.V().count().next();
...
g.close()
the graph.properties file looking like this
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphReader=org.janusgraph.hadoop.formats.cql.CqlInputFormat
gremlin.hadoop.graphWriter=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat
gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.inputLocation=none
gremlin.hadoop.outputLocation=output
gremlin.spark.persistContext=true
# Cassandra
janusgraphmr.ioformat.conf.storage.backend=cql
janusgraphmr.ioformat.conf.storage.hostname=myhost
janusgraphmr.ioformat.conf.storage.port=9042
janusgraphmr.ioformat.conf.index.search.backend=lucene
janusgraphmr.ioformat.conf.index.search.directory=/tmp/
janusgraphmr.ioformat.conf.index.search.hostname=127.0.0.1
cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
cassandra.input.widerows=true
# Spark
spark.master=local[*]
spark.executor.memory=1g
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator=org.janusgraph.hadoop.serialize.JanusGraphKryoRegistrator
If I just run the code like this, without specifying anything else, it just results in nothing happening, and endless log output like these
Code hier eingeben...18:39:07.749 [Executor task launch worker for task 3] DEBUG o.j.g.t.StandardJanusGraphTx - Guava vertex cache size: requested=20000 effective=20000 (min=100)
18:39:07.749 [Executor task launch worker for task 3] DEBUG o.j.g.t.vertexcache.GuavaVertexCache - Created dirty vertex map with initial size 32
18:39:07.749 [Executor task launch worker for task 3] DEBUG o.j.g.t.vertexcache.GuavaVertexCache - Created vertex cache with max size 20000
Additionally, I added the hdfs-site extracted from dataproc to my classpath, but that didn't help any.
The same in the OLTP world works like a charm. (of course using a proper query, one not iterating over the whole graph .... :D )
Any hints, ideas, experiences or links are greatly appreciated.
Looking forward to some answers,
Claire