Janusgraph - OLAP using Dataproc

54 views
Skip to first unread message

bobo...@gmail.com

unread,
Jun 18, 2020, 1:09:19 PM6/18/20
to JanusGraph users
Hi,

We are using Janusgraph (0.5.2) with Scylladb as backend. So far we are only using OLTP capabilities but would now like to also do some more advanced batch processing to create shortcut edges, for example for recommendations. To do that, I would like to use the OLAP features.

Reading the documentation this sounds pretty straightforward, assuming one has a Hadoop cluster up and running. But here comes my problem: I would like to use Dataproc - Google's managed solution for Hadoop and Spark. Unfortunately I couldn't find any further information on how to get those two things playing well together.

Does anyone have any experience, hints or documentation on how to properly configure Janusgraph with Dataproc?

In a very first step, a was trying the following (Java application with embedded Janusgraph)

GraphTraversalSource g = GraphFactory.open("graph.properties").traversal().withComputer(SparkGraphComputer.class);
long count = g.V().count().next();
...
g
.close()

the graph.properties file looking like this

gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin
.hadoop.graphReader=org.janusgraph.hadoop.formats.cql.CqlInputFormat
gremlin
.hadoop.graphWriter=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat
gremlin
.hadoop.jarsInDistributedCache=true
gremlin
.hadoop.inputLocation=none
gremlin
.hadoop.outputLocation=output
gremlin
.spark.persistContext=true

# Cassandra
janusgraphmr
.ioformat.conf.storage.backend=cql
janusgraphmr
.ioformat.conf.storage.hostname=myhost
janusgraphmr
.ioformat.conf.storage.port=9042
janusgraphmr
.ioformat.conf.index.search.backend=lucene
janusgraphmr
.ioformat.conf.index.search.directory=/tmp/
janusgraphmr
.ioformat.conf.index.search.hostname=127.0.0.1
cassandra
.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
cassandra
.input.widerows=true
# Spark
spark
.master=local[*]
spark
.executor.memory=1g
spark
.serializer=org.apache.spark.serializer.KryoSerializer
spark
.kryo.registrator=org.janusgraph.hadoop.serialize.JanusGraphKryoRegistrator


If I just run the code like this, without specifying anything else, it just results in nothing happening, and endless log output like these
Code hier eingeben...18:39:07.749 [Executor task launch worker for task 3] DEBUG o.j.g.t.StandardJanusGraphTx - Guava vertex cache size: requested=20000 effective=20000 (min=100)
18:39:07.749 [Executor task launch worker for task 3] DEBUG o.j.g.t.vertexcache.GuavaVertexCache - Created dirty vertex map with initial size 32
18:39:07.749 [Executor task launch worker for task 3] DEBUG o.j.g.t.vertexcache.GuavaVertexCache - Created vertex cache with max size 20000

Additionally, I added the hdfs-site extracted from dataproc to my classpath, but that didn't help any.

The same in the OLTP world works like a charm. (of course using a proper query, one not iterating over the whole graph .... :D )

Any hints, ideas, experiences or links are greatly appreciated.

Looking forward to some answers,
Claire

SAURABH VERMA

unread,
Jun 18, 2020, 1:59:50 PM6/18/20
to janusgra...@googlegroups.com
We've set up and janusgraph OLAP with spark-yarn, is that something you are looking for?

Thanks

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/7dc9a3f1-82bc-47d5-89a1-5f3d4e21e5cdo%40googlegroups.com.


--
Thanks & Regards,
Saurabh Verma,
India


Claire F

unread,
Jun 18, 2020, 2:20:50 PM6/18/20
to janusgra...@googlegroups.com
Hi Saurabh,

Thanks for your reply. 
I am really specifically looking with setup using Dataproc.

Regards
Claire

You received this message because you are subscribed to a topic in the Google Groups "JanusGraph users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/janusgraph-users/Fh0ARPasw8s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to janusgraph-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/CADJB8JuGxLsTAv6kJrnKfrry5zjKVZD6yQr6JacWKA5Pq2L%3Dvg%40mail.gmail.com.

HadoopMarc

unread,
Jun 19, 2020, 2:08:40 AM6/19/20
to JanusGraph users
Hi Claire,

As also indicated by Saurabh, your current config runs spark locally on your client node and does not use dataproc at all.

What possibly could work (I never used dataproc myself):
Best wishes,   Marc

Op donderdag 18 juni 2020 20:20:50 UTC+2 schreef Claire F:
Claire

To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@googlegroups.com.


--
Thanks & Regards,
Saurabh Verma,
India


--
You received this message because you are subscribed to a topic in the Google Groups "JanusGraph users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/janusgraph-users/Fh0ARPasw8s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to janusgra...@googlegroups.com.

Claire F

unread,
Jun 19, 2020, 2:40:02 AM6/19/20
to janusgra...@googlegroups.com
Hi Marc,

Thanks a lot for your detailed answer I will give that a try and see if I can get it to work. 
Then I hope I'll find a way to marry all that into my Java code once I get it working with the gremlin console, but that shouldn't bei an issue then.

I am aware that my current config uses Spark locally. However I seem to have misunderstood the documentation, as I thought the Hadoop cluster was still needed for some temporary files, and this is what I thought I'd need Dataproc's Hadoop component as well. Even better If I don't.

Regards and thanks again
Claire

To unsubscribe from this group and all its topics, send an email to janusgraph-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/4900585e-5091-43ae-842e-162e9ea94d8do%40googlegroups.com.

bobo...@gmail.com

unread,
Jun 22, 2020, 6:12:27 AM6/22/20
to JanusGraph users
Hi

After some version-conflict-solving, I was able to use the SparkGraphComputer using Dataproc's managed Spark. The code is contained within a Java application (with embedded Janusgraph)

As this might be interesting to other people in the future, I wanted to share my setup here:

Janusgraph Version: 0.5.2
TinkerPop Version : 3.4.7
Dataproc Version: 1.2.100-debian9 (because we need Spark 2.2.x)

(Very basic example) Java code

Configuration configuration = new PropertiesConfiguration("graph.properties"); // Need to do it this way, otherwise GraphFactory requires an absolute path
GraphTraversalSource g = GraphFactory.open(configuration).traversal().withComputer(SparkGraphComputer.class);

long count = g.V().count().next();
 
...


graph.properties


# Hadoop Graph Configuration

gremlin
.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin
.hadoop.graphReader=org.janusgraph.hadoop.formats.cql.CqlInputFormat
gremlin
.hadoop.graphWriter=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat
gremlin
.hadoop.jarsInDistributedCache=true
gremlin
.hadoop.inputLocation=none
gremlin
.hadoop.outputLocation=output
gremlin
.spark.persistContext=true
# Scylla
janusgraphmr
.ioformat.conf.storage.backend=cql
janusgraphmr
.ioformat.conf.storage.hostname=scylla-host
janusgraphmr
.ioformat.conf.storage.port=9042

janusgraphmr
.ioformat.conf.index.search.backend=lucene
janusgraphmr
.ioformat.conf.index.search.directory=/tmp/
janusgraphmr
.ioformat.conf.index.search.hostname=127.0.0.1
cassandra
.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
cassandra
.input.widerows=true
# Spark

spark
.master=yarn-client
spark
.executor.memory=1g
spark
.serializer=org.apache.spark.serializer.KryoSerializer
spark
.kryo.registrator=org.janusgraph.hadoop.serialize.JanusGraphKryoRegistrator






Some additional notes:
In our Maven setup we needed to
  • exclude the jackson-databind depdency from spark-gremlin due to version conflicts with Janusgraph
  • use the maven shade plugin to build an uber-jar and shade com.google.commons to overcome Depdencency issues on Guava between the one's on Dataproc and in our application
  • Use Java 8  (due to the version of dataproc we need to use due to Spark 2.2.x requirement)


Finally, we simply build a JAR archive with all the (shaded) depdendencies, and Upload that on Google Cloud Storage. We then submit the Spark Job as follows

gcloud dataproc jobs submit spark --cluster=<cluster> --class=<mainClass> --jars=gs://<bucket>/<folder>/<shaded-jar-with-dependencies>.jar  --region=<region>


Regards
Claire

HadoopMarc

unread,
Jun 22, 2020, 9:02:25 AM6/22/20
to JanusGraph users
Great work!

Marc

Op maandag 22 juni 2020 12:12:27 UTC+2 schreef bobo...@gmail.com:
Reply all
Reply to author
Forward
0 new messages