SparkGraphComputer work on any Graph Implementation

168 views
Skip to first unread message

NQuinn

unread,
Jul 13, 2017, 3:47:45 PM7/13/17
to Gremlin-users
Will the SparkGraphComputer work as a processing engine on top of any graph impl? I noticed that in the tinkerpop tests, they only use HadoopGraph and TinkerGraph. Thanks!

Marko Rodriguez

unread,
Jul 13, 2017, 3:59:53 PM7/13/17
to gremli...@googlegroups.com
Hello,

> Will the SparkGraphComputer work as a processing engine on top of any graph impl? I noticed that in the tinkerpop tests, they only use HadoopGraph and TinkerGraph. Thanks!

Yes! You just need to be able to generate an InputRDD from your graph database and wala!

HTH,
Marko.

NQuinn

unread,
Jul 13, 2017, 4:59:53 PM7/13/17
to Gremlin-users
Marko--

>> Will the SparkGraphComputer work as a processing engine on top of any graph impl? I noticed that in the tinkerpop tests, they only use HadoopGraph and TinkerGraph. Thanks! 
> Yes! You just need to be able to generate an InputRDD from your graph database and wala! 

Thanks for getting back to me.  That helps. I've looked at the SparkGraphComputer and it creates a new instance of the InputRDD, so I cannot pass a reference to the InputRDD implementation unless it is static. Is that right? 
Best,
Nick 

NQuinn

unread,
Jul 17, 2017, 7:58:52 PM7/17/17
to Gremlin-users
Marko--

Maybe I am doing something wrong, but it seems like the input rdd is read every time a query is executed. It seems like caching the graph below as written should only force it to be read once.

graph.traversal().withComputer(SparkGraphComputer.class);

Is there a reason why it is designed this way? It seems like there should be a way to cache the graph so that it is not read over and over.  Can you help me out? Thanks so much for your patience as I figure this stuff out. Hope you are well. 
Thanks!
Nick

Marko Rodriguez

unread,
Jul 17, 2017, 8:06:21 PM7/17/17
to gremli...@googlegroups.com
Hello,

Yes, every query constructs a new InputRDD. If you want it cached, you can make it so your InputRDD (when constructed) is cached and thus, the previous one provided is provided again. In essence, the vendor has control over the InputRDD’s construction and ultimate providing to TinkerPop. Thus, build a simple “if(inCache) { return rdd; } else { return constructRdd() }” clause in your InputRDD class.

Marko.
--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/d19dadbb-359e-408e-878b-5aad3bf9b262%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

NQuinn

unread,
Jul 18, 2017, 1:49:03 PM7/18/17
to Gremlin-users
| Thus, build a simple “if(inCache) { return rdd; } else { return constructRdd() }” clause in your InputRDD class.

I don't think this will work. If it is cached on the input rdd side and the SparkGraphComputer is creating a new instance of the InputRDD every time, there is no way to cache it.  I assume that you are creating a new instance of the input rdd every time because the spark context could be shut down in between the executions of the SparkGraphComputer which would invalidate the pair rdd, so that makes sense. I just wish there was a way to indicate when the context was shutdown and when it restarted, so that the input rdd could be constructed less frequently. Thanks Marko!

Marko Rodriguez

unread,
Jul 18, 2017, 1:51:46 PM7/18/17
to gremli...@googlegroups.com
Hello,

I don't think this will work. If it is cached on the input rdd side and the SparkGraphComputer is creating a new instance of the InputRDD every time, there is no way to cache it.  I assume that you are creating a new instance of the input rdd every time because the spark context could be shut down in between the executions of the SparkGraphComputer which would invalidate the pair rdd, so that makes sense. I just wish there was a way to indicate when the context was shutdown and when it restarted, so that the input rdd could be constructed less frequently. Thanks Marko!

Two things:

1. TinkerPop does not create the RDD. The graph vendor does. Thus, if its cached, returned the cached value.
2. You can persist the Spark context. Please look into SparkContextStorage.

HTH,
Marko.

NQuinn

unread,
Jul 18, 2017, 2:17:18 PM7/18/17
to Gremlin-users

Two things:

1. TinkerPop does not create the RDD. The graph vendor does. Thus, if its cached, returned the cached value.
2. You can persist the Spark context. Please look into SparkContextStorage.

Marko--
I believe that when you say RDD, you mean the PairRDD within the readGraphRDD inside the InputRDD class (because the SparkGraphComputer calls newInstance() on the InputRDD). If so, that makes sense and it can be cached, but not sure about the SparkContext.  At the end of the submitWithExecutor method, Spark.close() is called which closes the context underneath. I don't see how it can be reused.  Am I missing something?
Thanks!
Nick

Marko Rodriguez

unread,
Jul 18, 2017, 4:14:42 PM7/18/17
to gremli...@googlegroups.com
--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.

NQuinn

unread,
Jul 18, 2017, 4:23:41 PM7/18/17
to Gremlin-users
Wow! I don't know how I overlooked that. Thanks!
Reply all
Reply to author
Forward
0 new messages