[DISCUSS] GraphComputer Provider TraversalStrategies

107 views
Skip to first unread message

Marko Rodriguez

unread,
May 3, 2016, 5:04:06 PM5/3/16
to d...@tinkerpop.incubator.apache.org, gremli...@googlegroups.com
Hello,

I was working with Russell Spitzer and Jeremy Hanna today and we noted that native Spark takes 2.6 minutes to "g.V().count()" while SparkGraphComputer takes 4.5 minutes. Its understandable that SparkGraphComputer will be slower for such simple traversals given all the machinery it has in place to support arbitrary graph traversals. However, why not make it as faster?

...enter -- GraphComputer Provider-Specific TraversalStrategies.

With the release of TinkerPop 3.2.0, TraversalStrategies.GlobalCache can have TraversalStrategies registered that are associated with not only a Graph, but also a GraphComputer. The first such GraphComputer strategy was just created in TINKERPOP-1288 called SparkPartitionAwareStrategy [https://issues.apache.org/jira/browse/TINKERPOP-1288].


What does it do?
- If there is no message pass, then there is no need to partition the RDD across the cluster as that is a big shuffle and not worth the time and space.
How does it work?
- It analyzes the traversal for VertexSteps that move beyond the StarVertex (i.e. a message pass). If no such steps exist, then a SparkGraphComputer-specific configuration is set to skip partitioning.

You can see how its registered -- just like Graph-provider strategies.

Is it better?
- Native spark via SparkContext.newHadoopAPI().count() takes 2.6 minutes to count Friendster.
- Without SparkPartitionAwareStrategy, counting Friendster takes 4.5 minutes.
- With SparkPartitionAwareStrategy, counting Friendster takes 4.0 minutes.
*** Not crazy faster, but its definitely faster. And given that applying strategies to OLAP traversals costs basically nothing (as opposed every microsecond counts with OLTP), why not save 30 seconds! :) 

So this is a simple use case that makes all non-traversal computations more efficient. However, we can imagine more useful strategies to write such as -- using native Spark for counting instead of SparkGraphComputer. That is, once the InputRDD is loaded, a bypass can be used to simply do "inputRDD.count()" and generate Iterator<Traverser<E>>. In this way, completely skipping all the semantics and infrastructure of SparkGraphComputer. I still need to think a bit on the best model for this, but already I know that TinkerGraphComputer and SparkGraphComputer will become blazing fast for such simple operations with GraphComputer provider-specifc strategies!

Finally, you can see the types of traversals that SparkPartitionAwareStrategy applies to in its test case:

Thoughts?,
Marko.

Marko Rodriguez

unread,
May 3, 2016, 7:25:17 PM5/3/16
to d...@tinkerpop.incubator.apache.org, gremli...@googlegroups.com
Like Jackson Pollock, I just broke it wide open…..

In TINKERPOP-1288, we now have the concept of a "NativeInterceptor." This interface is tied to SparkGraphComputer, but I think I can generalize it to work for any GraphComputer provider. (Also, probably call it VertexProgramInterceptor…)


A NativeInterceptor bypasses the execution of a VertexProgram and instead does what it needs with the Graph and Memory (i.e. ComputerResult). 


As an example, I created VertexCountInterceptor which does inputRDD.count(). Classy.


Drum roll…..

- Native spark via SparkContext.newHadoopAPI().count() on Friendster takes 2.6 minutes.
- Without SparkPartitionAwareStrategy, counting Friendster takes 4.5 minutes.
- With SparkPartitionAwareStrategy, counting Friendster takes 4.0 minutes.
- With both SparkPartitionAwareStrategy and SparkInterceptorStrategy, counting Friendster takes 2.4 minutes.

And that, my friends, is how the dishes get done.

Rip off shirt, flick off camera, and jump kick,
Marko.

Daniel Quest

unread,
May 3, 2016, 8:47:57 PM5/3/16
to gremli...@googlegroups.com, d...@tinkerpop.incubator.apache.org
Awesome, 
And here is another way to do the dishes....

image1.jpeg

Sent from my iPhone
--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/F6E2D934-7B30-4D9F-99BA-02C313B46977%40gmail.com.
For more options, visit https://groups.google.com/d/optout.

Daniel Kuppitz

unread,
May 4, 2016, 3:59:10 AM5/4/16
to gremli...@googlegroups.com
- Native spark via SparkContext.newHadoopAPI().count() on Friendster takes 2.6 minutes.
- With both SparkPartitionAwareStrategy and SparkInterceptorStrategy, counting Friendster takes 2.4 minutes.

TinkerPop is faster than native Spark? Within the same test environment? :) I don't believe that but it's epic either way. 

I guess we need to rework some / all of our VertexPrograms.

Cheers,
Daniel


Marko Rodriguez

unread,
May 4, 2016, 9:04:31 AM5/4/16
to gremli...@googlegroups.com
Hello,

- Native spark via SparkContext.newHadoopAPI().count() on Friendster takes 2.6 minutes.
- With both SparkPartitionAwareStrategy and SparkInterceptorStrategy, counting Friendster takes 2.4 minutes.
TinkerPop is faster than native Spark? Within the same test environment? :) I don't believe that but it's epic either way. 


There are slight differences between the two models -- graph filters being the primary. However, I think these times are actually "the same," just different runs give slightly different times.

I guess we need to rework some / all of our VertexPrograms.

There are a couple of new things to consider.

1. TINKERPOP-1120 (don't waste space or you will waste time): 
* Don't leave vertex properties (compute keys) around if they are "empty." HALTED_TRAVERSERS are removed if they are just empty TraverserSets.
* Message passing code in SparkGraphComputer isn't as wasteful object-wise.
2. TINKERPOP-1288 (use traversal introspection and leverage provider-specific optimizations when possible):
* VertexProgramInterceptors are the "vertex-centric indices" of OLAP. Moreover, they are so elegantly located within the GraphComputer that its very natural and easy for any implementation to leverage.
* GraphComputer strategies that can reason on the traversal and set configurations for the GraphComputer (e.g. in Spark: don't partition, don't cache, etc.)

The combination of these two tickets take GraphComputers up a notch in terms of performance.

Thanks,
Marko.




Reply all
Reply to author
Forward
0 new messages