Ruminations on SparkGraphComputer Part Quadtangaloid

36 views
Skip to first unread message

Marko Rodriguez

unread,
May 3, 2016, 9:12:09 AM5/3/16
to d...@tinkerpop.incubator.apache.org, gremli...@googlegroups.com
Hello,

A fellow named Russ Spitzer kept mentioned "a pointless reduceByKey()" when doing SparkGraphComputer jobs. I decided to look into his (probably erroneous) claims while working on https://issues.apache.org/jira/browse/TINKERPOP-1120. Lo and behold, he was right (impossible!). There is an "extra" reduceByKey() that is done that, while it still needs to be done, it shouldn't shuffle so much data. There are 5 things in particular I accomplished yesterday:

-------SparkGraphComputer
1. If the vertex doesn't pass any messages, don't serialize an empty list, serialize null.
2. If the vertex doesn't have a view, don't serialize an empty list of detached vertex properties, serialize null.
3. If the vertex doesn't have a view nor messages, don't do anything!
-------TraversalVertexProgram
4. Found a memory bug where halted traversers were still distributed amongst the vertices even though they were sent to the master traversal.
5. If a halted traverser TraverserSet is empty, remove the property (remove the vertex view!).

With that, TraversalVertexProgram (Gremlin OLAP) will typically have nothing to reduceByKey() on the final stage as there are no more messages being sent (messages) and thus, no more HALTED_TRAVERSERS distributed across the vertex set (views). Because of this, the final reduceByKey() that once took 1.2 minutes now only takes 0.7 seconds. Combine that with the other memory/data reductions listed above and we get the following times below.

g.V().count() -- answer 125000000 (125 million vertices)
- TinkerPop 3.0.0.MX2.5 hours
- TinkerPop 3.0.0: 1.5 hours
- TinkerPop 3.1.1: 23 minutes
- TinkerPop 3.2.0: 6.8 minutes (Spark 1.5.2)
- TinkerPop 3.2.0: 5.5 minutes (Spark 1.6.1)
- TinkerPop 3.2.1: 4.5 minutes (Spark 1.6.1)

g.V().out().count() -- answer 2586147869 (2.5 billion length-1 paths (i.e. edges))
- TinkerPop 3.0.0.MXunknown
- TinkerPop 3.0.0: 2.5 hours
- TinkerPop 3.1.1: 1.1 hours
- TinkerPop 3.2.0: 13 minutes (Spark 1.5.2)
- TinkerPop 3.2.0: 12 minutes (Spark 1.6.1)
- TinkerPop 3.2.1: 10 minutes (Spark 1.6.1)
g.V().out().out().count() -- answer 640528666156 (640 billion length-2 paths)
- TinkerPop 3.0.0.MXunknown
- TinkerPop 3.0.0: unknown
- TinkerPop 3.1.1: unknown
- TinkerPop 3.2.0: 55 minutes (Spark 1.5.2)
- TinkerPop 3.2.0: 50 minutes (Spark 1.6.1)
- TinkerPop 3.2.1: 45 minutes (Spark 1.6.1)

g.V().out().out().out().count() -- answer 215664338057221 (215 trillion length 3-paths)
- TinkerPop 3.0.0.MX12.8 hours
- TinkerPop 3.0.0: 8.6 hours
- TinkerPop 3.1.1: 2.4 hours
- TinkerPop 3.2.0: 1.6 hours (Spark 1.5.2)
- TinkerPop 3.2.0: 1.5 hours (Spark 1.6.1)
- TinkerPop 3.2.1: 1.3 hours (Spark 1.6.1)

g.V().out().out().out().out().count() -- answer 83841426570464575 (83 quadrillion length 4-paths)
- TinkerPop 3.0.0.MXunknown
- TinkerPop 3.0.0: unknown
- TinkerPop 3.1.1: unknown
- TinkerPop 3.2.0: unknown (Spark 1.5.2)
- TinkerPop 3.2.0: 2.1 hours (Spark 1.6.1)
- TinkerPop 3.2.1: 1.9 hours (Spark 1.6.1)

g.V().out().out().out().out().out().count() -- answer -2280190503167902456 !! I blew the long space -- 64-bit overflow.
- TinkerPop 3.0.0.MXunknown
- TinkerPop 3.0.0: unknown
- TinkerPop 3.1.1: unknown
- TinkerPop 3.2.0: unknown (Spark 1.5.2)
- TinkerPop 3.2.0: 2.8 hours (Spark 1.6.1)
- TinkerPop 3.2.1: 2.4 hours (Spark 1.6.1)

g.V().group().by(outE().count()).by(count()). 
- TinkerPop 3.2.0:  12 minutes (Spark 1.6.1)
- TinkerPop 3.2.1:  12 minutes (Spark 1.6.1)

NOTE: This work is currently not in master/ as it still needs to go through Apache TinkerPop VOTE. 

Please tip your waitress,
Marko.


Reply all
Reply to author
Forward
0 new messages