Ruminations on SparkGraphComputer Part 666

158 views
Skip to first unread message

Marko Rodriguez

unread,
Sep 15, 2016, 5:17:06 PM9/15/16
to gremli...@googlegroups.com, d...@tinkerpop.apache.org
Hello,

Its about that time again. Spark 2.0 was released recently and, with the help of Chen Xin Yu, TINKERPOP-1389 has been updated to support Spark 2.0. How does it perform? A little faster here and a little slower there. Note that this work will go into TinkerPop 3.3.0. Currently, we don’t have a branch for TinkerPop 3.3.0 development work and until then, this will remain in TINKERPOP-1389. Finally, note that there have been no changes to SparkGraphComputer besides tweaks to get alignment with Spark’s API and serialization updates.

g.V().count() -- answer 125000000 (125 million vertices)
- TinkerPop 3.0.0.MX: 2.5 hours
- TinkerPop 3.0.0: 1.5 hours
- TinkerPop 3.1.1: 23 minutes
- TinkerPop 3.2.0: 6.8 minutes (Spark 1.5.2)
- TinkerPop 3.2.0: 5.5 minutes (Spark 1.6.1)
- TinkerPop 3.2.1: 2.2 minutes (Spark 1.6.1)
- TinkerPop 3.3.x: 1.6 minutes (Spark 2.2.0)

g.V().out().count() -- answer 2586147869 (2.5 billion length-1 paths (i.e. edges))
- TinkerPop 3.0.0.MX: unknown
- TinkerPop 3.0.0: 2.5 hours
- TinkerPop 3.1.1: 1.1 hours
- TinkerPop 3.2.0: 13 minutes (Spark 1.5.2)
- TinkerPop 3.2.0: 12 minutes (Spark 1.6.1)
- TinkerPop 3.2.1: 2.4 minutes (Spark 1.6.1)
- TinkerPop 3.3.x: 2.1 minutes (Spark 2.0.0)
g.V().out().out().count() -- answer 640528666156 (640 billion length-2 paths)
- TinkerPop 3.2.0: 55 minutes (Spark 1.5.2)
- TinkerPop 3.2.0: 50 minutes (Spark 1.6.1)
- TinkerPop 3.2.1: 37 minutes (Spark 1.6.1)
- TinkerPop 3.3.x: 40 minutes (Spark 2.0.0)

g.V().out().out().out().count() -- answer 215664338057221 (215 trillion length 3-paths)
- TinkerPop 3.0.0.MX: 12.8 hours
- TinkerPop 3.0.0: 8.6 hours
- TinkerPop 3.1.1: 2.4 hours
- TinkerPop 3.2.0: 1.6 hours (Spark 1.5.2)
- TinkerPop 3.2.0: 1.5 hours (Spark 1.6.1)
- TinkerPop 3.2.1: 1.1 hours (Spark 1.6.1)
- TinkerPop 3.3.x: 1.3 hours (Spark 2.0.0)

g.V().out().out().out().out().count() -- answer 83841426570464575 (83 quadrillion length 4-paths)
- TinkerPop 3.2.0: 2.1 hours (Spark 1.6.1)
- TinkerPop 3.2.1: 1.7 hours (Spark 1.6.1)
- TinkerPop 3.3.x: 2.0 hours (Spark 2.0.0)

g.V().out().out().out().out().out().count() -- answer -2280190503167902456 !! I blew the long space -- 64-bit overflow.
- TinkerPop 3.2.0: 2.8 hours (Spark 1.6.1)
- TinkerPop 3.2.1: 2.2 hours (Spark 1.6.1)
- TinkerPop 3.3.x: 2.6 hours (Spark 2.0.0)

g.V().group().by(outE().count()).by(count()). 
- TinkerPop 3.2.0: 12 minutes (Spark 1.6.1)
- TinkerPop 3.2.1: 2.4 minutes (Spark 1.6.1)
- TinkerPop 3.3.x: 3.1 minutes (Spark 2.0.0)

g.V().groupCount().by(outE().count())
- TinkerPop 3.2.0: 12 minutes (Spark 1.6.1)
  - TinkerPop 3.2.1: 2.7 minutes (Spark 1.6.1)
  - TinkerPop 3.3.x: 2.2 minutes (Spark 2.0.0)

Take care,
Marko.

Chen Xin Yu

unread,
Sep 17, 2016, 11:11:51 PM9/17/16
to Gremlin-users
Very helpful information, a simple question, where is the benchmark? is it the build in benchmark? Thanks!

Marko Rodriguez

unread,
Sep 18, 2016, 11:28:16 AM9/18/16
to gremli...@googlegroups.com
Hi,

> Very helpful information, a simple question, where is the benchmark? is it the build in benchmark? Thanks!

I just have those queries in a text file and I run them. SparkServer UI gives me to the times.

Marko.

Bryan Thompson

unread,
Sep 19, 2016, 9:38:50 AM9/19/16
to Gremlin-users
Against what data?

Marko Rodriguez

unread,
Sep 19, 2016, 10:18:21 AM9/19/16
to gremli...@googlegroups.com
The same dataset as all the other Ruminations on SparkGraphComputers — 1 - 5.

Friendster 2.5 billion edge dataset.

Marko.

http://markorodriguez.com



> On Sep 19, 2016, at 7:38 AM, Bryan Thompson <br...@systap.com> wrote:
>
> Against what data?
>
> --
> You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/781e8637-bce6-46bc-ba5d-c82aa8489287%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Jason Plurad

unread,
Sep 19, 2016, 10:46:22 AM9/19/16
to Gremlin-users
This looks like the data set: the graph contains 117,751,379 nodes and 2,586,147,869 directed edges.
https://archive.org/details/friendster-dataset-201107

#1 in the series mentions a 4 blade cluster. I believe it's a Spark standalone cluster.
https://groups.google.com/d/msg/gremlin-users/Mlf5UqaBSBI/GuSmb997DQAJ

@dkupptiz, have you published a gist with the script to load the data?

-- Jason

Daniel Kuppitz

unread,
Sep 19, 2016, 2:05:40 PM9/19/16
to gremli...@googlegroups.com
@dkupptiz, have you published a gist with the script to load the data?

The parser script is pretty old and I probably created a public gist, back in the days when we ran the first performance benchmarks. But here it is again:

def parse(line, factory) {
    def skip = ["private", "notfound"]
    def parts = line.split(/:/)
    if (parts[0] in skip) return null
    def v1 = factory.vertex(Long.valueOf(parts[0]))
    if (parts.length == 2) {
        parts[1].split(/,/).grep { !it.isEmpty() && !(it in skip) }.each {
            def v2 = factory.vertex(Long.valueOf(it))
            factory.edge(v1, v2, "knows")
        }
    }
    return v1
}

Cheers,
Daniel
 

To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-users+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/2177c883-98ca-4753-8e7a-7db832d21e13%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages