Connected components continued (scaling to a large graph)

185 views
Skip to first unread message

Jen

unread,
Dec 5, 2016, 11:34:59 AM12/5/16
to Gremlin-users
A few months ago, I posted this question on connected components. However, the connected components algorithm in that thread takes a really long time (over 24 hours) and runs out of memory on a large graph (hundreds of millions of edges and vertices) on OLAP Tinkerpop 3.2.1 with Spark. This also happens with a simpler version of the connected components query that I created which should work for my graph structure where "order" is a central node and "items" and "clients" are connected with an out-edge to "order" as in the example in the above thread.
g.V().hasLabel('order').as('order').emit().repeat(both().simplePath()).dedup().hasLabel('item','client').as('cluster').select('order','cluster').group().by(select('order')).by(select('cluster'))
I am using 32gb+ driver and executor memory sizes with ~100 mb partitions (the recommended partition size in my cluster). 

By comparison, loading the exact same graph in Spark's GraphX using underlying vertex and edge parquet files (before they are converted to the Tinkerpop format) and running its connected components algorithm takes 20 minutes with approximately 8 gb driver and executor memory sizes. It appears that GraphX is using a reasonably efficient connected components algorithm

Is there anything that can be done to improve the performance of the Gremlin/Tinkerpop connected components algorithm?

Thanks,
Jen

HadoopMarc

unread,
Dec 5, 2016, 1:32:26 PM12/5/16
to Gremlin-users
Hi Jen,

A few comments for more info:
 - repeat has no stop condition and goes to the end of your directed graph, right?
 - is the behaviour the same with a factor of 10 smaller partitions?
 - when examining the job from the Spark History UI, what stages do you see with what execution times?
 - what happens in the problematic stage, excessive GC, excessive shuffling (see Spark history again)?

Cheers,    Marc

Op maandag 5 december 2016 17:34:59 UTC+1 schreef Jen:
Reply all
Reply to author
Forward
0 new messages