A few months ago, I posted
this question on connected components. However, the connected components algorithm in that thread takes a really long time (over 24 hours) and runs out of memory on a large graph (hundreds of millions of edges and vertices) on OLAP Tinkerpop 3.2.1 with Spark. This also happens with a simpler version of the connected components query that I created which should work for my graph structure where "order" is a central node and "items" and "clients" are connected with an out-edge to "order" as in the example in the above thread.
g.V().hasLabel('order').as('order').emit().repeat(both().simplePath()).dedup().hasLabel('item','client').as('cluster').select('order','cluster').group().by(select('order')).by(select('cluster'))
I am using 32gb+ driver and executor memory sizes with ~100 mb partitions (the recommended partition size in my cluster).
By comparison, loading the exact same graph in Spark's GraphX using underlying vertex and edge parquet files (before they are converted to the Tinkerpop format) and running its connected components algorithm takes 20 minutes with approximately 8 gb driver and executor memory sizes. It appears that GraphX is using a reasonably efficient connected components algorithm.
Is there anything that can be done to improve the performance of the Gremlin/Tinkerpop connected components algorithm?
Thanks,
Jen