Hi,
I have been running a Gremlin traversal on my production instance for a couple of months now. The traversal fetches a set of vertices satisfying certain criteria, sorts them on their updated timestamp and then picks the latest 20 out of those. Something like:
g.v...(do something here).as('result_set')...(do something else).back('result_set').sort{it.updated_at}.reverse()_()[0..19].id
This worked fine initially but has now slowed down because of an increased count of the result_set. While trying to debug this delay, I figured out that removing sort{} fetched the results much faster. I'm assuming this is because sort{} takes place in Groovy space. It's like lazy evaluating everything before sorting and then take the whole bunch to sort, as opposed to emitting the sorted result set itself out of Gremlin Pipes.
Here are the benchmarks with sort:
Gremlin (35826.0ms) g.v....as('result_set').....back('result_set').sort{it.updated_at}.reverse()_()[0..10].id
Gremlin (22239.4ms) g.v....as('result_set').....back('result_set').sort{it.updated_at}.reverse()_()[0..10].id
Gremlin (20377.7ms) g.v....as('result_set').....back('result_set').sort{it.updated_at}.reverse()_()[0..10].id
And here are the benchmarks after removing sort{}.reverse()
Gremlin (29.0ms) g.v....as('result_set').....back('result_set')[0..10].id
Gremlin (18.7ms) g.v....as('result_set').....back('result_set')[0..10].id
Gremlin (15.7ms) g.v....as('result_set').....back('result_set')[0..10].id
As can be observed above, there's ~1000x increase in execution time when sort{} is applied. I don't think reverse() would be an expensive operation. result_set contains ~15k vertices. Sorting them is taking longer. Is there any way to speed things up?
I'm using Neo4j 1.7 stable release on production.
--
Nikhil Lanjewar
Engineering Lead at YourNextLeap
http://yournextleap.com