I am testing a java query on different size dataset, 100 Million to 1 Billion edges.
The query does not return much data 10 to 20 vertices with corresponding edges but it need to scan the whole dataset.
I can see a big performances degradation when the database size is bigger than 32 Gigs.
I am running the test on a 32 core 244G RAM virtual server, the query is threaded to use all cpu.
I changed the java heap size to 96G and played with the garbage collector options (retain -XX:+UseG1GC as the most improving option)
to get a better outcome but I still get big dip in performances, I assumed the threshold is around 32G:
100M edges, database is 7.5G : 12 min
250M edges, database is 19G : 35 min
500M edges, database is 38G : 12 hours with -XX:+UseG1GC
1B edges, database is 76G : 51 hours without -XX:+UseG1GC
Furthermore for the 0.5 Billion and 1 Billion test I can see that the bulk of the operations are system operations 60% versus
user operation 40% (from top linux command). When I run the smaller test 100% of the operations are user operations.
Are the java GC improvement in the Enterprise edition of Neo4j significant enough to bring the performance of the large scale dataset query in the same range as the smaller one?
Is there something else I can do to improve the performance of larger dataset queries?
tks
Patrice