Re: [Neo4j] cypher count query on large number of nodes cause OOM

193 views
Skip to first unread message

Michael Hunger

unread,
Jul 21, 2012, 6:00:45 PM7/21/12
to ne...@googlegroups.com
Sebastian,

did you try return count(*) ?

global aggregate queries are not really what you want to do with a graph database.

So usually the nodes you look at are connected to a set of start nodes and you would then only work with these connected nodes/subgraphs which are usually smaller than your whole dataset.

Can you share the stack-trace where it dies?

Michael

Am 20.07.2012 um 10:52 schrieb Sebastian N:

> Hi, I am evaluating neo4j. when running a unit test that inserts 1 million nodes (nodes with 2 String properties), and doing a cypher query to check the number of nodes inserted with "START n=node(*) RETURN count(n)" my JVM dies. My memory settings are Xms1024 Xmx1024. The insert work like a charm, really quick with really little memory usage. Am I doing something wrong? Should I rather use other query languages for a graph that size?
>
> If used in Prod, then our graph will consist of more than a million nodes, and I need to do aggregate queries because I have to link it to a UI.
>
> Any help appreciated.
> Best regards,
> Sebastian

Giannis Skitsas

unread,
Aug 10, 2012, 9:02:04 AM8/10/12
to ne...@googlegroups.com
Hi,

Just a question that might be similar with this post.
My graph consists of 10M nodes and I am trying a cypher query to count Actions created by a User, so I have the following query:

START u=node:nodes(userId ="someid") 
MATCH (a)-[r:FROM]->(u)
return count(r) 
   
The result is 312311 and it takes ~190705ms to execute. Is this something expected or am I doing something terribly wrong?

Best regards,
Giannis

Michael Hunger

unread,
Aug 10, 2012, 9:23:36 AM8/10/12
to ne...@googlegroups.com
Giannis,

so a single user created 300k actions? Impressive.

can you try to use count(*)

Did you measure the first query or subsequent ones?

Michael

Giannis Skitsas

unread,
Aug 10, 2012, 9:51:33 AM8/10/12
to ne...@googlegroups.com
These actions could have been created in one year time :) 
(just trying a scenario where I want to track every action of a user in a web page)
 
Using count(*) didn't see any improvement. Subsequent queries took ~1500-2000ms

Michael do you get better results when trying a similar query on large number of nodes?

Michael Hunger

unread,
Aug 10, 2012, 1:07:00 PM8/10/12
to ne...@googlegroups.com
Giannis,

the first query measures the disk speed for loading the 300k relationships and end-nodes. Do you have a spinning disk or SSD ?

The second one should be much faster, but it depends on your memory settings.

How do you run the JVM for Neo4j ? Heap size and how much memory does the machine have?

Also these nodes with that many relationships are mostly hard to handle. All queries coming along them explode in cardinality instantly.
It is not so much a problem to have long paths through the graph than having nodes with that many relationships. (at least for now).

Other options are: sub-divide the actions by other dimensions, e.g. time (month, day)
And another option is to pre-aggregate those e.g. on a timescale, so if you imagine a tree modeling time (year - month - day - actions).

Then you can aggregate the # of actions as property on the day node (in a batch fashion) and then query over the time-tree according to the time ranges you want to cover.

HTH,

Michael

Giannis Skitsas

unread,
Aug 13, 2012, 3:47:59 AM8/13/12
to ne...@googlegroups.com
Thanks for your response Michael! 

I am running these queries on a laptop with 8gb of ram and a spinning disk. I am also using -Xms512M  -Xmx1024M  -Xss128K when starting in embedded mode and using default options when in server.

The multilevel indexing structure you are mentioning is definitely something I need and I will also try to change my schema to avoid nodes with many relationships.  

Thanks again,
Giannis
Reply all
Reply to author
Forward
0 new messages