Speeding up cypher query

Paul Dubs

unread,

May 24, 2012, 6:05:40 PM5/24/12

to ne...@googlegroups.com

Hello,

I'm evaluating neo4j at the moment and I'm trying to formulate the queries that we use in Postgres into Cypher.

Now this query runs just fine (in about 143ms):

START a = node(3590), b = node(3588), c = node(3591) MATCH a<-[:CONTAINS]-r, b<-[:CONTAINS]-r, c<-[:CONTAINS]-r, r-[:CONTAINS]->i RETURN distinct id(i), i.title, count(r) as count order by count desc limit 10;

But I have to extend it such that I don't want any node r that can be reached via the relationship CONTAINS from node d:

START a = node(3590), b = node(3588), c = node(3591), d = node(3613) MATCH a<-[:CONTAINS]-r, b<-[:CONTAINS]-r, c<-[:CONTAINS]-r, r-[:CONTAINS]->i WHERE not(d<-[:CONTAINS]-r) RETURN distinct id(i), i.title, count(r) as count order by count desc limit 10;

The second query takes about 8 to 9 seconds right now, so I wonder there is anything I could do to rewrite it to be faster.

I tried to do it with gremlin, but failed to even recreate the first query.

So, is there anything I can do?

Peter Neubauer

unread,

May 25, 2012, 4:43:15 AM5/25/12

to ne...@googlegroups.com

Paul, others,
for reference I replicated your query here: http://tinyurl.com/7vy6ubw

I don't know how to speed it up right now, anyone else guys?

Cheers,

/peter neubauer

G: neubauer.peter
S: peter.neubauer
P: +46 704 106975
L: http://www.linkedin.com/in/neubauer
T: @peterneubauer

If you can write, you can code - @coderdojomalmo
If you can sketch, you can use a graph database - @neo4j

Paul Dubs

unread,

May 25, 2012, 7:40:14 AM5/25/12

to ne...@googlegroups.com

Peter, thanks for replicating the query.

Maybe I'm doing it wrong. The Question this Query tries to ask the Graph is:"Which nodes (i) and how often are they 'contain'ed in conjunction with all start nodes (a,b,c) but don't count the nodes that 'contain' also some other nodes (in this case just d)".

Or in a real world example:

Assume the Graph contains two types of nodes: Words and Texts. Words are related to Texts by the CONTAIN relationship.

Now find all words that are used together with the words a, b and c and order them by how often they appear in texts, but ignore texts that contain the word d.

Or more generalized: [...], but ignore thexts that contain the words d,e,f...

So how could I query a graph, to get the answer that question? (So that it performs better than the current query that I'm using).

Cheers,

Paul

Paul Dubs

unread,

May 26, 2012, 11:44:07 AM5/26/12

to ne...@googlegroups.com

Hello,

I have found a way to get a similiar result (it contains the starting nodes) but a much faster query time:

START r=node(1), a = node(3590), b = node(3588), c = node(3591), d=node(3613) MATCH r--re-[:CONTAINS]-i WHERE (re-[:CONTAINS]-a and re-[:CONTAINS]-b and re-[:CONTAINS]-c) and (not(re-[:CONTAINS]-d)) return distinct id(i), i.title, count(r) as count order by count desc limit 10;

This now runs in about 200ms. And it has the nice side effect that it gets faster as I add more constrains.

But is there a way to denote the nodes that I want to match as an array or something like that? I have tried

START r=node(1), a = node(3590, 3588, 3591), d=node(3613) MATCH r--re-[:CONTAINS]-i WHERE (ALL(x in a WHERE re-[:CONTAINS]-x)) and (not(re-[:CONTAINS]-d)) return distinct id(i), i.title, count(r) as count order by count desc limit 10;

but it just raises: "SyntaxException: Unknown identifier `a`"

Also, is there any way to remove the start nodes from the result?

Cheers,

Paul

Michael Hunger

unread,

May 26, 2012, 12:12:34 PM5/26/12

to ne...@googlegroups.com

Good catch, in this case you want to limit (where) and not expand (match) anyway. And starting with a simple pattern is really fast and filtering down the results too.

Your second question:

What you could do

#1 pass a in as parameter (collection of id's) and use that in connection with ID(x)

#2 prepend a little aggregation query and pass the results with WITH to the second part

START r=node(1), a = node(3590, 3588, 3591), d=node(3613) MATCH r--re-[:CONTAINS]-i

WITH r,d,re,i,collect(a) as nodes

WHERE (ALL(x in nodes WHERE re-[:CONTAINS]-x)) and (not(re-[:CONTAINS]-d)) return distinct id(i), i.title, count(r) as count order by count desc limit 10;

Cheers

Michael

Paul Dubs

unread,

May 26, 2012, 1:02:55 PM5/26/12

to ne...@googlegroups.com

Hello,

What you could do

#1 pass a in as parameter (collection of id's) and use that in connection with ID(x)

How would I do that exactly?

#2 prepend a little aggregation query and pass the results with WITH to the second part

START r=node(1), a = node(3590, 3588, 3591), d=node(3613) MATCH r--re-[:CONTAINS]-i
WITH r,d,re,i,collect(a) as nodes
WHERE (ALL(x in nodes WHERE re-[:CONTAINS]-x)) and (not(re-[:CONTAINS]-d)) return distinct id(i), i.title, count(r) as count order by count desc limit 10;

I had to modify this a bit because I got: "SyntaxException: Unknown identifier `count`.". So the query currently looks like this:

START

r = node(1),

a = node(3590, 3588, 3591),

d = node(3613)

MATCH

r--re-[:CONTAINS]-i

WITH

d, re, i, collect(a) as nodes

WHERE

ALL(x in nodes WHERE re-[:CONTAINS]-x)

AND

NOT(re-[:CONTAINS]-d)

WITH

i, count(re) as count

RETURN

DISTINCT id(i),

i.title,

count

ORDER BY

count DESC

LIMIT

10;

And it takes 1.5 Seconds to execute.

The more direct version takes only 218ms (according to neo4j-sh):

START

r = node(1),

a = node(3590),

b = node(3588),

c = node(3591),

d = node(3613)

MATCH

r--re-[:CONTAINS]-i

WHERE

(

re-[:CONTAINS]-a

AND

re-[:CONTAINS]-b

AND

re-[:CONTAINS]-c

)

AND

NOT(

re-[:CONTAINS]-d

)

RETURN

DISTINCT ID(i),

i.title,

count(re) as count

ORDER BY

count DESC

LIMIT

10;

So it seems that I probably should try the first variant, but I don't quite understand what you mean with it.

Cheers

Paul

Michael Hunger

unread,

May 26, 2012, 1:46:08 PM5/26/12

to ne...@googlegroups.com

It is slower b/c for the aggregation in with (or return) cypher has to pull through all results first to aggregate correctly.

So your first solution is probably the best,

the one that I had in mind for #1 would probably not work as you could use the passed in id's to compare to other node id's but not to initialize an identifier for matching.

Michael

Andres Taylor

unread,

May 26, 2012, 4:39:03 PM5/26/12

to ne...@googlegroups.com

Could you try another query?

START

a = node(3590,3588,3591)

WITH

collect(a) as nodes

START

r = node(1), d = node(3613)

MATCH

r--re-[:CONTAINS]-i

WHERE

ALL(x in nodes : re-[:CONTAINS]-x) AND

NOT re-[:CONTAINS]-d

RETURN

DISTINCT ID(i),

i.title,

count(re) as count

ORDER BY

count DESC

LIMIT

10;

HTH,

Andrés

Paul Dubs

unread,

May 27, 2012, 2:47:26 PM5/27/12

to ne...@googlegroups.com

Hi,

Doing it this way it runs almost as fast as the "unrolled" variant.

But as I generalize it such that I can also add any amount of "bad" nodes, it stays at around 500ms instead of getting faster, like it did with the "unrolled" variant.

It seems like the additional collect step takes about as much time as the rest.

The following queries return the exactly same results, but the first takes 180ms on average while the second takes 530ms on average.

Unrolled query:

START

r = node(1),

a = node(3590),

b = node(3588),

c = node(3591),

d = node(3613),

e = node(3597),

f = node(3627)

MATCH

r--re-[:CONTAINS]-i

WHERE

(

(re-[:CONTAINS]-a)

AND

(re-[:CONTAINS]-b)

AND

(re-[:CONTAINS]-c)

)

AND

NOT(

(re-[:CONTAINS]-d)

AND

(re-[:CONTAINS]-e)

AND

(re-[:CONTAINS]-f)

)

RETURN

DISTINCT ID(i),

i.title,

count(re)

ORDER BY

count(re) DESC

LIMIT

10;

Collect Query:

START

a = node(3590,3588,3591),

d = node(3613, 3597, 3627)

WITH

collect(a) as good, collect(d) as bad

START

r = node(1)

MATCH

r--re-[:CONTAINS]-i

WHERE

ALL(x in good WHERE re-[:CONTAINS]-x)

AND

NONE(x in bad WHERE re-[:CONTAINS]-x)

RETURN

DISTINCT ID(i),

i.title,

count(re)

ORDER BY

count(re) DESC

LIMIT

10;

I think in the ideal case there shouldn't be a difference in speed between the two. I still wonder why I can't use the Nodeset from the START section without a collect step. I tried to find the reason within the source, but I just couldn't wrap my head around it.

Anyway thank you all for your help. Looks like I will have to try to understand the query optimizer and see if there is a way to get a query plan out of it. Something like an EXPLAIN statement from postgres would really be a great addition to Cypher.

Peter Neubauer

unread,

May 27, 2012, 2:58:13 PM5/27/12

to ne...@googlegroups.com

Paul,
EXAPLAIN sounds like a great addition. Care to hack on it and issue a PR?

Cheers,

/peter neubauer

G: neubauer.peter
S: peter.neubauer
P: +46 704 106975
L: http://www.linkedin.com/in/neubauer
T: @peterneubauer

If you can write, you can code - @coderdojomalmo
If you can sketch, you can use a graph database - @neo4j

Michael Hunger

unread,

May 27, 2012, 3:54:53 PM5/27/12

to ne...@googlegroups.com

Paul,

you cannot use the "nodeset" from start, b/c there is no node-set :)

the combination of all start-items will be unrolled into individual executions of the query (i.e. a cross-product of the start-items), so in each of those executions there is exactly ONE node of your "a" start-set.

It is only aggregation that pulls together the results of different executions into aggregated values again.

The only way to pass in collections (right now) is to use parameters. But as I outlined those cannot be used like node identifier literals in match clauses (otherwise they can be used like any normal node).

Michael

Paul Dubs

unread,

May 28, 2012, 2:04:02 AM5/28/12

to ne...@googlegroups.com

EXAPLAIN sounds like a great addition. Care to hack on it and issue a PR?

I'm fairly busy right now, but as I was looking for a reason to learn Scala anyway, I'll give it a try.

Michael, thank you for the explanation.

Cheers,

Paul

Message has been deleted

Paul Dubs

unread,

Jul 12, 2012, 5:19:16 PM7/12/12

to ne...@googlegroups.com

After some more testing I have arrived at this query:

START

rootNode = node(1)

MATCH

(rootNode)-->(re)

WITH

re

START

good = node(3590,3588,3591),

bad = node(3613)

WITH

re, collect(good) as good, collect(bad) as bad

WHERE

ALL( g in good WHERE

(re)-[:CONTAINS]->(g)

)

AND

NONE( b in bad WHERE

(re)-[:CONTAINS]->(b)

)

WITH

re

MATCH

(re)-[:CONTAINS]->(i)

RETURN

DISTINCT id(i),

i.title,

count(i)

ORDER BY

count(i) DESCENDING

LIMIT

10;

While it is the fastest that I have managed yet, it is still takes about twice as long as the (quite complex) Postgres Query that works on the same data set and it only gets slower as I add more concurrent clients (on a machine with 12GB Ram and 12 Cores).

Is there any way to speed that query up any further? My attempts with gremlin were all slower. I haven't tried writing a custom traversal method yet as I have trouble understanding how to convert a query such as this to it.

Cheers,

Paul

Michael Hunger

unread,

Jul 12, 2012, 6:23:20 PM7/12/12

to ne...@googlegroups.com

Paul,

Would it be possible for you to share the dataset (or a generator) privately with us?

I'd love to have a look at the execution

Michael

Sent from mobile device

Paul Dubs

unread,

Jul 18, 2012, 7:25:46 AM7/18/12

to ne...@googlegroups.com

Hi,

at the moment I'm sorry to say that I may not share the dataset. When I have some free time I will create a dataset that I can share.

Paul

Reply all

Reply to author

Forward