Speeding up cypher query

170 views
Skip to first unread message

Paul Dubs

unread,
May 24, 2012, 6:05:40 PM5/24/12
to ne...@googlegroups.com
Hello,

I'm evaluating neo4j at the moment and I'm trying to formulate the queries that we use in Postgres into Cypher.

Now this query runs just fine (in about 143ms):
START a = node(3590), b = node(3588), c = node(3591) MATCH a<-[:CONTAINS]-r, b<-[:CONTAINS]-r, c<-[:CONTAINS]-r, r-[:CONTAINS]->i  RETURN distinct id(i), i.title, count(r) as count order by count desc limit 10;

But I have to extend it such that I don't want any node r that can be reached via the relationship CONTAINS from node d:
START a = node(3590), b = node(3588), c = node(3591), d = node(3613) MATCH a<-[:CONTAINS]-r, b<-[:CONTAINS]-r, c<-[:CONTAINS]-r, r-[:CONTAINS]->i WHERE not(d<-[:CONTAINS]-r) RETURN distinct id(i), i.title, count(r) as count order by count desc limit 10;

The second query takes about 8 to 9 seconds right now, so I wonder there is anything I could do to rewrite it to be faster.

I tried to do it with gremlin, but failed to even recreate the first query.

So, is there anything I can do?

Peter Neubauer

unread,
May 25, 2012, 4:43:15 AM5/25/12
to ne...@googlegroups.com
Paul, others,
for reference I replicated your query here: http://tinyurl.com/7vy6ubw

I don't know how to speed it up right now, anyone else guys?

Cheers,

/peter neubauer

G:  neubauer.peter
S:  peter.neubauer
P:  +46 704 106975
L:   http://www.linkedin.com/in/neubauer
T:   @peterneubauer

If you can write, you can code - @coderdojomalmo
If you can sketch, you can use a graph database - @neo4j

Paul Dubs

unread,
May 25, 2012, 7:40:14 AM5/25/12
to ne...@googlegroups.com
Peter, thanks for replicating the query.

Maybe I'm doing it wrong. The Question this Query tries to ask the Graph is:"Which nodes (i) and how often are they 'contain'ed in conjunction with all start nodes (a,b,c) but don't count the nodes that 'contain' also some other nodes (in this case just d)".

Or in a real world example:
Assume the Graph contains two types of nodes: Words and Texts. Words are related to Texts by the CONTAIN relationship.
Now find all words that are used together with the words a, b and c and order them by how often they appear in texts, but ignore texts that contain the word d.
Or more generalized: [...], but ignore thexts that contain the words d,e,f...

So how could I query a graph, to get the answer that question? (So that it performs better than the current query that I'm using).

Cheers,
Paul

Paul Dubs

unread,
May 26, 2012, 11:44:07 AM5/26/12
to ne...@googlegroups.com

Hello,

I have found a way to get a similiar result (it contains the starting nodes) but a much faster query time:
START r=node(1), a = node(3590), b = node(3588), c = node(3591), d=node(3613) MATCH r--re-[:CONTAINS]-i WHERE (re-[:CONTAINS]-a and re-[:CONTAINS]-b and re-[:CONTAINS]-c) and (not(re-[:CONTAINS]-d)) return distinct id(i), i.title, count(r) as count order by count desc limit 10;

This now runs in about 200ms. And it has the nice side effect that it gets faster as I add more constrains.

But is there a way to denote the nodes that I want to match as an array or something like that? I have tried
START r=node(1), a = node(3590, 3588, 3591), d=node(3613) MATCH r--re-[:CONTAINS]-i WHERE (ALL(x in a WHERE re-[:CONTAINS]-x)) and (not(re-[:CONTAINS]-d)) return distinct id(i), i.title, count(r) as count order by count desc limit 10;

but it just raises: "SyntaxException: Unknown identifier `a`"

Also, is there any way to remove the start nodes from the result?

Cheers,
Paul

Michael Hunger

unread,
May 26, 2012, 12:12:34 PM5/26/12
to ne...@googlegroups.com
Good catch, in this case you want to limit (where) and not expand (match) anyway. And starting with a simple pattern is really fast and filtering down the results too.

Your second question:

What you could do

#1 pass a in as parameter (collection of id's) and use that in connection with ID(x)
#2 prepend a little aggregation query and pass the results with WITH to the second part


START r=node(1), a = node(3590, 3588, 3591), d=node(3613) MATCH r--re-[:CONTAINS]-i
    WITH r,d,re,i,collect(a) as nodes
    WHERE (ALL(x in nodes WHERE re-[:CONTAINS]-x)) and (not(re-[:CONTAINS]-d)) return distinct id(i), i.title, count(r) as count order by count desc limit 10; 

Cheers

Michael

Paul Dubs

unread,
May 26, 2012, 1:02:55 PM5/26/12
to ne...@googlegroups.com
Hello,
 
What you could do

#1 pass a in as parameter (collection of id's) and use that in connection with ID(x)
How would I do that exactly?
 
#2 prepend a little aggregation query and pass the results with WITH to the second part


START r=node(1), a = node(3590, 3588, 3591), d=node(3613) MATCH r--re-[:CONTAINS]-i
    WITH r,d,re,i,collect(a) as nodes
    WHERE (ALL(x in nodes WHERE re-[:CONTAINS]-x)) and (not(re-[:CONTAINS]-d)) return distinct id(i), i.title, count(r) as count order by count desc limit 10; 
I had to modify this a bit because I got: "SyntaxException: Unknown identifier `count`.". So the query currently looks like this:

START
  r = node(1),
  a = node(3590, 3588, 3591),    
  d = node(3613)
MATCH
  r--re-[:CONTAINS]-i
WITH
  d, re, i, collect(a) as nodes
WHERE
  ALL(x in nodes WHERE re-[:CONTAINS]-x) 
  AND 
  NOT(re-[:CONTAINS]-d)
WITH
  i, count(re) as count
RETURN
  DISTINCT id(i),
  i.title,
  count             
ORDER BY
  count DESC
LIMIT
  10;


And it takes 1.5 Seconds to execute.

The more direct version takes only 218ms (according to neo4j-sh):
START
  r = node(1),
  a = node(3590),
  b = node(3588),
  c = node(3591),
  d = node(3613)
MATCH
  r--re-[:CONTAINS]-i
WHERE
  (
   re-[:CONTAINS]-a
   AND
   re-[:CONTAINS]-b
   AND
   re-[:CONTAINS]-c
  )
  AND
  NOT(
   re-[:CONTAINS]-d
  )
RETURN
  DISTINCT ID(i),
  i.title,
  count(re) as count
ORDER BY
  count DESC
LIMIT
  10;

So it seems that I probably should try the first variant, but I don't quite understand what you mean with it.

Cheers
Paul

Michael Hunger

unread,
May 26, 2012, 1:46:08 PM5/26/12
to ne...@googlegroups.com
It is slower b/c for the aggregation in with (or return) cypher has to pull through all results first to aggregate correctly.

So your first solution is probably the best,

the one that I had in mind for #1 would probably not work as you could use the passed in id's to compare to other node id's but not to initialize an identifier for matching.

Michael

Andres Taylor

unread,
May 26, 2012, 4:39:03 PM5/26/12
to ne...@googlegroups.com
Could you try another query?

START 
    a = node(3590,3588,3591)
WITH 
    collect(a) as nodes
START 
    r = node(1), d = node(3613)
MATCH 
    r--re-[:CONTAINS]-i
WHERE
    ALL(x in nodes : re-[:CONTAINS]-x) AND
    NOT   re-[:CONTAINS]-d
RETURN
  DISTINCT ID(i),
  i.title,
  count(re) as count
ORDER BY
  count DESC
LIMIT
  10;
  
HTH,
Andrés

Paul Dubs

unread,
May 27, 2012, 2:47:26 PM5/27/12
to ne...@googlegroups.com
Hi,
Doing it this way it runs almost as fast as the "unrolled" variant.
But as I generalize it such that I can also add any amount of "bad" nodes, it stays at around 500ms instead of getting faster, like it did with the "unrolled" variant.
It seems like the additional collect step takes about as much time as the rest.

The following queries return the exactly same results, but the first takes 180ms on average while the second takes 530ms on average.

Unrolled query:
START
    r = node(1),
    a = node(3590),
    b = node(3588),
    c = node(3591),
    d = node(3613),
    e = node(3597),
    f = node(3627)
MATCH
    r--re-[:CONTAINS]-i
WHERE
    (
      (re-[:CONTAINS]-a)
      AND
      (re-[:CONTAINS]-b)
      AND
      (re-[:CONTAINS]-c)
    )
    AND
    NOT(
      (re-[:CONTAINS]-d)
      AND
      (re-[:CONTAINS]-e)
      AND
      (re-[:CONTAINS]-f)
    )
RETURN
    DISTINCT ID(i),
    i.title,
    count(re)
ORDER BY
    count(re) DESC
LIMIT
    10;

Collect Query:
START
    a = node(3590,3588,3591),
    d = node(3613, 3597, 3627)
WITH
    collect(a) as good, collect(d) as bad
START
    r = node(1)
MATCH
    r--re-[:CONTAINS]-i
WHERE
    ALL(x in good WHERE re-[:CONTAINS]-x)
    AND
    NONE(x in bad WHERE re-[:CONTAINS]-x)
RETURN
    DISTINCT ID(i),
    i.title,
    count(re)
ORDER BY
    count(re) DESC
LIMIT
    10;

I think in the ideal case there shouldn't be a difference in speed between the two. I still wonder why I can't use the Nodeset from the START section without a collect step. I tried to find the reason within the source, but I just couldn't wrap my head around it.

Anyway thank you all for your help. Looks like I will have to try to understand the query optimizer and see if there is a way to get a query plan out of it. Something like an EXPLAIN statement from postgres would really be a great addition to Cypher.

Peter Neubauer

unread,
May 27, 2012, 2:58:13 PM5/27/12
to ne...@googlegroups.com
Paul,
EXAPLAIN sounds like a great addition. Care to hack on it and issue a PR?

Cheers,

/peter neubauer

G:  neubauer.peter
S:  peter.neubauer
P:  +46 704 106975
L:   http://www.linkedin.com/in/neubauer
T:   @peterneubauer

If you can write, you can code - @coderdojomalmo
If you can sketch, you can use a graph database - @neo4j


Michael Hunger

unread,
May 27, 2012, 3:54:53 PM5/27/12
to ne...@googlegroups.com
Paul,

you cannot use the "nodeset" from start, b/c there is no node-set :)

the combination of all start-items will be unrolled into individual executions of the query (i.e. a cross-product of the start-items), so in each of those executions there is exactly ONE node of your "a" start-set.

It is only aggregation that pulls together the results of different executions into aggregated values again.
The only way to pass in collections (right now) is to use parameters. But as I outlined those cannot be used like node identifier literals in match clauses (otherwise they can be used like any normal node).

Michael

Paul Dubs

unread,
May 28, 2012, 2:04:02 AM5/28/12
to ne...@googlegroups.com
EXAPLAIN sounds like a great addition. Care to hack on it and issue a PR?
I'm fairly busy right now, but as I was looking for a reason to learn Scala anyway, I'll give it a try.


Michael, thank you for the explanation.


Cheers,
Paul
Message has been deleted

Paul Dubs

unread,
Jul 12, 2012, 5:19:16 PM7/12/12
to ne...@googlegroups.com
After some more testing I have arrived at this query:

START
    rootNode = node(1)
MATCH
    (rootNode)-->(re)
WITH
    re
START
    good = node(3590,3588,3591),
    bad = node(3613)
WITH
    re, collect(good) as good, collect(bad) as bad
WHERE
    ALL( g in good WHERE
        (re)-[:CONTAINS]->(g)
    )
    AND
    NONE( b in bad WHERE
        (re)-[:CONTAINS]->(b)
    )
WITH
    re
MATCH
    (re)-[:CONTAINS]->(i)
RETURN
    DISTINCT id(i),
    i.title,
    count(i)
ORDER BY
    count(i) DESCENDING
LIMIT
    10;

While it is the fastest that I have managed yet, it is still takes about twice as long as the (quite complex) Postgres Query that works on the same data set and it only gets slower as I add more concurrent clients (on a machine with 12GB Ram and 12 Cores).

Is there any way to speed that query up any further? My attempts with gremlin were all slower. I haven't tried writing a custom traversal method yet as I have trouble understanding how to convert a query such as this to it.


Cheers,
Paul

Michael Hunger

unread,
Jul 12, 2012, 6:23:20 PM7/12/12
to ne...@googlegroups.com
Paul,

Would it be possible for you to share the dataset (or a generator) privately with us?

I'd love to have a look at the execution

Michael

Sent from mobile device

Paul Dubs

unread,
Jul 18, 2012, 7:25:46 AM7/18/12
to ne...@googlegroups.com
Hi,
at the moment I'm sorry to say that I may not share the dataset. When I have some free time I will create a dataset that I can share.

Paul
Reply all
Reply to author
Forward
0 new messages