Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Speeding up cypher query
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  15 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Paul Dubs  
View profile  
 More options May 24 2012, 6:05 pm
From: Paul Dubs <paul.d...@gmail.com>
Date: Thu, 24 May 2012 15:05:40 -0700 (PDT)
Local: Thurs, May 24 2012 6:05 pm
Subject: Speeding up cypher query

Hello,

I'm evaluating neo4j at the moment and I'm trying to formulate the queries
that we use in Postgres into Cypher.

Now this query runs just fine (in about 143ms):
START a = node(3590), b = node(3588), c = node(3591) MATCH
a<-[:CONTAINS]-r, b<-[:CONTAINS]-r, c<-[:CONTAINS]-r, r-[:CONTAINS]->i
 RETURN distinct id(i), i.title, count(r) as count order by count desc
limit 10;

But I have to extend it such that I don't want any node r that can be
reached via the relationship CONTAINS from node d:
START a = node(3590), b = node(3588), c = node(3591), d = node(3613) MATCH
a<-[:CONTAINS]-r, b<-[:CONTAINS]-r, c<-[:CONTAINS]-r, r-[:CONTAINS]->i
WHERE not(d<-[:CONTAINS]-r) RETURN distinct id(i), i.title, count(r) as
count order by count desc limit 10;

The second query takes about 8 to 9 seconds right now, so I wonder there is
anything I could do to rewrite it to be faster.

I tried to do it with gremlin, but failed to even recreate the first query.

So, is there anything I can do?


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Peter Neubauer  
View profile  
 More options May 25 2012, 4:43 am
From: Peter Neubauer <peter.neuba...@neotechnology.com>
Date: Fri, 25 May 2012 10:43:15 +0200
Local: Fri, May 25 2012 4:43 am
Subject: Re: [Neo4j] Speeding up cypher query
Paul, others,
for reference I replicated your query here: http://tinyurl.com/7vy6ubw

I don't know how to speed it up right now, anyone else guys?

Cheers,

/peter neubauer

G:  neubauer.peter
S:  peter.neubauer
P:  +46 704 106975
L:   http://www.linkedin.com/in/neubauer
T:   @peterneubauer

If you can write, you can code - @coderdojomalmo
If you can sketch, you can use a graph database - @neo4j


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Paul Dubs  
View profile  
 More options May 25 2012, 7:40 am
From: Paul Dubs <paul.d...@gmail.com>
Date: Fri, 25 May 2012 04:40:14 -0700 (PDT)
Local: Fri, May 25 2012 7:40 am
Subject: Re: [Neo4j] Speeding up cypher query

Peter, thanks for replicating the query.

Maybe I'm doing it wrong. The Question this Query tries to ask the Graph
is:"Which nodes (i) and how often are they 'contain'ed in conjunction with
all start nodes (a,b,c) but don't count the nodes that 'contain' also some
other nodes (in this case just d)".

Or in a real world example:
Assume the Graph contains two types of nodes: Words and Texts. Words are
related to Texts by the CONTAIN relationship.
Now find all words that are used together with the words a, b and c and
order them by how often they appear in texts, but ignore texts that contain
the word d.
Or more generalized: [...], but ignore thexts that contain the words
d,e,f...

So how could I query a graph, to get the answer that question? (So that it
performs better than the current query that I'm using).

Cheers,
Paul


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Paul Dubs  
View profile   Translate to Translated (View Original)
 More options May 26 2012, 11:44 am
From: Paul Dubs <paul.d...@gmail.com>
Date: Sat, 26 May 2012 08:44:07 -0700 (PDT)
Local: Sat, May 26 2012 11:44 am
Subject: Re: [Neo4j] Speeding up cypher query

Hello,

I have found a way to get a similiar result (it contains the starting
nodes) but a much faster query time:
START r=node(1), a = node(3590), b = node(3588), c = node(3591),
d=node(3613) MATCH r--re-[:CONTAINS]-i WHERE (re-[:CONTAINS]-a and
re-[:CONTAINS]-b and re-[:CONTAINS]-c) and (not(re-[:CONTAINS]-d)) return
distinct id(i), i.title, count(r) as count order by count desc limit 10;

This now runs in about 200ms. And it has the nice side effect that it gets
faster as I add more constrains.

But is there a way to denote the nodes that I want to match as an array or
something like that? I have tried
START r=node(1), a = node(3590, 3588, 3591), d=node(3613) MATCH
r--re-[:CONTAINS]-i WHERE (ALL(x in a WHERE re-[:CONTAINS]-x)) and
(not(re-[:CONTAINS]-d)) return distinct id(i), i.title, count(r) as count
order by count desc limit 10;

but it just raises: "SyntaxException: Unknown identifier `a`"

Also, is there any way to remove the start nodes from the result?

Cheers,
Paul


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Michael Hunger  
View profile   Translate to Translated (View Original)
 More options May 26 2012, 12:12 pm
From: Michael Hunger <michael.hun...@neotechnology.com>
Date: Sat, 26 May 2012 18:12:34 +0200
Local: Sat, May 26 2012 12:12 pm
Subject: Re: [Neo4j] Speeding up cypher query

Good catch, in this case you want to limit (where) and not expand (match) anyway. And starting with a simple pattern is really fast and filtering down the results too.

Your second question:

What you could do

#1 pass a in as parameter (collection of id's) and use that in connection with ID(x)
#2 prepend a little aggregation query and pass the results with WITH to the second part

> START r=node(1), a = node(3590, 3588, 3591), d=node(3613) MATCH r--re-[:CONTAINS]-i

    WITH r,d,re,i,collect(a) as nodes
    WHERE (ALL(x in nodes WHERE re-[:CONTAINS]-x)) and (not(re-[:CONTAINS]-d)) return distinct id(i), i.title, count(r) as count order by count desc limit 10;

Cheers

Michael

Am 26.05.2012 um 17:44 schrieb Paul Dubs:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Paul Dubs  
View profile   Translate to Translated (View Original)
 More options May 26 2012, 1:02 pm
From: Paul Dubs <paul.d...@gmail.com>
Date: Sat, 26 May 2012 10:02:55 -0700 (PDT)
Local: Sat, May 26 2012 1:02 pm
Subject: Re: [Neo4j] Speeding up cypher query

Hello,

> What you could do

> #1 pass a in as parameter (collection of id's) and use that in connection
> with ID(x)

How would I do that exactly?

> #2 prepend a little aggregation query and pass the results with WITH to
> the second part

> START r=node(1), a = node(3590, 3588, 3591), d=node(3613) MATCH
> r--re-[:CONTAINS]-i

>     WITH r,d,re,i,collect(a) as nodes
>     WHERE (ALL(x in nodes WHERE re-[:CONTAINS]-x)) and
> (not(re-[:CONTAINS]-d)) return distinct id(i), i.title, count(r) as count
> order by count desc limit 10;

I had to modify this a bit because I got: "SyntaxException: Unknown
identifier `count`.". So the query currently looks like this:

START
  r = node(1),
  a = node(3590, 3588, 3591),    
  d = node(3613)
MATCH
  r--re-[:CONTAINS]-i
WITH
  d, re, i, collect(a) as nodes
WHERE
  ALL(x in nodes WHERE re-[:CONTAINS]-x)
  AND
  NOT(re-[:CONTAINS]-d)
WITH
  i, count(re) as count
RETURN
  DISTINCT id(i),
  i.title,
  count            
ORDER BY
  count DESC
LIMIT
  10;

And it takes 1.5 Seconds to execute.

The more direct version takes only 218ms (according to neo4j-sh):
START
  r = node(1),
  a = node(3590),
  b = node(3588),
  c = node(3591),
  d = node(3613)
MATCH
  r--re-[:CONTAINS]-i
WHERE
  (
   re-[:CONTAINS]-a
   AND
   re-[:CONTAINS]-b
   AND
   re-[:CONTAINS]-c
  )
  AND
  NOT(
   re-[:CONTAINS]-d
  )
RETURN
  DISTINCT ID(i),
  i.title,
  count(re) as count
ORDER BY
  count DESC
LIMIT
  10;

So it seems that I probably should try the first variant, but I don't quite
understand what you mean with it.

Cheers
Paul


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Michael Hunger  
View profile   Translate to Translated (View Original)
 More options May 26 2012, 1:46 pm
From: Michael Hunger <michael.hun...@neotechnology.com>
Date: Sat, 26 May 2012 19:46:08 +0200
Local: Sat, May 26 2012 1:46 pm
Subject: Re: [Neo4j] Speeding up cypher query

It is slower b/c for the aggregation in with (or return) cypher has to pull through all results first to aggregate correctly.

So your first solution is probably the best,

the one that I had in mind for #1 would probably not work as you could use the passed in id's to compare to other node id's but not to initialize an identifier for matching.

Michael

Am 26.05.2012 um 19:02 schrieb Paul Dubs:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Andres Taylor  
View profile   Translate to Translated (View Original)
 More options May 26 2012, 4:39 pm
From: Andres Taylor <andres.tay...@neotechnology.com>
Date: Sat, 26 May 2012 22:39:03 +0200
Local: Sat, May 26 2012 4:39 pm
Subject: Re: [Neo4j] Speeding up cypher query

Could you try another query?

START
    a = node(3590,3588,3591)
WITH
    collect(a) as nodes
START
    r = node(1), d = node(3613)
MATCH
    r--re-[:CONTAINS]-i
WHERE
    ALL(x in nodes : re-[:CONTAINS]-x) AND
    NOT   re-[:CONTAINS]-d
RETURN
  DISTINCT ID(i),
  i.title,
  count(re) as count
ORDER BY
  count DESC
LIMIT
  10;

HTH,
Andrés


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Paul Dubs  
View profile   Translate to Translated (View Original)
 More options May 27 2012, 2:47 pm
From: Paul Dubs <paul.d...@gmail.com>
Date: Sun, 27 May 2012 11:47:26 -0700 (PDT)
Local: Sun, May 27 2012 2:47 pm
Subject: Re: [Neo4j] Speeding up cypher query

Hi,
Doing it this way it runs almost as fast as the "unrolled" variant.
But as I generalize it such that I can also add any amount of "bad" nodes,
it stays at around 500ms instead of getting faster, like it did with the
"unrolled" variant.
It seems like the additional collect step takes about as much time as the
rest.

The following queries return the exactly same results, but the first takes
180ms on average while the second takes 530ms on average.

Unrolled query:
START
    r = node(1),
    a = node(3590),
    b = node(3588),
    c = node(3591),
    d = node(3613),
    e = node(3597),
    f = node(3627)
MATCH
    r--re-[:CONTAINS]-i
WHERE
    (
      (re-[:CONTAINS]-a)
      AND
      (re-[:CONTAINS]-b)
      AND
      (re-[:CONTAINS]-c)
    )
    AND
    NOT(
      (re-[:CONTAINS]-d)
      AND
      (re-[:CONTAINS]-e)
      AND
      (re-[:CONTAINS]-f)
    )
RETURN
    DISTINCT ID(i),
    i.title,
    count(re)
ORDER BY
    count(re) DESC
LIMIT
    10;

Collect Query:
START
    a = node(3590,3588,3591),
    d = node(3613, 3597, 3627)
WITH
    collect(a) as good, collect(d) as bad
START
    r = node(1)
MATCH
    r--re-[:CONTAINS]-i
WHERE
    ALL(x in good WHERE re-[:CONTAINS]-x)
    AND
    NONE(x in bad WHERE re-[:CONTAINS]-x)
RETURN
    DISTINCT ID(i),
    i.title,
    count(re)
ORDER BY
    count(re) DESC
LIMIT
    10;

I think in the ideal case there shouldn't be a difference in speed between
the two. I still wonder why I can't use the Nodeset from the START section
without a collect step. I tried to find the reason within the source, but I
just couldn't wrap my head around it.

Anyway thank you all for your help. Looks like I will have to try to
understand the query optimizer and see if there is a way to get a query
plan out of it. Something like an EXPLAIN statement from postgres would
really be a great addition to Cypher.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Peter Neubauer  
View profile  
 More options May 27 2012, 2:58 pm
From: Peter Neubauer <peter.neuba...@neotechnology.com>
Date: Sun, 27 May 2012 20:58:13 +0200
Local: Sun, May 27 2012 2:58 pm
Subject: Re: [Neo4j] Speeding up cypher query
Paul,
EXAPLAIN sounds like a great addition. Care to hack on it and issue a PR?

Cheers,

/peter neubauer

G:  neubauer.peter
S:  peter.neubauer
P:  +46 704 106975
L:   http://www.linkedin.com/in/neubauer
T:   @peterneubauer

If you can write, you can code - @coderdojomalmo
If you can sketch, you can use a graph database - @neo4j


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Michael Hunger  
View profile  
 More options May 27 2012, 3:54 pm
From: Michael Hunger <michael.hun...@neotechnology.com>
Date: Sun, 27 May 2012 21:54:53 +0200
Local: Sun, May 27 2012 3:54 pm
Subject: Re: [Neo4j] Speeding up cypher query

Paul,

you cannot use the "nodeset" from start, b/c there is no node-set :)

the combination of all start-items will be unrolled into individual executions of the query (i.e. a cross-product of the start-items), so in each of those executions there is exactly ONE node of your "a" start-set.

It is only aggregation that pulls together the results of different executions into aggregated values again.
The only way to pass in collections (right now) is to use parameters. But as I outlined those cannot be used like node identifier literals in match clauses (otherwise they can be used like any normal node).

Michael

Am 27.05.2012 um 20:47 schrieb Paul Dubs:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Paul Dubs  
View profile   Translate to Translated (View Original)
 More options May 28 2012, 2:04 am
From: Paul Dubs <paul.d...@gmail.com>
Date: Sun, 27 May 2012 23:04:02 -0700 (PDT)
Local: Mon, May 28 2012 2:04 am
Subject: Re: [Neo4j] Speeding up cypher query

> EXAPLAIN sounds like a great addition. Care to hack on it and issue a PR?

I'm fairly busy right now, but as I was looking for a reason to learn Scala
anyway, I'll give it a try.

Michael, thank you for the explanation.

Cheers,
Paul


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Paul Dubs  
View profile  
 More options Jul 12 2012, 5:19 pm
From: Paul Dubs <paul.d...@gmail.com>
Date: Thu, 12 Jul 2012 14:19:16 -0700 (PDT)
Local: Thurs, Jul 12 2012 5:19 pm
Subject: Re: [Neo4j] Speeding up cypher query

After some more testing I have arrived at this query:

START
    rootNode = node(1)
MATCH
    (rootNode)-->(re)
WITH
    re
START
    good = node(3590,3588,3591),
    bad = node(3613)
WITH
    re, collect(good) as good, collect(bad) as bad
WHERE
    ALL( g in good WHERE
        (re)-[:CONTAINS]->(g)
    )
    AND
    NONE( b in bad WHERE
        (re)-[:CONTAINS]->(b)
    )
WITH
    re
MATCH
    (re)-[:CONTAINS]->(i)
RETURN
    DISTINCT id(i),
    i.title,
    count(i)
ORDER BY
    count(i) DESCENDING
LIMIT
    10;

While it is the fastest that I have managed yet, it is still takes about
twice as long as the (quite complex) Postgres Query that works on the same
data set and it only gets slower as I add more concurrent clients (on a
machine with 12GB Ram and 12 Cores).

Is there any way to speed that query up any further? My attempts with
gremlin were all slower. I haven't tried writing a custom traversal method
yet as I have trouble understanding how to convert a query such as this to
it.

Cheers,
Paul


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Michael Hunger  
View profile  
 More options Jul 12 2012, 6:23 pm
From: Michael Hunger <michael.hun...@neopersistence.com>
Date: Fri, 13 Jul 2012 00:23:20 +0200
Local: Thurs, Jul 12 2012 6:23 pm
Subject: Re: [Neo4j] Speeding up cypher query
Paul,

Would it be possible for you to share the dataset (or a generator) privately with us?

I'd love to have a look at the execution

Michael

Sent from mobile device

Am 12.07.2012 um 23:19 schrieb Paul Dubs <paul.d...@gmail.com>:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Paul Dubs  
View profile  
 More options Jul 18 2012, 7:25 am
From: Paul Dubs <paul.d...@gmail.com>
Date: Wed, 18 Jul 2012 04:25:46 -0700 (PDT)
Local: Wed, Jul 18 2012 7:25 am
Subject: Re: [Neo4j] Speeding up cypher query

Hi,
at the moment I'm sorry to say that I may not share the dataset. When I
have some free time I will create a dataset that I can share.

Paul


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »