The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Speeding up cypher query
 There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic. There was an error processing your request. Please try again. Standard view   View as tree
 15 messages

From:
To:
Cc:
Followup To:
Subject:
 Validation: For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon.

More options May 24 2012, 6:05 pm
From: Paul Dubs <paul.d...@gmail.com>
Date: Thu, 24 May 2012 15:05:40 -0700 (PDT)
Local: Thurs, May 24 2012 6:05 pm
Subject: Speeding up cypher query

Hello,

I'm evaluating neo4j at the moment and I'm trying to formulate the queries
that we use in Postgres into Cypher.

Now this query runs just fine (in about 143ms):
START a = node(3590), b = node(3588), c = node(3591) MATCH
a<-[:CONTAINS]-r, b<-[:CONTAINS]-r, c<-[:CONTAINS]-r, r-[:CONTAINS]->i
RETURN distinct id(i), i.title, count(r) as count order by count desc
limit 10;

But I have to extend it such that I don't want any node r that can be
reached via the relationship CONTAINS from node d:
START a = node(3590), b = node(3588), c = node(3591), d = node(3613) MATCH
a<-[:CONTAINS]-r, b<-[:CONTAINS]-r, c<-[:CONTAINS]-r, r-[:CONTAINS]->i
WHERE not(d<-[:CONTAINS]-r) RETURN distinct id(i), i.title, count(r) as
count order by count desc limit 10;

The second query takes about 8 to 9 seconds right now, so I wonder there is
anything I could do to rewrite it to be faster.

I tried to do it with gremlin, but failed to even recreate the first query.

So, is there anything I can do?

To post a message you must first join this group.
You do not have the permission required to post.
More options May 25 2012, 4:43 am
From: Peter Neubauer <peter.neuba...@neotechnology.com>
Date: Fri, 25 May 2012 10:43:15 +0200
Local: Fri, May 25 2012 4:43 am
Subject: Re: [Neo4j] Speeding up cypher query
Paul, others,
for reference I replicated your query here: http://tinyurl.com/7vy6ubw

I don't know how to speed it up right now, anyone else guys?

Cheers,

/peter neubauer

G:  neubauer.peter
S:  peter.neubauer
P:  +46 704 106975
T:   @peterneubauer

If you can write, you can code - @coderdojomalmo
If you can sketch, you can use a graph database - @neo4j

To post a message you must first join this group.
You do not have the permission required to post.
More options May 25 2012, 7:40 am
From: Paul Dubs <paul.d...@gmail.com>
Date: Fri, 25 May 2012 04:40:14 -0700 (PDT)
Local: Fri, May 25 2012 7:40 am
Subject: Re: [Neo4j] Speeding up cypher query

Peter, thanks for replicating the query.

Maybe I'm doing it wrong. The Question this Query tries to ask the Graph
is:"Which nodes (i) and how often are they 'contain'ed in conjunction with
all start nodes (a,b,c) but don't count the nodes that 'contain' also some
other nodes (in this case just d)".

Or in a real world example:
Assume the Graph contains two types of nodes: Words and Texts. Words are
related to Texts by the CONTAIN relationship.
Now find all words that are used together with the words a, b and c and
order them by how often they appear in texts, but ignore texts that contain
the word d.
Or more generalized: [...], but ignore thexts that contain the words
d,e,f...

So how could I query a graph, to get the answer that question? (So that it
performs better than the current query that I'm using).

Cheers,
Paul

To post a message you must first join this group.
You do not have the permission required to post.
More options May 26 2012, 11:44 am
From: Paul Dubs <paul.d...@gmail.com>
Date: Sat, 26 May 2012 08:44:07 -0700 (PDT)
Local: Sat, May 26 2012 11:44 am
Subject: Re: [Neo4j] Speeding up cypher query

Hello,

I have found a way to get a similiar result (it contains the starting
nodes) but a much faster query time:
START r=node(1), a = node(3590), b = node(3588), c = node(3591),
d=node(3613) MATCH r--re-[:CONTAINS]-i WHERE (re-[:CONTAINS]-a and
re-[:CONTAINS]-b and re-[:CONTAINS]-c) and (not(re-[:CONTAINS]-d)) return
distinct id(i), i.title, count(r) as count order by count desc limit 10;

This now runs in about 200ms. And it has the nice side effect that it gets
faster as I add more constrains.

But is there a way to denote the nodes that I want to match as an array or
something like that? I have tried
START r=node(1), a = node(3590, 3588, 3591), d=node(3613) MATCH
r--re-[:CONTAINS]-i WHERE (ALL(x in a WHERE re-[:CONTAINS]-x)) and
(not(re-[:CONTAINS]-d)) return distinct id(i), i.title, count(r) as count
order by count desc limit 10;

but it just raises: "SyntaxException: Unknown identifier `a`"

Also, is there any way to remove the start nodes from the result?

Cheers,
Paul

To post a message you must first join this group.
You do not have the permission required to post.
More options May 26 2012, 12:12 pm
From: Michael Hunger <michael.hun...@neotechnology.com>
Date: Sat, 26 May 2012 18:12:34 +0200
Local: Sat, May 26 2012 12:12 pm
Subject: Re: [Neo4j] Speeding up cypher query

Good catch, in this case you want to limit (where) and not expand (match) anyway. And starting with a simple pattern is really fast and filtering down the results too.

What you could do

#1 pass a in as parameter (collection of id's) and use that in connection with ID(x)
#2 prepend a little aggregation query and pass the results with WITH to the second part

> START r=node(1), a = node(3590, 3588, 3591), d=node(3613) MATCH r--re-[:CONTAINS]-i

WITH r,d,re,i,collect(a) as nodes
WHERE (ALL(x in nodes WHERE re-[:CONTAINS]-x)) and (not(re-[:CONTAINS]-d)) return distinct id(i), i.title, count(r) as count order by count desc limit 10;

Cheers

Michael

Am 26.05.2012 um 17:44 schrieb Paul Dubs:

To post a message you must first join this group.
You do not have the permission required to post.
More options May 26 2012, 1:02 pm
From: Paul Dubs <paul.d...@gmail.com>
Date: Sat, 26 May 2012 10:02:55 -0700 (PDT)
Local: Sat, May 26 2012 1:02 pm
Subject: Re: [Neo4j] Speeding up cypher query

Hello,

> What you could do

> #1 pass a in as parameter (collection of id's) and use that in connection
> with ID(x)

How would I do that exactly?

> #2 prepend a little aggregation query and pass the results with WITH to
> the second part

> START r=node(1), a = node(3590, 3588, 3591), d=node(3613) MATCH
> r--re-[:CONTAINS]-i

>     WITH r,d,re,i,collect(a) as nodes
>     WHERE (ALL(x in nodes WHERE re-[:CONTAINS]-x)) and
> (not(re-[:CONTAINS]-d)) return distinct id(i), i.title, count(r) as count
> order by count desc limit 10;

I had to modify this a bit because I got: "SyntaxException: Unknown
identifier `count`.". So the query currently looks like this:

START
r = node(1),
a = node(3590, 3588, 3591),
d = node(3613)
MATCH
r--re-[:CONTAINS]-i
WITH
d, re, i, collect(a) as nodes
WHERE
ALL(x in nodes WHERE re-[:CONTAINS]-x)
AND
NOT(re-[:CONTAINS]-d)
WITH
i, count(re) as count
RETURN
DISTINCT id(i),
i.title,
count
ORDER BY
count DESC
LIMIT
10;

And it takes 1.5 Seconds to execute.

The more direct version takes only 218ms (according to neo4j-sh):
START
r = node(1),
a = node(3590),
b = node(3588),
c = node(3591),
d = node(3613)
MATCH
r--re-[:CONTAINS]-i
WHERE
(
re-[:CONTAINS]-a
AND
re-[:CONTAINS]-b
AND
re-[:CONTAINS]-c
)
AND
NOT(
re-[:CONTAINS]-d
)
RETURN
DISTINCT ID(i),
i.title,
count(re) as count
ORDER BY
count DESC
LIMIT
10;

So it seems that I probably should try the first variant, but I don't quite
understand what you mean with it.

Cheers
Paul

To post a message you must first join this group.
You do not have the permission required to post.
More options May 26 2012, 1:46 pm
From: Michael Hunger <michael.hun...@neotechnology.com>
Date: Sat, 26 May 2012 19:46:08 +0200
Local: Sat, May 26 2012 1:46 pm
Subject: Re: [Neo4j] Speeding up cypher query

It is slower b/c for the aggregation in with (or return) cypher has to pull through all results first to aggregate correctly.

So your first solution is probably the best,

the one that I had in mind for #1 would probably not work as you could use the passed in id's to compare to other node id's but not to initialize an identifier for matching.

Michael

Am 26.05.2012 um 19:02 schrieb Paul Dubs:

To post a message you must first join this group.
You do not have the permission required to post.
More options May 26 2012, 4:39 pm
From: Andres Taylor <andres.tay...@neotechnology.com>
Date: Sat, 26 May 2012 22:39:03 +0200
Local: Sat, May 26 2012 4:39 pm
Subject: Re: [Neo4j] Speeding up cypher query

Could you try another query?

START
a = node(3590,3588,3591)
WITH
collect(a) as nodes
START
r = node(1), d = node(3613)
MATCH
r--re-[:CONTAINS]-i
WHERE
ALL(x in nodes : re-[:CONTAINS]-x) AND
NOT   re-[:CONTAINS]-d
RETURN
DISTINCT ID(i),
i.title,
count(re) as count
ORDER BY
count DESC
LIMIT
10;

HTH,
Andrés

To post a message you must first join this group.
You do not have the permission required to post.
More options May 27 2012, 2:47 pm
From: Paul Dubs <paul.d...@gmail.com>
Date: Sun, 27 May 2012 11:47:26 -0700 (PDT)
Local: Sun, May 27 2012 2:47 pm
Subject: Re: [Neo4j] Speeding up cypher query

Hi,
Doing it this way it runs almost as fast as the "unrolled" variant.
But as I generalize it such that I can also add any amount of "bad" nodes,
it stays at around 500ms instead of getting faster, like it did with the
"unrolled" variant.
It seems like the additional collect step takes about as much time as the
rest.

The following queries return the exactly same results, but the first takes
180ms on average while the second takes 530ms on average.

Unrolled query:
START
r = node(1),
a = node(3590),
b = node(3588),
c = node(3591),
d = node(3613),
e = node(3597),
f = node(3627)
MATCH
r--re-[:CONTAINS]-i
WHERE
(
(re-[:CONTAINS]-a)
AND
(re-[:CONTAINS]-b)
AND
(re-[:CONTAINS]-c)
)
AND
NOT(
(re-[:CONTAINS]-d)
AND
(re-[:CONTAINS]-e)
AND
(re-[:CONTAINS]-f)
)
RETURN
DISTINCT ID(i),
i.title,
count(re)
ORDER BY
count(re) DESC
LIMIT
10;

Collect Query:
START
a = node(3590,3588,3591),
d = node(3613, 3597, 3627)
WITH
collect(a) as good, collect(d) as bad
START
r = node(1)
MATCH
r--re-[:CONTAINS]-i
WHERE
ALL(x in good WHERE re-[:CONTAINS]-x)
AND
RETURN
DISTINCT ID(i),
i.title,
count(re)
ORDER BY
count(re) DESC
LIMIT
10;

I think in the ideal case there shouldn't be a difference in speed between
the two. I still wonder why I can't use the Nodeset from the START section
without a collect step. I tried to find the reason within the source, but I
just couldn't wrap my head around it.

Anyway thank you all for your help. Looks like I will have to try to
understand the query optimizer and see if there is a way to get a query
plan out of it. Something like an EXPLAIN statement from postgres would
really be a great addition to Cypher.

To post a message you must first join this group.
You do not have the permission required to post.
More options May 27 2012, 2:58 pm
From: Peter Neubauer <peter.neuba...@neotechnology.com>
Date: Sun, 27 May 2012 20:58:13 +0200
Local: Sun, May 27 2012 2:58 pm
Subject: Re: [Neo4j] Speeding up cypher query
Paul,
EXAPLAIN sounds like a great addition. Care to hack on it and issue a PR?

Cheers,

/peter neubauer

G:  neubauer.peter
S:  peter.neubauer
P:  +46 704 106975
T:   @peterneubauer

If you can write, you can code - @coderdojomalmo
If you can sketch, you can use a graph database - @neo4j

To post a message you must first join this group.
You do not have the permission required to post.
More options May 27 2012, 3:54 pm
From: Michael Hunger <michael.hun...@neotechnology.com>
Date: Sun, 27 May 2012 21:54:53 +0200
Local: Sun, May 27 2012 3:54 pm
Subject: Re: [Neo4j] Speeding up cypher query

Paul,

you cannot use the "nodeset" from start, b/c there is no node-set :)

the combination of all start-items will be unrolled into individual executions of the query (i.e. a cross-product of the start-items), so in each of those executions there is exactly ONE node of your "a" start-set.

It is only aggregation that pulls together the results of different executions into aggregated values again.
The only way to pass in collections (right now) is to use parameters. But as I outlined those cannot be used like node identifier literals in match clauses (otherwise they can be used like any normal node).

Michael

Am 27.05.2012 um 20:47 schrieb Paul Dubs:

To post a message you must first join this group.
You do not have the permission required to post.
More options May 28 2012, 2:04 am
From: Paul Dubs <paul.d...@gmail.com>
Date: Sun, 27 May 2012 23:04:02 -0700 (PDT)
Local: Mon, May 28 2012 2:04 am
Subject: Re: [Neo4j] Speeding up cypher query

> EXAPLAIN sounds like a great addition. Care to hack on it and issue a PR?

I'm fairly busy right now, but as I was looking for a reason to learn Scala
anyway, I'll give it a try.

Michael, thank you for the explanation.

Cheers,
Paul

To post a message you must first join this group.
You do not have the permission required to post.
More options Jul 12 2012, 5:19 pm
From: Paul Dubs <paul.d...@gmail.com>
Date: Thu, 12 Jul 2012 14:19:16 -0700 (PDT)
Local: Thurs, Jul 12 2012 5:19 pm
Subject: Re: [Neo4j] Speeding up cypher query

After some more testing I have arrived at this query:

START
rootNode = node(1)
MATCH
(rootNode)-->(re)
WITH
re
START
good = node(3590,3588,3591),
WITH
WHERE
ALL( g in good WHERE
(re)-[:CONTAINS]->(g)
)
AND
(re)-[:CONTAINS]->(b)
)
WITH
re
MATCH
(re)-[:CONTAINS]->(i)
RETURN
DISTINCT id(i),
i.title,
count(i)
ORDER BY
count(i) DESCENDING
LIMIT
10;

While it is the fastest that I have managed yet, it is still takes about
twice as long as the (quite complex) Postgres Query that works on the same
data set and it only gets slower as I add more concurrent clients (on a
machine with 12GB Ram and 12 Cores).

Is there any way to speed that query up any further? My attempts with
gremlin were all slower. I haven't tried writing a custom traversal method
yet as I have trouble understanding how to convert a query such as this to
it.

Cheers,
Paul

To post a message you must first join this group.
You do not have the permission required to post.
More options Jul 12 2012, 6:23 pm
From: Michael Hunger <michael.hun...@neopersistence.com>
Date: Fri, 13 Jul 2012 00:23:20 +0200
Local: Thurs, Jul 12 2012 6:23 pm
Subject: Re: [Neo4j] Speeding up cypher query
Paul,

Would it be possible for you to share the dataset (or a generator) privately with us?

I'd love to have a look at the execution

Michael

Sent from mobile device

Am 12.07.2012 um 23:19 schrieb Paul Dubs <paul.d...@gmail.com>:

To post a message you must first join this group.
You do not have the permission required to post.
More options Jul 18 2012, 7:25 am
From: Paul Dubs <paul.d...@gmail.com>
Date: Wed, 18 Jul 2012 04:25:46 -0700 (PDT)
Local: Wed, Jul 18 2012 7:25 am
Subject: Re: [Neo4j] Speeding up cypher query

Hi,
at the moment I'm sorry to say that I may not share the dataset. When I
have some free time I will create a dataset that I can share.

Paul