I'm evaluating neo4j at the moment and I'm trying to formulate the queries that we use in Postgres into Cypher.
Now this query runs just fine (in about 143ms): START a = node(3590), b = node(3588), c = node(3591) MATCH a<-[:CONTAINS]-r, b<-[:CONTAINS]-r, c<-[:CONTAINS]-r, r-[:CONTAINS]->i RETURN distinct id(i), i.title, count(r) as count order by count desc limit 10;
But I have to extend it such that I don't want any node r that can be reached via the relationship CONTAINS from node d: START a = node(3590), b = node(3588), c = node(3591), d = node(3613) MATCH a<-[:CONTAINS]-r, b<-[:CONTAINS]-r, c<-[:CONTAINS]-r, r-[:CONTAINS]->i WHERE not(d<-[:CONTAINS]-r) RETURN distinct id(i), i.title, count(r) as count order by count desc limit 10;
The second query takes about 8 to 9 seconds right now, so I wonder there is anything I could do to rewrite it to be faster.
I tried to do it with gremlin, but failed to even recreate the first query.
On Fri, May 25, 2012 at 12:05 AM, Paul Dubs <paul.d...@gmail.com> wrote:
> Hello,
> I'm evaluating neo4j at the moment and I'm trying to formulate the queries
> that we use in Postgres into Cypher.
> Now this query runs just fine (in about 143ms):
> START a = node(3590), b = node(3588), c = node(3591) MATCH a<-[:CONTAINS]-r,
> b<-[:CONTAINS]-r, c<-[:CONTAINS]-r, r-[:CONTAINS]->i RETURN distinct id(i),
> i.title, count(r) as count order by count desc limit 10;
> But I have to extend it such that I don't want any node r that can be
> reached via the relationship CONTAINS from node d:
> START a = node(3590), b = node(3588), c = node(3591), d = node(3613) MATCH
> a<-[:CONTAINS]-r, b<-[:CONTAINS]-r, c<-[:CONTAINS]-r, r-[:CONTAINS]->i WHERE
> not(d<-[:CONTAINS]-r) RETURN distinct id(i), i.title, count(r) as count
> order by count desc limit 10;
> The second query takes about 8 to 9 seconds right now, so I wonder there is
> anything I could do to rewrite it to be faster.
> I tried to do it with gremlin, but failed to even recreate the first query.
Maybe I'm doing it wrong. The Question this Query tries to ask the Graph is:"Which nodes (i) and how often are they 'contain'ed in conjunction with all start nodes (a,b,c) but don't count the nodes that 'contain' also some other nodes (in this case just d)".
Or in a real world example: Assume the Graph contains two types of nodes: Words and Texts. Words are related to Texts by the CONTAIN relationship. Now find all words that are used together with the words a, b and c and order them by how often they appear in texts, but ignore texts that contain the word d. Or more generalized: [...], but ignore thexts that contain the words d,e,f...
So how could I query a graph, to get the answer that question? (So that it performs better than the current query that I'm using).
I have found a way to get a similiar result (it contains the starting nodes) but a much faster query time: START r=node(1), a = node(3590), b = node(3588), c = node(3591), d=node(3613) MATCH r--re-[:CONTAINS]-i WHERE (re-[:CONTAINS]-a and re-[:CONTAINS]-b and re-[:CONTAINS]-c) and (not(re-[:CONTAINS]-d)) return distinct id(i), i.title, count(r) as count order by count desc limit 10;
This now runs in about 200ms. And it has the nice side effect that it gets faster as I add more constrains.
But is there a way to denote the nodes that I want to match as an array or something like that? I have tried START r=node(1), a = node(3590, 3588, 3591), d=node(3613) MATCH r--re-[:CONTAINS]-i WHERE (ALL(x in a WHERE re-[:CONTAINS]-x)) and (not(re-[:CONTAINS]-d)) return distinct id(i), i.title, count(r) as count order by count desc limit 10;
but it just raises: "SyntaxException: Unknown identifier `a`"
Also, is there any way to remove the start nodes from the result?
On Friday, May 25, 2012 1:40:14 PM UTC+2, Paul Dubs wrote:
> Peter, thanks for replicating the query.
> Maybe I'm doing it wrong. The Question this Query tries to ask the Graph > is:"Which nodes (i) and how often are they 'contain'ed in conjunction with > all start nodes (a,b,c) but don't count the nodes that 'contain' also some > other nodes (in this case just d)".
> Or in a real world example: > Assume the Graph contains two types of nodes: Words and Texts. Words are > related to Texts by the CONTAIN relationship. > Now find all words that are used together with the words a, b and c and > order them by how often they appear in texts, but ignore texts that contain > the word d. > Or more generalized: [...], but ignore thexts that contain the words > d,e,f...
> So how could I query a graph, to get the answer that question? (So that it > performs better than the current query that I'm using).
Good catch, in this case you want to limit (where) and not expand (match) anyway. And starting with a simple pattern is really fast and filtering down the results too.
Your second question:
What you could do
#1 pass a in as parameter (collection of id's) and use that in connection with ID(x)
#2 prepend a little aggregation query and pass the results with WITH to the second part
> START r=node(1), a = node(3590, 3588, 3591), d=node(3613) MATCH r--re-[:CONTAINS]-i
WITH r,d,re,i,collect(a) as nodes
WHERE (ALL(x in nodes WHERE re-[:CONTAINS]-x)) and (not(re-[:CONTAINS]-d)) return distinct id(i), i.title, count(r) as count order by count desc limit 10;
> I have found a way to get a similiar result (it contains the starting nodes) but a much faster query time:
> START r=node(1), a = node(3590), b = node(3588), c = node(3591), d=node(3613) MATCH r--re-[:CONTAINS]-i WHERE (re-[:CONTAINS]-a and re-[:CONTAINS]-b and re-[:CONTAINS]-c) and (not(re-[:CONTAINS]-d)) return distinct id(i), i.title, count(r) as count order by count desc limit 10;
> This now runs in about 200ms. And it has the nice side effect that it gets faster as I add more constrains.
> But is there a way to denote the nodes that I want to match as an array or something like that? I have tried
> START r=node(1), a = node(3590, 3588, 3591), d=node(3613) MATCH r--re-[:CONTAINS]-i WHERE (ALL(x in a WHERE re-[:CONTAINS]-x)) and (not(re-[:CONTAINS]-d)) return distinct id(i), i.title, count(r) as count order by count desc limit 10;
> but it just raises: "SyntaxException: Unknown identifier `a`"
> Also, is there any way to remove the start nodes from the result?
> Cheers,
> Paul
> On Friday, May 25, 2012 1:40:14 PM UTC+2, Paul Dubs wrote:
> Peter, thanks for replicating the query.
> Maybe I'm doing it wrong. The Question this Query tries to ask the Graph is:"Which nodes (i) and how often are they 'contain'ed in conjunction with all start nodes (a,b,c) but don't count the nodes that 'contain' also some other nodes (in this case just d)".
> Or in a real world example:
> Assume the Graph contains two types of nodes: Words and Texts. Words are related to Texts by the CONTAIN relationship.
> Now find all words that are used together with the words a, b and c and order them by how often they appear in texts, but ignore texts that contain the word d.
> Or more generalized: [...], but ignore thexts that contain the words d,e,f...
> So how could I query a graph, to get the answer that question? (So that it performs better than the current query that I'm using).
> #1 pass a in as parameter (collection of id's) and use that in connection > with ID(x)
How would I do that exactly?
> #2 prepend a little aggregation query and pass the results with WITH to > the second part
> START r=node(1), a = node(3590, 3588, 3591), d=node(3613) MATCH > r--re-[:CONTAINS]-i
> WITH r,d,re,i,collect(a) as nodes > WHERE (ALL(x in nodes WHERE re-[:CONTAINS]-x)) and > (not(re-[:CONTAINS]-d)) return distinct id(i), i.title, count(r) as count > order by count desc limit 10;
I had to modify this a bit because I got: "SyntaxException: Unknown identifier `count`.". So the query currently looks like this:
START r = node(1), a = node(3590, 3588, 3591), d = node(3613) MATCH r--re-[:CONTAINS]-i WITH d, re, i, collect(a) as nodes WHERE ALL(x in nodes WHERE re-[:CONTAINS]-x) AND NOT(re-[:CONTAINS]-d) WITH i, count(re) as count RETURN DISTINCT id(i), i.title, count ORDER BY count DESC LIMIT 10;
And it takes 1.5 Seconds to execute.
The more direct version takes only 218ms (according to neo4j-sh): START r = node(1), a = node(3590), b = node(3588), c = node(3591), d = node(3613) MATCH r--re-[:CONTAINS]-i WHERE ( re-[:CONTAINS]-a AND re-[:CONTAINS]-b AND re-[:CONTAINS]-c ) AND NOT( re-[:CONTAINS]-d ) RETURN DISTINCT ID(i), i.title, count(re) as count ORDER BY count DESC LIMIT 10;
So it seems that I probably should try the first variant, but I don't quite understand what you mean with it.
It is slower b/c for the aggregation in with (or return) cypher has to pull through all results first to aggregate correctly.
So your first solution is probably the best,
the one that I had in mind for #1 would probably not work as you could use the passed in id's to compare to other node id's but not to initialize an identifier for matching.
> #1 pass a in as parameter (collection of id's) and use that in connection with ID(x)
> How would I do that exactly?
> #2 prepend a little aggregation query and pass the results with WITH to the second part
>> START r=node(1), a = node(3590, 3588, 3591), d=node(3613) MATCH r--re-[:CONTAINS]-i
> WITH r,d,re,i,collect(a) as nodes
> WHERE (ALL(x in nodes WHERE re-[:CONTAINS]-x)) and (not(re-[:CONTAINS]-d)) return distinct id(i), i.title, count(r) as count order by count desc limit 10; > I had to modify this a bit because I got: "SyntaxException: Unknown identifier `count`.". So the query currently looks like this:
> START
> r = node(1),
> a = node(3590, 3588, 3591),
> d = node(3613)
> MATCH
> r--re-[:CONTAINS]-i
> WITH
> d, re, i, collect(a) as nodes
> WHERE
> ALL(x in nodes WHERE re-[:CONTAINS]-x) > AND > NOT(re-[:CONTAINS]-d)
> WITH
> i, count(re) as count
> RETURN
> DISTINCT id(i),
> i.title,
> count > ORDER BY
> count DESC
> LIMIT
> 10;
> And it takes 1.5 Seconds to execute.
> The more direct version takes only 218ms (according to neo4j-sh):
> START
> r = node(1),
> a = node(3590),
> b = node(3588),
> c = node(3591),
> d = node(3613)
> MATCH
> r--re-[:CONTAINS]-i
> WHERE
> (
> re-[:CONTAINS]-a
> AND
> re-[:CONTAINS]-b
> AND
> re-[:CONTAINS]-c
> )
> AND
> NOT(
> re-[:CONTAINS]-d
> )
> RETURN
> DISTINCT ID(i),
> i.title,
> count(re) as count
> ORDER BY
> count DESC
> LIMIT
> 10;
> So it seems that I probably should try the first variant, but I don't quite understand what you mean with it.
START
a = node(3590,3588,3591)
WITH
collect(a) as nodes
START
r = node(1), d = node(3613)
MATCH
r--re-[:CONTAINS]-i
WHERE
ALL(x in nodes : re-[:CONTAINS]-x) AND
NOT re-[:CONTAINS]-d
RETURN
DISTINCT ID(i),
i.title,
count(re) as count
ORDER BY
count DESC
LIMIT
10;
Hi, Doing it this way it runs almost as fast as the "unrolled" variant. But as I generalize it such that I can also add any amount of "bad" nodes, it stays at around 500ms instead of getting faster, like it did with the "unrolled" variant. It seems like the additional collect step takes about as much time as the rest.
The following queries return the exactly same results, but the first takes 180ms on average while the second takes 530ms on average.
Unrolled query: START r = node(1), a = node(3590), b = node(3588), c = node(3591), d = node(3613), e = node(3597), f = node(3627) MATCH r--re-[:CONTAINS]-i WHERE ( (re-[:CONTAINS]-a) AND (re-[:CONTAINS]-b) AND (re-[:CONTAINS]-c) ) AND NOT( (re-[:CONTAINS]-d) AND (re-[:CONTAINS]-e) AND (re-[:CONTAINS]-f) ) RETURN DISTINCT ID(i), i.title, count(re) ORDER BY count(re) DESC LIMIT 10;
Collect Query: START a = node(3590,3588,3591), d = node(3613, 3597, 3627) WITH collect(a) as good, collect(d) as bad START r = node(1) MATCH r--re-[:CONTAINS]-i WHERE ALL(x in good WHERE re-[:CONTAINS]-x) AND NONE(x in bad WHERE re-[:CONTAINS]-x) RETURN DISTINCT ID(i), i.title, count(re) ORDER BY count(re) DESC LIMIT 10;
I think in the ideal case there shouldn't be a difference in speed between the two. I still wonder why I can't use the Nodeset from the START section without a collect step. I tried to find the reason within the source, but I just couldn't wrap my head around it.
Anyway thank you all for your help. Looks like I will have to try to understand the query optimizer and see if there is a way to get a query plan out of it. Something like an EXPLAIN statement from postgres would really be a great addition to Cypher.
On Sun, May 27, 2012 at 8:47 PM, Paul Dubs <paul.d...@gmail.com> wrote:
> Hi,
> Doing it this way it runs almost as fast as the "unrolled" variant.
> But as I generalize it such that I can also add any amount of "bad" nodes,
> it stays at around 500ms instead of getting faster, like it did with the
> "unrolled" variant.
> It seems like the additional collect step takes about as much time as the
> rest.
> The following queries return the exactly same results, but the first takes
> 180ms on average while the second takes 530ms on average.
> Unrolled query:
> START
> r = node(1),
> a = node(3590),
> b = node(3588),
> c = node(3591),
> d = node(3613),
> e = node(3597),
> f = node(3627)
> MATCH
> r--re-[:CONTAINS]-i
> WHERE
> (
> (re-[:CONTAINS]-a)
> AND
> (re-[:CONTAINS]-b)
> AND
> (re-[:CONTAINS]-c)
> )
> AND
> NOT(
> (re-[:CONTAINS]-d)
> AND
> (re-[:CONTAINS]-e)
> AND
> (re-[:CONTAINS]-f)
> )
> RETURN
> DISTINCT ID(i),
> i.title,
> count(re)
> ORDER BY
> count(re) DESC
> LIMIT
> 10;
> Collect Query:
> START
> a = node(3590,3588,3591),
> d = node(3613, 3597, 3627)
> WITH
> collect(a) as good, collect(d) as bad
> START
> r = node(1)
> MATCH
> r--re-[:CONTAINS]-i
> WHERE
> ALL(x in good WHERE re-[:CONTAINS]-x)
> AND
> NONE(x in bad WHERE re-[:CONTAINS]-x)
> RETURN
> DISTINCT ID(i),
> i.title,
> count(re)
> ORDER BY
> count(re) DESC
> LIMIT
> 10;
> I think in the ideal case there shouldn't be a difference in speed between
> the two. I still wonder why I can't use the Nodeset from the START section
> without a collect step. I tried to find the reason within the source, but I
> just couldn't wrap my head around it.
> Anyway thank you all for your help. Looks like I will have to try to
> understand the query optimizer and see if there is a way to get a query plan
> out of it. Something like an EXPLAIN statement from postgres would really be
> a great addition to Cypher.
you cannot use the "nodeset" from start, b/c there is no node-set :)
the combination of all start-items will be unrolled into individual executions of the query (i.e. a cross-product of the start-items), so in each of those executions there is exactly ONE node of your "a" start-set.
It is only aggregation that pulls together the results of different executions into aggregated values again.
The only way to pass in collections (right now) is to use parameters. But as I outlined those cannot be used like node identifier literals in match clauses (otherwise they can be used like any normal node).
> Hi,
> Doing it this way it runs almost as fast as the "unrolled" variant.
> But as I generalize it such that I can also add any amount of "bad" nodes, it stays at around 500ms instead of getting faster, like it did with the "unrolled" variant.
> It seems like the additional collect step takes about as much time as the rest.
> The following queries return the exactly same results, but the first takes 180ms on average while the second takes 530ms on average.
> Unrolled query:
> START
> r = node(1),
> a = node(3590),
> b = node(3588),
> c = node(3591),
> d = node(3613),
> e = node(3597),
> f = node(3627)
> MATCH
> r--re-[:CONTAINS]-i
> WHERE
> (
> (re-[:CONTAINS]-a)
> AND
> (re-[:CONTAINS]-b)
> AND
> (re-[:CONTAINS]-c)
> )
> AND
> NOT(
> (re-[:CONTAINS]-d)
> AND
> (re-[:CONTAINS]-e)
> AND
> (re-[:CONTAINS]-f)
> )
> RETURN
> DISTINCT ID(i),
> i.title,
> count(re)
> ORDER BY
> count(re) DESC
> LIMIT
> 10;
> Collect Query:
> START
> a = node(3590,3588,3591),
> d = node(3613, 3597, 3627)
> WITH
> collect(a) as good, collect(d) as bad
> START
> r = node(1)
> MATCH
> r--re-[:CONTAINS]-i
> WHERE
> ALL(x in good WHERE re-[:CONTAINS]-x)
> AND
> NONE(x in bad WHERE re-[:CONTAINS]-x)
> RETURN
> DISTINCT ID(i),
> i.title,
> count(re)
> ORDER BY
> count(re) DESC
> LIMIT
> 10;
> I think in the ideal case there shouldn't be a difference in speed between the two. I still wonder why I can't use the Nodeset from the START section without a collect step. I tried to find the reason within the source, but I just couldn't wrap my head around it.
> Anyway thank you all for your help. Looks like I will have to try to understand the query optimizer and see if there is a way to get a query plan out of it. Something like an EXPLAIN statement from postgres would really be a great addition to Cypher.
After some more testing I have arrived at this query:
START rootNode = node(1) MATCH (rootNode)-->(re) WITH re START good = node(3590,3588,3591), bad = node(3613) WITH re, collect(good) as good, collect(bad) as bad WHERE ALL( g in good WHERE (re)-[:CONTAINS]->(g) ) AND NONE( b in bad WHERE (re)-[:CONTAINS]->(b) ) WITH re MATCH (re)-[:CONTAINS]->(i) RETURN DISTINCT id(i), i.title, count(i) ORDER BY count(i) DESCENDING LIMIT 10;
While it is the fastest that I have managed yet, it is still takes about twice as long as the (quite complex) Postgres Query that works on the same data set and it only gets slower as I add more concurrent clients (on a machine with 12GB Ram and 12 Cores).
Is there any way to speed that query up any further? My attempts with gremlin were all slower. I haven't tried writing a custom traversal method yet as I have trouble understanding how to convert a query such as this to it.
> After some more testing I have arrived at this query:
> START
> rootNode = node(1)
> MATCH
> (rootNode)-->(re)
> WITH
> re
> START
> good = node(3590,3588,3591),
> bad = node(3613)
> WITH
> re, collect(good) as good, collect(bad) as bad
> WHERE
> ALL( g in good WHERE
> (re)-[:CONTAINS]->(g)
> )
> AND
> NONE( b in bad WHERE
> (re)-[:CONTAINS]->(b)
> )
> WITH
> re
> MATCH
> (re)-[:CONTAINS]->(i)
> RETURN
> DISTINCT id(i),
> i.title,
> count(i)
> ORDER BY
> count(i) DESCENDING
> LIMIT
> 10;
> While it is the fastest that I have managed yet, it is still takes about twice as long as the (quite complex) Postgres Query that works on the same data set and it only gets slower as I add more concurrent clients (on a machine with 12GB Ram and 12 Cores).
> Is there any way to speed that query up any further? My attempts with gremlin were all slower. I haven't tried writing a custom traversal method yet as I have trouble understanding how to convert a query such as this to it.