Slow cypher query in relatively small data set, and help writing a related query

Ziv Unger

unread,

Jan 28, 2014, 6:18:14 AM1/28/14

to ne...@googlegroups.com

Hi

I have a relatively small database, consisting of the following elements, each of which is defined as a Label:

Company

BusRel (business relationship)

Region

Sector

Sectors are a parent/child hierarchy (max. 1 level deep), with a [:has_child] relationship between a parent and children. (approximately 122 nodes total)

Regions are also a parent/child hierarchy, but are present for example as: (:Region {name: 'Africa'})-[:has_child]->(:Region {name: 'South Africa'})-[:has_child]->(:Region {name: 'Gauteng'})-[:has_child]->(:Region {name: 'Pretoria'}), ie. Continent all the way down to City with a :has_child relationship from parent to child. (approximately 50000 nodes total)

BusRel are business relationships. There are matching pairs which are linked to each other (bi-directional links), for example: {:BusRel {name: 'Looking for a distributor'})<-[:looks_for]->(:BusRel {name: 'Looking to distribute'}) (approximately 18 nodes total)

Companies aren't linked to each other directly, but rather via one or more of the Labels above. A company could have multiple BusRels and Sectors, but only one Region (at present). (approximately 10000 nodes total)

I've attached a screenshot from the web admin to illustrate an example, the nodes being:

Blue: Company

Grey: Region

Yellow: BusRel

Orange: Sector

I've written a query to start at a specified company (there is a unique constraint on Company property "id") and retrieve all other Companies which share (1) one or more corresponding business relationship (2) the same region (3) one or more sectors. It looks like this:

match (c:Company {id: 6})-[:operates_in_region]->(regions:Region)<-[:operates_in_region]-(p:Company)

match (c)-[:operates_in_sector]->(sectors:Sector)-[:has_child*0..1]-(:Sector)<-[:operates_in_sector]-(p)

match (c)-[:has_objective]->(:BusRel)-[:looks_for]->(busrels:BusRel)<-[:has_objective]-(p)

return p.name, collect(distinct(busrels.name)), collect(distinct(sectors.name)), collect(distinct(regions.name))

This is the product of some query tuning based on some reading and a recent video by Wes and Mark. The problem is that this query returns in about 100 - 120ms, which seems slow to me. (please correct me if I'm wrong, but I've seen examples of far more complex db's and queries returning in under 20ms)

I've tried profiling it, which returns the following:

ColumnFilter(symKeys=["p.name", " INTERNAL_AGGREGATEd6949509-3054-42f0-99dd-d0389f6b0e7f", " INTERNAL_AGGREGATE7016f1ed-68e5-4260-bbaf-7c513953e457", " INTERNAL_AGGREGATEaf0e7813-ba3e-48b4-a3bf-a2aa25085a4f"], returnItemNames=["p.name", "collect(distinct(busrels.name))", "collect(distinct(sectors.name))", "collect(distinct(regions.name))"], _rows=4, _db_hits=0)

EagerAggregation(keys=["Cached(p.name of type Any)"], aggregates=["( INTERNAL_AGGREGATEd6949509-3054-42f0-99dd-d0389f6b0e7f,Distinct(Collect(Property(busrels,name(1))),Property(busrels,name(1))))", "( INTERNAL_AGGREGATE7016f1ed-68e5-4260-bbaf-7c513953e457,Distinct(Collect(Property(sectors,name(1))),Property(sectors,name(1))))", "( INTERNAL_AGGREGATEaf0e7813-ba3e-48b4-a3bf-a2aa25085a4f,Distinct(Collect(Property(regions,name(1))),Property(regions,name(1))))"], _rows=4, _db_hits=48)

Extract(symKeys=["sectors", " UNNAMED170", " UNNAMED178", " UNNAMED215", " UNNAMED150", " UNNAMED110", "regions", " UNNAMED274", "p", " UNNAMED25", "c", " UNNAMED65", " UNNAMED235", "busrels", " UNNAMED243"], exprKeys=["p.name"], _rows=10, _db_hits=10)

Filter(pred="(((hasLabel( UNNAMED235:BusRel(3)) AND hasLabel( UNNAMED235:BusRel(3))) AND hasLabel(busrels:BusRel(3))) AND hasLabel(busrels:BusRel(3)))", _rows=10, _db_hits=0)

PatternMatch(g="(c)-[' UNNAMED215']-( UNNAMED235),(p)-[' UNNAMED274']-(busrels),( UNNAMED235)-[' UNNAMED243']-(busrels)", _rows=10, _db_hits=0)

Filter(pred="(((hasLabel(sectors:Sector(0)) AND hasLabel( UNNAMED170:Sector(0))) AND hasLabel( UNNAMED170:Sector(0))) AND hasLabel(sectors:Sector(0)))", _rows=130, _db_hits=0)

PatternMatch(g="(c)-[' UNNAMED110']-(sectors),(p)-[' UNNAMED178']-( UNNAMED170),( UNNAMED170)-[' UNNAMED150']-(sectors)", _rows=130, _db_hits=2733)

Filter(pred="hasLabel(p:Company(2))", _rows=584, _db_hits=0)

TraversalMatcher(trail="(c)-[ UNNAMED25:operates_in_region WHERE (hasLabel(NodeIdentifier():Region(1)) AND hasLabel(NodeIdentifier():Region(1))) AND true]->(regions)<-[ UNNAMED65:operates_in_region WHERE hasLabel(NodeIdentifier():Company(2)) AND true]-(p)", _rows=584, _db_hits=586)

There hardly seem to be enough db_hits or rows being processed to warrant a 100ms execution time. I'd be happy retrieving only the first matched region, sector and busrel to optimize the query so that it would stop searching for more matches between two companies and return faster. Ie. I just need one match across the bridging nodes, not all of them, but I'm not sure if that's possible / how to do that.

This is running on my development machine, which is an i5 2500K with 8GB of RAM and the DB is on a fast SSD. I've tried increasing the JVM heap size from the default to 1GB, then 2GB and then 4GB with the same results. Any help / tips would be appreciated.

The second part of the question involves writing a "summary" query, to create a paged list of Companies and total number of matches, using the same criteria. I've come up with this, sans paging:

match (c:Company)-[:operates_in_region]->(:Region)<-[:operates_in_region]-(p:Company)

match (c)-[:operates_in_sector]->(:Sector)-[:has_child*0..1]-(:Sector)<-[:operates_in_sector]-(p)

match (c)-[:has_objective]->(:BusRel)-[:looks_for]->(:BusRel)<-[:has_objective]-(p)

return c.name, count(p)

But I suspect there is a serious flaw in the query as this never returns. (or at least it hasn't before hitting the 30s maximum execution time I've set)

How would I go about constructing something like that?

Many thanks in advance.

debug.png

Michael Hunger

unread,

Jan 28, 2014, 7:17:08 AM1/28/14

to ne...@googlegroups.com

I think you already did a really good job.

How did you measure? With browser? As Wes said in the webinar that's quite unreliable. So either measure with neo4j-shell or with your own, parametrized code.

What kind of indexes/constraints do you have in your db?

e.g. unique constraint on :Company(id) or :Region(name) ?

You might drop label checks in places where due to the relationship-types there is only one label possible anyway.

Did you try to pull more of the pattern in a single match?

Also probably sensible to look at the cardinalities in between and try to reduce them if they are getting too high? (e.g. with WITH distinct c,p,regions.name, busrels.name

How do individual parts of the pattern perform?

match (c:Company {id: 6})-[:operates_in_region]->(regions:Region)<-[:operates_in_region]-(p:Company),(c)-[:has_objective]->(:BusRel)-[:looks_for]->(busrels:BusRel)<-[:has_objective]-(p)

match (c)-[:operates_in_sector]->(sectors:Sector)-[:has_child*0..1]-(:Sector)<-[:operates_in_sector]-(p)

return p.name, collect(distinct(busrels.name)), collect(distinct(sectors.name)), collect(distinct(regions.name))

Any chance to get your db (privately) for testing?

--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
<debug.png>

Ziv Unger

unread,

Jan 28, 2014, 9:47:04 AM1/28/14

to ne...@googlegroups.com

Thanks for the reply! I have forwarded you a link to the DB in an email.

I measured in browser and using the neo4j-shell. Both give me the same result, namely between 100-120ms once the cache has warmed up.

The DB is an export of the basic structure of a PostgreSQL DB, so at the moment lookups are based on the primary key id of the record in Postgres. As such, all of the nodes have id and name properties, and all of the labels have a unique constraint on id.

I did try to put more of the pattern in a single match, but did not notice an improvement.

I tried various combinations of WITH inbetween the matches to try and filter out some nodes, but avoided distinct as I thought that might slow the query down further. I'll give it a shot now though.

The individual matches all run around the same, 100-120ms.

Thanks again!

Ziv Unger

unread,

Jan 28, 2014, 9:55:35 AM1/28/14

to ne...@googlegroups.com

Completely missed your last line with the modified query!

I just ran that in a console and it returns around 33ms!

I tried having all 3 matches comma separated in one statement, which raised the time to over 700ms, and had it as a continuation of the cypher pattern from the first (p:Person), but never thought to just have two. I was working on the theory from the Webinar in which you said that multiple patterns should be separated out into separate MATCH statements. How does one know what will make a difference like this, as in my mind I'd expect the results to be similar?

On Tuesday, 28 January 2014 14:17:08 UTC+2, Michael Hunger wrote:

Ziv Unger

unread,

Jan 28, 2014, 10:12:04 AM1/28/14

to ne...@googlegroups.com

Sorry for the email spam, but based on your suggestion I stripped the Label lookups on the paths that can't be any other type of label, and distilled the query down to this:

match (c:Company {id: 6})-[:operates_in_region]->(regions)<-[:operates_in_region]-(p:Company),(c)-[:has_objective]->()-[:looks_for]->(busrels)<-[:has_objective]-(p)

match (c)-[:operates_in_sector]->(sectors)-[:has_child*0..1]-()<-[:operates_in_sector]-(p)

return p.name, collect(distinct(busrels.name)), collect(distinct(sectors.name)), collect(distinct(regions.name));

Which is incredibly returning at 22ms! That's pretty damn impressive. I've implemented the same changes in the summary query, but still can't get it to return in decent amount of time. I'll keep trying this side, any more help in that regard is much appreciated.

Thanks again!

Michael Hunger

unread,

Jan 28, 2014, 8:10:15 PM1/28/14

to ne...@googlegroups.com

Neat.

The idea is to limit down the # of rows processed at a time as much as possible.

In your case the subgraph at the beginning is quite limiting, and only then you go further and match the rest (which expands again)

I got your global query down to 30s by reordering the matches

Starting from the BusRels (fewest entries) (but is not a big difference you can also start with c:Company) this order of matches proved most effective to prune down the # of rows processed in between.

match (c)-[:has_objective]->(:BusRel)-[:looks_for]->(:BusRel)<-[:has_objective]-(p)

with distinct c,p

match (c)-[:operates_in_sector]->()-[:has_child*0..1]-()<-[:operates_in_sector]-(p)

with distinct c,p

match (c)-[:operates_in_region]->()<-[:operates_in_region]-(p)

with distinct c,p

with c.name as name , count(p) as cnt

return count(*)

;

Hope it helps.

Michael

Ziv Unger

unread,

Jan 28, 2014, 11:47:25 PM1/28/14

to ne...@googlegroups.com

Many thanks for your help, it has been invaluable!

Ziv

--
You received this message because you are subscribed to a topic in the Google Groups "Neo4j" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/neo4j/BIlliXZKzF8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to neo4j+un...@googlegroups.com.

Reply all

Reply to author

Forward