Hi
I have a relatively small database, consisting of the following elements, each of which is defined as a Label:
Company
BusRel (business relationship)
Region
Sector
Sectors are a parent/child hierarchy (max. 1 level deep), with a [:has_child] relationship between a parent and children. (approximately 122 nodes total)
Regions are also a parent/child hierarchy, but are present for example as: (:Region {name: 'Africa'})-[:has_child]->(:Region {name: 'South Africa'})-[:has_child]->(:Region {name: 'Gauteng'})-[:has_child]->(:Region {name: 'Pretoria'}), ie. Continent all the way down to City with a :has_child relationship from parent to child. (approximately 50000 nodes total)
BusRel are business relationships. There are matching pairs which are linked to each other (bi-directional links), for example: {:BusRel {name: 'Looking for a distributor'})<-[:looks_for]->(:BusRel {name: 'Looking to distribute'}) (approximately 18 nodes total)
Companies aren't linked to each other directly, but rather via one or more of the Labels above. A company could have multiple BusRels and Sectors, but only one Region (at present). (approximately 10000 nodes total)
I've attached a screenshot from the web admin to illustrate an example, the nodes being:
Blue: Company
Grey: Region
Yellow: BusRel
Orange: Sector
I've written a query to start at a specified company (there is a unique constraint on Company property "id") and retrieve all other Companies which share (1) one or more corresponding business relationship (2) the same region (3) one or more sectors. It looks like this:
match (c:Company {id: 6})-[:operates_in_region]->(regions:Region)<-[:operates_in_region]-(p:Company)
match (c)-[:operates_in_sector]->(sectors:Sector)-[:has_child*0..1]-(:Sector)<-[:operates_in_sector]-(p)
match (c)-[:has_objective]->(:BusRel)-[:looks_for]->(busrels:BusRel)<-[:has_objective]-(p)
This is the product of some query tuning based on some reading and a recent video by Wes and Mark. The problem is that this query returns in about 100 - 120ms, which seems slow to me. (please correct me if I'm wrong, but I've seen examples of far more complex db's and queries returning in under 20ms)
I've tried profiling it, which returns the following:
ColumnFilter(symKeys=["
p.name", " INTERNAL_AGGREGATEd6949509-3054-42f0-99dd-d0389f6b0e7f", " INTERNAL_AGGREGATE7016f1ed-68e5-4260-bbaf-7c513953e457", " INTERNAL_AGGREGATEaf0e7813-ba3e-48b4-a3bf-a2aa25085a4f"], returnItemNames=["
p.name", "collect(distinct(
busrels.name))", "collect(distinct(
sectors.name))", "collect(distinct(
regions.name))"], _rows=4, _db_hits=0)
EagerAggregation(keys=["Cached(
p.name of type Any)"], aggregates=["( INTERNAL_AGGREGATEd6949509-3054-42f0-99dd-d0389f6b0e7f,Distinct(Collect(Property(busrels,name(1))),Property(busrels,name(1))))", "( INTERNAL_AGGREGATE7016f1ed-68e5-4260-bbaf-7c513953e457,Distinct(Collect(Property(sectors,name(1))),Property(sectors,name(1))))", "( INTERNAL_AGGREGATEaf0e7813-ba3e-48b4-a3bf-a2aa25085a4f,Distinct(Collect(Property(regions,name(1))),Property(regions,name(1))))"], _rows=4, _db_hits=48)
Extract(symKeys=["sectors", " UNNAMED170", " UNNAMED178", " UNNAMED215", " UNNAMED150", " UNNAMED110", "regions", " UNNAMED274", "p", " UNNAMED25", "c", " UNNAMED65", " UNNAMED235", "busrels", " UNNAMED243"], exprKeys=["
p.name"], _rows=10, _db_hits=10)
Filter(pred="(((hasLabel( UNNAMED235:BusRel(3)) AND hasLabel( UNNAMED235:BusRel(3))) AND hasLabel(busrels:BusRel(3))) AND hasLabel(busrels:BusRel(3)))", _rows=10, _db_hits=0)
PatternMatch(g="(c)-[' UNNAMED215']-( UNNAMED235),(p)-[' UNNAMED274']-(busrels),( UNNAMED235)-[' UNNAMED243']-(busrels)", _rows=10, _db_hits=0)
Filter(pred="(((hasLabel(sectors:Sector(0)) AND hasLabel( UNNAMED170:Sector(0))) AND hasLabel( UNNAMED170:Sector(0))) AND hasLabel(sectors:Sector(0)))", _rows=130, _db_hits=0)
PatternMatch(g="(c)-[' UNNAMED110']-(sectors),(p)-[' UNNAMED178']-( UNNAMED170),( UNNAMED170)-[' UNNAMED150']-(sectors)", _rows=130, _db_hits=2733)
Filter(pred="hasLabel(p:Company(2))", _rows=584, _db_hits=0)
TraversalMatcher(trail="(c)-[ UNNAMED25:operates_in_region WHERE (hasLabel(NodeIdentifier():Region(1)) AND hasLabel(NodeIdentifier():Region(1))) AND true]->(regions)<-[ UNNAMED65:operates_in_region WHERE hasLabel(NodeIdentifier():Company(2)) AND true]-(p)", _rows=584, _db_hits=586)
There hardly seem to be enough db_hits or rows being processed to warrant a 100ms execution time. I'd be happy retrieving only the first matched region, sector and busrel to optimize the query so that it would stop searching for more matches between two companies and return faster. Ie. I just need one match across the bridging nodes, not all of them, but I'm not sure if that's possible / how to do that.
This is running on my development machine, which is an i5 2500K with 8GB of RAM and the DB is on a fast SSD. I've tried increasing the JVM heap size from the default to 1GB, then 2GB and then 4GB with the same results. Any help / tips would be appreciated.
The second part of the question involves writing a "summary" query, to create a paged list of Companies and total number of matches, using the same criteria. I've come up with this, sans paging:
match (c:Company)-[:operates_in_region]->(:Region)<-[:operates_in_region]-(p:Company)
match (c)-[:operates_in_sector]->(:Sector)-[:has_child*0..1]-(:Sector)<-[:operates_in_sector]-(p)
match (c)-[:has_objective]->(:BusRel)-[:looks_for]->(:BusRel)<-[:has_objective]-(p)
But I suspect there is a serious flaw in the query as this never returns. (or at least it hasn't before hitting the 30s maximum execution time I've set)
How would I go about constructing something like that?
Many thanks in advance.