Complex nested queries in Cypher, with subqueries and unions

3,064 views

Skip to first unread message

Marco Brandizi

unread,

Oct 29, 2017, 12:07:40 PM10/29/17

to Neo4j

Hi all,

I've already tried this on StackOverflow, I'll try it here, with a better phrasing and perspective.

In a work to evaluate the expressivity of Cypher, I've loaded this type of graph in Neo4j (see the attachment for a more complete and more readable version of it):

:Protein ---> :is_a ---------------> :Enzyme ---> :activated_by|inhibited_by ---> :Compound
         \<-- :activated_by <---/ 

:Compound --> :consumed_by|:produced_by ---> :Transport

:Transport --> :catalyzed_by :Enzyme
:Transport --> :part_of ---> :Pathway

(yes it's biology, yes, it's from BioPAX). I want to write a query that returns pairs of (Protein, Pathway), by going through all possible paths/subgraphs. Moreover, I would like to write it in a compact form, like in SQL or SPARQL, that is, in informal terms:

select distinct protein, pathway {  
   ( 
     protein  is_a -> enzyme 
     union protein <- activated_by enzyme
   )
   // I expect matching enzymes to be joned downstream
   (
     ( enzyme activated_by|inhibited_by -> compound - consumed_by | produced_by -> transport )
     union ( transport <- catalyzed_by enzyme )
   ) // I expect tuples of protein/enzyme/transport, where transport is obtained from the two sub-branches
   transport - part_of -> pathway
}

Is it possible to do it in Cypher? I mean, by being explicit, I already know that I could simplify it a lot by writing something like ( protein:Protein ) -- ( pathway:Pathway), but suppose you need to explicitly select those paths and not others that might be there. The general question is: is it possible to nest subqueries of any complexity? And in particular, is it possible to write an union-based query, which contains further union-based subqueries? Is there a simple way to do so?

I tried to follow the approach suggested by the documentation and suggested by the answers in that StackOverflow link mentioned above, by using WITH + COLLECT() + UNWIND. However, I either obtain a syntax error or a query plan where I can see things are not joined as I expect. For instance, this query:

PROFILE MATCH (prot:Protein) - [:is_a] -> (enz:Enzyme)
WITH prot, COLLECT ( enz ) AS enzs
UNWIND enzs AS enz
MATCH (enz) - [:ac_by|:in_by] -> (comp:Comp)
WITH prot, enzs, COLLECT (comp) AS comps
UNWIND enzs AS enz
MATCH (enz) <- [:ca_by] - (tns:Transport)
WITH prot, comps, COLLECT (tns) AS tnss
UNWIND comps AS comp
MATCH (tns:Transport) <- [:cs_by|:pd_by] - (comp)
WITH prot, tnss + COLLECT ( tns ) AS tnss
UNWIND tnss AS tns
MATCH (tns) - [:part_of] -> (path:Path)
RETURN prot, path LIMIT 25

shows me a linear plan, where, at some point, there are 0 resulting rows (and I'm very sure that's wrong, single-path queries between protein/pathway return me a few nodes). Even if it worked, frankly, I don't find it the easiest way to express this graph pattern, for the nesting approach shown above and supported by other languages (I've already tried the same with SPARQL) is quite easier to write and read.

Thanks in advance for any help.

ara_knet_pattern.png

Kamal Murthy

unread,

Nov 2, 2017, 12:10:37 AM11/2/17

to Neo4j

Hi,

Here are my suggestions:

I created the database with this script:

CREATE (prot:Protein {name:"P1"})

CREATE (enz:Enzyme {name:"E1"})

CREATE (comp:Comp {name:"C1"})

CREATE (tns:Transport {name:"T1"})

CREATE (path:Path {name:"PT1"})

CREATE (prot)-[:is_a]->(enz)

CREATE (enz)-[:ac_by {actvby:"compound"}]->(comp)

CREATE (enz)-[:in_by]->(comp)

CREATE (enz)-[:ac_by {actvby:"enzyme"}]->(prot)

CREATE (comp)-[:cs_by]->(tns)

CREATE (comp)-[:pd_by]->(tns)

CREATE (tns)-[:ca_by]->(enz)

CREATE (tns)-[:part_of]->(path);

1. Using COLLECT on Protein vs Enzyme:

//COLLECT on Enzyme:

MATCH (prot:Protein) - [:is_a] -> (enz:Enzyme)

WITH prot, COLLECT ( enz ) AS enzs

UNWIND enzs AS enz

MATCH (p2)-[]->(enz) - [:ac_by|:in_by] -> (comp:Comp)

RETURN p2, enz, comp;

//COLLECT on Protein:

MATCH (prot:Protein) - [:is_a] -> (enz:Enzyme)

WITH COLLECT(prot) as pn, enz

UNWIND pn AS p2

MATCH (p2)-[]->(enz) - [:ac_by|:in_by] -> (comp:Comp)

RETURN p2, enz, comp;

COLLECT on Protein equals the first union in the select statement

The final query with only one COLLECT/UNWIND and two WITH :

MATCH (prot:Protein) - [:is_a] -> (enz:Enzyme)

WITH COLLECT(prot) as pn, enz

UNWIND pn AS p2

MATCH (p2)-[]->(enz) - [:ac_by|:in_by] -> (comp:Comp)

WITH p2, enz, comp

MATCH (tns:Transport) <- [:cs_by|:pd_by] - (comp)

WITH p2, enz, comp, tns

MATCH (tns) - [:part_of] -> (path:Path)

RETURN p2, enz, comp, tns, path;

In this example (with one protein) final result can also be obtained by this query:

MATCH (prot:Protein) - [:is_a] -> (enz:Enzyme)

WITH COLLECT(prot) as pn, enz

UNWIND pn AS p2

MATCH (p2)-[]->(enz) - [:ac_by|:in_by] -> (comp:Comp)-[:cs_by|:pd_by]->(tns:Transport)-[:part_of] -> (path:Path)

RETURN p2, enz, comp, tns, path;

I used this COLLECT+UNWIND+WITH in recommending restaurants based on user preferences (https://neo4j.com/graphgist/800a57b2-bbd1-40d3-9dee-a00c4ef624e6).