Query optimization question

16 views
Skip to first unread message

Sunit Jain

unread,
Jul 10, 2015, 8:53:25 AM7/10/15
to neo4j-...@googlegroups.com
Hi
I'm working on a data where 
(Protein) - [BELONGS_TO] -> (Cluster)
(Protein) - [CONTAINS]   -> (PFam)
(Protein) - [FOUND_IN]   -> (Organism)
 
Here is what I'm trying to do: 
Given Organisms with certain properties, find the common PFams.

Here is the query I came up with:
MATCH (o1:ORGANISM{gene_A:"Yes"})<--(p1:PROTEIN)-->(pf:PFAM)<--(p2:PROTEIN)-->(o2:ORGANISM{gene_A:"Yes"}) WHERE o1.name <> o2.name RETURN DISTINCT pf.name,pf.description, COUNT(DISTINCT o1.name) as Weight ORDER BY Weight DESC;

This query regularly takes ~80000ms! Am I doing something wrong? How can I optimize this query? Ideally, I would like to run this query for the common clusters, but I have ~30x more `Clusters` than I have `PFams`!

# Nodes = 164 565
# Properties = 525 025
# Relationships = 389 695 

 I'm running this on a server with 512GB RAM.

Thanks,
Sunit.

PS: Would you like me to ask questions like these in the larger neo4j group?

Michael Hunger

unread,
Jul 10, 2015, 10:34:41 AM7/10/15
to Sunit Jain, neo4j-...@googlegroups.com
you just forgot to create an index/constraint

create index on :ORGANISM(gene_A);

but it might be that your subselection by gene_A is too broad so you get hundreds of millions of paths.
How selective is (o1:ORGANISM{gene_A:"Yes"}) ???

if it is very selective, try adding both to your original query
USING INDEX o1:ORGANISM(gene_A)
USING INDEX o2:ORGANISM(gene_A)

 add the correct relationship-types to your patten

you can also profile your query by prefixing it with PROFILE

you don't need the distinct if you already aggregate

MATCH (o1:ORGANISM{gene_A:"Yes"})<-[:TYPE???]-(p1:PROTEIN)-[:TYPE???]->(pf:PFAM)<-[:TYPE???]-(p2:PROTEIN)-[:TYPE???]->(o2:ORGANISM{gene_A:"Yes"}) 

WHERE o1.name <> o2.name

RETURN pf.name,pf.description, COUNT(DISTINCT o1.name) as Weight 

ORDER BY Weight DESC;

and you can split up your pattern a bit

MATCH (o1:ORGANISM{gene_A:"Yes"})<-[:TYPE???]-(p1:PROTEIN)
WITH distinct o1,p1
MATCH (p1)-[:TYPE???]->(pf:PFAM)
WITH distinct o1, pf
MATCH (pf)<-[:TYPE???]-(p2:PROTEIN)

WITH distinct o1,pf,p2
MATCH (p2)-[:TYPE???]->(o2:ORGANISM{gene_A:"Yes"}) 

--
You received this message because you are subscribed to the Google Groups "neo4j-biotech" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j-biotec...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sunit Jain

unread,
Jul 13, 2015, 11:37:54 AM7/13/15
to neo4j-...@googlegroups.com, sun...@umich.edu
Thank you!

How selective is (o1:ORGANISM{gene_A:"Yes"}) ???
There are a total of 17 ORGANISM nodes, 10 of which have the "gene_A" property as "Yes".

I created an index for geneA and named the relationships. I also added an ID function to only get unique pairs of organisms. Doing this alone brought the time down to ~50000ms. But this still seems high. Anything else that I can do here? Here is my most recent query and the query plan:
PROFILE
MATCH
(o1:ORGANISM{gene_A:"Yes"})<-[:FOUND_ON]-(p1:PROTEIN)-[:CONTAINS]->(pf:PFAM)<-[:CONTAINS]-(p2:PROTEIN)-[:FOUND_ON]->(o2:ORGANISM{gene_A:"Yes"})
USING SCAN o1:ORGANISM
USING SCAN o2:ORGANISM
WHERE o1
.name <> o2.name AND (ID(o1) < ID(o2))

RETURN  pf
.name,pf.description, COUNT(DISTINCT o1.name) as Weight
ORDER BY
Weight DESC;
An image of the query plan is shown below.



Also, splitting up the pattern as you suggested, yields 0 rows.
Reply all
Reply to author
Forward
0 new messages