Query optimization question

Sunit Jain

unread,

Jul 10, 2015, 8:53:25 AM7/10/15

to neo4j-...@googlegroups.com

Hi

I'm working on a data where

(Protein) - [BELONGS_TO] -> (Cluster) (Protein) - [CONTAINS] -> (PFam) (Protein) - [FOUND_IN] -> (Organism)

Here is what I'm trying to do:

Given Organisms with certain properties, find the common PFams.

Here is the query I came up with:

MATCH (o1:ORGANISM{gene_A:"Yes"})<--(p1:PROTEIN)-->(pf:PFAM)<--(p2:PROTEIN)-->(o2:ORGANISM{gene_A:"Yes"}) WHERE o1.name <> o2.name RETURN DISTINCT pf.name,pf.description, COUNT(DISTINCT o1.name) as Weight ORDER BY Weight DESC;

This query regularly takes ~80000ms! Am I doing something wrong? How can I optimize this query? Ideally, I would like to run this query for the common clusters, but I have ~30x more `Clusters` than I have `PFams`!

# Nodes = 164 565

# Properties = 525 025

# Relationships = 389 695

I'm running this on a server with 512GB RAM.

Thanks,

Sunit.

PS: Would you like me to ask questions like these in the larger neo4j group?

Michael Hunger

unread,

Jul 10, 2015, 10:34:41 AM7/10/15

to Sunit Jain, neo4j-...@googlegroups.com

you just forgot to create an index/constraint

create index on :ORGANISM(gene_A);

but it might be that your subselection by gene_A is too broad so you get hundreds of millions of paths.

How selective is (o1:ORGANISM{gene_A:"Yes"}) ???

if it is very selective, try adding both to your original query

USING INDEX o1:ORGANISM(gene_A)

USING INDEX o2:ORGANISM(gene_A)

add the correct relationship-types to your patten

you can also profile your query by prefixing it with PROFILE

you don't need the distinct if you already aggregate

MATCH (o1:ORGANISM{gene_A:"Yes"})<-[:TYPE???]-(p1:PROTEIN)-[:TYPE???]->(pf:PFAM)<-[:TYPE???]-(p2:PROTEIN)-[:TYPE???]->(o2:ORGANISM{gene_A:"Yes"})

WHERE o1.name <> o2.name

RETURN pf.name,pf.description, COUNT(DISTINCT o1.name) as Weight

ORDER BY Weight DESC;

and you can split up your pattern a bit

MATCH (o1:ORGANISM{gene_A:"Yes"})<-[:TYPE???]-(p1:PROTEIN)

WITH distinct o1,p1

MATCH (p1)-[:TYPE???]->(pf:PFAM)

WITH distinct o1, pf

MATCH (pf)<-[:TYPE???]-(p2:PROTEIN)

WITH distinct o1,pf,p2

MATCH (p2)-[:TYPE???]->(o2:ORGANISM{gene_A:"Yes"})

--
You received this message because you are subscribed to the Google Groups "neo4j-biotech" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j-biotec...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sunit Jain

unread,

Jul 13, 2015, 11:37:54 AM7/13/15

to neo4j-...@googlegroups.com, sun...@umich.edu

Thank you!

How selective is (o1:ORGANISM{gene_A:"Yes"}) ???

There are a total of 17 ORGANISM nodes, 10 of which have the "gene_A" property as "Yes".

I created an index for geneA and named the relationships. I also added an ID function to only get unique pairs of organisms. Doing this alone brought the time down to ~50000ms. But this still seems high. Anything else that I can do here? Here is my most recent query and the query plan:

PROFILE 
MATCH (o1:ORGANISM{gene_A:"Yes"})<-[:FOUND_ON]-(p1:PROTEIN)-[:CONTAINS]->(pf:PFAM)<-[:CONTAINS]-(p2:PROTEIN)-[:FOUND_ON]->(o2:ORGANISM{gene_A:"Yes"})
USING SCAN o1:ORGANISM
USING SCAN o2:ORGANISM
WHERE o1.name <> o2.name AND (ID(o1) < ID(o2))


RETURN  pf.name,pf.description, COUNT(DISTINCT o1.name) as Weight 
ORDER BY Weight DESC;