Neo4j and Gene-Disease Association ontology

Benny Kneissl

unread,

Jul 15, 2014, 3:10:56 AM7/15/14

to neo4j-...@googlegroups.com

Hi all,

I have already integrated different biological data sources into Neo4j. For all biological entities (genes, proteins, etc.), I have created nodes which I now want to connect with nodes representing diseases.

I found DisGeNET and its ontology is given below. I'm wondering what is the best way to connect gene nodes with disease nodes using this ontology, in particular, since the relationship types have a hierarchy. Suppose I have a relationship X = (a:Gene)-[r:GeneticVariationAssociation]->(b:Disease). For sure, when searching for all BiomarkerAssociations

MATCH (g:Gene)-[r:BiomarkerAssociation]->(d:Disease) RETURN a,r,b

, I would like to find also relationship X. Do I have to create several relationships between g and d or is there a smarter way to model hierarchies for relationships (which may result in a different query)?

Since I think, modelling hierarchies for relationships might be of interest for several projects, I opened this discussion.

Looking forward to see your suggestions!

ben.but...@neotechnology.com

unread,

Jul 15, 2014, 5:00:41 AM7/15/14

to neo4j-...@googlegroups.com

Hi Benny

Is this the model that you are working with? http://www.disgenet.org/web/DisGeNET/v2.1/rdf

In a case like this where there is a hierarchy within the associations themselves, the best idea is usually to model them as nodes. This gives much more flexibility and richness to the model.

If you have a node for the association, you could then either give it multiple labels, like this:

(g:Gene)<-[:REFERS_TO]-(a:BiomarkerAssociation:GeneticVariationAssociation)-[:REFERS_TO]->(d:Disease)

Node (a) would have properties and relationships applicable to all BiomarkerAssociations and also those applicable to GeneticVariationAssociations.

Or you could model the relationship type hierarchy as nodes:

(g:Gene)<-[:REFERS_TO]-(ba:BiomarkerAssociation)-[:REFERS_TO]->(d:Disease)
(a)-[:IS_A]->(gva:GeneticVariationAssociation)

Here the association properties and relationships are segregated to the appropriate nodes. Obviously this model allows arbitrary richness if you need to model the complete association ontology.

There is a clear path here from the very simple (a relationship) to the very complex (a hierarchical association model). It's hard to know where to start. I have two pieces of advice: start from a simple model and expect to refactor it as you find out more about your domain and application; base your modelling on real questions that you want to answer, rather than an abstract overview of the complete domain.

Can you articulate a few such questions here?

-Ben

Benny Kneissl

unread,

Jul 15, 2014, 7:59:01 AM7/15/14

to neo4j-...@googlegroups.com

Hi Ben,

thanks for your suggestion. I like the idea of representing hierarchies using labels and I do it already for other biological data, e.g. biological processes (interactions, biochemical reactions, etc.). But the main reason to use this concept was that in many cases more than two biological entities are involved in one biological process such that it was not possible to use just relationships. For example, I have two educts and one product in one biochemical reaction.

In the case of disease-gene associations, however, I have always exactly one gene and one disease and a direct relationship seems to be straight-forward for both reasons, first, less nodes (and depending on the level also less relationships) have to be created and, second, the query to access the data afterwards is also shorter regarding real world questions like

- Which biomarkers that are related to disease X (=BiomarkerAssociation) occur in the same pathway?

- Which genes that are altered expressed in disease X (=AlteredExpressedAssociation) encode cell surface proteins?

I just want to clarify if there are potential problems I do not foresee at the moment when starting with this simple model of using several relationships to represent a hierarchy such that I could directly skip this step and use directly intermediate nodes to save one refactoring step.

Do you know if there are plans to have different types (labels) for relationships or is this technically not possible at all?

ben.but...@neotechnology.com

unread,

Jul 15, 2014, 9:38:07 AM7/15/14

to neo4j-...@googlegroups.com

On Tuesday, July 15, 2014 12:59:01 PM UTC+1, Benny Kneissl wrote:

I just want to clarify if there are potential problems I do not foresee at the moment when starting with this simple model of using several relationships to represent a hierarchy such that I could directly skip this step and use directly intermediate nodes to save one refactoring step.

There certainly aren't any problems with the example queries that you give, but of course you learn new things about your domain the whole time. In general you will not be able to avoid refactoring your model and you'll end up doing extra work if you try to anticipate everything at the start. I would always start as simple as possible and assume that refactoring will be necessary.

Do you know if there are plans to have different types (labels) for relationships or is this technically not possible at all?

It's certainly not possible with the current data storage format. I think it's very unlikely that it will ever be supported.

-Ben

Reply all

Reply to author

Forward