Bioinformatics: Metabolic Pathways in Graph databases

Milton Yutaka Nishiyama Junior

unread,

Aug 26, 2014, 4:04:29 PM8/26/14

to bio4j...@googlegroups.com

Hy Everyone,

My name is Milton and I'm starting to work with Graph database and would like if someone could give me some hints.

I work with Bioinformatics and I would like to integrate KEGG-Metabolic Pathways with Microarray data

Just to introduce my problem:

- the Metabolic Pathways are graphs (~200) and each node are the genes, in my case.

- the Microarray data, has multiple experiments, where each experiment is a list of genes (nodes) with a respective signal intensities (value)

What I'm planning to do is to integrated the gene intensity (values) for each experiment (microarray) with the nodes in the Metabolic pathways and include those pathways in the graph database.

In the end, if I have 5 experiments, I would have the same metabolic pathway in the graph database, but for each experiment with different values for each node (gene)

My final goal would be to make some queries in each experiment, and identify the differences of one pathway, between the 5 experiments.

I would appreciate if someone could give some hint and tell me if I'm crazy in the idea to include the same pathway but related to different experiments, with different node values.

Thanks a lot,

Milton

Andrei Kucharavy

unread,

Aug 27, 2014, 4:24:27 PM8/27/14

to bio4j...@googlegroups.com

Hello Milton.

Your problem is quite common and several approaches have been developped for it already.

You might want to look in the following tools for this:

Andrei Kucharavy

unread,

Aug 27, 2014, 4:43:08 PM8/27/14

to bio4j...@googlegroups.com

- PARADIGM by Vaske et al.

- PathScan by Wendl et al.

- TieDie by Paull et al.

- Network-based stratification by Matan Hofree.

Most of these tools use KEGG pathways as one of their possible sources along with the other (KEGG on it's own is insufficient most of the time).

In case you want to run a deeper analysis on your own data but you are ready to use Reactome instead of KEGG and if you are on good terms with Python, you could use the neo4j_Importers subroutines from the following package (disclosure: I maintain it).: https://github.com/chiffa/PolyPharma/tree/master/PolyPharma

Once you get it, you can either add your own data and further process it on your own or use the rest of pipeline to input the genes that are differentially expressed as a csv file for the difference between each experiment. The rest of pipeline will output you a gdf file which is easy to visualize with a tool of your choice.

Hope this helps,

Andrei

On Tuesday, August 26, 2014 3:04:29 PM UTC-5, Milton Yutaka Nishiyama Junior wrote:

Milton Y. Nishiyama Jr.

unread,

Aug 27, 2014, 5:25:05 PM8/27/14

to bio4j...@googlegroups.com

Hi, Andrei,

Thank You for the references, some of them are new for me.

I will read carefully the works, but I'm affraid that they focus more on the algorithms and are a bit different of my goal.

I'm focusing more in the development of a graph database to store the "processed" and "intergrated" information, and following in the development of queries using graph theory and later in a web interface and a kind of declarative query language to create more complex queries, similar to the PathCase platform.

In this first step, I'm working on the graph database and in simple graph queries, to interrogate the database, and I'm starting to use Neo4j.

Best regards,

Milton

--
Has recibido este mensaje porque estás suscrito al grupo "bio4j-user" de Grupos de Google.
Para anular la suscripción a este grupo y dejar de recibir sus mensajes, envía un correo electrónico a bio4j-user+...@googlegroups.com.
Para acceder a más opciones, visita https://groups.google.com/d/optout.

--
=================================================
Milton Yutaka Nishiyama Jr.
PhD Student
Bioinformatics Group - IME/IQ - USP
Biochemistry Department, University of Sao Paulo
http://sucest-fun.org/
Brazil
=================================================

Eduardo Pareja Tobes

unread,

Aug 28, 2014, 12:37:01 PM8/28/14

to bio4j...@googlegroups.com

Hi Milton

Right now Bio4j cannot directly help you here; the problem with KEGG is that its license terms are let's say not open-source friendly. We plan to integrate NCBI BioSystems which could certainly be useful for what you want.

Concerning your problem, assuming the number of experiments is known beforehand, a simple model would be

1. store the pathways you are interested in as graphs, with gene nodes

2. create a rel type `ValueOfExperimentXXX` for each experiment `XXX`, with source gene nodes and target a node type representing the values that your experiment yields for a gene (intensity in this case?)

The key idea with all this is: Do not modify or add data to the pathways; if you have something (like an experiment) that produces some sort of output for a pathway, create a rel and a node storing that output.

I don't know about the language you want to use for all this or whether you using Neo4j is a requirement (I'd recommend you Titan); if you're OK with Scala we at oh no sequences! have a (experimental) nice technology-independent library: ohnosequences/scarph. If you're more for Java, check ohnosequences/typed-graphs.

HTH

Andrei Kucharavy

unread,

Aug 29, 2014, 12:20:28 PM8/29/14

to bio4j...@googlegroups.com

I agree with Eduardo.

I would also advise you not to create an "experiement" node to group your experimetnal results but instead use an additional field in the "experimental results" nodes to store and index the experiment id.

In case you are using python, https://github.com/espeed/bulbs

is a nice orm to work with data in Titan or neo4j.

Andrei

Milton Y. Nishiyama Jr.

unread,

Sep 2, 2014, 3:53:51 PM9/2/14

to bio4j...@googlegroups.com

Hi Eduardo and Andrei,

Thank You for the replys, I will add below my doubts, sorry for the naive questions.

About your suggested model, I have a few questions below:

1. store the pathways you are interested in as graphs, with gene nodes

The idea is to store the full raw pathways of interest (with gene nodes) as graphs in the database ?

so, I would have the pathway to group the gene nodes, and I would have an experiment to group the gene nodes, is it right ?

2. create a rel type `ValueOfExperimentXXX` for each experiment `XXX`, with source gene nodes and target a node type representing the values that your experiment yields for a gene (intensity in this case?)

The idea is to store in the same graph database, each experimentXXX (groupping the gene nodes) with a rel type (ValueOfExperimentXXX) connecting each gene node (with values: intensity) ?

And after that, the idea would be construct the queries to integrate each experiment (gene nodes) with the pathways (gene nodes) and compare the experiments/pathways, is it right ?

I don't know if it makes sense and if I will can create the queries later, but that is what I was wondering to insert into the graph database:

Create a rel type ValueOfExperimentXXX for each experiment XXX, with source pathway nodes, with a rel type ValueOfPathwayY, with source gene nodes and target a node type representing the values that the experiment yields for a gene in each pathway (value:intensity)

Andrei why you didn't recommend to create an experiment node to group the experimental results (pathways,genes) nodes ? what's the main consequences for that, are they related to the queries ?

I'm programming in perl and python and I can use Titan or Neo4j. I will give a look in the suggested libraries.

Thank you and Best regards,

Milton

--
Has recibido este mensaje porque estás suscrito al grupo "bio4j-user" de Grupos de Google.
Para anular la suscripción a este grupo y dejar de recibir sus mensajes, envía un correo electrónico a bio4j-user+...@googlegroups.com.
Para acceder a más opciones, visita https://groups.google.com/d/optout.

Andrei Kucharavy

unread,

Sep 2, 2014, 7:16:15 PM9/2/14

to bio4j...@googlegroups.com, yuta...@iq.usp.br

A node that have lots of relations starting or ending in it will slow down traversals (and more generaly queries on it's relations). It is recommended to avoid nodes with more than a couple of thousands relations.

Milton Yutaka Nishiyama Junior

unread,

Nov 17, 2014, 5:21:33 PM11/17/14

to bio4j...@googlegroups.com

Hi Eduardo,

I tried to send this message for the group, but I think that it didn't work. Could you allow the following message and give me your opinion.
Actually I tried to follow part of your ideas.

I'm returning to my graphdb model for metabolic pathways and gene expression profile, and I would like you opinion about the following implementation, if they make sense and what's the best model.

There is some concern about using Neo4j, what's the gain with Titan (Eduardo's suggestion) ? I'm using Neo4j and java.

The main questions will be to look, for the genes around, longest, shortest and all possible paths in a Metabolic Pathway, with some restrictions, and make graph pattern match, basides more basic queries

I have created three models and I would like your opinion and suggestion:

In my first model, I created three types of nodes: Metabolic Pathways (MP), genes (G) and MicroarrayExperiment (E), with relation between the Gs and MP, relation between Gs and G, and relations between Gs and E, with intensity.values

In my second model, I created two types of nodes: genes (G), MicroarrayExperiments (E). For each Metabolic Pathway(MP),I create a set of gene (nodes) with 2 properties (gene_name and pathway_name), with relation between Gs and G, and finally a relation (with intensity.values ) of G to E

My third model is similar to the second, but instead to connect all genes (each gene node) to one MicroarrayExperiment(node), I created for each gene(node), a rel type with each MicroarrayExperiment(node) with intensity.value