Storage of numeric values in Neo4j

65 views
Skip to first unread message

Benny Kneissl

unread,
Sep 2, 2014, 9:32:38 AM9/2/14
to neo4j-...@googlegroups.com
Hi,

I'm currently thinking about the integration of millions of numeric values, e.g. expression values for genes, in my Neo4j graph database.

The genes (G1, G2, ..., Gm) are represented as nodes in my graph, identified by their Entrez Gene ID (stored in the corresponding property map). 

Now I have a study S, which consists of several hundred experiments (E1, E2, ..., En), which have measured different values for several thousand genes (G1, G2, ..., Gm).


I want to use one numeric index (N1, N2, ..., Nn)  for each experiment (E1, E2, ..., En) to be able to ask for values within a range for each experiment separately.

But here are now some questions:
  1. If I have stored the measured values ONLY in the index
    1. how can I extract the value for a particular node from this index?
    2. or do I have to store the value additionally in the node's property map? I think to store relationships between the experiment and the genes is no option at all.
  2. Is there any possibility to store a String (in my case the Entrez Gene ID) in an Index instead of the Node, i.e. Index<String> instead of Index<Node>? The reason is that the Entrez Gene ID is fixed, but the node entity might change due to updates of the data (for example, when I have to merge nodes after adding more data). The Index<Node> is invalid afterwards.


 Is Neo4j in general suited to handle these kind of data or should I use directly something else, e.g. a document store?

Thanks for your comments and sharing your experiences regarding the storage of numeric values in Neo4j.

Best,
Benny

Michael Hunger

unread,
Sep 2, 2014, 9:41:27 AM9/2/14
to Benny Kneissl, neo4j-...@googlegroups.com
I'd probably store them on a relationship-property between the experiment and the gene.

Am 02.09.2014 um 15:32 schrieb Benny Kneissl <benny....@googlemail.com>:

Hi,

I'm currently thinking about the integration of millions of numeric values, e.g. expression values for genes, in my Neo4j graph database.

The genes (G1, G2, ..., Gm) are represented as nodes in my graph, identified by their Entrez Gene ID (stored in the corresponding property map). 

Now I have a study S, which consists of several hundred experiments (E1, E2, ..., En), which have measured different values for several thousand genes (G1, G2, ..., Gm).


I want to use one numeric index (N1, N2, ..., Nn)  for each experiment (E1, E2, ..., En) to be able to ask for values within a range for each experiment separately.

But here are now some questions:
  1. If I have stored the measured values ONLY in the index
    1. how can I extract the value for a particular node from this index?
not in neo4j, you could use a direct lucene library (something like luke) to extract them,
usually we don't recommend storing values only in an index

    1. or do I have to store the value additionally in the node's property map? I think to store relationships between the experiment and the genes is no option at all.
yes that's better 
why don't you think that's no option at all?

  1. Is there any possibility to store a String (in my case the Entrez Gene ID) in an Index instead of the Node, i.e. Index<String> instead of Index<Node>? The reason is that the Entrez Gene ID is fixed, but the node entity might change due to updates of the data (for example, when I have to merge nodes after adding more data). The Index<Node> is invalid afterwards.
that's why you use automatic indexes in neo4j 2.x+ which take care of that themselves

for legacy indexes, you have to manually remove the entry from the index -> index.remove(node)



 Is Neo4j in general suited to handle these kind of data or should I use directly something else, e.g. a document store?

it is rather dependent on which use-cases you want to run on your data

Thanks for your comments and sharing your experiences regarding the storage of numeric values in Neo4j.

Best,
Benny

--
You received this message because you are subscribed to the Google Groups "neo4j-biotech" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j-biotec...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Martin Preusse

unread,
Sep 2, 2014, 9:42:54 AM9/2/14
to Benny Kneissl, neo4j-...@googlegroups.com
Hi Benny,

if you want to map Experiments with an expression value to genes, why is it not an option to create relationships and store the expression value as property on the relation?

It's a matter of scale, but if you have hundreds of experiments with ~50.000 genes you end up with ca 50 million edges. That's fine I think and your query is much simpler.


Cheers
Martin
> --
> You received this message because you are subscribed to the Google Groups "neo4j-biotech" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to neo4j-biotec...@googlegroups.com (mailto:neo4j-biotec...@googlegroups.com).

Benny Kneissl

unread,
Sep 2, 2014, 11:18:10 AM9/2/14
to neo4j-...@googlegroups.com, benny....@googlemail.com

Hi Michael and Martin,

you both prefer relationships... I should believe it. :-) But maybe I can give you more explanations why I don't like using relationships for this kind of data (although I also did in the beginning).

  1. The additional storage in a relationship is in my opinion more or less redundant, since you normally use only the index to filter the nodes by a range of values and will never use some kind of graph pattern / traversal algorithm (at least for the use cases I have regarding this kind of data).
  2. Moreover, as far as I know range queries (on basis of an numeric index) are not possible in Cypher due to the lucene parser. You can just ask for a value to be smaller or greater than a given one. But in this case all relationships are scanned. Or am I wrong?
  3. It's also not about 50 million edges. There are a lot of public studies (and for sure also more in-house data), which I want to compare / analyse. I just checked now for the public available TCGA data, which consists of 26 Cancer types (=studies), containing altogether 26218 CNV experiments, 21121 miRNA abundance (as well as others) experiments. Hence, we are talking at least about 50k (experiments) * 50k (genes) relationships. Do you still think it's appropriate? How much storage space is needed for these number of relationships?
I guess to store such kind of data in a nodes property might be more suitable. But if there's only one mistake in one linked database such that I have to adapt a node (or at least its Entrez Gene ID) all my indexes will be invalid. I mean it's programmatically possible to take care of it, but not so easy.


Best,
Benny
> To unsubscribe from this group and stop receiving emails from it, send an email to neo4j-biotec...@googlegroups.com (mailto:neo4j-biotech+unsub...@googlegroups.com).

Martin Preusse

unread,
Sep 3, 2014, 6:01:41 AM9/3/14
to Benny Kneissl, neo4j-...@googlegroups.com
Hi,

50k * 50k is 2.5 billion relationships. I think that's ok, but the downside is that neo4j only allows to store 34 billion edges.

But as far as I know this is because neo can only use 34 billion unique node IDs and this can be (will be?) changed in future.

I still kind of don't get how you want to store the 50k expression values on the experiment node (or 50k experiments on the gene node).

Martin
> > > To unsubscribe from this group and stop receiving emails from it, send an email to neo4j-biotec...@googlegroups.com (javascript:) (mailto:neo4j-biotec...@googlegroups.com (javascript:)).
> > > For more options, visit https://groups.google.com/d/optout.
> >
>
> --
> You received this message because you are subscribed to the Google Groups "neo4j-biotech" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to neo4j-biotec...@googlegroups.com (mailto:neo4j-biotec...@googlegroups.com).

Benny Kneissl

unread,
Sep 3, 2014, 10:33:30 AM9/3/14
to neo4j-...@googlegroups.com, benny....@googlemail.com
Hi Martin,

I would store the measured values on the genes. The experiment name is the property key, the property value is the measured one stored numerically. 

Since schema indexes are not allowed for numeric values, I would use a legacy index - one for each experiment. The name of the legacy index equals again the experiment name. 

Hence, I'm able to 
a) get all genes measured per experiment (= get all nodes stored in the corresponding index)
b) get all genes within a range (= use a range query on that index)
c) get the measured value of a gene in an experiment (= lookup in the node's property map)

Do you think something is missing?


Benny
> > > To unsubscribe from this group and stop receiving emails from it, send an email to neo4j-biotec...@googlegroups.com (javascript:) (mailto:neo4j-biotech+unsub...@googlegroups.com (javascript:)).
> > > For more options, visit https://groups.google.com/d/optout.
> >
>
> --
> You received this message because you are subscribed to the Google Groups "neo4j-biotech" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to neo4j-biotec...@googlegroups.com (mailto:neo4j-biotech+unsub...@googlegroups.com).

Craig Taverner

unread,
Sep 3, 2014, 11:00:44 AM9/3/14
to Benny Kneissl, neo4j-...@googlegroups.com
Hi,

Looking back at both Michael's and Martin's suggestions to use relationships, I think it is important to understand first why you are considering a graph database at all. I personally think it's a great match for your use case, but only so if you actually decide to use the graph. Just storing data in properties and legacy indexes is not really using the graph at all.

The discussion so far seems to have been a bit about whether or not to put the experiments values in relationships (logical structure, but taking more space) or in long lists of properties (somewhat less logical, messy, but less storage). I think I'd like to suggest something quite different. Add more graph structure, something to suite your queries. Since you want to perform range queries on numerical values, and the schema indexes do not support this, how about making your own data model for the range queries? A graph model designed to answer the questions you want to ask?

For example, if you have some experimental result, with values stored in thousands of experiments. Let's create nodes for those experiment-results, and connect them (with relationships) to the experiment node and the gene node involved. This is similar to what Michael and Martin suggested, with storing values on relationships, except I've added an extra node in the middle. This allows you to create an additional tree structure to query those values using range queries.

(gene)<--(result)<--(experiment)
            |
            V
         (range)

The result 'belongs' to a range node. Depending on your needs for fine grained resolution of ranges (domain specific needs), these could all be connected to a root result type node, or be part of a tree of different resolutions. Range trees as data structures for optimized, domain specific queries have been well discussed and used by many. See some recent articles on this by:
The pros and cons of this approach versus indexing relationship properties are (IMHO):
  • Pros:
    • Very fast range queries, optimized to your domain. Each query is a short traversal (easily expressible in Cypher) down a simple tree structure to your data.
    • Visual exploration of the results, since the ranges (or aggregations) are now part of your data model, you can explore your data in many ways.
    • Opportunity for optimization and refinement, as you get used to this approach you can add structures to suite other queries you might want to perform. For example, adding counters and aggregations to the tree nodes to get instant results over extremely large datasets (data warehouse-like approaches)
  • Cons:
    • We've tripled the number of relationships and added a node for each result. So total storage goes up a lot. The properties are about the same though, so it's mostly the relationship store that will expand a lot.
    • You need to add the nodes to the tree during import of the results. This means you need to be careful to have only one way, or one API, for adding results, so you don't inadvertently add data without adding to the tree. This is a little like the old indexes that you needed to add to yourself, so the cost is similar.
Anyway, I don't know if this idea appeals to you, but I think it is one that really does make use of the graph a lot. Kind of building a domain-specific index structure in the graph for domain specific queries. Worth considering?

Regards, Craig



To unsubscribe from this group and stop receiving emails from it, send an email to neo4j-biotec...@googlegroups.com.

Benny Kneissl

unread,
Sep 3, 2014, 11:53:19 AM9/3/14
to neo4j-...@googlegroups.com, benny....@googlemail.com
Hi Craig,

I've seen this tree structure in a Neo4j workshop but I thought it will not help me in my use case. The concerns I have here are

a) the numbers are floats and not integers, hence I can't query exact values
b) I don't know the range in advance, so it makes no sense to precompute all possibilities, in particular, due to a)

I think this concept will not work then, or am I wrong?

Benny




> > > To unsubscribe from this group and stop receiving emails from it, send an email to neo4j-biotec...@googlegroups.com (javascript:) (mailto:neo4j-biotech+unsubscri...@googlegroups.com (javascript:)).
> > > For more options, visit https://groups.google.com/d/optout.
> >
>
> --
> You received this message because you are subscribed to the Google Groups "neo4j-biotech" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to neo4j-biotec...@googlegroups.com (mailto:neo4j-biotech+unsubscri...@googlegroups.com).
> For more options, visit https://groups.google.com/d/optout.



Reply all
Reply to author
Forward
0 new messages