The relationship store

Showing 1-24 of 24 messages
The relationship store Niels Hoogeveen 3/23/13 6:44 AM
As promised, my three-monthly reminder on the issue of densly connected nodes. This time around, however, I'd like to start the discussion again on the method chosen to solve the problem.
 
The current relationship store has at least the following three issues:
  1. nodes having many relationships of one type, have slow access to relationships of other types
  2. fetching the relationships between two nodes, requires a scan of all relationships of one of the nodes
  3. relations get fragmented over the store 
issue 1:
 
Since all relationships of a node are organized as one linked list, there is no partitioning of relationships by relationship type. This means that all relationships attached to a node need to be scanned to fetch the relationships of a certain type. This is especially troubling when there are many relationships of one type attached to the node, while trying to fetch relationships of another type. It is acceptable that fetching many relationships takes time, it is not that acceptable that the presence of many relationships of one type, causes fetching of relationships of other types to slow down.
 
issue 2:
 
Since the id of the other node attached to the relationship is stored in the relationship record, and relationships are organized as two non-partitioned linked lists, it requires a full scan of the relationships of one of the nodes to fetch the relationships between two nodes. This can be troubling when both nodes have many relationships of the same type.
 
issue 3:
 
New relationships are added to the store, either by appending a relationship record to the store, or by reusing the space of a deleted relationship. This leads to fragmentation of relationships. Ideally we would like to have similar relationships stored together, so they can be fetched with one disk operation.
 
About a year-and-a-half ago, I proposed a solution to solve these three issues. At the same time, a new node store layout was prepared to solve issue 1. This solution is still awaiting QA and may end up in some future release of Neo4j.
 
At the time, I had a discussion with Mattias about the merits of both solutions. Unfortunately that discussion is no longer available, since the issue on GitHub no longer exists.
 
Serving from memory: Mattias and I agreed that the solution I proposed solved more problems than his, but was the more involved solution, since it also had more impact on the caching mechanism as well as the relationship store. I agreed that a quicker fix was desirable, given the nuisance of issue 1.
 
Now that we are more than a year further down the road, I no longer feel that argument applies. The quick fix was never applied and it is still uncertain when it will make it into a release. Therefore, I would like to repropose my original solution.
 
As I said before, the current relationship store, contains two linked lists of unpartitioned relationships. The record layout looks as follows:
 
private final long firstNode;
private final long secondNode;
private final int type;
private long firstPrevRel
private long firstNextRel
private long secondPrevRel
private long secondNextRel
 
The field firstNode, contains the id of the node on the outgoing side of the relationship, the field secondNode contains the id of the node on the incoming side of the relationship. For the incoming side the firstPrevRel and firstNextRel fields span up the list of relationships attached to that node, while the fields secondPrevRel and secondNextRel do the same for the node on the outgoing side of the relationship.
 
Having two linked lists combined in one record, makes it impossible to sort relationships for both the incoming and the outgoing side, so the first step we need to take is break up the relationships in two parts: the list on the incoming side and the list on the outgoing side.
 
Now that we have the two lists separated, we can sort the relationship lists. To do so, I proprose a Btree structure, storing the following record:
 
private final long node;
private final bool direction;
private final int type;
private final long otherNode;
private final long relationship;
 
Let's look the possible scenarios:
  • Fetch all relationships of one node, irrespective of direction, type, or the other node attached to the relationships. All those relationships are grouped together the Btree, requiring fewer read operations than the current relationship store layout. (solves issue 3)
  • Fetch all relationships of one node in one direction. All those relationships are grouped together and require fewer reads of the store. (solves issue 3)
  • Fetch all relationships of one node for a specific relationship type in both directions. Here we need to do two look ups in the relationship store, one for the incoming side and one for the outgoing side. Other than that, all relationships are further grouped together. (solves issue 1 and issue 3).
  • Fetch all relationships of one node for in one direction for a specific relationship type. All relationships are grouped together according to this criterion, requiring few reads of the store. (solves issue 1 and issue 3).
  • Fetch all relationships between two nodes, irrespective of direction and type. In this situation we need to do two lookups in the store for either side and additional lookups for each relationship type. The number of relationships types used with one particular node is relatively small, so we still have acceptable lookup times. (solves issue 2)
  • Fetch all relationship between two node with a given direction. For each relationship type known in that direction for that node, we need to do a lookup in the store. Again the number of relationship types used is likely to be small, so we still have accaptable lookup time. (solves issue 2)
  • Fetch all relationships between two nodes with a given relationship type for both directions. We need two lookups in the store, for either side. (solves issue 1, issue 2 and issue 3)
  • Fetch all relationships between two nodes with a given relationship type and a given direction. Requires one lookup in the store. (solves issue 1, issue 2 and issue 3)
What remains of the relationship record is the bare minimum, just two ids to both nodes contained in the relationship, and an implicit id (position in the file) for lookup of properties.
 
Altogether, it feel that this solution is preferable to the one now awaiting QA. The argument that this solution would take more time to make it to a release, in my opinion, no longer applies, because the quick solution already takes more than a year-and-a-half anyway. Given the choice between waiting a long time for a suboptimal solution and waiting long for a more optimal solution, I am personally in favor of the latter.
 
Kind regard,
Niels Hoogeveen
 
 
 
Re: [Neo4j] The relationship store Wes Freeman 3/23/13 9:53 AM
Cool ideas. I think this is clearly superior for fetching (especially from disk), but what about updating? What happens when you add more relationships/types to a node? Do you leave padding for adding new items to the btree, and when that gets filled do you move them to the end of the file (a la mongodb)--or how would that work?

I always thought it would be even better if a select property could be organized in a btree per node, sort of like a clustered index in relational database terms. This way, you could efficiently query on your selected property (in addition to relationship type). Optimizing the case where you have a billion relationships coming out of a node, but you only care about the outbound relationships of a certain type, where the timestamp is in a particular 24 hour range. Currently, there's no good way to do that query--you can build in graph indexes and such to reduce the number of nodes from that node, but it sure would be nice to be able to query it simply.

Wes

 
 
 

--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Re: [Neo4j] The relationship store Niels Hoogeveen 3/23/13 12:19 PM
Hi Wes,
 
When adding a relationship to the store, the normal Btree operations would take place. So the index block the relationship will have to fit in is located. If the block is not full it will simply be added. If the block is full, a new block will be created and the normal btree rebalancing procedure will be followed. That way, the relationships will always remain in sorted order, grouped in blocks of similar relationships. New blocks will either be appended to the file, or deleted block will be reused. Since one block contains many relationships (with 2k blocks, 128 relationships can be stored in one block), fragmentation of relationships becomes less of an issue.
Re: [Neo4j] The relationship store Wes Freeman 3/23/13 7:27 PM
Ah, so you're saying it should be a btree for the whole graph?--I misunderstood, and thought it was per node or something. I guess if that were the case you wouldn't need to store the first node id every time. Interesting stuff...

More concerns (hopefully you can clear them up):
- why do you need to store direction? is it no longer implied? or is that just part of your data structure and not actually stored?
- doesn't this optimize for relationships in one direction as a btree scan is required for the other direction? or am I misunderstanding how your btree is organized? (should we consider a better multidimensional index that doesn't favor any particular query plan?)

Wes
Re: [Neo4j] The relationship store Wes Freeman 3/23/13 7:43 PM
Btw, here is your old github issue, which I'm reading through now: https://github.com/neo4j/neo4j/issues/198
Re: [Neo4j] The relationship store Niels Hoogeveen 3/24/13 5:58 AM
I see I wasn't completely clear in my proposal. For each relationship, two entries are stored in the Btree, one for the incoming side and one for the outgoing side. This makes relationships are treated equally irrespective of direction, To make a distinction between the incoming side and the outgoing side, the direction needs to be part of the key.
Re: [Neo4j] The relationship store Michael Hunger 3/24/13 8:15 AM
We started a document that covers some of that.

I can share it with you

Sent from mobile device
Re: [Neo4j] The relationship store Niels Hoogeveen 3/24/13 11:45 AM
Please do.
Re: [Neo4j] The relationship store Niels Hoogeveen 3/26/13 6:13 AM
Hi Michael,
 
You said you had a document you wanted to share with me.
 
Niels

On Sunday, March 24, 2013 4:15:24 PM UTC+1, Michael Hunger wrote:
Re: [Neo4j] The relationship store Mattias Persson 3/27/13 1:43 PM
One objection I have is around a b-tree that is for the whole db. What I would like to investigate more is something like a index/b-tree/block like structure per dense node to not introduce any scalability concerns related to a global index.


2013/3/24 Niels Hoogeveen <nielsh...@gmail.com>



--
Mattias Persson, [mat...@neotechnology.com]
Hacker, Neo Technology
www.neotechnology.com
Re: [Neo4j] The relationship store Wes Freeman 3/27/13 2:35 PM
+1 for Mattias's comment. Using a b-tree for the whole graph's relationships might introduce a performance requirement of fitting all relationships in RAM. I like the idea of the property index per node, but that also introduces some concerns, like how to store the mini-trees together on disk; maybe there's a good way of chunking them?

Wes
Re: [Neo4j] The relationship store Jim Webber 3/28/13 2:32 AM
I'd add to this that dense nodes become dense because they're written to a lot.

Any solution for dense nodes should make removing (or minimising) write contention a priority too.

Jim
Re: [Neo4j] The relationship store Mattias Persson 3/28/13 3:30 AM



2013/3/28 Jim Webber <j...@neotechnology.com>

I'd add to this that dense nodes become dense because they're written to a lot.

Any solution for dense nodes should make removing (or minimising) write contention a priority too.

That is true, good point. However that contention doesn't get worse the denser the nodes gets, it's quite constant. The loading/reading part gets worse and worse, hence the higher priority and focus there imho.
Re: [Neo4j] The relationship store Ian Robinson 3/28/13 4:30 AM
I think it's more that the number of writes per second may be higher for a dense node. If you're going to get from 0 to 100,000 relationships on a node in, say, 2 hrs, the contention will be higher than on a node that only attracts 100 new relationships in 2 hrs. Dense nodes have to become dense somehow, and in some scenarios, that can happen in a bursty manner.

ianUK: +44 7974 144424
Skype: iansrobinson
Re: [Neo4j] The relationship store Niels Hoogeveen 3/28/13 5:37 AM
Good point Mattias.
 
It is of course possible to eliminate the first entry of the key and store the absolute file position of the first index block within the node. That way index blocks can be maintained on a per-node basis.Of course this has to happen for both the incoming direction and the outgoing direction, leading to two file-pointers stored in the node and two root index blocks per node in the relationship-index file. 
 
Niels
Re: [Neo4j] The relationship store Niels Hoogeveen 3/28/13 5:44 AM
While we're at it, why not eleminate the direction entry of the key too and have two separate files, one for the incoming sides of the relationship, one for the outgoing side of the relationship, leading to an index key:
 
private final int type;
private final long otherNode;
private final long relationship;
 
Re: [Neo4j] The relationship store Mattias Persson 3/31/13 4:02 PM
Would you mind sketching this up in a nice picture?

Perhaps here? https://docs.google.com/a/neopersistence.com/drawings/d/10cjeetZ26k6hU2fTrAzzmARkUeWOB61YdSszJEuI84c/edit?usp=sharing


2013/3/28 Niels Hoogeveen <nielsh...@gmail.com>
Re: [Neo4j] The relationship store Niels Hoogeveen 4/1/13 5:51 PM
Hi Mattias,
 
I am not the greatest artist in the world, but I tried to sketch the proposed solution in the document you created. Feel free to improve on it.
 
In words:
 
Each node record contains a record-id pointing into a store (file) containing index blocks for outgoing relationships, and it has another record-id doing the same for the incoming relationships.
 
There are two stores of index blocks one for the outgoing relationships, one for the incoming relationships.
 
The record id's stored in the node record, point to the root index block of the Btree for that node. A store contains multiple Btrees (one for each node having relationships in the associated direction).
 
The key of the index is:
  1. relationshiptype-id
  2. node-id (the other node of the relationship)
  3. relationship-id
 
Niels
Re: [Neo4j] The relationship store Mattias Persson 5/2/13 12:30 AM
Hi,

thanks for the drawing.

I don't get the whole picture though... does each index entry in the b-tree point to a relationship record in the existing relationship store, or do you envision a new type of relationship store where relationships live together?


2013/4/2 Niels Hoogeveen <nielsh...@gmail.com>
Re: [Neo4j] The relationship store Niels Hoogeveen 5/7/13 12:33 PM
Hi Mattias,
 
The current relationship record has four fields that become obsolete in this proposal. These fields are: firstPrevRel, firstNextRel, secondPrevRel, secondNextRel. It seems to me that it would be better to start working with a slimmed down relationship file, since we can save 32 byte per relationship entry, a 60% reduction.
 
Over the last couple of weeks, I have been pondering how to efficiently deal with higher-order relationships and have come to realize that the solution I proposed can be used for this purpose.
 
Instead of creating a node that models the higher-arity relationship, it is possible to create a path that has the same function. Now the problem with ordinary paths is that we never know that two relationships that are positioned after one another actually belong together. This can be achieved with the solution I described. A path can be stored by using the same relationship-id for all relationships in the path. The relationship record itself can store the first and last node of the path. Instead of storing the direction in the btree records, it is possible to store a byte, which denotes the position in the path. This way binary relations can be used just like they are used now (except being stored in a btree), using 1 for outgoing and -1 for incoming relationships. Higher arity-relationships can be denoted by using numbers other than 1 or -1.
 
Niels
Re: [Neo4j] The relationship store Niels Hoogeveen 5/13/13 6:03 AM
Hi Mattias,
 
I hope this answered your question. As you can see, the lookup of the relationship record itself is actually only needed if you want to do a lookup by relationship-id. For all other queries a lookup within the index would suffice, since all information of the relationship record is already contained within the index. Only in the case where the relationship-id is all you know about the relation, would it be necessary to access the relationship store itself.
 
Niels
Re: [Neo4j] The relationship store Mattias Persson 5/15/13 1:21 PM
Hi Niels,

sorry I haven't put many thoughts into this recently, but I've read your replies.

So we'd have each dense node having its own little index in the form of a b-(or+)tree for example. And precisely, the relationship records can be stored directly in the index (it'd have to include the pointer to the first property record as well. I like that very much and thought about perhaps trying out MapDB (the lower-level Engine, which works on record level) and see how it scales. Building such an index from scratch would be great, but may be quite time-consuming.

So it becomes a matter of composing the right index key... and actually each dense node can have different key layout to optimise for how it's used... or even several indexes for different use cases.



2013/5/13 Niels Hoogeveen <nielsh...@gmail.com>
Re: [Neo4j] The relationship store Marko Rodriguez 5/15/13 1:29 PM
Hi,

The model you guys are talking about is called vertex-centric indices and you can learn more about it here:


Blueprints supports this via the VertexQuery interface:


Currently, Titan and OrientDB provide native support for it. Be great to see it in Neo4j.

HTH,
Marko.
Re: [Neo4j] The relationship store Niels Hoogeveen 5/18/13 1:19 PM
I am looking forward to seeing your experiments, Mattias. Indeed having good index functionality per node opens many new possibilities.
 
MapDb looks like an interesting product, although seeing that it's a one man's hobby project makes it a bit risky to use, unless of course, Neotech forks the project and helps to grow it into a maintained product.
More topics »