As promised, my three-monthly reminder on the issue of densly connected nodes. This time around, however, I'd like to start the discussion again on the method chosen to solve the problem.
The current relationship store has at least the following three issues:
- nodes having many relationships of one type, have slow access to relationships of other types
- fetching the relationships between two nodes, requires a scan of all relationships of one of the nodes
- relations get fragmented over the store
issue 1:
Since all relationships of a node are organized as one linked list, there is no partitioning of relationships by relationship type. This means that all relationships attached to a node need to be scanned to fetch the relationships of a certain type. This is especially troubling when there are many relationships of one type attached to the node, while trying to fetch relationships of another type. It is acceptable that fetching many relationships takes time, it is not that acceptable that the presence of many relationships of one type, causes fetching of relationships of other types to slow down.
issue 2:
Since the id of the other node attached to the relationship is stored in the relationship record, and relationships are organized as two non-partitioned linked lists, it requires a full scan of the relationships of one of the nodes to fetch the relationships between two nodes. This can be troubling when both nodes have many relationships of the same type.
issue 3:
New relationships are added to the store, either by appending a relationship record to the store, or by reusing the space of a deleted relationship. This leads to fragmentation of relationships. Ideally we would like to have similar relationships stored together, so they can be fetched with one disk operation.
About a year-and-a-half ago, I proposed a solution to solve these three issues. At the same time, a new node store layout was prepared to solve issue 1. This solution is still awaiting QA and may end up in some future release of Neo4j.
At the time, I had a discussion with Mattias about the merits of both solutions. Unfortunately that discussion is no longer available, since the issue on GitHub no longer exists.
Serving from memory: Mattias and I agreed that the solution I proposed solved more problems than his, but was the more involved solution, since it also had more impact on the caching mechanism as well as the relationship store. I agreed that a quicker fix was desirable, given the nuisance of issue 1.
Now that we are more than a year further down the road, I no longer feel that argument applies. The quick fix was never applied and it is still uncertain when it will make it into a release. Therefore, I would like to repropose my original solution.
As I said before, the current relationship store, contains two linked lists of unpartitioned relationships. The record layout looks as follows:
private final long firstNode;
private final long secondNode;
private final int type;
private long firstPrevRel
private long firstNextRel
private long secondPrevRel
private long secondNextRel
The field firstNode, contains the id of the node on the outgoing side of the relationship, the field secondNode contains the id of the node on the incoming side of the relationship. For the incoming side the firstPrevRel and firstNextRel fields span up the list of relationships attached to that node, while the fields secondPrevRel and secondNextRel do the same for the node on the outgoing side of the relationship.
Having two linked lists combined in one record, makes it impossible to sort relationships for both the incoming and the outgoing side, so the first step we need to take is break up the relationships in two parts: the list on the incoming side and the list on the outgoing side.
Now that we have the two lists separated, we can sort the relationship lists. To do so, I proprose a Btree structure, storing the following record:
private final long node;
private final bool direction;
private final int type;
private final long otherNode;
private final long relationship;
Let's look the possible scenarios:
- Fetch all relationships of one node, irrespective of direction, type, or the other node attached to the relationships. All those relationships are grouped together the Btree, requiring fewer read operations than the current relationship store layout. (solves issue 3)
- Fetch all relationships of one node in one direction. All those relationships are grouped together and require fewer reads of the store. (solves issue 3)
- Fetch all relationships of one node for a specific relationship type in both directions. Here we need to do two look ups in the relationship store, one for the incoming side and one for the outgoing side. Other than that, all relationships are further grouped together. (solves issue 1 and issue 3).
- Fetch all relationships of one node for in one direction for a specific relationship type. All relationships are grouped together according to this criterion, requiring few reads of the store. (solves issue 1 and issue 3).
- Fetch all relationships between two nodes, irrespective of direction and type. In this situation we need to do two lookups in the store for either side and additional lookups for each relationship type. The number of relationships types used with one particular node is relatively small, so we still have acceptable lookup times. (solves issue 2)
- Fetch all relationship between two node with a given direction. For each relationship type known in that direction for that node, we need to do a lookup in the store. Again the number of relationship types used is likely to be small, so we still have accaptable lookup time. (solves issue 2)
- Fetch all relationships between two nodes with a given relationship type for both directions. We need two lookups in the store, for either side. (solves issue 1, issue 2 and issue 3)
- Fetch all relationships between two nodes with a given relationship type and a given direction. Requires one lookup in the store. (solves issue 1, issue 2 and issue 3)
What remains of the relationship record is the bare minimum, just two ids to both nodes contained in the relationship, and an implicit id (position in the file) for lookup of properties.
Altogether, it feel that this solution is preferable to the one now awaiting QA. The argument that this solution would take more time to make it to a release, in my opinion, no longer applies, because the quick solution already takes more than a year-and-a-half anyway. Given the choice between waiting a long time for a suboptimal solution and waiting long for a more optimal solution, I am personally in favor of the latter.
Kind regard,
Niels Hoogeveen