performance with many relationships per node

928 views
Skip to first unread message

Jason W

unread,
Jan 5, 2014, 1:29:25 AM1/5/14
to ne...@googlegroups.com
I've been reading about Neo4j's internals and found it interesting that a node's relationships are stored as a doubly linked list of relationships. This seems to be ideal for traversing a graph with relatively few relationships per node. It does not seem ideal for traversing when a single node can have a large number of relationships. An example use case would be node "A" with 1 million relationships. If I want to determine whether a relationship exists between node "A" and another node "X", Neo4j will have to step through the entire linked list of relationships until it finds the corresponding relationship. This would result in linearly decreasing performance as the number of relationships per node increases. Am I understanding this correctly?

Is there any type of index that can be created to speed up this type of access pattern?



This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.

Stefan Armbruster

unread,
Jan 5, 2014, 7:42:52 AM1/5/14
to ne...@googlegroups.com
Your understanding here is correct.
In your example: Given a node with 10E7 relationships and you want to know if it is connected to node x. If you know that x has significant lesser relationships you can reverse the traversal direction and check if x is connected to a. This will be cheaper as only the smaller linked list of node x needs to be iterated. 
 

Michael Hunger

unread,
Jan 5, 2014, 8:09:34 AM1/5/14
to ne...@googlegroups.com
This will be addressed in Neo4j 2.1 or 2.2 depending on the roadmap

What is your concrete use-case?

There are some ways to mitigate it manually for now depending on your use-cases

Eg introduce a hierarchy or a separate node for the large relationship-set

Or index certain rels that are important to access quickly in a legacy relationsship index





Sent from mobile device

Am 05.01.2014 um 13:42 schrieb Stefan Armbruster <ml...@armbruster-it.de>:

Your understanding here is correct.
In your example: Given a node with 10E7 relationships and you want to know if it is connected to node x. If you know that x has significant lesser relationships you can reverse the traversal direction and check if x is connected to a. This will be cheaper as only the smaller linked list of node x needs to be iterated. 
 

--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Jason W

unread,
Jan 5, 2014, 5:38:16 PM1/5/14
to ne...@googlegroups.com
Michael,
This is a major performance bottleneck for me as I have nodes that may have 1M relationships of type 1, but only a hundred relationships of type 2. Traversal performance for type 2 will be brought down due to the type 1 relationships. I've read through the archives and it seems like this was being worked on over a year ago, but was never released. This is the main issue preventing me from purchasing an enterprise license, as I cannot commit to production use with this limitation. When is the planned release data on your roadmap?

My concrete use case is measuring cellular interactions in different people. For example, one person may have 1 million active cells, but only several hundred inactive cells. I need to be able to quickly query out the active and inactive cells with some filtering based on properties. Some type of index on the relationship that can help me quickly filter down to the cells I'm interested in. I also need to query out which active/inactive cells are shared between groups of people.

Michael Hunger

unread,
Jan 6, 2014, 6:43:13 AM1/6/14
to ne...@googlegroups.com, Max de Marzi
Jason,

The release is afaik planned either for some time in Q2 2014.

actually the workarounds are not so difficult.

The easiest one is to add a separate node which holds the 1M relationships. So your actual node has just 1+100, the 100 that you mentioned of type 2 and the one to the separate node.

So you can ignore the 1M relationships most (or all of the time) from this direction and only check from the other side.

Wouldn't you filter on the relationship-type if you are only looking for active / inactive ?

Cheers

Michael

Nicholas Stuart

unread,
Jan 6, 2014, 3:22:35 PM1/6/14
to ne...@googlegroups.com
I have suffered from this issue and we have a few workarounds currently that fit our data model. 

I do not recommend indexing as a solution. We attempted to do this, and it lead to GC problems as Lucene was create a lot of trash, so my high performance query was spending a lot of time doing GC. 

Your use case may differ, but for me, I was able to fix this issue by removing those relationships, and putting the node id's that they referenced in to an array on the node. My basic diagram was 

(User) -[HAS]-> (Value) <-[REQUIRES_VALUE]- (Item)

We had a lot of the HAS relationships on the value nodes, but very few of the REQUIRES_VALUE. We were able to change our data model such that out user had a property which was an array of longs that referenced the Value node ID. This was ideal for my use case as we only ever needed to go from User to Item in our search, but if we ever need to go from Item to User, we're SOL.

Jason W

unread,
Jan 7, 2014, 1:11:43 AM1/7/14
to ne...@googlegroups.com, Max de Marzi
Michael,
What do you mean by filter on the relationship type? I was under the impression that all relationships for a node are in the same linked list and there's no fast way to filter for the relationship type without traversing through the whole linked list.

Michael Hunger

unread,
Jan 7, 2014, 2:49:16 AM1/7/14
to ne...@googlegroups.com
I meant something like this:

Jason W

unread,
Jan 7, 2014, 8:54:13 AM1/7/14
to ne...@googlegroups.com
Thanks Michael - I understood your workaround and see that it would work.

However, I wanted to confirm that neo4j does not have native ability to filter quickly by relationship label? I just did some more testing using the "profile" statement, and it looks like it is doing some native filtering by relationship label. Also, can auto indexes or legacy indexes be useful here?

Michael Hunger

unread,
Jan 7, 2014, 9:07:55 AM1/7/14
to ne...@googlegroups.com
Jason,

we want to add a better internal structure for these dense nodes in Neo4j 2.1 then it will be handled internally.

These workarounds are only until then.

Auto-indexes or legacy-indexes of relationships would allow you to quickly access your type1 relationships even if all the other relationships have not yet been loaded from disk.
You would kind of skip the graph navigation for this node then and instead look up the type1 relationships from the index and follow them to their end-nodes.

But whenever you would navigate across your domain node you'd be hit again by the large number of rels. So the separation out to a "dense-meta-node" (I just made that up). is the better general solution.

You can even evolve that solution one step further by adding a tree of type2 relationships starting at the "dense-meta-node" which is filled incrementally and whose subtrees can be accessed by an additional selection criteria (like a property).

Florent Biville

unread,
Apr 24, 2014, 1:45:47 PM4/24/14
to ne...@googlegroups.com, michael...@neopersistence.com
Sorry to reply to this old thread, but I'm actually facing the situation you're describing at the end: having possibly many rels to traverse.
As you mentioned, the fix in 2.1 won't fix this and there is the workaround you describe above or even, as Mark Needham and Cédric Fauvet told me, the possibility of fanning out such dense relationships with Max de Marzi unmanaged extension: https://github.com/maxdemarzi/dense.

Do you know if this issue will still be tackled in 2.2? 
Would you mind sharing some insights (like the approach currently considered by the engineering team)?

This may be a very naive outlook on the matter, but wouldn't it be possible to actually implement the fanning out (as done with the unmanaged extension of Max de Marzi) at the storage level and introduce the notion of paging into core APIs? 
This notion of page, as in the unmanaged extension, is currently taking the shape of a relationship depth variable.
With no specified page (or ranges of page), this would remain backwards-compatible: all relationships would be traversed, else only the ones in the specified page.

With guaranteed backwards-compatibility, people who currently don't need to worry about this still wouldn't need to.
People in need of this wouldn't a workaround anymore.

Again, I'm aware this has huge impacts anyway and I may miss many others. But I'm really curious about the approaches that are considered.
Sorry if that does not make sense at all.
Reply all
Reply to author
Forward
0 new messages