Many super-nodes in real world

572 views
Skip to first unread message

sergei fed

unread,
Feb 17, 2012, 9:50:59 AM2/17/12
to Neo4j
Hi Everyone,


The "super-node" or "densely connected node" (whatever we call it :))
issue is a major issue for Neo4J if we want to use it in a real social
website and a big loaded website.

There are several cases where one node is connected to more than
hundreds or thousand nodes, especially when we set a relation between
entities (not what we could call a "social link"), for example : movie
posted by member belongs to category

There is super-node support branch on git repos. Great. But i can't
find a lot of documentations about release date and what this feature
will resolve exactly. Maybe you have replies ?

Morever, could you tell us what are the current problems dus to super-
node not well-handled by Neo4 ? what we can do and what we can't do ?

I am working on a big project and unfortunately, this kind of problem
is a big issue for me. I guess we can do partitioning or something
like manually but it don't seem to be the good solution.

I should say too that do not have a release date and the scope of
super-node support (in git) is a problem to make decision in a short-
term.


Thanks for replies



Peter Neubauer

unread,
Feb 17, 2012, 10:00:06 AM2/17/12
to ne...@googlegroups.com
Sergej,
the supernode support is in lab-status and delivers a number of
improvements on your issue. The main reason it is not entering master
yet is the change of the storage format required, which means thorough
QA and migration. So, before we find time to do that it's hard to
commit to timelines. It is currently looking like Neo4j 1.8 will be a
reasonable target for it. Have you checked it out to see if it fits
your bill? There is a discussion regarding this at
https://github.com/neo4j/community/issues/144 that you can participate
in?

Cheers,

/peter neubauer

G:  neubauer.peter
S:  peter.neubauer
P:  +46 704 106975
L:   http://www.linkedin.com/in/neubauer
T:   @peterneubauer

Neo4j 1.6 released                 - dzone.com/6S4K
The Neo4j Heroku Challenge   - http://neo4j-challenge.herokuapp.com/

sergei

unread,
Feb 17, 2012, 10:05:50 AM2/17/12
to ne...@googlegroups.com
Thanks Peter.
Could you have a (approximative) date for 1.8 ?


2012/2/17 Peter Neubauer <peter.n...@neotechnology.com>

Peter Neubauer

unread,
Feb 17, 2012, 10:07:45 AM2/17/12
to ne...@googlegroups.com
Yes,
we are trying to release every 3rd month, so this will be midsummer
approx. Don't tweet me on this though!

Cheers,

/peter neubauer

G:  neubauer.peter
S:  peter.neubauer
P:  +46 704 106975
L:   http://www.linkedin.com/in/neubauer
T:   @peterneubauer

Neo4j 1.6 released                 - dzone.com/6S4K
The Neo4j Heroku Challenge   - http://neo4j-challenge.herokuapp.com/

sergei

unread,
Feb 17, 2012, 10:24:35 AM2/17/12
to ne...@googlegroups.com
thanks again 
last questions :)

what are the current limits of neo4j about super-nodes ?

you think we can reach how many links connected to a node without experiencing any problems ?

the problem is not continue, is it ? there is a number of links above, it become slow  ?



Mattias Persson

unread,
Feb 17, 2012, 10:32:15 AM2/17/12
to ne...@googlegroups.com
So a little overview of which changes it makes: https://docs.google.com/a/neopersistence.com/drawings/d/1wjiBn2N3d0zRMP788yR-YNF58QNqBTzPyDt-0H_hHJE/edit (please let me know if you cannot access it, I couldn't figure out how to make a google doc public).

It doesn't solve everything regarding these densely connected nodes. It adds Node#getRelationshipCount(type/direction) as well as efficient loading of relationships per type/direction so that only those specific relationships are loaded when requested, instead of all. Limitations is that it's still not efficient to check whether or not two nodes are connected with each other if they both have many relationships of the same type and direction between each other. If only one of the two has then it's fast because you can check from the one with the lowest count and count is cheap to get. The store format changes will require slightly more disk space, although not significantly more.

What's the main use case(s) that's you'd like it to address? Be as specific as possible.

Best,
Mattias

2012/2/17 Peter Neubauer <peter.n...@neotechnology.com>



--
Mattias Persson, [mat...@neotechnology.com]
Hacker, Neo Technology
www.neotechnology.com

sergei

unread,
Feb 17, 2012, 11:55:57 AM2/17/12
to ne...@googlegroups.com
Thx Mattias,


In fact i need to know from how many links we start to have performance issues with 1.6.

Another real case whose i don't know if problems can be is : 
- i have a list of 100k nodes of type A (fetch by index or next by next)
- these nodes of type A can have 2-3k links to other nodes of type B
- these nodes of type B can have some links to nodes of type C
- node C can have property : city
I would like to traverse all nodes A which have "connections" with node C.london from node B  (a->b->c)

I understand that 1.8 resolve it but i would like to understand how this "query" perform on 1.6. 

Last question, i would like to know how this limit is handle on neo4j, i mean does it a continue problem or is there a threshold, a limit above which we have problem. I often read 100k as a limit, does it mean that we don't have any issue at 99k ? 




2012/2/17 Mattias Persson <mat...@neotechnology.com>

Mattias Persson

unread,
Feb 17, 2012, 3:28:35 PM2/17/12
to ne...@googlegroups.com
The solution in that branch focuses on solving the problem where "the other relationships which I don't care about right now" affects the traversal performance on nodes which has a lot of relationships. But it doesn't make the case where you need to traverse, say 100 million relationships, any faster.

Now, if you'd like to traverse from A (which are very many) --> B (which are many) --> C (which are few) and you can prune some branches along the way, it's probably much more efficient to start from C to B to A instead.

About the 100k "threshold" I don't recognize that number anywhere and there's no such limit in Neo4j. Where have you read about that?

2012/2/17 sergei <s.fed...@gmail.com>

sergei

unread,
Feb 17, 2012, 3:49:00 PM2/17/12
to ne...@googlegroups.com
unfortunately, node C can have "very many" relationship too and B cant be a starting node in my case

to say simply, i wanna traverse a lot of nodes A and filter them to property of node C (reached from B) and C node has a lot of connections, more than A node. The difference is that i must traverse a lot of node A. So for each node of type A (whice are many), i must traverse many connections.

Maybe i must "denormalize" some properties and so, do more work at update (as we can do proceed in other database types) but before to do it, i would like to well know the real limits of Neo4J. Denormalization needs to be consistent and add delete/update which can be an heavy and not elegant process...

Mattias Persson

unread,
Mar 5, 2012, 4:11:24 AM3/5/12
to ne...@googlegroups.com


2012/2/17 sergei <s.fed...@gmail.com>

unfortunately, node C can have "very many" relationship too and B cant be a starting node in my case

to say simply, i wanna traverse a lot of nodes A and filter them to property of node C (reached from B) and C node has a lot of connections, more than A node. The difference is that i must traverse a lot of node A. So for each node of type A (whice are many), i must traverse many connections.

Maybe i must "denormalize" some properties and so, do more work at update (as we can do proceed in other database types) but before to do it, i would like to well know the real limits of Neo4J. Denormalization needs to be consistent and add delete/update which can be an heavy and not elegant process...

Maybe you could make use of a relationship index for A-->B relationships?
Hmm, ok. That's not a limit in neo4j, but I'm guessing merely some kind of notion of how many relationships define a pretty heavily connected node.

Pablo Pareja

unread,
Mar 5, 2012, 5:11:42 AM3/5/12
to ne...@googlegroups.com
Hi,

Most times when dealing with this sort of scenario, relationship indexing is IMHO not a good fit; not at least the way it is implemented right now. 
The thing is that as long as this sort of indexes cannot somewhat be used let's call it 'by default' in traversals, Cypher & Gremlin queries... you neither can have clearly specified queries nor actually 'use' them because different components from the Server or other stuff don't take them into account.

What's your view on this 'supernode' issue for the next release(s)?
I really think it's something that should be prioritized in the near future. Actually, I even know already quite a few people who moved away to other technologies in important projects because of this problem.

Cheers,

Pablo

Miguel Gomard

unread,
Mar 5, 2012, 6:42:16 AM3/5/12
to ne...@googlegroups.com
I totally agree with that.

We are forced to use Neo4J as a component or an exotic component in architecture to handle few kind of features and finally, these features can be handled by different ways, more sophisticated and complex but also more scalable

So i think that :
- ever Neo4J is a great tool to handle (without sharding ok) most common uses of databases like an efficient traversal of every nodes with "joins" and "order" (for example, the ordered result of a select in a database of millions records), 
- ever Neo4J is for small databases/small volume
- ever Neo4J will be an unsuitable tool because it needs to be among of big/complex architecture and it will be the only non-shareable element in it

The lack of efficiency global traversal (where we can have super node like 200000 products which are in one category) make the use of Neo4J more theorical than practical.

Sorry to be so rough, but i feel like a big frustration when we go deep inside Neo4J for an ambitious website. I tried to see OrientDB which seem to be more aware of these problems and try to be more general (document + graph).
I think this approach (focus) is better even if oriented team and community seems to be fewer and to be less advanced than Neo4J team.

Regards

Johan Svensson

unread,
Mar 5, 2012, 7:37:28 AM3/5/12
to ne...@googlegroups.com
Hi,

The problem described in the blog by Aleksa is:

You have many relationships of one (or a few) type(s) together with
few relationships of some other type(s) connected to the same node.
Then when traversing, if the relationship chain has not been loaded
(and cached) getting all relationships of any type will require the
full chain to be loaded.

This will be fixed.

I do not see how the document data model could help here since it can
not handle connected data. Regarding the "100k limit" that is probably
GC related since there is no such limit. Time reading the full
relationship chain from disk will be directly proportional to amount
of entries in the chain (and in turn linked to how good the disks
are).

Indexing relationships as described in the blog is a workaround that
can be used. Another workaround is to introduce balance nodes having
one node that holds the relationships there are few of and the other
node holding the ones there are many of.

We are aware of this problem and have a very good idea on how to solve
it. It is however a big change in the store format and will take some
time before it makes it into the milestone releases.

Regards,
Johan

Pablo Pareja

unread,
Mar 6, 2012, 6:32:16 AM3/6/12
to ne...@googlegroups.com
Hi Johan,

First of all thanks for your answer.
Just to be clear, in my case I'm not talking about adding any sort of document data model thing.

It's as simple as the issue I opened here already 6 months ago:


As I said in my previous message (and after re-reading Aleksa's blog post) I keep thinking 
relationship indexing is not the solution. I already gave a few important reasons for that in the other
message and could still add a good few more if it's necessary...

I must admit that this 'issue' is something that surprised me from the very beginning. I assumed from the
first moment I started using Neo4j that this 'native' relationship filtering by name would be implemented and working.
Otherwise, IMHO you cannot fully take advantage of modelling data in the graph because, in almost 
every data model, there are a set of entities/concepts that may be not many themselves but that are 
however very highly connected to the rest of data. 

You could think, ok I will model those nodes as properties and problem solved, but then in the end you 
would be taking this approach in all such cases, which could bring up the point: 
Why should I use a graph DB in the end? It may be pretty cool but If I cannot model entities as nodes and take advantage of that
in terms of traversals and so on, Why bothering dealing with specific indexes for relationships, 'ugly' Cypher/Gremlin queries,
unexpected performances depending on the case/tool/component used and situation... ?

I don't want to be rough either, but just make clear my point. 

In the project I'm working on (Bio4j) I have many so-called 'supernodes', you can have an overall look at the size of 
this DB in the post I published yesterday:


As you can guess after seeing all those numbers, this supernode thing is quite important for me 
(I can see how it would also be important for almost all people dealing with data sizes big enough).

Thus I would deeply appreciate it if you could give a release date estimation where a solution to this issue
will be incorporated.

Mattias Persson

unread,
Mar 10, 2012, 10:16:37 AM3/10/12
to ne...@googlegroups.com
If it's of any comfort I'm all with you on this and also hope that the time will come soon.

Although this is "only" about the loading of the relationships into cache. Once they are in there the behavior is like you'd expect where it doesn't matter how many "other" relationships there are on a node. But then again it's the loading that is the slowest part of it all and if the graph gets bigger than what fits into memory the problem is much more visible. I'm sure you're aware of this, just wanted to put it out there.

Best,
Mattias

2012/3/6 Pablo Pareja <ppa...@era7.com>

Inder Pall

unread,
Nov 7, 2012, 12:56:16 AM11/7/12
to ne...@googlegroups.com
Hello Folks,

did the discussed fix or any other fix made it to master to handle super nodes during queries?

- Inder

Mattias Persson

unread,
Nov 7, 2012, 8:03:34 AM11/7/12
to Neo4j Development
Not yet, it's on the roadmap not too far ahead. See discussion here


2012/11/7 Inder Pall <inder...@gmail.com>

--
 
 
Reply all
Reply to author
Forward
0 new messages