Re: [Neo4j] Neo4j vs Mongo

nirmal

unread,

Jun 17, 2012, 9:26:59 PM6/17/12

to ne...@googlegroups.com

I wouldn't store metadata on Neo4j. It might be worth storing the details in Mongo while the graph information in Ne04j.

On 17 June 2012 16:03, ramya <ramy...@yahoo.com> wrote:

Hi - I'd like to understand the differences in implementation and performance between storing data in a graph database such as Neo4j vs. a document store like MongoDB. Can mongo be used to create user profile documents, wherein each profile may contain user ID's for one or more of other users (i.e., to represent a graph of users)? In such case, what is the performance between mongodb and neo4j?
And also, are there areas where neo4j and mongo complement each other (rather than compete for a graph solution)? I am trying to pick the ideal stack to create a social network. Any recommendations will be great! Thanks in advance!

--
Nirmal Selvaraj

Emil Eifrem

unread,

Jun 18, 2012, 1:39:46 AM6/18/12

to ne...@googlegroups.com

On Sun, Jun 17, 2012 at 6:26 PM, nirmal <nirmals...@gmail.com> wrote:
> I wouldn't store metadata on Neo4j. It might be worth storing the details in
> Mongo while the graph information in Ne04j.

Now, that is a very interesting statement. What do you feel is missing
to be able to store the metadata as conveniently in Neo4j as in Mongo?

Cheers,

--
Emil Eifrém, CEO [em...@neotechnology.com]
Neo Technology, www.neotechnology.com
Cell: +46 733 462 271 | US: 206 403 8808
http://blogs.neotechnology.com/emil
http://twitter.com/emileifrem

Markus Gattol

unread,

Jun 20, 2012, 4:43:35 AM6/20/12

to ne...@googlegroups.com

chiming in... Emil, first of all thanks for moving forward the entire graph database world, I've always been fascinated by graphs in general and since a few years now thanks to you guys and all the other players in this corner we can actually use proper products to build amazing things with.

I can understand the question the OP asked and I guess may folks including myself, at one point or another, have thought about it that way... what if your dataset grows? How would I be able to still enjoy all the benefits of a graph database and maybe store crazy amounts of data, so much data that in fact it's to much for a single machine. My thought process led me down the same route... maybe use a layered approach in my data tier ie a graph database atop some distributed filesystem (eg ceph) and/or MongoDB.

Example: say I wanted to do facebook (yes, super-ridiculous example but...). I would store connections and such lightweight stuff in Neo4j and move anything 'heavy' (read big) down to the fat second layer which is storing my 1 PiB of images and videos my users upload every day.

In the end however (and you hear that from everybody on the road long enough and/or somehow directly involved with building the next generation database) we have to get used to having a variety of different databases/filesystems sitting within out data tier, each one used according to their particular strength, so that in the end the entire logic tier (people 30 and younger please read 'application' ;) gets the best 'deal' for anything data related it has to ask/remember for/to.

Overall I think such 'but when I need to scale to...' questions/discussions however, in practice, just become relevant for every 10^6 website/app out there anyway. Neo4j certainly can drive anything the random project ever requires just fine... like every other database out there as well :)

nirmal

unread,

Jun 20, 2012, 9:47:11 AM6/20/12

to ne...@googlegroups.com

Thanks Marcus. You had reflected every bit of my point. Adding to that, in general certain tools are made with certain advantage. Purely with the performance and ability of Neo4j to traverse quickly and provide the graph output while Mongo being a strong source of heavy data lifting, I had chosen to use one over the other in selecting the data to be stored.

--
Nirmal Selvaraj

Rick Otten

unread,

Jun 20, 2012, 10:28:25 AM6/20/12

to ne...@googlegroups.com

MongoDB works much better when all of the data on a given partition can fit in memory. If any of the nodes get bogged down in disk I/O, mongo doesn’t do so well. What makes it scale is that many of data collections that can be easily stored in MongoDB are also easily partitioned. This lets you spread your data across a number of servers … letting you fit data sets which are larger than the memory capacity of a single server still nearly entirely in memory.

Partitioning graph data can be hard – it depends on the graph – so it is particularly challenging to find ways to keep large graphs (in general) entirely in memory when you outgrow your commodity hardware resources.

An interesting hybrid (graph-massive parallel datastore) technology I’ve been looking at lately is Titan. It lets you run the full Tinkerpop suite on top of HDFS. Neo4j is way slicker and much more mature, but if scaling is a serious concern, Titan may be worth a look - http://thinkaurelius.github.com/titan/

A similar, competing project, is Giraph - http://incubator.apache.org/giraph/

Peter Neubauer

unread,

Jun 20, 2012, 10:32:23 AM6/20/12

to ne...@googlegroups.com

And, for this type of graph on top of Hadoop there is even infogrid...

/peter

Send from mobile.

Marko Rodriguez

unread,

Jun 20, 2012, 11:15:53 AM6/20/12

to ne...@googlegroups.com

Hi,

An interesting hybrid (graph-massive parallel datastore) technology I’ve been looking at lately is Titan. It lets you run the full Tinkerpop suite on top of HDFS. Neo4j is way slicker and much more mature, but if scaling is a serious concern, Titan may be worth a look - http://thinkaurelius.github.com/titan/

A similar, competing project, is Giraph - http://incubator.apache.org/giraph/

I would not say that Giraph and Titan are competing projects:

1. Giraph requires that the whole graph be maintained within the collective memory of the machine cluster.

(though, last I heard, they are trying to figure out a disk/memory model)

Titan, on the other hand, can store graphs that are the size of the collective disk space of the machine cluster.

2. Giraph is intended for global graph algorithms via BSP message passing.

Titan, on the other hand, is intended and optimized for local graph traversals via "link walking"/"pointer chasing"

3. Giraph is OLAP for single-user, non-transactional use cases.

Titan is OLTP for multi-users systems with numerous concurrent transactions.

I hope that helps,

Marko.

http://thinkaurelius.com

Nikhil

unread,

Jun 20, 2012, 11:49:12 AM6/20/12

to ne...@googlegroups.com

I'm sorry I might be wrong, but if I were to implement a Facebook-like functionality, I would rather store all images/videos and other 1PiB data on some CDN and store URLs as node properties. I would use a graph to store 'processable' data (no, I don't plan to perform image/video processing using a graph there ;) ).

In my short experience with Neo4j so far (which is as short as a year in production), I have found Neo4j efficient enough to store denormalized or flat data (typically the case with documents) within nodes and still be able to perform efficient graph traversal. A comparison which I personally like is the one mentioned here. :)

Rick Otten has mentioned (in a parallel email in this thread) a valid point about partitioning where most other data stores win over graph databases. Vertical scaling of your graph server is going to reach it's threshold when you exhaust all your hardware resources. However, after Titan was released by Aurelius, I feel much more confident that the problem of graph sharding can be solved gracefully.

To sum it up, I see comparing Neo4j with Mongo pretty much as comparing apples to oranges (I'd be happier if someone corrected me here). You use a graph when your data is highly interconnected and needs to be navigated, else if your data is nothing but a bunch of denormalized structures, you go with a plain document store!

--
Nikhil Lanjewar
Engineering Lead at YourNextLeap
http://yournextleap.com

http://twitter.com/rhetonik

Max De Marzi Jr.

unread,

Jun 20, 2012, 5:21:11 PM6/20/12

to ne...@googlegroups.com

I am not a MongoDB expert, but I work with a dev who has some battle scars managing it. The issue with using Mongo to build a graph is you either:

1. Create a "connections" collection that stores both ids which just grows and grows and grows, which you'll have to index to use.

2. Create a "connections_array" field on each model... which leads to a bottle neck in Mongo if the objects are constantly being updated by adding more ids into the array. It has something to do with running out of space for that document and mongo moving it to a different space in memory/disk at the end of the data file. Consult your local MongoDB expert for details.

Either way, your traversal logic is going to hurt if you have deeper links, or if you have to find paths between two objects.

Jordan Pollard

unread,

Jun 22, 2012, 4:11:54 AM6/22/12

to ne...@googlegroups.com

I haven't worked with Mongo since 1.9, but I believe the best way to support relationships between profile documents was to use a DBRef. The problem is that you might as well use a relational database at that point as behind the scenes you're really doing an expensive join operation from one document to another (you may find this article useful). This is exactly where the strength of a graph database like Neo4j comes into play so I would argue you would absolutely want to use a graph database for this.

I developed a hybrid approach that you may find useful, but make sure you are aware of the scalability concerns of each database. I use Neo4j to store relationships between Facebook users and other properties I know I will use frequently. I also keep a store of their profile information as a document in MongoDB. In Neo4j, I use a Mongo ObjectId to reference the user's profile document and other documents/data that I do not use very frequently.

Michael Hunger

unread,

Jun 22, 2012, 2:41:57 PM6/22/12

to ne...@googlegroups.com

Does Mongo keep referential integrity of DBRefs? I think that is one of the big advantages of Neo4j that this is the only ref-integrity constraint that you never get dangling links.

And that moving the creation of actual links to the insert time moves the join costs to a single point in time that can be easily controlled instead of having this JOIN cost on every query (link in mongodb or rdbms)

Otherwise I'm all for polyglot persistence, use the datastore that's suited best for a certain type of data, so larger profiles (and pictures etc.) into mongo and rels and query-relevant information into neo4j.

Cheers

Michael

Jordan Pollard

unread,

Jun 29, 2012, 10:29:20 AM6/29/12

to ne...@googlegroups.com

That was one of the problems I was trying to solve (re: referential integrity as well as efficiency since I knew there were nasty join operations going on behind the scenes) as there are not really any mature data modeling practices for document databases. After reading "REST In Practice", I posted an idea on the "REST In Practice" user group proposing a model inspired by REST... That's when Jim pointed out Neo4j :)

I'm pretty sure Mongo still doesn't maintain referential integrity between DBRefs as I would think it goes against the philosophy of a document database.

brian

unread,

Jun 30, 2012, 3:51:29 PM6/30/12

to ne...@googlegroups.com

Completely agree with Nikhil. In general, I would store large-ish blobs or documents in a CDN, filesystem, or document store rather than in a property of a Node in a graph db. I believe I read in a separate thread, that when one property of a Node is read, all properties of the Node are loaded (correct me if I got this wrong). If so, you might be wasting memory during certain traversals or queries that are not concerned with the document property. In these cases, it might make more sense to store the traversable properties and metadata of your documents as Node properties including a URI to where the document/blob actually lives.

-brian

Michael Hunger

unread,

Jun 30, 2012, 4:01:16 PM6/30/12

to ne...@googlegroups.com

Only all property which directly fit into property records, i.e. short-strings, numbers and small arrays, everything else is only loaded on demand (longer strings and larger arrays)

So the large documentes would not be loaded eagerly but they would use more space in the store and transaction logs.

Michael

Jordan Pollard

unread,

Aug 8, 2012, 4:46:05 PM8/8/12

to ne...@googlegroups.com

While I may end up with a bit of data duplication, I guess the other thing I was thinking of here is that I was not sure how large of a graph I would end up with. It seems like a realistic possibility that I could end up in a situation where it is more beneficial to forego the benefits of ACID transactions for faster reads/writes so data could be loaded on demand similar to what Twitter does. Storing that data all in Neo4j would make that transition a bit more difficult and the Facebook APIs change so much so I would rather grab all the information I'm going to get from Facebook, put it into a data store of some type (like a document DB or some blob storage situation) and then write the logic for the data I want to parse out from there. That said, I have begun exploring other solutions for scalability with graph databases especially since my company has a need for it while maintaing consistency (not related to what I've been messing around with for Facebook). I am not as familiar with Titan, but I suspect it is similar to LinkedIn's Norbert?

Reply all

Reply to author

Forward