| Comparing Titan+Cassandra with Neo4j | Puneet Arora | 15.11.13 00:29 | HI All, I am a Masters student at University of Buffalo. As my masters project I decided to "Analyse the performance of column-oriented databases (Titan+Cassandra) with traditional graph database (Neo4j) for Social Network Graphs." Question / Hypothesis : Traditional Graph databases should perform better as Social Networks is a typical Graph Database, but why big tech companies are still working with column-oriented such as Cassandra, HBase for their social networks. Do they really provide better performance? If yes, then what are the Pros and Cons or their limitations. On the process I am planning to create a small set of benchmarks including some micro-benchmarks and graph traversal algorithms. I would like to know your view on this or any suggestions which could help me in the study. Thanks Puneet Arora |
| Re: [Aurelius] Comparing Titan+Cassandra with Neo4j | Marko Rodriguez | 15.11.13 08:15 | Hello,
Excellent.
Why do you assume that hypothesis? You would do yourself a great service by understanding how the different databases work and the benefits and drawbacks you get from each. I recommend you start reading:
It would be smart to do both a micro and macro benchmark as in production, who runs a single thread against a tiny dataset (micro)? Finally, if you move forward with micro-benchmarking, please use Titan 0.4.1 (releasing next week). Titan 0.4.1 has a new caching system that in total (both on disk + in memory) will demonstrate Titan outperforming many "traditional graph databases". We will have a blog post coming out next week demonstrating these new advances and this should really help to educate the public so they don't have marketing driven hypotheses such as yours: "traditional graph database."
In conclusion, I would do two benchmarks: MICRO: <insert next blog post from Aurelius here>
Good luck, Marko. |
| Re: [Aurelius] Comparing Titan+Cassandra with Neo4j | Puneet Arora | 15.11.13 17:45 | Hi Marko, Thank for your input. That really serves as an encouragement. Moreover, I am looking forward for the release of Titan 0.4.1 and may use it for my study. Also, thanks for the links, they are really insightful. Will keep this tread posted with results. Thanks Puneet |
| Re: [Aurelius] Comparing Titan+Cassandra with Neo4j | Pieter Martin | 17.11.13 09:06 | Hi,
I think many people will be interested in your findings. I am definitely. As a suggestion it would be great if your benchmarks can run on any blueprints enabled db. The 4 dbs I work with (Titan/Neo4j/OrientDb and Bitsy) all have very different backends. It would be great to see a thorough benchmarks on all of them. Good luck Pieter -- |
| Re: [Aurelius] Comparing Titan+Cassandra with Neo4j | Marko Rodriguez | 17.11.13 11:57 | Hi,
Great. Please stay tuned. The post is complete -- but we won't publish it until Titan 0.4.1 is released.
Note that the benchmark we will present soon is particular to the context of read-only times around the disk and the cache. If you are interested in benchmarking, I urge you to do so according to the types of statistics you wish to elicit -- e.g. distributed writes, single machine read/writes, load times, OLAP queries, HDD vs. SSD, threaded traversals, bulk mutations, etc. In short, there are so many ways to slice and dice a system that any one benchmark will only demonstrate a particular behavior (moreover, only a particular behavior in a controlled environment). Take care, Marko.
|
| Re: [Aurelius] Comparing Titan+Cassandra with Neo4j | Puneet Arora | 17.11.13 14:10 | Hi Pieter, Thanks for your interest in my research. Currently we are targeting only Titan+Cassandra and Neo4j both are Blueprint enabled. But in order to do a "apples to apples" comparison we will restrict ourselves to the native API implementations as Blueprint may incur some additional overhead. We are planning to have following benchmarks.
We have decided these benchmarks keeping in mind their relevance to Social Networks. I will welcome any suggestions for benchmarks along with their relevance to social networks. Thanks Puneet Arora
|
| Re: [Aurelius] Comparing Titan+Cassandra with Neo4j | Naga Pavan Sripada | 14.05.14 00:06 | Hi Puneet, Can you share your benchmark results for the below mentioned scenarios. Thanks, Pa1. |
| Re: Comparing Titan+Cassandra with Neo4j | Adolfo Rodriguez | 15.05.14 12:05 |
I have a concern from time back after seeing some OrientDB's Luca presentation comparing performance RDBMS performance vs traversals performance so maybe this is the right thread to raise it. Presentation advocated that graph has always a better performance, no matter the graph size. I do not know how the details of a RDBMS JOIN works but I guess, that the "JOINs execution engine" is able to mobilise all implied resources to memory on one shot and perform the whole operation in memory. Only one access to the storage is required to fetch bulk data and the rest of the operation would perform in memory. On the contrary, when a graph is traversed, the path is only discovered at runtime and therefore an unknown number of (small) hits is required to the storage system for the 'just discovered' next steps. The workaround to this necessity of multiple access to the storage system is by caching data in memory and, I suppose, this is the motivation of the next TTL attr which requires a fine tuning to choose the right data. So, at the first view, I see an advantage in performance for how a JOIN is executed against how a graph is traversed (despite modeling and analysis for me is much more natural in graphs). Any better informed rationale? |
| Re: Comparing Titan+Cassandra with Neo4j | Adolfo Rodriguez | 15.05.14 12:11 | Well, after re-reading I realize my case is different and would be more referred to a MySQL vs Titan comparison. Anyway, if someone could provide a rationale 'traverse vs JOIN' would be appreciated. Thanks |
| Re: [Aurelius] Re: Comparing Titan+Cassandra with Neo4j | Puneet Arora | 15.05.14 12:34 | Hi Adolfo, Here is my guess : Yes, you are correct the JOIN will get all the data into memory. But still joining tables will create huge amount of data. A naive example could be If we have 2 tables with 1000 rows each. Now when I join them they create 1000x1000 (= 1,000,000) entries first. As the tables grow huge, the multiplication will grow huge as well. And it may not be able to store everything in memory, so it will perform paging which is costly. And work on pages of data, check required conditions and get the entities.
Whereas, in graph databases you don't have to perform join. Graph databases are stored in adjacency lists. Moreover they store relationships as well. So we directly get the entities which satisfy the relationship.
So we save time in doing joins, paging and unnecessary computations. I think this is where Graph Databases perform better. I hope that answers your question. Moreover I found a nice paper on it http://www.cs.olemiss.edu/~ychen/publications/conference/vicknair_acmse10.pdf. Maybe this helps you.
-- Puneet Arora On Thu, May 15, 2014 at 3:11 PM, 'Adolfo Rodriguez' via Aurelius <aureliu...@googlegroups.com> wrote:
On Thu, May 15, 2014 at 3:11 PM, 'Adolfo Rodriguez' via Aurelius <aureliu...@googlegroups.com> wrote:
|
| Re: [Aurelius] Re: Comparing Titan+Cassandra with Neo4j | Adolfo Rodriguez | 15.05.14 16:15 | Hi Puneet, thanks for your response, the paper empirically addresses a simple case of a 1 step traversal. I can agree that, in this specific case, graph can perform better. The problem of these papers if that they always compares pearls to apples so, despite they have industry value, I cannot find much academic value. When an app developer creates a traversal, he is actually trying to replicate what the JOIN engine does fully optimized out of the box. The problem is that traversal developer knows nothing about best practices. I figure out a lot of possible strategies in doing the traversal: * fetching vertices on demand (the most intuitive), * create extra relationships in the physical graph only for querying purposes, to reduce required steps.. * setting permanently key vertices in memory with minimum footprint (this minimum footprint should be provided by the manufacturer), * setting temporarily in memory some vertices with a TTL, * indexing here or there and how, * how compares execution in compiled C vs bytecode Java so regarding my original question, *I suppose there is not a single rule* (and statement in the mentioned slideshow is TOO generic/optimistic and does not compare equals). In the worst case, you can always mimic in the traversal how the JOIN engine works. What I really would find useful is identifying and evaluating all these alternatives separately rather that comparing products against each other in a simple test case. A checklist of available optimization options has to be offered by the graph engine provider. I suppose this will be covered by the graph providers in future years. Any comment very welcome. |
| Re: [Aurelius] Re: Comparing Titan+Cassandra with Neo4j | Matthias | 19.05.14 12:01 | Hello Adolfo, From an academic perspective, comparing "traversal" with "join" makes little sense since that is comparing apples to oranges (or is it pearls, or pears?) as you said. You can map a graph into an RDBMS and then execute a traversal as a join query. In fact, most of the early RDF databases (which are essentially graph databases) did just that. However, we noticed that traversals often lead to complex join queries (i.e. very many joins) and that RDBMS aren't very good at executing those. So, we set out to engineer specific graph engines.
There is a whole class of RDF databases that start with the RDBMS approach, but get rid of the triple or edge table and build a bunch of covering indexes (SPO, OPS, etc). This is essentially still a JOIN approach but the system is more tuned for executing joins with the covering indexes, no extra table and better join algorithms (there is a whole literature on join optimization for RDF that equally applies to graphs).
Next, graph databases are a class of databases that materialize the join. Hence, instead of executing a (potentially expensive) join, we have a pointers to the other vertices. So, to find all neighbors of a vertex, we follow those pointers. That's not unlike network databases from the 60s. Now, we have the extra benefit that we have enough RAM to load the entire graph in RAM and therefore the materialized joins turn into an object graph. That's the approach Trinity and Neo4j take.
So, bottom line is that most graph databases either execute joins or materialize the join and pre-load it into memory for performance. However, I believe there is an academic case to be made for graph databases. See, join processing ultimately relies on these big index structures (the SPO, OPS, etc indexes).
Those work fine for smaller graphs, but when you have billions of edges, the log N factor in building and maintaining those indexes starts to hurt you - not to mention the fact that building distributed index structures of this nature is really hard.
Likewise, materializing all data into memory does not scale either - or if it does it is very expensive to do so. So, for large scale graph databases, you need index structures that allow "stale" data to flow to cheap, secondary memory and quickly grab the important stuff but without building global index structures. We believe that the answer to that problem are vertex centric indexes which is what Titan is all about.
Cheers, Matthias
Matthias Broecheler http://www.matthiasb.com |
| Re: [Aurelius] Re: Comparing Titan+Cassandra with Neo4j | Adolfo Rodriguez | 19.05.14 17:26 | Thanks Matthias, this is a great explanation. In particular this sentence addresses my concern: [...] we noticed that traversals often lead to complex join queries [...]. Looking forward to see how this memory cache is fine tuned for future release. Regards. |