RavenDb vs Azure Document Db

1,563 views
Skip to first unread message

Justin A

unread,
Aug 21, 2014, 9:01:06 PM8/21/14
to rav...@googlegroups.com
*someone* has to do the obligatory post about this .. so it looks like it will be me :/

Ayende - can we expect a comparison blog post about the RavenDb 3.0 vs (the current preview of) Azure Document Db ? Why someone would prefer Raven over Azure DocDb?

-curious me-

Maverix

unread,
Aug 21, 2014, 9:59:38 PM8/21/14
to rav...@googlegroups.com
Thanks Justin you beat me to the post.  Seems like this is what Azure Table Storage should have been.

The Missing Bits:
  • No Raven Management Studio Equivalent (Yet)
  • Map Reduce Missing
  • No Includes
  • No Lazy Loads
  • No User Defined indexes
  • Missing Advanced Indexing options:
    • load document
    • Map Reduce
    • Multi Map Indexes
  • No Full Text Search?
  • Set based patching
  • Faceted Search
  • Changes API
  • Poor Paging Story (continuation tokens = yuk)
The Added Bits:
  • Fully Managed Service with SLA
  • Linear Scaling
  • Everything is Indexed
  • Backed by Microsoft
Questions in my mind:
  • Local dev Environment?
  • Unit Testing story?

Quote from Scott Gu
We now have DocumentDB databases that are each 100s of TBs in size, each processing millions of complex DocumentDB queries per day, with predictable performance of low single digit ms latency

Justin A

unread,
Aug 22, 2014, 12:52:21 AM8/22/14
to rav...@googlegroups.com
my first question (which we've been talking about in JabbR (http://www.jabbr.net)) was about their unit testing story and local dev story - which were the TWO questions you raised also! (And my two _initial_ most important factors)

I thought some of those *missing bits* are available already -> like loading a doc ("SELECT * from doc where id = 1"). of course, i've only just read the spin real quick.

Oren Eini (Ayende Rahien)

unread,
Aug 22, 2014, 1:51:58 AM8/22/14
to ravendb
I'm looking into this, obviously, and it is very clear that the mindset is very different.
Note, I have no special knowledge, and I didn't actually play with that at all. I just went over the docs briefly. 

I've just read the docs, but what pops to mind is that this is "Demoable by default", and that you are going to have a lot of issues working with that for real.

* SP, UDF and Triggers in JavaScript replicate _all_ the usual issues about deployment of schema in relational world. 
Have fun trying to manage that outside of your own source code, and have fun with someone making changes to that on your systems in production, only to be rolled back on the next update, or forgetting to push that to production and seeing the application fail or have bugs.

* Indexing - I don't know if you noticed, but the "index everything by default" is on by default. That make it very easy to handle things, if you have small documents and running on a small scale. But indexing has a cost, a non trivial cost. That means that if you have big / deep documents, and you run with the default indexing, you are going to slow down significantly.
They have an option to not index stuff, or just index specific paths. But that just mean _more_ stuff that you need to maintain, and probably need to set the indexing configuration in both prod & dev. I expect to get issues in going to prod as well (dev has index everything by default, prod has specific stuff indexed).
There is no word on what happens when you need to change the index configuration for a collection. I'm not really encouraged by that, my guess that it is likely to be a BIG thing. Especially if you have a lot of data in your collection.

* Cross document queries - At least according to the documentation, I don't think that those are possible. The queries they have in the docs all show operations on the same document only. It also make sense considering they are running distributed, so doing cross document stuff would be very hard. 
But that really limits the kind of things that you can do. 

* Computation during query - they allow that, and that basically means that you can kiss performance goodbye for anything complex. The use of UDF in queries is also pretty scary to me. Because they are a perf killer. I assume that a lot of advanced functionality is going to be exposed through those UDF, maybe even LoadDocument or an equivalent of that. So you could try to query something based on a related document field, but that would be an O(N) operation. 
Welcome back, table scans. 

* No computational indexing at all. No way to say "index the sum total of this order", for example. So you would have to do that in a UDF, which would do an aggregation operation on the entire document, _during query_.

* Transactions - those appear to be handled via a stored procedure that is running on the server. So if you want to do something as simple as saving a new order and increment the # of orders for the customer, you can't do that without writing a SP (with all the associated cost around that).

* Requests & Performance - They don't appear to have any concept of batching operations. And you appear to be charged / limited per # of operation / sec. Also note that they appear to be doing only 500 writes / sec and 1,000 (trivial) queries per second. 
RavenDB can do three times that without even trying, and bulk insert gives you a lot more. Especially if you are running on SSD, as they appear to be doing.

* No id generation strategy - That appears to be left to you, or use a guid. Either option isn't very good. 

* No aggregation - There doesn't appear to be any option to do any sort of aggregation that I have seen. That means no Group By clause in their SQL, no map/reduce indexing, nothing.

* No built in caching system. No _way_ to actually get a proper caching system. You are back to primitive time based expiry and serving out of date data.

Note that those are things that I figured out from the docs in about 30 minutes or reading them. And I'm focused solely on the core features. Things like geo distribution, data locality, large queries, management options, reporting and much more are all things that cannot be answered by the current docs.
I've also pretty much ignored the client API story, too.




Oren Eini

CEO


Mobile: + 972-52-548-6969

Office:  + 972-4-622-7811

Fax:      + 972-153-4622-7811





--
You received this message because you are subscribed to the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Itamar Syn-Hershko

unread,
Aug 22, 2014, 9:23:19 AM8/22/14
to rav...@googlegroups.com
From the quick look I had at it (very quick look) it seems like a Mongo clone, as if nothing happened the last 10 years. There may be diff design decisions but the idea that you have SQL and sprocs is just wrong. A document DB requires a different mindset, and compromising in that or even suggesting a compromise is effectively killing it.

--

Itamar Syn-Hershko
http://code972.com | @synhershko
Freelance Developer & Consultant


--

Federico Lois

unread,
Aug 22, 2014, 9:23:19 AM8/22/14
to rav...@googlegroups.com
Today it would be difficult to say, I would say to wait for some iterations until do an in-depth comparizon. 
  • Map Reduce Missing
  • Missing Advanced Indexing options:
    • load document
    • Map Reduce
    • Multi Map Indexes
    Not missing at all, Azure since 6 month or so support HDInsights (which is Hadoop runing on Azure) and today they deployed HBase. The only thing you need there is to define a Hadoop Map-Reduce index (which are full programs -- you can do whatever you want into them) and set the storage back into DocumentDB. Question that remains though is how well integrated it is, probably we will have to wait 2 or 3 iterations (6 to 12 months) for deep integration but Microsoft is playing hardball with Google and Amazon, so I would expect they move there. 
    • No Full Text Search?
    Again depends on how well integrated are the DocumentDB indexes (apparently they index everything o.O --- autotune on-the-fly?) with Search-as-a-Service (also released today).
    • Poor Paging Story (continuation tokens = yuk)
    This is an API abstraction :) ... not such a big thing. It is freaking cool, but someone will build it *eventually*. 

    I prefer an integrated high-level API for development but for some problems if there are performance/management gains, even the high-level API is difficult to justify. Having said that... there is nothing preventing RavenDB internals to exploit Azure when hosted in Azure. Dont know how many of you do host on Azure, but I would be pretty damn happy for that to happen.


    --

    Oren Eini (Ayende Rahien)

    unread,
    Aug 22, 2014, 10:40:57 AM8/22/14
    to ravendb
    Federico,
    Paging story - you _can't_ cover that up. You can look at CouchDB, where they have pretty much the same issue, and there is no better way to handle paging 6 - 7 years after it came out. This is a fundamental difference in how the query system work. For example, I'll bet that if you have continuation tokens, you don't get the total number of results.
    So you have no way of doing something as simple as showing the number of users in the system.

    For that matter, they don't support _OrderBy_, so that is pretty much all you need to know about how good their paging support is.
    For that matter, they appear to have no real good way to actually get the continuation token in the API that I saw.


    Regarding Hadoop / HBase - That is nice, but that isn't really the same. While I'm sure that they would do some level of integration with ADB for that, that concept is pretty much not relevant.
    The problem is that you are talking about map/reduce jobs that has to process the entire data sets. In RavenDB, we allow you to incrementally update the map/reduce, so you have to do very little work to keep current. With external tools, you are back to doing daily jobs and having aggregations that are days behind.

    Same thing for full text search.

    Also, have good luck talking about _latencies_ for search here. 




    Oren Eini

    CEO


    Mobile: + 972-52-548-6969

    Office:  + 972-4-622-7811

    Fax:      + 972-153-4622-7811





    Chris Marisic

    unread,
    Aug 22, 2014, 11:12:59 AM8/22/14
    to rav...@googlegroups.com


    On Friday, August 22, 2014 1:51:58 AM UTC-4, Oren Eini wrote:

    * Cross document queries - At least according to the documentation, I don't think that those are possible. The queries they have in the docs all show operations on the same document only. It also make sense considering they are running distributed, so doing cross document stuff would be very hard. 
    But that really limits the kind of things that you can do. 



    That does not seem to be accurate they support joins, example:

    SELECT T
    FROM teams T
    JOIN person IN T.members
     

    Chris Marisic

    unread,
    Aug 22, 2014, 11:24:50 AM8/22/14
    to rav...@googlegroups.com


    On Friday, August 22, 2014 9:23:19 AM UTC-4, Itamar Syn-Hershko wrote:
    There may be diff design decisions but the idea that you have SQL and sprocs is just wrong. A document DB requires a different mindset, and compromising in that or even suggesting a compromise is effectively killing it.


    I'm not so sure about these statements. Usage of SPs allow you to have fine grained consistency control that would be difficult or impossible to emulate in RavenDB. Could one argue that if it's not reasonable to do in RavenDB that it is a poorly modeled document? Yes, very likely so. However there are some very respected people like https://twitter.com/kellabyte who flatly disagree with RavenDB deserving to be called ACID. The design differences of DocumentDB might have people like her consider that it is worth of the ACID label (I'm not sure off hand, i did just ask her).

    Regarding the statements about SQL. I would advise extreme caution about being dismissive about SQL as a language. SQL has one of the longest track records in terms of modern development. It is nearly ubiquitous. For DocumentDB to support SQL as the query language that is a radically lower barrier to entry than what you need to learn for RavenDB.

    Kijana Woodard

    unread,
    Aug 22, 2014, 11:48:35 AM8/22/14
    to rav...@googlegroups.com
    The problem with SQL is not as a query language, although I think it will be strange to query non-flat documents with SQL. The problem will be people will "port" relational designs into the db and then hit a wall.

    That said, that could also be a benefit. They can get into the db using the skills they know. Everything will "work" with full consistency by default. Once they run into trouble, they can relax consistency here and there to proceed.

    I don't _like_ that approach, but I can see it being "popular". Which means, once again, a bunch of terrible apps for my future self to rewrite. :-(

    On @kellabyte's critique of RavenDB's ACID properties, I disagree with her analysis, as I understand it. 

    If you used, for instance, @itamar's Elastic Search plugin for indexing, you wouldn't expect that to be ACID consistent with session.SaveChanges.

    She seems to draw the acid boundary around the client API or around the server process the client is talking to. I don't care that the the Lucene indexing _happens_ to be running inside the raven db server process. The ACID guarantees don't extend to it. 

    RavenDB gets to call itself ACID. Otherwise, that would be saying that if you do nothing more than remove the Lucene indexing feature and add the Elastic Search plugin by default, suddenly it's ACID? That doesn't make sense.


    --

    Oren Eini (Ayende Rahien)

    unread,
    Aug 22, 2014, 4:33:32 PM8/22/14
    to ravendb
    That is for joining things _inside the same document_.
    You can't join to another collection, or even to another document in the same collection.
    In other words:

    SELECT f1.id as id1, f2.id as id2
    FROM Families f1
    join Families  f2
    where f1.lastName = f2.lastName

    This doesn't work.



    Oren Eini

    CEO


    Mobile: + 972-52-548-6969

    Office:  + 972-4-622-7811

    Fax:      + 972-153-4622-7811





    --

    Oren Eini (Ayende Rahien)

    unread,
    Aug 22, 2014, 4:37:41 PM8/22/14
    to ravendb
    Chris,
    SP has a LOT of well documented issues. And the way they handled this basically means that they killed pretty much any chance to actually do reasonable optimizations.
    It is pretty easy to do the same consistency guarantees, in fact, you can do it today by doing a write and waiting for it to be indexed.
    In ADB, they are basically doing that, except that you have the option to ask for the indexing to be lazy. I'm assuming that you'll use that fairly often for the common stuff, and you are back in the same boat.

    As for SQL, please note that they do NOT have SQL. They have something that looks like that, but have very limited support.
    Join only on internal sub documents, where and from, that is it.

    No Group By, no Having no real joins.

    If we really wanted to, we could spend a week and do pretty much the same thing in RavenDB. We actually _had_ that in the 5xx builds.
    We removed that because there was no good way to make that perform well at scale.  



    Oren Eini

    CEO


    Mobile: + 972-52-548-6969

    Office:  + 972-4-622-7811

    Fax:      + 972-153-4622-7811





    --

    Chris Marisic

    unread,
    Aug 22, 2014, 4:44:07 PM8/22/14
    to rav...@googlegroups.com
    I did not notice that all, reviewing http://azure.microsoft.com/en-us/documentation/articles/documentdb-sql-query/ much more intently i see it says self joins

    Oren Eini (Ayende Rahien)

    unread,
    Aug 22, 2014, 5:05:37 PM8/22/14
    to ravendb
    Okaaay, I just found this.

    So...
    You can store a _HARD_ maximum of 16KB ?

    Let us take a blog post of mine as an example:

    This blog post takes 38.20 Kb
    The comments for it takes about 10 Kb in a separate document.

    You _can't_ use this for any real world scenario without running into this limit all the time.

    And RavenDB doesn't _have_ a limit. We have suggestions, but we have seen customers with documents sizes that were in multiple MB and several that had 10s of MB per _document_.

    The limitation on the query are also _very_ bad. Only 3 AND/ OR clauses allowed.
    I foresee a lot of people move their complex queries to UDF functions (of which you can have just one per query).
    Also, the number of stored procedures / udf per collection is _tiny_.





    Oren Eini

    CEO


    Mobile: + 972-52-548-6969

    Office:  + 972-4-622-7811

    Fax:      + 972-153-4622-7811





    On Fri, Aug 22, 2014 at 4:01 AM, Justin A <jus...@adler.com.au> wrote:

    --

    Federico Lois

    unread,
    Aug 26, 2014, 8:44:43 AM8/26/14
    to rav...@googlegroups.com
    Don't get me wrong, but incrementally update the map/reduce is a great concept and VERY VERY USEFUL. However, I miss the ability of doing complex calculations (to the point it is a scalability pain for us). I am absolutely sure it is because of the domain of Codealike, but most normal aggregations do not work for us. 

    For example:
    - We had to resort to having services that poll for new data to preprocess data using very specific (and custom) aggregation functions (mean is a very bad statistical measurement). The Focus Level is as real time as it gets (data gets in and we have to update it eventually). The calculation is actually pretty complex, works in a sliding window fashion and the full equation is pretty nasty: http://blog.codealike.com/focus-what-are-we-talkin-about/ 
    - We are still trying to avoid having services (which sucks in every single way) to build regression models like the Codealike Index (which is on our test environment at the moment) that would allow a particular user to understand his performance based on his historical pattern. 
    - Some complex aggregations also trigger events like the On-Fire. And when they fired, they are done. They are history, a change of parameters must not modify the data (that rules out the type of map-reduce Raven does).  
    - We have already prototyped and tested offline a recurrent neural network that mimics your focus and can predict based on your usual coding patterns when it is the best time to interrupt you. Predicting the general focus topology 2 hours in advanced. Which has quite similar issues (that is why it is not implemented live). 

    Other problems include: We are using a server to do calculations and another server to server the users (input endpoints) and web server. If we ever think to put those calculations in the main server, goodbye response times.

    I don't know if others are abusing RavenDB as we are here :P ... but those are the problems we are dealing on the data science side of the equation.


    Oren Eini (Ayende Rahien)

    unread,
    Aug 26, 2014, 8:49:00 AM8/26/14
    to ravendb
    This is meant for a post 3.0 feature, but ...

    James Tan

    unread,
    Aug 28, 2014, 12:03:12 PM8/28/14
    to rav...@googlegroups.com
    In fact, to me, index all document by default is a big thing and I posted a suggestion for RavenDB while ago, no follow up.  https://groups.google.com/forum/#!topic/ravendb/7laB5w7ZVDI
    I do think it is worth of full documents indexing and this could be very useful for second level specific indexing because all terms are pre-indexed.

    Thanks

    James


    On Friday, August 22, 2014 1:51:58 AM UTC-4, Oren Eini wrote:

    Oren Eini (Ayende Rahien)

    unread,
    Aug 28, 2014, 12:08:05 PM8/28/14
    to ravendb

    Marco

    unread,
    Aug 29, 2014, 3:18:27 AM8/29/14
    to rav...@googlegroups.com
    In the sample from the blogpost, one field in the index contains a value from every field.

    I assume that James point is to index by default every property to a separate field, so you only have to create an index manually when you want index nested properties and/or with specific settings (although some attributes in could help also)

    Oren Eini (Ayende Rahien)

    unread,
    Aug 29, 2014, 3:29:35 AM8/29/14
    to ravendb
    It is pretty much the same thing.
    Use dynamic indexing (CreateField) with Recurse to go over the entire thing.

    Mircea Chirea

    unread,
    Aug 29, 2014, 4:00:16 AM8/29/14
    to rav...@googlegroups.com
    It could also be a huge performance issue if you are not expressly aware of it. Indexing everything is usually done by search services like ElasticSearch, because by definition if you put it there you want it indexed. In a database meant to store all kinds of data, not so useful. It can be done if you are aware that you can shoot yourself in the foot.

    James Tan

    unread,
    Feb 4, 2015, 11:15:52 AM2/4/15
    to rav...@googlegroups.com

    With more documentation and details available for DocumentDB, I think some features could be more interesting than before.
    For example, this explains all documents/fields indexing clearly.

    Thanks

    James
    Reply all
    Reply to author
    Forward
    0 new messages