New Stable Build 700

345 views
Skip to first unread message

Oren Eini (Ayende Rahien)

unread,
Mar 1, 2012, 7:55:28 AM3/1/12
to ravendb
There is a new build out there, and it is awesome.

Indexing speed for the really tough scenarios? About 7 - 10 times faster.
Querying speed for really big indexes? About twice as fast.

We did a lot of major work in this build (actually, this represent over a month of work and just under a thousand separate commits) in two major areas:
- Performance for indexing
- Removing a set of annoying race conditions in Munin (not relevant to prod) and in the HTTP caching system (relevant to prod systems).

In particular, there was an instance where we would report a stale index result with the non stale etag, causing a set of 304 that would eventually cause an error in the application.
The mitigating factor is that any change in the database would release this issue. 

You check check the full release notes here:

And you can download it from here:

Tobi

unread,
Mar 1, 2012, 1:35:45 PM3/1/12
to rav...@googlegroups.com
Am 01.03.2012 13:55, schrieb Oren Eini (Ayende Rahien):

> There is a new build out there, and it is awesome.

Great!

I'm just missing the build tag in the Git and one test fails on my machine:

Raven.Bundles.Tests.Authentication.SmugglerOAuth.Export_WithoutCredentials_WillReturnWithStatus401
[FAIL]
Assert.Equal() Failure
Position: First difference is at position 0
Expected: The remote server returned an error: (401) Unauthorized.
Actual: Der Remoteserver hat einen Fehler zur�ckgegeben: (401)

Localization-issue!

Fixed in:

https://github.com/ravendb/ravendb/pull/463

Tobias

Oren Eini (Ayende Rahien)

unread,
Mar 1, 2012, 4:37:08 PM3/1/12
to rav...@googlegroups.com
Added the tag, and pulled your change, thanks.

On Thu, Mar 1, 2012 at 8:35 PM, Tobi <lista...@e-tobi.net> wrote:
Am 01.03.2012 13:55, schrieb Oren Eini (Ayende Rahien):


There is a new build out there, and it is awesome.

Great!

I'm just missing the build tag in the Git and one test fails on my machine:

Raven.Bundles.Tests.Authentication.SmugglerOAuth.Export_WithoutCredentials_WillReturnWithStatus401 [FAIL]
  Assert.Equal() Failure
  Position: First difference is at position 0
  Expected: The remote server returned an error: (401) Unauthorized.
  Actual:   Der Remoteserver hat einen Fehler zurückgegeben: (401)

Justin A

unread,
Mar 2, 2012, 12:29:35 AM3/2/12
to rav...@googlegroups.com

WoHoo!!!


Matt Warren

unread,
Mar 2, 2012, 5:41:00 AM3/2/12
to rav...@googlegroups.com
Currently I've got access to a large dataset, so I thought I'd try putting it into Raven (using the latest build) and see what happens ;-)

Here's the stat's so far:
  • Importing 47,4895,872 (47 million) docs took 4 hours 33 mins @ 2899 docs per/sec (in embedded mode)
  • Creating an index (run after the full import) took 7 hours 13 mins @ 1586 doc per/sec
After the Import the Esent Data file was 47.8GB and the Lucene index takes up 3.54GB. The docs have 19 fields, 2 are dates, 5 are text (but less than 20 chars each), the rest are numbers. The indexed select 2 date fields and 3 numerical fields.

During the import and the indexing RavenDB was well behaved with regards to memory usage. Previously when I've done these type of tests I've had to set lots of config settings to prevent it getting an OOME. But this time I didn't have to do anything and it kept below 2.5GB usage. The memory usage graph looked like a saw edge, i.e. it gradually worked it's way up and then dropped down, I guess after a GC Colleciton.

Whilst I have access to this data set are there any other tests that I can run or any other info that it's worth sharing? I'm going to try a Map/Reduce one next, I'll post the results here when it's done.

Note that during all these tests I carried on using my PC, so I had 2 version of Visual Studio open, Chrome with lots of tabs, Outlook etc.

Oren Eini (Ayende Rahien)

unread,
Mar 2, 2012, 5:50:25 AM3/2/12
to rav...@googlegroups.com
What kind of index did you run? Simple, full text, nonanalyzed?
Something that would be interesting is doing an import after the index was created.
And measure how far behind the indexing are. In our tests over 3 million docs, we saw about 1.5 seconds average over the entire process. That means that about 2 seconds after the import was done, all the indexes were stable.

Matt Warren

unread,
Mar 2, 2012, 5:57:14 AM3/2/12
to rav...@googlegroups.com
No it was just a simple index, a straight select statement. I might to a full text and non-analysed one next, but there's no fields with less 20 characters in them so I don't know how good a test it'll be.

I'll also try adding new docs with an index already in place and see what the timings are. How many new docs should I add, at the moment the import process takes 1 csv files and turns it into 250,000 docs?

Oren Eini (Ayende Rahien)

unread,
Mar 2, 2012, 5:58:14 AM3/2/12
to rav...@googlegroups.com
Those are big enough number to stress test what we are doing.

Matt Warren

unread,
Mar 8, 2012, 12:19:03 PM3/8/12
to rav...@googlegroups.com
Just been able to do some more tests.

To import an additional 220,000 docs takes 150 secs (~1450 docs per/sec)

After this it only takes an additional 30 secs for the 2 indexes (a custom one and Raven/DocumentsByEntityName) to be non-stale. 

So overall it takes ~3mins to add and index an additional 220,000 docs in a store/index that already contains 47 million docs.

On Friday, 2 March 2012 10:50:25 UTC, Oren Eini wrote:

Paul Hinett

unread,
Mar 8, 2012, 1:04:41 PM3/8/12
to rav...@googlegroups.com

I wish I could get speeds like that.

 

I’m using 700, but my inserts are taking a lot longer, roughly 1.5 seconds per 512 inserts, granted this does include a query to NHibernate which is taking around 750ms.

 

When you are importing 220,000 docs, where are you importing from? Is all your data loaded into memory first?

 

Paul

Oren Eini (Ayende Rahien)

unread,
Mar 8, 2012, 1:24:47 PM3/8/12
to rav...@googlegroups.com
Paul,
That gives us about 750 ms per 512, which isn't good, but not that bad,
The major factor is speed disk for those sort of things.
What kind of disk do you have?

Paul Hinett

unread,
Mar 8, 2012, 1:33:10 PM3/8/12
to rav...@googlegroups.com

I installed a new disk today, not quite SSD but it’s the Seagate Hybrid SSD 750GB (7200rpm) 32MB Cache.

Matt Warren

unread,
Mar 9, 2012, 9:15:04 AM3/9/12
to rav...@googlegroups.com

When you are importing 220,000 docs, where are you importing from? Is all your data loaded into memory first?

Yep, I'm only timing the time it takes to put 220,000 in-memory POCO's into RavenDB, not the time to get the data into memory in the first place. The hard drive is a 7,200rpm, SATA 250GB drive, but it's not SSD.

I think what's happening is that because I'm importing a large amt of docs in 1 go, the indexing is started whilst the docs are still being saved. This means that most of the indexing is done in parallel with import, so there's only a small wait at the end for the index to be non-stale.

Oren Eini (Ayende Rahien)

unread,
Mar 9, 2012, 9:21:04 AM3/9/12
to rav...@googlegroups.com
Matt,
Exactly

Matt Warren

unread,
Mar 9, 2012, 9:29:26 AM3/9/12
to rav...@googlegroups.com
One other thing I've notice is that RavenDB faceted search is wayyyy faster than doing "group by.." queries in SQL that produce the same results.

I'm sure this is because the complexity is different. SQL group by must be O(num of rows), whereas RavenDB/Lucene faceted search is O(num of terms that the field has), which is drastically different. I guess the inverted-index in Lucene really helps in this scenario

The timings are below, there are the same amout of rows in the SQL table as docs in the RavenDB store (now 55 million) and they represent the same data:

SQL takes 50 secs
    select Date, Count(*) from dbo.Table
    group by Date

Faceted search on the same field takes 4.5 secs!!

Oren Eini (Ayende Rahien)

unread,
Mar 9, 2012, 9:39:05 AM3/9/12
to rav...@googlegroups.com
Wow, how many items do you have there?

Matt Warren

unread,
Mar 9, 2012, 9:54:42 AM3/9/12
to rav...@googlegroups.com
There's 55 million docs in the RavenDB store and the same amount of rows in the database.

The groupby/faceted queries return 519 "groups" with group counts from 16,000 upto 150,000

Oren Eini (Ayende Rahien)

unread,
Mar 9, 2012, 9:57:34 AM3/9/12
to rav...@googlegroups.com
Matt,
Okay, that makes sense. 
I wonder if there are other ways to optimize this even further...
Right now we get the matching terms, then issue a query for each of them, and when you have that many, it is somewhat expensive.

One big thing that might help is actually enabling 304 on the endpoint. It currently does not supports this.

On another subject, did you have a chance to look at INTERSECT ?

Matt Warren

unread,
Mar 9, 2012, 10:12:37 AM3/9/12
to rav...@googlegroups.com
The only way the I can think of to make is faster is outlined in this article  http://www.devatwork.nl/articles/lucenenet/faceted-search-and-drill-down-lucenenet/, in fact the comments there discuss the 2 approaches.

This was the approach that I was taking with my original (complicated) approach to the faceted queries (before you simplified it ;-). However the issue is, how/when do you create the initial BitArray, in theory it can be done at the end of any index updates. If this approach was implemented, then you would only need to issue one query for a faceted search, regardless of the amt of terms.

> One big thing that might help is actually enabling 304 on the endpoint. It currently does not supports this.
What do you mean by this?

> On another subject, did you have a chance to look at INTERSECT ?
Yeah, I've been busy at work and sick, so not had much spare time! Can you hang on another couple of weeks, I'd really like to implement it?

Oren Eini (Ayende Rahien)

unread,
Mar 9, 2012, 10:40:42 AM3/9/12
to rav...@googlegroups.com
The BitArray bit is interesting, because you might use the same facets for multiple queries, so we can save a lot there.
We can say that the first time we calculate a facet, we just execute the query to get the bit set, then save that (only until the index is updated, of course).
That way, we can share the bit arrays for the terms, and only AND them with each query.

Thoughts?

More inline

On Fri, Mar 9, 2012 at 5:12 PM, Matt Warren <matt...@gmail.com> wrote:
The only way the I can think of to make is faster is outlined in this article  http://www.devatwork.nl/articles/lucenenet/faceted-search-and-drill-down-lucenenet/, in fact the comments there discuss the 2 approaches.

This was the approach that I was taking with my original (complicated) approach to the faceted queries (before you simplified it ;-). However the issue is, how/when do you create the initial BitArray, in theory it can be done at the end of any index updates. If this approach was implemented, then you would only need to issue one query for a faceted search, regardless of the amt of terms.

> One big thing that might help is actually enabling 304 on the endpoint. It currently does not supports this.
What do you mean by this?


304 is the Not Modified HTTP header.
It allows us to detect if something have changed or not, so we can skip all the work and just tell the user to keep his own version, because it is current.
 
> On another subject, did you have a chance to look at INTERSECT ?
Yeah, I've been busy at work and sick, so not had much spare time! Can you hang on another couple of weeks, I'd really like to implement it?


Sure, just making sure it is not forgotten

Matt Warren

unread,
Mar 9, 2012, 11:51:36 AM3/9/12
to rav...@googlegroups.com
That's exactly how I did it in my complex version, I calculated all the necessary BitArrays, and then stored them in a dictionary keyed on the FacetName/Range. I then serialized this as a doc and stored it in Raven.

At facet query time I pulled out the doc and did an AND of the query bit array with each facet term bit array.

The issue I found was that it took a relatively long time to create all the facet bit arrays after the index had been updated. I seem to remember it took at least a minute (when there was 100,000's of docs in the store). But it could all be done in the background and I guess that the facet for that index could be marked as "STALE" until it was completed.

The other issue I never addressed was when the update was done. It would be best if it was done after all the batches of work for an index were done, because you have to completely re-create the BitArrays each time, I don't think you can incrementally update it.

Oren Eini (Ayende Rahien)

unread,
Mar 9, 2012, 1:23:19 PM3/9/12
to rav...@googlegroups.com
How about this code:


var cachedFacets = GetCachedFacets(facetDoc);

if(cachedFacets.IndexEtag == docDb.GetIndexEtag(index))
  return cachedFacets;

// generate and create the facets

OR 

// kick off a process for calculating the facets.
More complex, because you can have only one such process at any given point for any facet docs.


On Fri, Mar 9, 2012 at 8:21 PM, Oren Eini (Ayende Rahien) <aye...@ayende.com> wrote:
inline

On Fri, Mar 9, 2012 at 6:51 PM, Matt Warren <matt...@gmail.com> wrote:
That's exactly how I did it in my complex version, I calculated all the necessary BitArrays, and then stored them in a dictionary keyed on the FacetName/Range. I then serialized this as a doc and stored it in Raven.
 
That isn't good, it will force us to reindex (because you create a new doc).
We can keep this in memory instead
 
At facet query time I pulled out the doc and did an AND of the query bit array with each facet term bit array.

The issue I found was that it took a relatively long time to create all the facet bit arrays after the index had been updated. I seem to remember it took at least a minute (when there was 100,000's of docs in the store). But it could all be done in the background and I guess that the facet for that index could be marked as "STALE" until it was completed.

Facets is a relatively costly feature, we can probably do this on demand, rather than all the time. And if we detect an index difference, we can recalc this, or maybe return stale facets? 
How important is it to get up to the ms facet info, really? 

Oren Eini (Ayende Rahien)

unread,
Mar 9, 2012, 1:21:49 PM3/9/12
to rav...@googlegroups.com
inline

On Fri, Mar 9, 2012 at 6:51 PM, Matt Warren <matt...@gmail.com> wrote:
That's exactly how I did it in my complex version, I calculated all the necessary BitArrays, and then stored them in a dictionary keyed on the FacetName/Range. I then serialized this as a doc and stored it in Raven.
 
That isn't good, it will force us to reindex (because you create a new doc).
We can keep this in memory instead
 
At facet query time I pulled out the doc and did an AND of the query bit array with each facet term bit array.

The issue I found was that it took a relatively long time to create all the facet bit arrays after the index had been updated. I seem to remember it took at least a minute (when there was 100,000's of docs in the store). But it could all be done in the background and I guess that the facet for that index could be marked as "STALE" until it was completed.
Facets is a relatively costly feature, we can probably do this on demand, rather than all the time. And if we detect an index difference, we can recalc this, or maybe return stale facets? 
How important is it to get up to the ms facet info, really? 
 
The other issue I never addressed was when the update was done. It would be best if it was done after all the batches of work for an index were done, because you have to completely re-create the BitArrays each time, I don't think you can incrementally update it.

Itamar Syn-Hershko

unread,
Mar 10, 2012, 2:50:25 PM3/10/12
to rav...@googlegroups.com
Just a nitpick here: the CDDB speed tests we did were including the time it takes to parse the data out, we didn't create POCOs first and only measured insert times like you did.

Itamar Syn-Hershko

unread,
Mar 10, 2012, 2:58:58 PM3/10/12
to rav...@googlegroups.com
The problem with such BitArrays is every Lucene index change invalidates them. The Lucene doc ids are not guaranteed to survive a deletion or segment change. So if you were calculating those arrays while indexing was still operating in the background, it would have been cheaper to just do it the normal way (which is what we do now I think).

Also, what happens when you have, say, 100 different facets with 5 mil docs?

Oren Eini (Ayende Rahien)

unread,
Mar 11, 2012, 6:23:56 AM3/11/12
to rav...@googlegroups.com
Itamar,
For most scenarios where we use facets, we work on top of a stable data set, so I don't think this would be too hard to do.
And bit array over 5 mil documents is about 150K, so that turns out to hold roughly 150 MB, which is acceptable, I think.

That is why I also said we need to consider stale facets as well, because we can avoid regenerating this all the time.

Chris Marisic

unread,
Mar 12, 2012, 11:43:23 AM3/12/12
to rav...@googlegroups.com


On Friday, March 9, 2012 9:29:26 AM UTC-5, Matt Warren wrote:

The timings are below, there are the same amout of rows in the SQL table as docs in the RavenDB store (now 55 million) and they represent the same data:

SQL takes 50 secs
    select Date, Count(*) from dbo.Table
    group by Date

Faceted search on the same field takes 4.5 secs!!


Can you show what the faceted search looks like since you put the sql version there?

Matt Warren

unread,
Mar 12, 2012, 11:54:42 AM3/12/12
to rav...@googlegroups.com
Sure, it's pretty straight-forward though:

Create the facet setup doc:
                      var facetSetupDoc = "facets/ForecastData"
             session.Store(new FacetSetup
             {
                   Id = facetSetupDoc,
                   Facets = new List<Facet>
                   {
                       new Facet {Name = "Date"},
                   }
             });
             session.SaveChanges();

Query it:
    var facetResults = session.Query<ForecastData>("ForecastIndex")
                              .ToFacets(facetSetupDoc);

Rémy van Duijkeren

unread,
Apr 4, 2012, 9:02:14 AM4/4/12
to rav...@googlegroups.com

Op zondag 11 maart 2012 11:23:56 UTC+1 schreef Oren Eini het volgende:
That is why I also said we need to consider stale facets as well, because we can avoid regenerating this all the time.
 

For the scenarios I used to work with (online travel agencies), the indexes and facets together were considered stale (so the counts where always correct). It was ok to have them stale for up to an hour.

 

When a documents causes a trigger (added/changed/deleted), the indexes and facets where rebuild in the background. When ready they replaced the current set of indexes and facets.

 

Optionally an x amount of time is waited after a trigger, before rebuilding, so multiple triggers could be rebuild in one go.

 

I think only stale facets is not a big issue, because it’s usually an indication to the user, but I would prefer to have correct counts.

Stephen Panetta

unread,
Apr 17, 2012, 10:12:42 PM4/17/12
to rav...@googlegroups.com
Hi Oren,
I have a RavenDB index which has a facet containing 500 categories.
For any given search, retrieving these facets can take up to 500ms over 10,000 documents.

I don't think I can reduce the number of categories any further, so my only option would be to use stale/bitwise facets if I'd like to improve performance.

Is this something likely to be included in a new build of RavenDB or would I be better off developing it myself using the method that Matt Warren mentioned?

Oren Eini (Ayende Rahien)

unread,
Apr 17, 2012, 10:20:56 PM4/17/12
to rav...@googlegroups.com
Are you using build 700?
Can you try build 888 ?
We added caching there that should greatly help performance in your scenario.

Steve

unread,
Apr 17, 2012, 10:23:44 PM4/17/12
to rav...@googlegroups.com
Certainly. Thanks for the reply.
I'll get back to you soon.

p.s. I've double posted as I didn't think that one went through. Sorry!


On Wednesday, April 18, 2012 12:20:56 PM UTC+10, Oren Eini wrote:
Are you using build 700?
Can you try build 888 ?
We added caching there that should greatly help performance in your scenario.

On Wed, Apr 18, 2012 at 5:12 AM, Stephen Panetta
Hi Oren,

Steve

unread,
Apr 23, 2012, 8:49:56 AM4/23/12
to rav...@googlegroups.com
Hi Oren,
Just thought I'd give you an update.
I did some preliminary tests and the new build did show some minor performance improvements. 
It's possible I didn't give the caching enough chance.

With the 500 category facet, for build 700, I'm running 10 identical searches in 6.3 seconds.
For build 888, it's taking 5.55 seconds.

I'm going to do a bit more testing soon to see if I can work out what it is.

On Wednesday, April 18, 2012 12:20:56 PM UTC+10, Oren Eini wrote:
Are you using build 700?
Can you try build 888 ?
We added caching there that should greatly help performance in your scenario.

On Wed, Apr 18, 2012 at 5:12 AM, Steve wrote

Hannes Johansson

unread,
Oct 1, 2012, 7:04:06 PM10/1/12
to rav...@googlegroups.com
I'd be very interested to see what your conclusions were. Also, I'd be really interested in knowing whether there are any other performance improvements on facets planned in the (very) near future.

The problem I have is that I have a lot of facets/terms and correspondingly a lot of filters that can be applied to query RavenDB. This means that

a) The facets query will be very slow as it's bound by the number of terms (or facets if you will), which is rather large (in some cases > 200). I know you might argue that it's not useful for the user to display that many facets, and while you may have a good point, I have to consider the fact that I am migrating an existing application and I need to preserve just about the same functionality (it wouldn't be acceptable for customers not to) and that's the way it works today.

b) Even though the facet queries are cached, since there are so many combinations of filters available, it's likely that most queries will be cache misses.

Making a facets query takes about ~4-5 seconds in some cases and that is of course unacceptable. Is there no way other than just trying to remove most of the terms (facets)? It's unlikely that I'll be able to push it below at least ~50, and in some database tenants I have ~40k documents (though I know that number is not the most critical in the facets case). It's becoming a serious issue because this functionality is very important in the application and everything else works rather well.

I'm even starting to think about delegating the facetting to a different service that would use Solr or something, as they have managed to get ridiculously good performance for their faceted searches, but that would of course introduce all sorts of problems with trying to synchronize the indexes and keeping them reasonably consistent.

Any other ideas for an outline of a solution to this use case?

Kijana Woodard

unread,
Oct 1, 2012, 7:36:21 PM10/1/12
to rav...@googlegroups.com
There's been a lot of movement on facets lately. Have you already seen this:

Hannes Johansson

unread,
Oct 2, 2012, 2:28:18 AM10/2/12
to rav...@googlegroups.com
I hadn't seen that particular thread, so thanks for linking me to it. But I have however already tried running it with that code, and while I did see clear performance improvements, it was still nowhere near enough. Instead of taking ~4-5 seconds it took ~2 seconds, which is a cool improvement, but it's still way too slow.

Oren Eini (Ayende Rahien)

unread,
Oct 2, 2012, 3:54:52 AM10/2/12
to rav...@googlegroups.com
What is the data size? How many facets?
Did you see what happens on the next query?
What build are you using?
Reply all
Reply to author
Forward
0 new messages