Indexing Query Performance Question

162 views
Skip to first unread message

KG

unread,
Jun 6, 2012, 1:23:39 PM6/6/12
to rav...@googlegroups.com
This might just be due to our being RavenDB virgins, but here's the issue we're having:

We have a "table" with about 8.5 million documents in it. Every time we need to write a new query against this in .NET RavenDB generates a new index for the query. Problem is, the index takes at least an hour to get all the current documents into the index query results, meanwhile whenever someone runs the query they're getting a 0 length result. 

Once the index has caught up to the 8.5 million documents, the queries work because the index query has been populated.

Is there something we're doing wrong? Is there a way to speed this up or to query without an index? This kind of performance impact is absolutely unacceptable.

Kijana Woodard

unread,
Jun 6, 2012, 1:42:09 PM6/6/12
to rav...@googlegroups.com
Wouldn't you have an index on 8.5 million rows in SQL Server? What's wrong with writing a few static indexes?

Is for speed of development on developing the new queries? If so, develop with a smaller set to get the functionality right and push to dev integration/build/automated testing/whatever to verify how it performs against the larger set.

But I'm a Raven noob too, so maybe someone with more experience has another suggestion.

KG

unread,
Jun 6, 2012, 1:46:25 PM6/6/12
to rav...@googlegroups.com
We tested with a static index, and it still takes forever for the index to populate. I understand we have a lot of data, but the fact that if we write a query based on an index that doesn't exist yet and our customer will have to wait hours for the query to return accurate results, that just seems crazy. Something's wrong here, either on our end or Raven's.

Mauro Servienti

unread,
Jun 6, 2012, 1:50:52 PM6/6/12
to rav...@googlegroups.com

The fact is simple: if you have a scenario that requires a sort of dynamic query, where queries can change their shape based on user input (for example a reporting server) a database such raven is not the correct choice.

If you are simply developing new scenario than you have to pay the wait time for the index to index all the data.

 

In my experience, on my machine (a laptop), however, indexing 20mil documents using 3 indexes takes not more than 40 minutes.

 

.m

Kijana Woodard

unread,
Jun 6, 2012, 1:56:39 PM6/6/12
to rav...@googlegroups.com
You say it takes "at least an hour". 8.5 million documents that's ~2000/second. Depending on your hardware, that doesn't bother me at all. 

But, it does sound like you're trying to use RavenDB as a data warehouse. My thoughts is that Raven is best suited for OLTP, but I'll let an official Rhino comment on that.

Chris Marisic

unread,
Jun 6, 2012, 2:00:16 PM6/6/12
to rav...@googlegroups.com
I might be old fashioned as I've used raven since the 1xx series, but I've never once used a dynamic index either in development or production.

The only time I've ever used a dynamic index was trying to figure out how raven would write an index to do what i wanted itself, which then i learned the output it creates isn't usable by the client anyway. I do think there should be a way to export a dynamic index definition to an AbstractIndexCreationTask<> from the silverlight client. Oren previously said this isn't really possible (which really doesn't make sense to me).

KG

unread,
Jun 6, 2012, 2:03:37 PM6/6/12
to rav...@googlegroups.com
What we were doing is copying the dynamic index definition, creating an index in code, and then deleting the dynamic one, so that it would always be there.

Perhaps we are attempting to use Raven for the wrong purpose, but all we're really trying to do is simple queries, nothing fancy. The fact that it just performs poorly with large volume data is the issue.

Chris Marisic

unread,
Jun 6, 2012, 2:37:53 PM6/6/12
to rav...@googlegroups.com
I see nothing really wrong here either. This is a one time paid cost. Even while it's indexing you can still even query it etc. I've seen lots of Oren's blog posts and forum posts showing the perf improvements they've made when working with big imports that the index only lags behind the actual inserting data by only a few seconds.

"meanwhile whenever someone runs the query they're getting a 0 length result. " this statement doesn't make sense. You should be able to materialize results nearly immediately after index creation, sure it won't have all 8.5M results, but it should be giving your results.

KG

unread,
Jun 6, 2012, 2:43:41 PM6/6/12
to rav...@googlegroups.com
I ran a query that basically said:

[linq].Where(x => x.Created > [a date] && x.Created < [a date]).ToList();

The list contained 0 elements until the elements that actually fit this criteria appeared in the index query. The elements were the last ones to appear in the index query, so it took forever for them to show up in the results of the query i wrote above.

KG

unread,
Jun 6, 2012, 2:50:46 PM6/6/12
to rav...@googlegroups.com
It might also be worth noting that the table that we're running this query against is constantly receiving updates.

Chris Marisic

unread,
Jun 6, 2012, 4:02:15 PM6/6/12
to rav...@googlegroups.com
Right, so there was nothing inherently wrong and Raven was working exactly as expected, when you asked it for Where(x => x.Created > [a date] && x.Created < [a date]).ToList(); it correctly told you that there were no items in that index that matched that at the time of the execution.

This is one of the few costs of RavenDB compared to sql databases from the fact that all indexes are precomputed.

Mainly you want to create 1 index per document type (unless you need to reduce over it as reduce indexes are always separate from map only) and stick every property you'll ever want to query on in it.


On Wednesday, June 6, 2012 2:43:41 PM UTC-4, KG wrote:

Matt Warren

unread,
Jun 6, 2012, 4:03:38 PM6/6/12
to rav...@googlegroups.com
> We tested with a static index, and it still takes forever for the index to populate. I understand we have a lot
> of data, but the fact that if we write a query based on an index that doesn't exist yet and our customer will
> have to wait hours for the query to return accurate results, that just seems crazy. Something's wrong here,
> either on our end or Raven's

This is all entirely expected, unless there is an existing index that it can use, RavenDB has to build an index before it can perform queries on it. Of course it can query the index whilst it is still building it, but if the query you issue should only match most recently added docs, they'll be indexed last, so they won't show up till the index is completely built.


> We have a "table" with about 8.5 million documents in it. Every time we need to write a new query against
> this in .NET RavenDB generates a new index for the query. Problem is, the index takes at least an hour to
> get all the current documents into the index query results, meanwhile whenever someone runs the query
> they're getting a 0 length result.

This is either a bug or something is causing the index to be deleted. RavenDB will only have to create a new index if the fields you are querying against are not in an existing index (the query optimiser handles this). So once you have created an index, it's presisted to disk and available after a re-start, it doesn't need to be re-created each time. What might be happening is that the existing index doesn't have the fields you need and RavenDB has to create a new one, is this the case?

To get round this, what you can do it build the largest possible static index that has all the available fields in it, for instance given a Customer like this:

public class Customer
{
    public string Name { get; set; }
    public int Age { get; set; }
    public DateTime DateOfBirth  { get; set; }
    public decmial Pay { get; set }
    .....
}

Then have an index like this:
    from customer in docs
         select new { customer.Name, customer.Age, customer.DateOfBirth, customer.Pay, .... }

Once this index has been built, RavenDB won't have to build any more indexes when you query for Customers, it can use the existing index.

But if you want to allow completely ad-hoc indexing or "reporting" then RavenDB isn't suited to that, as you've found out ;-)

Matt Warren

unread,
Jun 6, 2012, 4:06:52 PM6/6/12
to rav...@googlegroups.com
Once an index has been built, the time for new/updated docs to be added to it should be small. See Oren's blog post for some real-world numbers http://ayende.com/blog/154913/ravendb-amp-freedb-an-optimization-story

RavenDB can handle this scenario well, it will update the index as docs are updated, but you don't pay the cost of this "updating" at write time, because it's done in the background.

KG

unread,
Jun 6, 2012, 4:13:37 PM6/6/12
to rav...@googlegroups.com
Ok so i'm not completely crazy. If you run a query on an index, and the expected result document hasn't been indexed yet, it won't return that document even though it satisfies the query criteria. I thought I was losing it.

Matt Warren

unread,
Jun 6, 2012, 4:16:52 PM6/6/12
to rav...@googlegroups.com
Yeah that's right, it can't return the doc until it's been indexed.

However you can "Load" that doc straight way, loading and storing docs in RavenDB is completely ACID. It's just indexing that is "eventually consistent"

Matt Warren

unread,
Jun 6, 2012, 4:18:56 PM6/6/12
to rav...@googlegroups.com
Just to add, in your scenario (with 8.5 million docs), you don't want to be letting the end-user do something that causes a query to be issued that isn't covered by an existing index. Otherwise you'll have the long wait times that you've experienced.

KG

unread,
Jun 6, 2012, 4:20:26 PM6/6/12
to rav...@googlegroups.com
I'm intrigued. Can you explain to me the difference between load and query?

Matt Warren

unread,
Jun 6, 2012, 4:25:58 PM6/6/12
to rav...@googlegroups.com
Load means directly loading a doc but using it's ID, something like this:
    session.Load<Customer>("customer/4")
will load up the doc, straight after SaveChanges() has been called.

But this only work if you know the Id, which is where queries come in. If you want "all the customer with age > 5" or "all customers with D.O.B older than 1980" you have to use a query and that needs an index.

KG

unread,
Jun 6, 2012, 4:30:14 PM6/6/12
to rav...@googlegroups.com
Ok. So the moral of the story is that you need to anticipate early the fields on your object that you'd want to query on, and build the appropriate indicies. And if you don't, and you have a large data set down the road one day, you and your customers may be S.O.L. on query results for a while, until the index is fully populated. Also, you can use the Load method to directly access an object, assuming you have its ID, bypassing any indexing logic. Am I understanding this correctly?

Chris Marisic

unread,
Jun 6, 2012, 4:35:30 PM6/6/12
to rav...@googlegroups.com
Yes.

Generally when you're in production and need to add new indexes, or reindex existing data that more than tens of thousands of documents you want to treat this as an offline operation and schedule downtime for your application at night/weekends etc.

The biggest value you can get out of raven, is using Semantic ids, so that you can load your data as much as possible and that your indexes should really be used for displaying lists of data, not be used to query 1 single document.

Working with Load properly also guarantees you perfect consistency

Matt Warren

unread,
Jun 6, 2012, 4:44:03 PM6/6/12
to rav...@googlegroups.com
Yep, you've got it.

Just to add, whilst you have a small amount of docs, you can get away with letting RavenDB create "temp" indexes for you because it not only creates the ones you need, it "promotes" temp indexes if they get used enough time and keeps them (rather than cleaning them up). But with a large data set, this doesn't work, and so you really need to create the indexes yourself (as you say)

KG

unread,
Jun 6, 2012, 4:45:37 PM6/6/12
to rav...@googlegroups.com
Yeah definitely. We're using RavenDB because we need to store massive amounts of data that Sql Server doesn't provide enough storage for.

Matt Warren

unread,
Jun 6, 2012, 5:24:30 PM6/6/12
to rav...@googlegroups.com
Just out of intereset how much data do you intend to store in RavenDB? How many docs and roughly how many fields/properties per doc?

KG

unread,
Jun 7, 2012, 9:19:06 AM6/7/12
to rav...@googlegroups.com
Each doc has about 15-20 properties. We will be storing millions of docs. 10's of millions.

Oren Eini (Ayende Rahien)

unread,
Jun 7, 2012, 4:10:43 PM6/7/12
to rav...@googlegroups.com
inline

On Wed, Jun 6, 2012 at 11:03 PM, Matt Warren <matt...@gmail.com> wrote:


> We have a "table" with about 8.5 million documents in it. Every time we need to write a new query against
> this in .NET RavenDB generates a new index for the query. Problem is, the index takes at least an hour to
> get all the current documents into the index query results, meanwhile whenever someone runs the query
> they're getting a 0 length result.

This is either a bug or something is causing the index to be deleted. RavenDB will only have to create a new index if the fields you are querying against are not in an existing index (the query optimiser handles this). So once you have created an index, it's presisted to disk and available after a re-start, it doesn't need to be re-created each time. What might be happening is that the existing index doesn't have the fields you need and RavenDB has to create a new one, is this the case?

Actually, I would suspect that this is an issue with the temporary index getting cleaned up. 

Oren Eini (Ayende Rahien)

unread,
Jun 7, 2012, 4:12:04 PM6/7/12
to rav...@googlegroups.com
KG,
It can't return that document, since it hasn't been indexed.
I guess we could have scanned over all of the other documents to see if they match, but that would be extremely expensive.
Instead, we give you the results that we have right now, let you know as of when they are accurate and whatever they are stale or not.

Oren Eini (Ayende Rahien)

unread,
Jun 7, 2012, 4:13:57 PM6/7/12
to rav...@googlegroups.com
Actually, I would phrase this differently. If you have large data set and a customer issue an unexpected query somehow, you are NOT out of luck, because the entire system won't ground to a halt while the database is doing a full table scan of millions of documents.
Instead, we will give you a result straight away, and optimize ourselves for this query from that point onward.

Oren Eini (Ayende Rahien)

unread,
Jun 7, 2012, 4:16:14 PM6/7/12
to rav...@googlegroups.com
Chris,
I would strongly disagree with you here. One of the selling points for RavenDB is that introducing a new index is NOT an offline operation.
Unlike SQL Server, where online index build is only available at the Enterprise edition, RavenDB offers online index builds.

Sure, at large data sets you usually have to plan this, and you probably don't want to do that during peak load, but even then, RavenDB will manage.

I would actually say that part of upgrading the app would include (live) adding of new indexes, waiting for them to complete, then deploy the new software.

nightwatch

unread,
Jun 7, 2012, 5:12:35 PM6/7/12
to rav...@googlegroups.com
I'm curious how Raven will be better than SQL at storing a 'massive' amount of data.
After all, JSON document is usually much bigger than a corresponding table row, so you'll need approx 3-5 times more storage for Json data than for sql table with the same records in binary format.
Of course it depends on the data, I'm talking about some imaginative 'typical' case.
R

Oren Eini (Ayende Rahien)

unread,
Jun 7, 2012, 5:13:45 PM6/7/12
to rav...@googlegroups.com
Actually, we see smaller db sizes with RavenDB.
I am not really sure why, though. Probably the modeling makes a difference.

nightwatch

unread,
Jun 7, 2012, 5:25:18 PM6/7/12
to rav...@googlegroups.com
For sparse tables (lots of null columns) Json could be smaller. And if you're storing large texts then Json overhead will be small.
R

Chris Marisic

unread,
Jun 7, 2012, 5:58:28 PM6/7/12
to rav...@googlegroups.com
That's a somewhat interesting story of deploying index changes first, but the development story doesn't lend itself to that very well. The best story is code your app, and the indexes are created on deploy. Kinda makes it coupled to the app code.

Adding new indexes I can see being an online op, but one that still might be worth doing offline is modifying an existing key index that it will produce stale results for a long period and users of the system know the data enough to understand something is "wrong", like they go to search for someone that they know exists and get 404s. that confuses the hell out of users.

I have to say I'm excited to be deploying my next raven app even if i feel there's still a few maintenance sharp edges. That's because understanding my need for indexes very well and having natural unique keys there's going to be few times if ever we will have to touch the database as an admin except for setting up the backup scripts one time. And the latest tools for backups are as convenient as they've ever been.

Oren Eini (Ayende Rahien)

unread,
Jun 7, 2012, 6:03:30 PM6/7/12
to rav...@googlegroups.com
What sharp edges do you see? We will make every effort to remove them.

Chris Marisic

unread,
Jun 7, 2012, 7:42:46 PM6/7/12
to rav...@googlegroups.com
Mainly the scenario of needing to update an existing index and having to reindex.

Perhaps it could do it in the background, and then swap over when it's done, or nearly done, etc.

Matt Warren

unread,
Jun 8, 2012, 4:37:22 AM6/8/12
to rav...@googlegroups.com
I guess that storing everything in BSON helps?

But when you say "we see smaller db sizes with RavenDB" are you including the Lucene indexes in this? I think it's a fair to compare Esent + Lucene size to RDBMS size.

Oren Eini (Ayende Rahien)

unread,
Jun 8, 2012, 5:33:53 AM6/8/12
to rav...@googlegroups.com
Yes, I am doing that.
Case in point, my blog db is  137.51 MBytes 
But using RDBMS, it was 400 - 500 MB.

Matt Warren

unread,
Jun 8, 2012, 5:39:09 AM6/8/12
to rav...@googlegroups.com
Okay, that's impressive, I'd be interested in seeing why that's the case, because it's (pleasantly) surprising!

Oren Eini (Ayende Rahien)

unread,
Jun 8, 2012, 5:41:25 AM6/8/12
to rav...@googlegroups.com
No idea, and that might be because a lot of factors. Including things like different usage patterns, data that got stored for subtext that I am not doing in racoon, etc.
Reply all
Reply to author
Forward
0 new messages