Is OrderBy applied before filtering (WhereGreaterThanOrEqual) in LuceneQuery?

129 views
Skip to first unread message

Tobias Sebring

unread,
Aug 14, 2012, 6:31:39 AM8/14/12
to rav...@googlegroups.com
I'm seeing some strange results when adding OrderBy operands to my queries on a fairly large dataset (2.3M documents).

In the following queries FilesCount_Range:[0x00000001 TO NULL] will limit the dataset to 2600 out of the 2.3M documents in the full dataset.

The below query will return almost immediately:
query=FilesCount_Range:[0x00000001 TO NULL]
start=0
pageSize=25
aggregation=None
fetch=Id
fetch=Files

This query on the other hand takes 50 seconds to return:
query=FilesCount_Range:[0x00000001 TO NULL]
start=0
pageSize=25
aggregation=None
fetch=Id
fetch=Files
sort=Text
sort=Year
sort=ProjectId
sort=CaseId


This is taken from the mvc-profiler:
"49308 ms waiting for server in 1 request(s) for 1 sessions(s)Session opened for 50025.86 ms for 1 request(s) - Data/Index"

Subsequent calls with the same orderby are cached and return immediately.

The index is set up like so:
public class Data_Index : AbstractIndexCreationTask<Data, Data_ Index.Result>
{
public class Result
{
public string Text { get; set; }
public int Year { get; set; }
public int? ProjectId { get; set; }
public int? CaseId { get; set; }
public int FilesCount { get; set; }
public string[] Files { get; set; }
}

public Data_ Index()
{
Map = data => from d in data
select new
{
Text = d.Text,
Year = d.Year,
ProjectId = d.ProjectId,
CaseId = d.CaseId,
FilesCount = d.Files.Count,
Files = d.Files.Select(x => x.Path).ToArray()
};

Index(x => x.Text, FieldIndexing.NotAnalyzed);
Index(x => x.Year, FieldIndexing.Default);
Index(x => x.ProjectId, FieldIndexing.Default);
Index(x => x.CaseId, FieldIndexing.Default);
Index(x => x.FilesCount, FieldIndexing.Default);
Index(x => x.Files, FieldIndexing.Default);

Sort(x => x.Text, SortOptions.String);
Sort(x => x.Year, SortOptions.Int);
Sort(x => x.ProjectId, SortOptions.Int);
Sort(x => x.CaseId, SortOptions.Int);
Sort(x => x.FilesCount, SortOptions.Int);
}
}

This is on the latest build of RavenDb running as a console application. I've tried to reproduce this issue on a embedded server running in memory but to no avail.

I'm considering denormalizing my data even further by extracting all documents with files to another collection but hesitant because I figure these queries should work without this much overhead.

Tobias Sebring

unread,
Aug 15, 2012, 5:36:52 AM8/15/12
to rav...@googlegroups.com
Any clue as to why this is so slow? I have still not been able to figure this out.

Oren Eini (Ayende Rahien)

unread,
Aug 15, 2012, 5:53:53 AM8/15/12
to rav...@googlegroups.com
How many items match the result?

Tobias Sebring

unread,
Aug 15, 2012, 6:25:22 AM8/15/12
to rav...@googlegroups.com
"TotalResults": 2601,
"SkippedResults": 0

Oren Eini (Ayende Rahien)

unread,
Aug 15, 2012, 7:19:35 AM8/15/12
to rav...@googlegroups.com
Well, we do have to do extra work on your case, loading a lot more info to memory to be able to sort on this.
If this happens on a cold start, it may take a while.
Can you try a different query (so it won't be cached), and see what happens over time?

Tobias Sebring

unread,
Aug 15, 2012, 7:45:20 AM8/15/12
to rav...@googlegroups.com
Is 50 seconds reasonable for a cold start with the size of the data set I'm dealing with? If I remove the OrderBy on the text field the query will sort on Year, ProjectId and CaseId and return within 16 seconds which also seems slow for loading 2601 documents and doing an in-memory sort. A subsequent and modified query (e.g. adding another field to the query) will complete pretty fast. After a delay of inactivity the query will once again take 50 seconds to complete.

Considering these factors I'm guessing the sorting is done on the full document collection and then cached in memory? If so is there a way to increase the lifetime of this cache and/or possibly not sort on the full collection but on the documents returned from the query?

Oren Eini (Ayende Rahien)

unread,
Aug 15, 2012, 7:54:22 AM8/15/12
to rav...@googlegroups.com
Tobias,
What is the size of the text field?

We do a lot of complex caching there, and depending on the actual index size (can you check), loading through it might take a while.

Tobias Sebring

unread,
Aug 15, 2012, 7:59:08 AM8/15/12
to rav...@googlegroups.com
25-50 chars on average but in rare cases upwards of 100 chars.

Oren Eini (Ayende Rahien)

unread,
Aug 15, 2012, 7:59:50 AM8/15/12
to rav...@googlegroups.com
Strange.
At any rate, you need to warm up the index, make sure that a lot of it is in memory before you can really make perf comparisons. That is especially true with large indexes.

Tobias Sebring

unread,
Aug 15, 2012, 8:05:36 AM8/15/12
to rav...@googlegroups.com
What are my options for doing so and/or to make sure it stays in memory? The database is around 4-5GB and I don't expect it to become much larger than that. If I knew of a way to do it I could potentially pre-load the entire database into memory but it would still need to be persisted to disk.

Oren Eini (Ayende Rahien)

unread,
Aug 15, 2012, 8:15:13 AM8/15/12
to rav...@googlegroups.com
Tobias,
First, let us check some things.
Can you check the actual size of the index? 

Tobias Sebring

unread,
Aug 15, 2012, 8:45:34 AM8/15/12
to rav...@googlegroups.com
I will as soon as I can later today. Right now I don't have access to the database computer and rebuilding the index takes ~4 hours.

We are talking about the physical size of the index on disk right?

Oren Eini (Ayende Rahien)

unread,
Aug 15, 2012, 8:49:08 AM8/15/12
to rav...@googlegroups.com
YEד

Tobias Sebring

unread,
Aug 15, 2012, 9:53:06 AM8/15/12
to rav...@googlegroups.com
The index is 759mb.
Data file is 3,63 GB.

Indexing Attempts: 2,296,704
Indexing Successes: 2,296,704

Oren Eini (Ayende Rahien)

unread,
Aug 15, 2012, 9:55:09 AM8/15/12
to rav...@googlegroups.com
Hm,
Big enough that we will take several queries to load significant portions of it to memory.
In short, as long as the site is serving queries, you'll see good perf.
If you have long period of inactivity, we will drop all resources.

Do you expect long periods of inactivity?

Tobias Sebring

unread,
Aug 15, 2012, 10:01:40 AM8/15/12
to rav...@googlegroups.com
Yes, the site will be inactive for most hours out of the day and will only see sporadic use that will not be concentrated in any specific period of time. I expect the resources will have been dropped almost every time a user accesses the site.

Oren Eini (Ayende Rahien)

unread,
Aug 16, 2012, 12:02:33 PM8/16/12
to rav...@googlegroups.com
How are you running this?
In a service? In IIS?

I assume that you are running in a child db mode, you need to set:
Raven/Tenants/MaxIdleTimeForTenantDatabase

To a high value.

If you are running in IIS, you need to set IIS idle settings to zero and disable recycling.

Tobias Sebring

unread,
Aug 16, 2012, 12:24:38 PM8/16/12
to rav...@googlegroups.com
I'm currently running it as a console application (i.e. haven't installed it as a service but not running embedded). The site is running in IIS. Going to try that setting to see if it makes any difference.

Another thing I've noticed over the past day is that if I leave indexing on while doing the batch import (takes aprox. 40 min) around 2,200,000 / 2,300,000 documents will have already been processed in the index when it's done. If I repeat the same batch import  with indexing turned off, turn it back on, and wait for indexing to complete - the same index will take 3-4 hours to build with a very noticeable decrease in performance as the index gets bigger.

Another issue with the indexing seems to be that after you stop actively adding documents to the index the indexing will slow down and those last 100,000 documents may take 15-20 minutes to complete and more often than not the index will remain stale even though all documents have been processed (to the point that it does not return any results when querying). I've noticed that if I shutdown the ravendb server when this happens and then start it back up the index will no longer be stale.

Oren Eini (Ayende Rahien)

unread,
Aug 16, 2012, 12:28:30 PM8/16/12
to ravendb
Tobias,
Is it a child database?

Tobias Sebring

unread,
Aug 16, 2012, 12:32:43 PM8/16/12
to rav...@googlegroups.com
I'm not sure I understand "child database". It's a separate database from the system database and I've created it in the manager by clicking New database.

Oren Eini (Ayende Rahien)

unread,
Aug 16, 2012, 12:35:31 PM8/16/12
to rav...@googlegroups.com
Okay, that is what I meant.

I just fixed a bug that would likely cause non system database to shut down after inactivity, even if they are doing indexing, if you are doing a LOT of indexing.
That would probably match what you are seeing.
Can you wait for the next build and test that?

Tobias Sebring

unread,
Aug 16, 2012, 1:02:22 PM8/16/12
to rav...@googlegroups.com
Absolutely.

While I have you on the hook I have another problem I'm struggling with:

My batch import is processed in multiple phases to build my full denormalized dataset. In one of the later phases I need to query data in the database to find the relevant document to update using .LuceneQuery<>(). This works well for a majority of the updates in this phase but a few iterations will fail to find the relevant document to update.
For every update iteration I modify fields that are part of the index I'm querying and as such I call .WaitForNonStaleResults() on the subsequent query after every update to (hopefully) make sure the index is in good shape. The index is defined as an AbstractIndexCreationTask<Data, Data.Result> (Map/Reduce) but there is no defined reduce function but rather I'm using it to query the data using lambda syntax on computed index-fields and then selecting the actual document type for the result: 
session.Advanced.LuceneQuery<Datas_Index.Result, Datas_Index>()
   .WhereEqual(x => ..., ...)
   .SelectFields<Data>()
   .ToList()

After I rerun the same update code on the data that failed to find the relevant document to update - like magic - it's now returned and updates are successful. Any idea what I'm doing wrong here?

Oren Eini (Ayende Rahien)

unread,
Aug 16, 2012, 2:18:57 PM8/16/12
to rav...@googlegroups.com
No idea.
Can you try generating a failing test?
I am not sure that I am following.

Tobias Sebring

unread,
Aug 16, 2012, 3:51:05 PM8/16/12
to rav...@googlegroups.com
I'll see if I can get a test to fail to show the issue.
Reply all
Reply to author
Forward
0 new messages