Load all objects regardless of MaxPageSize

983 views
Skip to first unread message

rfuller

unread,
Oct 17, 2011, 12:39:44 AM10/17/11
to ravendb
How can I get RavenDB to load all documents of a given type regardless
of MaxPageSize?

-Ryan

Ayende Rahien

unread,
Oct 17, 2011, 1:18:42 AM10/17/11
to rav...@googlegroups.com
You can't do this by default. See our Safe by Default page.

If you really want to do that, you can set Raven/MaxPageSize to a high value,  but that isn't recommended

rfuller

unread,
Oct 17, 2011, 10:26:05 AM10/17/11
to ravendb
Is there a way to specify this at the session level rather than for
the entire raven server? Sometimes it's just a single method or a
maintenance procedure that I'm running where I need to get around the
MaxPageSize.

-Ryan

On Oct 17, 1:18 am, Ayende Rahien <aye...@ayende.com> wrote:
> You can't do this by default. See our Safe by Default page.http://ravendb.net/documentation/safe-by-default

Itamar Syn-Hershko

unread,
Oct 17, 2011, 10:37:20 AM10/17/11
to rav...@googlegroups.com
No, by design this setting is on the server

Everett Muniz

unread,
Oct 17, 2011, 10:45:36 AM10/17/11
to rav...@googlegroups.com
Ayende / Itamar...

For those rare occasions when there's a good reason to get all results matching a particular query, is an extension method like the one below a valid approach?

public static class DocumentSessionExtensions
    {
        public static List<T> AllResultsFrom<T>(this IDocumentSession instance, Func<IDocumentSession, IRavenQueryable<T>> buildQuery)
        {
            const int PageSize = 1024;

            var skipCount = 0;
            var totalCount = 0;
            var results = new List<T>();
            RavenQueryStatistics statistics;

            do
            {
                var query = buildQuery(instance);
                results.AddRange(query.Statistics(out statistics).Skip(skipCount).Take(PageSize));
                skipCount = results.Count + statistics.SkippedResults;
                totalCount = statistics.TotalResults;
            }
            while (results.Count < totalCount);

            return results;

Itamar Syn-Hershko

unread,
Oct 17, 2011, 10:56:03 AM10/17/11
to rav...@googlegroups.com
No, because the session will error out in the 30th request by default. You are also returning tracked entities which will stop being tracked the minute the session is disposed.

You want to make this as an extension to the doc store probably, and to perform the action you want within the session scope (allow to provide an Action<> as parameter probably)

rfuller

unread,
Oct 17, 2011, 11:11:19 AM10/17/11
to ravendb
This setting is a huge hassle. It makes every query have the
potential of giving back the wrong answer without any warning or
indication of a problem.

I can see the argument for session.Query<MyObject>() falling back to
some sort of default paging, but if
session.Query<MyObject>().Take(2000) returns anything less than 2000
that's simply an incorrect answer to the query. The queries need to
do what they say they do or it makes it really hard to reason about
your programs.

I had some methods that I thought were working fine, but now I've
realized that they were only working on the first 128 objects of a
given type. Oops. But you can't realize this by reading the code.
You have to know that Raven will actually give you incomplete results
when you run a query like session.Query<MyObject>(). I'd rather it
throw an Exception and force me to do the paging myself. At least
then I'd know there was a problem. (Just like if you go over the
request limit for a given session, Raven throws an Exception. It
doesn't just start ignoring you.)

It sounds to me like the best thing to do is to crank MaxPageSize
super high so that at least you can be sure that your queries are
giving you the results you asked for.

-Ryan

On Oct 17, 10:37 am, Itamar Syn-Hershko <ita...@hibernatingrhinos.com>
wrote:

Ayende Rahien

unread,
Oct 17, 2011, 11:15:46 AM10/17/11
to rav...@googlegroups.com
Rfuller,
This is intentional, because it is much better to show a bit less data than to crash because you tried to load 2 million items.

We are explicitly enforcing this on the server because otherwise the "fix" would be to always say Take(int.MaxValue).
There is a reason this is configurable at the server, if you really want that, you can, but we don't allow it by default, and that is a conscious design decision.
All queries allow you to get the Statistics for the query back, and you can check if it has more items than you brought back

rfuller

unread,
Oct 17, 2011, 11:44:17 AM10/17/11
to ravendb
"because it is *much* better to show a bit less data than to crash"

I'm not sure that's always true. Sometimes an error is better than
the wrong answer. An error I go fix, but the wrong answer I act on as
if it were the right answer.

And what if the purpose of my query is to manipulate the data returned
rather than to show the data? Not every query ends with the results
being rendered to a web page. I've written a bunch of functions that
perform some sort some action on sets of objects and now each one is a
bug waiting to happen because the query result set might be
incomplete. This doesn't seem 'safe' to me. It seems insidiously
dangerous.

Also every single developer on my team has to be aware of this
problem. They can't just write queries that make sense. They have to
know that sometimes Raven won't do as it's asked.

This setting doesn't make me feel safe :)

-Ryan

On Oct 17, 11:15 am, Ayende Rahien <aye...@ayende.com> wrote:
> Rfuller,
> This is intentional, because it is *much* better to show a bit less data

Matt Warren

unread,
Oct 17, 2011, 11:49:41 AM10/17/11
to ravendb
Ryan

Just to add one thing, the behaviour in Raven is very similar to the
behaviour you get in the Azure Table storage, so it's not unique.

See http://msdn.microsoft.com/en-us/library/windowsazure/dd135718.aspx
for a bit more info:

"A query against the Table service may return a maximum of 1,000 items
at one time and may execute for a maximum of five seconds. If the
result set contains more than 1,000 items, if the query did not
complete within five seconds, or if the query crosses the partition
boundary, the response includes headers which provide the developer
with continuation tokens to use in order to resume the query at the
next item in the result set. Continuation token headers may be
returned for a Query Tables operation or a Query Entities operation."

So the query doesn't throw an exception if it can't get all the items
you asked for, you have to look in the header of the response and work
it out yourself.

rfuller

unread,
Oct 17, 2011, 12:02:35 PM10/17/11
to ravendb
Matt,
Yikes. That's even worse. Now if Azure is having a 'slow' day your
site isn't just slow, it's also returning incomplete results. (Sure,
5 seconds is a really long time. But still, to me, this all seems to
create an environment where my software is very unpredictable).

A query that returned the correct results yesterday can be wrong today
because a new User signed up and it put you over the 1,000 threshold.
That's horrific!

-Ryan

On Oct 17, 11:49 am, Matt Warren <mattd...@gmail.com> wrote:
> Ryan
>
> Just to add one thing, the behaviour in Raven is very similar to the
> behaviour  you get in the Azure Table storage, so it's not unique.
>
> Seehttp://msdn.microsoft.com/en-us/library/windowsazure/dd135718.aspx

Matt Warren

unread,
Oct 17, 2011, 12:30:01 PM10/17/11
to ravendb
Yeah but the whole point of Azure is that it *forces* you to handle
these scenarios. If it didn't build in these timeouts it couldn't
provider the SLA's that it does as part of it's Platform-As-A-Service
thingy.

FYI, some other ones includes:
- web requests are automaticially re-tried if they remain idel for
too long. Basicially the load-balancer sends them to the other role
(http://stackoverflow.com/questions/6419924/the-underlying-connection-
was-closed-an-unexpected-error-occurred-on-a-receive/6422219#6422219).
- when accessing SQL Azure, if your query takes too long, or Azure
wants to throttle it for another reason you have to re-try it (see
http://windowsazurecat.com/2010/10/best-practices-for-handling-transient-conditions-in-sql-azure-client-applications/).

Chris Marisic

unread,
Oct 17, 2011, 1:39:20 PM10/17/11
to rav...@googlegroups.com


On Monday, October 17, 2011 12:02:35 PM UTC-4, rfuller wrote:
Matt,
Yikes.  That's even worse.  Now if Azure is having a 'slow' day your
site isn't just slow, it's also returning incomplete results.  (Sure,
5 seconds is a really long time.  But still, to me, this all seems to
create an environment where my software is very unpredictable).

A query that returned the correct results yesterday can be wrong today
because a new User signed up and it put you over the 1,000 threshold.
That's horrific!

-Ryan


I disagree that it's horrific, it's a specific design convention and very transparent.

"Also every single developer on my team has to be aware of this
problem.  They can't just write queries that make sense.  They have to
know that sometimes Raven won't do as it's asked."

Developers need to understand the tool they're using atleast on basic levels. The safe by design record limit is a core aspect to Raven. This means you need to take paging into account. If you want to disregard this as it's been mentioned you can do that, change it on the server.

Chris Marisic

unread,
Oct 17, 2011, 1:41:40 PM10/17/11
to rav...@googlegroups.com
I have code very similar to this Everett in some places, (this extension method would make it far cleaner than my usage that i manually wrote this where i needed it each place).  As Itamar noted this by default will explode at 30 requests but I think that's pretty reasonable because by that point you have 30K documents in memory, + probably a whole bunch more of stuff related to that. If you need to load more than 30K records, you probably need to be doing fancier things anyway.

Chris Marisic

unread,
Oct 17, 2011, 1:44:04 PM10/17/11
to rav...@googlegroups.com
As a random note, if you restructured your code slightly to use yield returns you could keep it such that you only have 1024 documents in memory at a time.If you do that, just be careful to avoid multiple enumerations across it.

Everett Muniz

unread,
Oct 17, 2011, 2:06:15 PM10/17/11
to rav...@googlegroups.com
Thanks for that feedback Itamar and Chris.  It's good that you both pointed out how the request limit affects the efficacy of the approach.  I actually did it with that consciously in mind.  It was my attempt to not totally abandon the "safe by default" philosophy....something like "a little less safe by default when I explicitly decide I'm ok with that" :-).  It's also good that you pointed out the downside to using this approach with entities.  I use it exclusively with projections for the very reasons you point out.  

Chris, I want to understand your suggestion about using yield return better.  Can you offer any more specifics?  I get the fundamental mechanics of yield return.  I'm more interested in you're ideas about applying it here to limit the total number of entities in the session at any given time.  I don't want to hi-jack the discussion any more than I already have (wasn't my intention to begin with) so feel free to e-mail me directly.

Ayende Rahien

unread,
Oct 17, 2011, 2:06:05 PM10/17/11
to rav...@googlegroups.com
Remember that the session keep track of all loaded entities.

rfuller

unread,
Oct 17, 2011, 7:10:34 PM10/17/11
to ravendb
The MaxPageSize seems to be ignored when I do a query like
session.Query<MyObject>(). It still will only return 128 items. It
only seems to pay attention to MaxPageSize when I use .Take(). How
can I change the 128 limit?

-Ryan
Message has been deleted

Ayende Rahien

unread,
Oct 18, 2011, 1:53:38 AM10/18/11
to rav...@googlegroups.com
The 128 is simply the default limit that we use if you don't specify any limit.

Ayende Rahien

unread,
Oct 18, 2011, 1:54:00 AM10/18/11
to rav...@googlegroups.com
I don't think that I understand, MaxPageSize is a configuration variable, it is defined in the Web.config of the server.


On Tue, Oct 18, 2011 at 1:22 AM, rfuller <rfu...@gmail.com> wrote:
Also is there a way to get programmatic access to the MaxPageSize
variable so that I can write a 'GetAll' function like Everett's above,
without hard coding in the MaxPageSize?

-Ryan

Chris Marisic

unread,
Oct 18, 2011, 9:10:45 AM10/18/11
to rav...@googlegroups.com
Basically instead of repeatedly querying the server all at once, and stuffing it into your list, you want to query the server to populate your list, then start iterating over the list and yield returning it.

You want to keep a counter of your position and know the total results. Whenever you rollover the batch size, you need to invoke the next raven query with the correct skip/takes, and then continue the process of yield returning results from your list.

The advantage here, if someone would use your method than just call First(), it would only invoke 1 single query to the server. The main thing you have to be careful about however, is if you would do enumer.First(), and then do foreach(enumer), you would reprocess the enumerable. This would be more pronounced if you did Last() as you would double the queries to the server. If you needed to do multiple linq operations over the 1 enumerable at that point you would just want to call ToList or AsEnumerable which would then load all the documents into memory as your current code does basically.

rfuller

unread,
Oct 18, 2011, 10:49:58 AM10/18/11
to ravendb
Is there a way to change the 128 default limit?

-Ryan

Ayende Rahien

unread,
Oct 18, 2011, 10:54:25 AM10/18/11
to rav...@googlegroups.com
No

rfuller

unread,
Oct 18, 2011, 1:57:02 PM10/18/11
to ravendb
Ayende,

Would you consider an option to completely turn off this forced paging
in a future version of RavenDB?

For example, I need session.Query<Product>().Where(p=>p.IsSoldOut ==
true) to return ALL sold out products. Not 128 of them. This forced
behind the scenes paging (behind the scenes in that it's not explicit
in the code, and it's not how normal LinqToObjects works) makes it too
hard to reason about my program by reading the code.

-Ryan

Ayende Rahien

unread,
Oct 18, 2011, 5:57:40 PM10/18/11
to rav...@googlegroups.com
Ryan,
While only fools predict the future, at current, I don't see us doing that, no.
I wouldn't mind having a convention specifying the default limit on the client, though

rfuller

unread,
Oct 18, 2011, 6:19:05 PM10/18/11
to ravendb
A convention specifying the client limit would be great.

-Ryan

Ayende Rahien

unread,
Oct 18, 2011, 6:21:21 PM10/18/11
to rav...@googlegroups.com
Can you send a pull reuest?

rfuller

unread,
Oct 18, 2011, 6:42:25 PM10/18/11
to ravendb
I've never done the Git thing, but I'll give it a try.

-Ryan

jalchr

unread,
Oct 19, 2011, 5:11:19 AM10/19/11
to rav...@googlegroups.com

Is there a way to turn off this entities-tracking ... something like nhibernate stateless session ?
I find that sometimes that I don't need the unit-of-work built-in features. I just want to query and would avoid the extra memory and processing if possible.

Ayende Rahien

unread,
Oct 19, 2011, 5:16:48 AM10/19/11
to rav...@googlegroups.com
session.DatabaseCommands.Query

Chris Marisic

unread,
Oct 19, 2011, 10:07:09 AM10/19/11
to rav...@googlegroups.com
Is session.DatabaseCommands.Query interchangable with any usage of session.Query except you will get untracked entities?

Chris Marisic

unread,
Oct 19, 2011, 10:10:09 AM10/19/11
to rav...@googlegroups.com
If this gets added as a convention i would strongly recommend it either be both a store convention level AND session level, OR only session level. I actually feel it's more appropriate ONLY at the session level so you have to be extremely explicit where you want to override the paging usage.

Ayende Rahien

unread,
Oct 19, 2011, 10:35:06 AM10/19/11
to rav...@googlegroups.com
No, it is the low level interface, it gives you json objects, not entities.

Itamar Syn-Hershko

unread,
Oct 19, 2011, 10:37:26 AM10/19/11
to rav...@googlegroups.com
If you want something interchangeable you can create an extension method that uses the session API but then calls session.Advanced.Evict() on all the entities before it returns

Chris Marisic

unread,
Oct 19, 2011, 11:11:24 AM10/19/11
to rav...@googlegroups.com


On Wednesday, October 19, 2011 10:37:26 AM UTC-4, Itamar Syn-Hershko wrote:
If you want something interchangeable you can create an extension method that uses the session API but then calls session.Advanced.Evict() on all the entities before it returns

 

Isn't that nearly the worst of all worlds? You do all of the processing to initialize change tracking, and then immediately throw it away?

jalchr

unread,
Oct 19, 2011, 5:30:37 PM10/19/11
to rav...@googlegroups.com
Agree with Chris, there should be a simpler way to tell the session that we don't care about tracking the entities to don't bother with it.

var resutl = session.Query<MyEntity>().Customize( x => x.SkipTrackingEntities()).Skip(0).Take(1024).ToArray();

or 

var session = store.OpenStatelessSession();



Chris Marisic

unread,
Oct 19, 2011, 5:52:03 PM10/19/11
to rav...@googlegroups.com
+1 this, I'm entirely fine with either (or both).

Dody Gunawinata

unread,
Oct 20, 2011, 5:27:40 AM10/20/11
to rav...@googlegroups.com
+1

This is especially common in web application scenario since it's
stateless in the first place.

--
nomadlife.org

Ayende Rahien

unread,
Oct 20, 2011, 8:43:38 AM10/20/11
to rav...@googlegroups.com
Pull request?

Itamar Syn-Hershko

unread,
Oct 20, 2011, 12:44:18 PM10/20/11
to rav...@googlegroups.com
I think the Customize() method is the best approach here, this will allow it to work with all existing session types (sync, sharded, async)

On Wed, Oct 19, 2011 at 11:30 PM, jalchr <jal...@gmail.com> wrote:
Reply all
Reply to author
Forward
0 new messages