Querying and modifying a large set of documents

Anders Granberg

unread,

Aug 24, 2016, 4:53:40 AM8/24/16

to RavenDB - 2nd generation document database

I have a user case where a background worker running in the IIS needs to get a large set of documents by query and then track any changes made.

The scenario can be simplified down to something like this:

Get documents by query.
Note: This result can very well be over 1024 documents
Do some operations.
Note: These operations is often a nested set of operations and not just a simple update of a single property
Save any changes made.

I’m looking for any advice on the best practise of how to approach this scenario.

## My approaches so far

So far I have tried out a couple of approaches and done some measuring regarding time spent and memory used.

1) Paged load with save changes

Usage of one DocumentSession for every set of 1024 documents and call of SaveChanges on each open session after all operations has finished.

Note: This approach will keep the sessions open until all operations is finished to be able to call SaveChanges. Which is probably a bad idea.

Measurements

Load (11000 documents) finished after 21,7991127 seconds
Operation finished after 0,0030559 seconds
Save finished after 54,1690771 seconds
Total time 75,9712457 seconds
Memory used: CurrentProcess: 1336008 Kb, GC: 1025541 Kb

2) Paged load with patching

Same approach as 1) but with usage of patching instead of SaveChanges.

Note: This approach will dispose all sessions after the load is finished.

Measurements

Load (11000 documents) finished after 20,7642839 seconds
Operation finished after 0,004087 seconds
Save finished after 20,9501986 seconds
Total time 41,7185695 seconds
Memory used: CurrentProcess: 1467656 Kb, GC: 993595 Kb

3) Stream load with patching

Use streaming to load all documents and usage of patching to save changed properties.

Measurements

Load (11000 documents) finished after 12,9728393 seconds
Operation finished after 0,0034327 seconds
Save finished after 22,6796775 seconds
Total time 35,6559495 seconds
Memory used: CurrentProcess: 535576 Kb, GC: 106613 Kb

Conclusion

So far it seems that a stream approach is much faster and is using far less memory. But it’s quite more difficult to work with and the result needs some manual work and magic in order to deserialize back to my object type needed for the operations.

Just as one example I had a hard time of finding the document id from the RavenJObject returned (needed to be able to create patches later). I did however manage to find it by calling ToJsonDocument().Key on my RavenJObject. No idea if that is the intended way of doing this though?

Any thoughts on these approaches or any new ideas on how to load and modify a large set of documents would be much appreciated!

Thanks

Michael Yarichuk

unread,

Aug 24, 2016, 6:27:51 AM8/24/16

to RavenDB - 2nd generation document database

Query streaming is designed for handling such scenarios.

Also, can you tell more about the manual work and magic you needed to do?

I mean, if using streaming, you can get document key and strongly typed document with the following:

var query = session.Query<Foo, FooIndex>().OrderBy(x => x.Num);

var enumerator = session.Advanced.Stream(query);

while (enumerator.MoveNext())

{

var docId = enumerator.Current.Key;

var stronglyTypedDocument = enumerator.Current.Document;

--
You received this message because you are subscribed to the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Best regards,

Hibernating Rhinos Ltd cid:image001.png@01CF95E2.8ED1B7D0

Michael Yarichuk l RavenDB Core Team

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

RavenDB paving the way to "Data Made Simple" http://ravendb.net/

Anders Granberg

unread,

Aug 24, 2016, 7:43:23 AM8/24/16

to RavenDB - 2nd generation document database

Thanks for your answer.

When testing out streaming I have used the code example found in the documentation:

https://ravendb.net/docs/article-page/3.0/csharp/client-api/commands/querying/how-to-stream-query-results

QueryHeaderInformation queryHeaderInfo; 
IEnumerator<RavenJObject> enumerator = store 
   .DatabaseCommands 
   .StreamQuery( 
      "Orders/Totals", 
      new IndexQuery { Query = "Company:companies/1" }, 
      out queryHeaderInfo); 

while (enumerator.MoveNext()) { 
   RavenJObject order = enumerator.Current; 
}

That would return an RavenJObject that requires a lot of manual work to get the doc id and then cast back to my object type.

But your code example is exactly what I need and is working like a charm. Thanks! And thanks for confirming that the Stream approach is the correct way to approach this scenario.

Michael Yarichuk

unread,

Aug 24, 2016, 7:57:04 AM8/24/16

to RavenDB - 2nd generation document database

Good to know that stuff works.

Also - http://issues.hibernatingrhinos.com/issue/RDoc-593

--
You received this message because you are subscribed to the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Tobi

unread,

Aug 24, 2016, 8:33:37 AM8/24/16

to rav...@googlegroups.com

A word of warning:

The data that you stream is a snapshot of the data as it was when you
started reading it.

Modifying all the documents that you stream might get you into trouble.
I did this once and ran into "Version store out of memory" errors
because RavenDB had to keep too much snapshots.

Not much of problem probably if its only up to 1024 documents.

Tobias

> --
> You received this message because you are subscribed to the Google
> Groups "RavenDB - 2nd generation document database" group.
> To unsubscribe from this group and stop receiving emails from it, send

> an email to ravendb+u...@googlegroups.com
> <mailto:ravendb+u...@googlegroups.com>.

Anders Granberg

unread,

Aug 24, 2016, 9:54:35 AM8/24/16

to RavenDB - 2nd generation document database

Thanks for the heads up Tobi.

I will probably need to add all streamed documents to an array and let the stream finish before making any modifications. Wouldn't that solve the issue with RavenDB to keep too many snapshots? Or am I missing your point?

/Anders

Tobi

unread,

Aug 24, 2016, 10:03:03 AM8/24/16

to rav...@googlegroups.com

That's correct.

The problem ist with modifying docs WHILE streaming. Doing updates
AFTER streaming does not have this issue. As soon as you finish/dispose
the stream-enumerator the snapshots are released.

Tobias

> > an email to ravendb+u...@googlegroups.com <javascript:>
> > <mailto:ravendb+u...@googlegroups.com <javascript:>>.