Hi Oren / Tal,
I am indeed just manipulating the collection, but for a large DB. We used to use the patch API, but had trouble with it. We opted for a different approach, which granted took longer and involved pulling all the docs across to the client to enact the change, but we have found it more reliable and indeed the logging, continuous feedback and ability to cancel the operation and kick off a new one has been important to us. Furthermore, even the patch required a LoadDocument to load up the manifest to decide what change to enact - we couldn't figure out how to not have to do this everytime within a patch, and this might have added to the problems we were having with them. (I notice in Oren's eBook for Raven DB 3.0 the mention of a 10k changes limit for patches?)
Let me elaborate a little further:
What I am doing is repricing an entire catalog. I'm testing with 200k docs, but in production it's more like 1.5M. We need to price the document, as opposed to having a readonly property in memory, because we want to do a facet search on products including a price range. One other complication is that the catalog databases are built from a library database which is rebuilt every night from distributor company feeds - often the changes overnight are small, but once every few weeks there'll be an increment that could roll over as much as 40% of the catalog (deleted products, updated products, new products never seen before).
So the repricing algorithm that we have been using (working fine, but slow and hence why I am revisiting), was to load and enact changes in batches. It resulted in code like:
batchOfMigratedItems.AddRange(
session.Advanced.LuceneQuery<CatalogProduct>("Raven/DocumentsByEntityName")
.WhereEquals("Tag", "CatalogProducts")
.OrderBy("__document_id")
.Skip(nextBatchStart + nextPageOffset)
.Take(PAGESIZE)
.ToArray()
);
Ordering by document id (rather than etag for example) to avoid updated products making their way to the end of the queue and getting picked up again on a later page (and of course other products being missed entirely for the same reason).
So I was revisiting this code to see if there was anything to be gained by using a frozen stream (Note: I'm not concerned about new products being added during the process - we can control this, so considered out of scope for now).
The stream idea was in case the skip and take (deep paging) were having a negative effect. And then combining this with the async SaveChanges, seemed like a reasonable way of getting through all documents quicker.
However, my test bed of 200k products isn't showing much of an improvement. I will retest with 1.5M and see if it makes a difference there.
For completeness, below is the stream code that I ended up with (it does work - thanks for your feedback), but as I said the data isn't showing much of a performance gain yet.
Based on this new information, can you offer any further advice?
[Note: this code hasn't made it out of test yet and won't be committed unless I can back it up with performance improvement data]
const int batchSize = 500;
const int sessionLimit = 30;
int currentBatchSizeCounter = 0;
int currentSessionLimitCounter = 0;
IAsyncDocumentSession bulkSession = null;
try
{
using (var stream = _catalogStore.OpenSession())
{
bulkSession = _catalogStore.OpenAsyncSession();
foreach (var product in GetAllProducts(stream))
{
++currentBatchSizeCounter;
RepriceProduct(product);
bulkSession.StoreAsync(product, product.Id);
if (currentBatchSizeCounter > batchSize)
{
currentBatchSizeCounter = 0;
++currentSessionLimitCounter;
_logger.Log(LoggingLevel.Info, "Flushing batch of products to DB");
if (currentSessionLimitCounter >= sessionLimit)
{
currentSessionLimitCounter = 0;
bulkSession.SaveChangesAsync();
bulkSession.Dispose();
bulkSession = _catalogStore.OpenAsyncSession();
}
else
{
bulkSession.SaveChangesAsync();
}
}
}
// Flush last batch
_logger.Log(LoggingLevel.Info, "Flushing LAST batch of products to DB");
bulkSession.SaveChangesAsync();
}
}
finally
{
if (bulkSession != null)
{
bulkSession.Dispose();
}
}