Re: [RavenDB] Document ID issue during bulk insert of 2M documents with subsequent patching

438 views
Skip to first unread message

Oren Eini (Ayende Rahien)

unread,
Jul 27, 2012, 12:21:34 PM7/27/12
to rav...@googlegroups.com
1) Keep a side document with the mapping, so you can easily do a load by id, something like:

"references/data/2012/07/27/js9am2ms8la91" - { "DocId": "data/1"}

2) It isn't exposed to the client API.

3)  http://ravendb.net/docs/server/administration/upgrade 

On Fri, Jul 27, 2012 at 7:16 PM, Tobias Sebring <tseb...@gmail.com> wrote:
I'm using RavenDb to do bulk inserts from a large datadump similar to the process outlined by Ayende here: http://ayende.com/blog/4474/etl-process-using-raven. My problem is that unlike the Stackoverflow datadump where Ayende utilizes the userId for document IDs, the datadump I'm working with is using complex string IDs that I would prefer to not use for my document IDs. With indexing turned off and auto generated document IDs - I do not know how to load the inserted documents for patching.

Example process:
foreach data in datas
    session.Store(new Data { Id = "data/1", DatadumpKey = "/data/2012/07/27/js9am2ms8la91" })
    session.SaveChanges();

foreach part in parts
   var data = session.Query<Data>().SingleOrDefault(p => p.DatadumpKey == part.DatadumpKey); //this does not work since there is no index, and with indexes enabled it will always be stale.
   data.Parts.Add(new Part { ... }));
   session.SaveChanges();

1. What's the recommended solution to the issue explained above?
2. Is it possible to disable indexing from the client API rather than through HTTP?
3. What's the best way to export the new database and import it (overwrite) onto a production server in order to keep the downtime as low as possible?

Tobias Sebring

unread,
Jul 27, 2012, 4:09:35 PM7/27/12
to rav...@googlegroups.com
For 1) - I'm having trouble implementing this without a tenfold degradation to performance. Any idea why?
I'm using the client API to import batches of 1024 documents with the following code:
using (var session = Store.OpenSession())
{
foreach (var i in data.Batch)
{
session.Store(i);
}
session.SaveChanges();

foreach (var i in data.Batch)
{
try
{
session.Store(new IdMapping { Id = " IdMappings/" + i.LongId, LongId = i.LongId });
}
catch (Raven.Client.Exceptions.NonUniqueObjectException)
{
session.Delete(i);
}
}
session.SaveChanges();
}

Oren Eini (Ayende Rahien)

unread,
Jul 27, 2012, 4:17:05 PM7/27/12
to rav...@googlegroups.com
If the session throws an exception, you may no longer use the session.

Tobias Sebring

unread,
Jul 27, 2012, 4:40:53 PM7/27/12
to rav...@googlegroups.com
The session does not throw an exception. The try-catch is there to catch when I try to insert a duplicate key because sometimes the dataset I'm working with is not consistent. Are you saying the try-catch could be the cause of the performance degradation? Only two or so  NonUniqueObjectException are thrown in the entire 2M document dataset.

Oren Eini (Ayende Rahien)

unread,
Jul 27, 2012, 6:02:45 PM7/27/12
to rav...@googlegroups.com
I don't _know_ what the issue is, but an exception from the session render its state undefined.
Do this without try/catch.
Then see how many queries you make to the service.

Tobias Sebring

unread,
Jul 27, 2012, 7:17:16 PM7/27/12
to rav...@googlegroups.com
Okay. Got it. It's working now but I have to utilize a ConcurrentDictionary to make sure no duplicate keys are attempted to be saved to RavenDb. This sadly means saving 2,000,000 strings in memory throughout the entire import process which brings the server to it's knees. Speed is about 200k documents per minute.

Thank you for your help on this issue!

Oren Eini (Ayende Rahien)

unread,
Jul 28, 2012, 2:01:06 AM7/28/12
to rav...@googlegroups.com
Use a bloom filter instead 

Tobias Sebring

unread,
Jul 28, 2012, 9:44:54 PM7/28/12
to rav...@googlegroups.com
Got that fixed. Now I'm having trouble limiting the memory footprint of RavenDb. The memory consumption will gradually rise to 98% of physical ram at which point Windows 7 will start display warnings to close the program down and other applications will crash randomly.

I've tried the following things to limit memory utilization in accordance with other threads in this group:

I've turned off indexing:
using (var webClient = new WebClient())
{
webClient.UseDefaultCredentials = true;
var result = webClient.UploadString(new Uri(new Uri("http://localhost:8080"), "/admin/stopindexing"), "POST", "");
}

Modified cache configuration settings (tried with different values - same result):
<appSettings>
<add key="Raven/MemoryCacheLimitPercentage" value="50" />
<add key="Raven/MemoryCacheLimitCheckInterval" value="00:00:15" />
<add key="Raven/MemoryCacheExpiration" value="60" />
</appSettings>

And disabled all caching:
using (Store.DatabaseCommands.DisableAllCaching())
{
... batch store / savechanges
}

Non of these seem to have any effect on memory usage of the application with RavenDb is running in embedded mode. Commenting out the few lines of RavenDb code that handles batch imports results in a maximum 125mb memory usage on system with 16GB physical ram.

Oren Eini (Ayende Rahien)

unread,
Jul 29, 2012, 1:41:11 AM7/29/12
to rav...@googlegroups.com
What lines did you comment?
What build are you using?
How many items are you using per SaveChanges call?

Tobias Sebring

unread,
Jul 29, 2012, 6:20:56 AM7/29/12
to rav...@googlegroups.com
Code with commented out lines:
var bc = new BlockingCollection<IndexedBatch<TData>>();
var importTask = Task.Run(() =>
{
bc.GetConsumingEnumerable()
.AsParallel()
.WithExecutionMode(ParallelExecutionMode.ForceParallelism)
.WithMergeOptions(ParallelMergeOptions.NotBuffered)
.ForAll(data =>
{
var st = Stopwatch.StartNew();
//using (var session = Store.OpenSession())
//{
foreach (var i in data.Batch)
{
//session.Store(i);
}
//session.SaveChanges();
//}

Console.WriteLine(@"Batch imported {0} in {1} ms", data.Index, st.ElapsedMilliseconds);
});
});


Build is from NuGet a few days ago:
  <package id="RavenDB.Client" version="1.2.2044-Unstable" />
  <package id="RavenDB.Database" version="1.2.2044-Unstable" />
  <package id="RavenDB.Embedded" version="1.2.2044-Unstable" />

Batch size is 1024 from recommendation I picked up here in the group. I'm running multiple import jobs concurrently but also tried limiting that with .WithDegreeOfParallelism(1) and got the same result.

Oren Eini (Ayende Rahien)

unread,
Jul 29, 2012, 6:22:25 AM7/29/12
to rav...@googlegroups.com
Run this sequentially, without prallelism, first.
What is the size of the documents?
Can you create a repro?

Tobias Sebring

unread,
Jul 29, 2012, 9:38:21 AM7/29/12
to rav...@googlegroups.com

Sequentially - same result.

Document size is small. This document is from the repro:
{
  "LongText": "f93ht2b8is1usozq3nwqbc34ti1aln9fx5if5ra7u9mz444ktxpmc8bcg9xlaav5su7wfuukmz6",
  "MediumText": "f93ht2b8is1usozq3nwqbc34ti1aln9fx5if5ra7u9mz444ktx",
  "ShortText": "f93ht2b8is1usozq3nwqbc34t",
  "NumberIntervals": [
    {
      "NumberFrom": 75,
      "NumberTo": 1985
    },
    {
      "NumberFrom": 705,
      "NumberTo": 1391
    },
    {
      "NumberFrom": 456,
      "NumberTo": 1471
    }
  ],
  "Type": "Type1",
  "Categories": [
    "Category3",
    "Category2",
    "Category4"
  ]

Oren Eini (Ayende Rahien)

unread,
Jul 29, 2012, 9:52:27 AM7/29/12
to rav...@googlegroups.com
I can't follow the code, please create a repro without all the threading complexity there.

Tobias Sebring

unread,
Jul 29, 2012, 10:57:39 AM7/29/12
to rav...@googlegroups.com
I made the threading optional in the original repro controlled in by a boolean at the top of main() to show off the real code before I made it sequential:
var runInParallel = false;

Here's an updated repro with all the optional threading gone:

Note. that the ConcurrentDictionary is only ever accessed sequentially and I left it in there because it is one of the few things in the non-ravendb targeted code that will allocate a big chunk of memory.

Tobias Sebring

unread,
Jul 29, 2012, 12:48:39 PM7/29/12
to ravendb
I just noticed that clean solution wouldn't delete the files under
obj/ hence the archive attached was quite large. This is the same
repro as DataImport2.zip but it's 8kb instead of 25mb:
https://dl.dropbox.com/u/6420016/DataImport2-small.zip

On Jul 29, 4:57 pm, Tobias Sebring <tsebr...@gmail.com> wrote:
> I made the threading optional in the original repro controlled in by a
> boolean at the top of main() to show off the real code before I made it
> sequential:
> var runInParallel = false;
>
> Here's an updated repro with all the optional threading gone:https://dl.dropbox.com/u/6420016/DataImport2.zip
>
> Note. that the ConcurrentDictionary is only ever accessed sequentially and
> I left it in there because it is one of the few things in the non-ravendb
> targeted code that will allocate a big chunk of memory.
>
>
>
>
>
>
>
> On Sunday, July 29, 2012 3:52:27 PM UTC+2, Oren Eini wrote:
>
> > I can't follow the code, please create a repro without all the threading
> > complexity there.
>
> > On Sun, Jul 29, 2012 at 4:38 PM, Tobias Sebring <tsebr...@gmail.com>wrote:
>
> >> Repro:https://dl.dropbox.com/u/6420016/DataImport.zip
>
> >> Sequentially - same result.
>
> >> Document size is small. This document is from the repro:
> >> {
> >>   "LongText":
> >> "f93ht2b8is1usozq3nwqbc34ti1aln9fx5if5ra7u9mz444ktxpmc8bcg9xlaav5su7wfuukmz 6",
> >>   "MediumText": "f93ht2b8is1usozq3nwqbc34ti1aln9fx5if5ra7u9mz444ktx",
> >>   "ShortText": "f93ht2b8is1usozq3nwqbc34t",
> >>   "NumberIntervals": [
> >>     {
> >>       "NumberFrom": 75,
> >>       "NumberTo": 1985
> >>     },
> >>     {
> >>       "NumberFrom": 705,
> >>       "NumberTo": 1391
> >>     },
> >>     {
> >>       "NumberFrom": 456,
> >>       "NumberTo": 1471
> >>     }
> >>   ],
> >>   "Type": "Type1",
> >>   "Categories": [
> >>     "Category3",
> >>     "Category2",
> >>     "Category4"
> >>   ]
> >> }
>
> >> On Sunday, July 29, 2012 12:22:25 PM UTC+2, Oren Eini wrote:
>
> >>> Run this sequentially, without prallelism, first.
> >>> What is the size of the documents?
> >>> Can you create a repro?
>
> >>> On Sun, Jul 29, 2012 at 1:20 PM, Tobias Sebring <tsebr...@gmail.com>wrote:
>
> >>>> Code with commented out lines:
> >>>> var bc = new BlockingCollection<**IndexedBatch<TData>>();
> >>>>  var importTask = Task.Run(() =>
> >>>> {
> >>>> bc.GetConsumingEnumerable()
> >>>> .AsParallel()
> >>>> .WithExecutionMode(**ParallelExecutionMode.**ForceParallelism)
> >>>>  .WithMergeOptions(**ParallelMergeOptions.**NotBuffered)
> >>>> .ForAll(data =>
> >>>> {
> >>>>  var st = Stopwatch.StartNew();
> >>>> //using (var session = Store.OpenSession())
> >>>> //{
> >>>>  foreach (var i in data.Batch)
> >>>> {
> >>>> //session.Store(i);
> >>>>  }
> >>>> //session.SaveChanges();
> >>>> //}
>
> >>>> Console.WriteLine(@"Batch imported {0} in {1} ms", data.Index,
> >>>> st.ElapsedMilliseconds);
> >>>>  });
> >>>> });
>
> >>>> Build is from NuGet a few days ago:
> >>>>   <package id="RavenDB.Client" version="1.2.2044-Unstable" />
> >>>>   <package id="RavenDB.Database" version="1.2.2044-Unstable" />
> >>>>   <package id="RavenDB.Embedded" version="1.2.2044-Unstable" />
>
> >>>> Batch size is 1024 from recommendation I picked up here in the group.
> >>>> I'm running multiple import jobs concurrently but also tried limiting that
> >>>> with .WithDegreeOfParallelism(**1) and got the same result.
>
> >>>> On Sunday, July 29, 2012 7:41:11 AM UTC+2, Oren Eini wrote:
>
> >>>>> What lines did you comment?
> >>>>> What build are you using?
> >>>>> How many items are you using per SaveChanges call?
>
> >>>>> On Sun, Jul 29, 2012 at 4:44 AM, Tobias Sebring <tsebr...@gmail.com>wrote:
>
> >>>>>> Got that fixed. Now I'm having trouble limiting the memory footprint
> >>>>>> of RavenDb. The memory consumption will gradually rise to 98% of physical
> >>>>>> ram at which point Windows 7 will start display warnings to close the
> >>>>>> program down and other applications will crash randomly.
>
> >>>>>> I've tried the following things to limit memory utilization in
> >>>>>> accordance with other threads in this group:
>
> >>>>>> I've turned off indexing:
> >>>>>> using (var webClient = new WebClient())
> >>>>>> {
> >>>>>> webClient.**UseDefaultCredential**s = true;
> >>>>>> var result = webClient.UploadString(new Uri(new Uri("http://localhost
> >>>>>> :8080"), "/admin/stopindexing"), "POST", "");
> >>>>>> }
>
> >>>>>> Modified cache configuration settings (tried with different values -
> >>>>>> same result):
> >>>>>> <appSettings>
> >>>>>> <add key="Raven/**MemoryCacheLimitPer**centage" value="50" />
> >>>>>>  <add key="Raven/**MemoryCacheLimitChe**ckInterval" value="00:00:15"
> >>>>>> />
> >>>>>> <add key="Raven/**MemoryCacheExpirati**on" value="60" />
> >>>>>>  </appSettings>
>
> >>>>>> And disabled all caching:
> >>>>>> using (Store.DatabaseCommands.**Disabl**eAllCaching())
> >>>>>> {
> >>>>>>>> catch (Raven.Client.Exceptions.**NonUn********iqueObjectException)
> >>>>>>>>  {
> >>>>>>>> session.Delete(i);
> >>>>>>>> }
> >>>>>>>> }
> >>>>>>>>  session.SaveChanges();
> >>>>>>>> }
>
> >>>>>>>> On Friday, July 27, 2012 6:21:34 PM UTC+2, Oren Eini wrote:
>
> >>>>>>>> 1) Keep a side document with the mapping, so you can easily do a
> >>>>>>>> load by id, something like:
>
> >>>>>>>> "references/data/2012/07/27/js************9am2ms8la91" - {
> >>>>>>>> "DocId": "data/1"}
>
> >>>>>>>> 2) It isn't exposed to the client API.
>
> >>>>>>>> 3)  http://ravendb.net/docs/**server**********
> >>>>>>>> /administration/upgrade<http://ravendb.net/docs/server/administration/upgrade>
>
> >>>>>>>>  On Fri, Jul 27, 2012 at 7:16 PM, Tobias Sebring <
> >>>>>>>> tsebr...@gmail.com> wrote:
>
> >>>>>>>> I'm using RavenDb to do bulk inserts from a large datadump similar
> >>>>>>>> to the process outlined by Ayende here: http<http://ayende.com/blog/>

Oren Eini (Ayende Rahien)

unread,
Jul 29, 2012, 1:55:46 PM7/29/12
to rav...@googlegroups.com
Thanks, reproduced and testing this now.

Oren Eini (Ayende Rahien)

unread,
Jul 29, 2012, 1:58:44 PM7/29/12
to rav...@googlegroups.com
*snort*
The problem was that you called StopIndexing, that caused us to hold in memory stuff until indexing would resume.
It is a bug that wasn't exposed until this exact scenario (large import with indexing disabled), this being a rare case, we didn't notice that.
Thanks for this, fixed now and will be out in a few minutes.

Tobias Sebring

unread,
Jul 29, 2012, 2:01:06 PM7/29/12
to rav...@googlegroups.com
Sweet. Thank you so much for taking the time to help me with this. Very much appreciated!
Reply all
Reply to author
Forward
0 new messages