Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Document ID issue during bulk insert of 2M documents with subsequent patching
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  19 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Tobias Sebring  
View profile  
 More options Jul 27 2012, 12:16 pm
From: Tobias Sebring <tsebr...@gmail.com>
Date: Fri, 27 Jul 2012 09:16:03 -0700 (PDT)
Local: Fri, Jul 27 2012 12:16 pm
Subject: Document ID issue during bulk insert of 2M documents with subsequent patching

I'm using RavenDb to do bulk inserts from a large datadump similar to the
process outlined by Ayende here:
http://ayende.com/blog/4474/etl-process-using-raven. My problem is that
unlike the Stackoverflow datadump where Ayende utilizes the userId for
document IDs, the datadump I'm working with is using complex string IDs
that I would prefer to not use for my document IDs. With indexing turned
off and auto generated document IDs - I do not know how to load the
inserted documents for patching.

Example process:
foreach data in datas
    session.Store(new Data { Id = "data/1", DatadumpKey =
"/data/2012/07/27/js9am2ms8la91" })
    session.SaveChanges();

foreach part in parts
   var data = session.Query<Data>().SingleOrDefault(p => p.DatadumpKey ==
part.DatadumpKey); //this does not work since there is no index, and with
indexes enabled it will always be stale.
   data.Parts.Add(new Part { ... }));
   session.SaveChanges();

1. What's the recommended solution to the issue explained above?
2. Is it possible to disable indexing from the client API rather than
through HTTP?
3. What's the best way to export the new database and import it (overwrite)
onto a production server in order to keep the downtime as low as possible?


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Oren Eini (Ayende Rahien)  
View profile  
 More options Jul 27 2012, 12:21 pm
From: "Oren Eini (Ayende Rahien)" <aye...@ayende.com>
Date: Fri, 27 Jul 2012 19:21:34 +0300
Local: Fri, Jul 27 2012 12:21 pm
Subject: Re: [RavenDB] Document ID issue during bulk insert of 2M documents with subsequent patching

1) Keep a side document with the mapping, so you can easily do a load by
id, something like:

"references/data/2012/07/27/js9am2ms8la91" - { "DocId": "data/1"}

2) It isn't exposed to the client API.

3)  http://ravendb.net/docs/server/administration/upgrade


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tobias Sebring  
View profile  
 More options Jul 27 2012, 4:09 pm
From: Tobias Sebring <tsebr...@gmail.com>
Date: Fri, 27 Jul 2012 13:09:35 -0700 (PDT)
Local: Fri, Jul 27 2012 4:09 pm
Subject: Re: [RavenDB] Document ID issue during bulk insert of 2M documents with subsequent patching

For 1) - I'm having trouble implementing this without a tenfold degradation
to performance. Any idea why?
I'm using the client API to import batches of 1024 documents with the
following code:
using (var session = Store.OpenSession())
{
foreach (var i in data.Batch)
{
session.Store(i);

}

session.SaveChanges();

foreach (var i in data.Batch)
{
try
{
session.Store(new IdMapping { Id = " IdMappings/" + i.LongId, LongId =
i.LongId });

}

catch (Raven.Client.Exceptions.NonUniqueObjectException)
{
session.Delete(i);
}
}

session.SaveChanges();


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Oren Eini (Ayende Rahien)  
View profile  
 More options Jul 27 2012, 4:17 pm
From: "Oren Eini (Ayende Rahien)" <aye...@ayende.com>
Date: Fri, 27 Jul 2012 23:17:05 +0300
Local: Fri, Jul 27 2012 4:17 pm
Subject: Re: [RavenDB] Document ID issue during bulk insert of 2M documents with subsequent patching

If the session throws an exception, you may no longer use the session.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tobias Sebring  
View profile  
 More options Jul 27 2012, 4:40 pm
From: Tobias Sebring <tsebr...@gmail.com>
Date: Fri, 27 Jul 2012 13:40:53 -0700 (PDT)
Local: Fri, Jul 27 2012 4:40 pm
Subject: Re: [RavenDB] Document ID issue during bulk insert of 2M documents with subsequent patching

The session does not throw an exception. The try-catch is there to catch
when I try to insert a duplicate key because sometimes the dataset I'm
working with is not consistent. Are you saying the try-catch could be the
cause of the performance degradation? Only two or so  
NonUniqueObjectException are thrown in the entire 2M document dataset.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Oren Eini (Ayende Rahien)  
View profile  
 More options Jul 27 2012, 6:02 pm
From: "Oren Eini (Ayende Rahien)" <aye...@ayende.com>
Date: Sat, 28 Jul 2012 01:02:45 +0300
Local: Fri, Jul 27 2012 6:02 pm
Subject: Re: [RavenDB] Document ID issue during bulk insert of 2M documents with subsequent patching

I don't _know_ what the issue is, but an exception from the session render
its state undefined.
Do this without try/catch.
Then see how many queries you make to the service.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tobias Sebring  
View profile  
 More options Jul 27 2012, 7:17 pm
From: Tobias Sebring <tsebr...@gmail.com>
Date: Fri, 27 Jul 2012 16:17:16 -0700 (PDT)
Local: Fri, Jul 27 2012 7:17 pm
Subject: Re: [RavenDB] Document ID issue during bulk insert of 2M documents with subsequent patching

Okay. Got it. It's working now but I have to utilize a ConcurrentDictionary
to make sure no duplicate keys are attempted to be saved to RavenDb. This
sadly means saving 2,000,000 strings in memory throughout the entire import
process which brings the server to it's knees. Speed is about 200k
documents per minute.

Thank you for your help on this issue!


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Oren Eini (Ayende Rahien)  
View profile  
 More options Jul 28 2012, 2:01 am
From: "Oren Eini (Ayende Rahien)" <aye...@ayende.com>
Date: Sat, 28 Jul 2012 09:01:06 +0300
Local: Sat, Jul 28 2012 2:01 am
Subject: Re: [RavenDB] Document ID issue during bulk insert of 2M documents with subsequent patching

Use a bloom filter instead


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tobias Sebring  
View profile  
 More options Jul 28 2012, 9:44 pm
From: Tobias Sebring <tsebr...@gmail.com>
Date: Sat, 28 Jul 2012 18:44:54 -0700 (PDT)
Local: Sat, Jul 28 2012 9:44 pm
Subject: Re: [RavenDB] Document ID issue during bulk insert of 2M documents with subsequent patching

Got that fixed. Now I'm having trouble limiting the memory footprint of
RavenDb. The memory consumption will gradually rise to 98% of physical ram
at which point Windows 7 will start display warnings to close the program
down and other applications will crash randomly.

I've tried the following things to limit memory utilization in accordance
with other threads in this group:

I've turned off indexing:
using (var webClient = new WebClient())
{
webClient.UseDefaultCredentials = true;
var result = webClient.UploadString(new Uri(new
Uri("http://localhost:8080"), "/admin/stopindexing"), "POST", "");

}

Modified cache configuration settings (tried with different values - same
result):
<appSettings>
<add key="Raven/MemoryCacheLimitPercentage" value="50" />
<add key="Raven/MemoryCacheLimitCheckInterval" value="00:00:15" />
<add key="Raven/MemoryCacheExpiration" value="60" />
</appSettings>

And disabled all caching:
using (Store.DatabaseCommands.DisableAllCaching())
{
... batch store / savechanges

}

Non of these seem to have any effect on memory usage of the application
with RavenDb is running in embedded mode. Commenting out the few lines of
RavenDb code that handles batch imports results in a maximum 125mb memory
usage on system with 16GB physical ram.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Oren Eini (Ayende Rahien)  
View profile  
 More options Jul 29 2012, 1:41 am
From: "Oren Eini (Ayende Rahien)" <aye...@ayende.com>
Date: Sun, 29 Jul 2012 08:41:11 +0300
Local: Sun, Jul 29 2012 1:41 am
Subject: Re: [RavenDB] Document ID issue during bulk insert of 2M documents with subsequent patching

What lines did you comment?
What build are you using?
How many items are you using per SaveChanges call?


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tobias Sebring  
View profile  
 More options Jul 29 2012, 6:20 am
From: Tobias Sebring <tsebr...@gmail.com>
Date: Sun, 29 Jul 2012 03:20:56 -0700 (PDT)
Local: Sun, Jul 29 2012 6:20 am
Subject: Re: [RavenDB] Document ID issue during bulk insert of 2M documents with subsequent patching

Code with commented out lines:
var bc = new BlockingCollection<IndexedBatch<TData>>();
var importTask = Task.Run(() =>
{
bc.GetConsumingEnumerable()
.AsParallel()
.WithExecutionMode(ParallelExecutionMode.ForceParallelism)
.WithMergeOptions(ParallelMergeOptions.NotBuffered)
.ForAll(data =>
{
var st = Stopwatch.StartNew();
//using (var session = Store.OpenSession())
//{
foreach (var i in data.Batch)
{
//session.Store(i);

}

//session.SaveChanges();
//}

Console.WriteLine(@"Batch imported {0} in {1} ms", data.Index,
st.ElapsedMilliseconds);

});
});

Build is from NuGet a few days ago:
  <package id="RavenDB.Client" version="1.2.2044-Unstable" />
  <package id="RavenDB.Database" version="1.2.2044-Unstable" />
  <package id="RavenDB.Embedded" version="1.2.2044-Unstable" />

Batch size is 1024 from recommendation I picked up here in the group. I'm
running multiple import jobs concurrently but also tried limiting that
with .WithDegreeOfParallelism(1) and got the same result.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Oren Eini (Ayende Rahien)  
View profile  
 More options Jul 29 2012, 6:22 am
From: "Oren Eini (Ayende Rahien)" <aye...@ayende.com>
Date: Sun, 29 Jul 2012 13:22:25 +0300
Local: Sun, Jul 29 2012 6:22 am
Subject: Re: [RavenDB] Document ID issue during bulk insert of 2M documents with subsequent patching

Run this sequentially, without prallelism, first.
What is the size of the documents?
Can you create a repro?


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tobias Sebring  
View profile  
 More options Jul 29 2012, 9:38 am
From: Tobias Sebring <tsebr...@gmail.com>
Date: Sun, 29 Jul 2012 06:38:21 -0700 (PDT)
Local: Sun, Jul 29 2012 9:38 am
Subject: Re: [RavenDB] Document ID issue during bulk insert of 2M documents with subsequent patching

Repro: https://dl.dropbox.com/u/6420016/DataImport.zip

Sequentially - same result.

Document size is small. This document is from the repro:
{
  "LongText":
"f93ht2b8is1usozq3nwqbc34ti1aln9fx5if5ra7u9mz444ktxpmc8bcg9xlaav5su7wfuukmz 6",
  "MediumText": "f93ht2b8is1usozq3nwqbc34ti1aln9fx5if5ra7u9mz444ktx",
  "ShortText": "f93ht2b8is1usozq3nwqbc34t",
  "NumberIntervals": [
    {
      "NumberFrom": 75,
      "NumberTo": 1985
    },
    {
      "NumberFrom": 705,
      "NumberTo": 1391
    },
    {
      "NumberFrom": 456,
      "NumberTo": 1471
    }
  ],
  "Type": "Type1",
  "Categories": [
    "Category3",
    "Category2",
    "Category4"
  ]


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Oren Eini (Ayende Rahien)  
View profile  
 More options Jul 29 2012, 9:52 am
From: "Oren Eini (Ayende Rahien)" <aye...@ayende.com>
Date: Sun, 29 Jul 2012 16:52:27 +0300
Local: Sun, Jul 29 2012 9:52 am
Subject: Re: [RavenDB] Document ID issue during bulk insert of 2M documents with subsequent patching

I can't follow the code, please create a repro without all the threading
complexity there.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tobias Sebring  
View profile  
 More options Jul 29 2012, 10:57 am
From: Tobias Sebring <tsebr...@gmail.com>
Date: Sun, 29 Jul 2012 07:57:39 -0700 (PDT)
Local: Sun, Jul 29 2012 10:57 am
Subject: Re: [RavenDB] Document ID issue during bulk insert of 2M documents with subsequent patching

I made the threading optional in the original repro controlled in by a
boolean at the top of main() to show off the real code before I made it
sequential:
var runInParallel = false;

Here's an updated repro with all the optional threading gone:
https://dl.dropbox.com/u/6420016/DataImport2.zip

Note. that the ConcurrentDictionary is only ever accessed sequentially and
I left it in there because it is one of the few things in the non-ravendb
targeted code that will allocate a big chunk of memory.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tobias Sebring  
View profile  
 More options Jul 29 2012, 12:48 pm
From: Tobias Sebring <tsebr...@gmail.com>
Date: Sun, 29 Jul 2012 09:48:39 -0700 (PDT)
Local: Sun, Jul 29 2012 12:48 pm
Subject: Re: Document ID issue during bulk insert of 2M documents with subsequent patching
I just noticed that clean solution wouldn't delete the files under
obj/ hence the archive attached was quite large. This is the same
repro as DataImport2.zip but it's 8kb instead of 25mb:
https://dl.dropbox.com/u/6420016/DataImport2-small.zip

On Jul 29, 4:57 pm, Tobias Sebring <tsebr...@gmail.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Oren Eini (Ayende Rahien)  
View profile  
 More options Jul 29 2012, 1:55 pm
From: "Oren Eini (Ayende Rahien)" <aye...@ayende.com>
Date: Sun, 29 Jul 2012 20:55:46 +0300
Local: Sun, Jul 29 2012 1:55 pm
Subject: Re: [RavenDB] Re: Document ID issue during bulk insert of 2M documents with subsequent patching

Thanks, reproduced and testing this now.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Oren Eini (Ayende Rahien)  
View profile  
 More options Jul 29 2012, 1:58 pm
From: "Oren Eini (Ayende Rahien)" <aye...@ayende.com>
Date: Sun, 29 Jul 2012 20:58:44 +0300
Local: Sun, Jul 29 2012 1:58 pm
Subject: Re: [RavenDB] Re: Document ID issue during bulk insert of 2M documents with subsequent patching

*snort*
The problem was that you called StopIndexing, that caused us to hold in
memory stuff until indexing would resume.
It is a bug that wasn't exposed until this exact scenario (large import
with indexing disabled), this being a rare case, we didn't notice that.
Thanks for this, fixed now and will be out in a few minutes.

On Sun, Jul 29, 2012 at 8:55 PM, Oren Eini (Ayende Rahien) <

...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tobias Sebring  
View profile  
 More options Jul 29 2012, 2:01 pm
From: Tobias Sebring <tsebr...@gmail.com>
Date: Sun, 29 Jul 2012 11:01:06 -0700 (PDT)
Local: Sun, Jul 29 2012 2:01 pm
Subject: Re: [RavenDB] Re: Document ID issue during bulk insert of 2M documents with subsequent patching

Sweet. Thank you so much for taking the time to help me with this. Very
much appreciated!

...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »