RavenDB/ETL questions

163 views
Skip to first unread message

fschwiet

unread,
May 13, 2011, 7:00:43 PM5/13/11
to ravendb
I'm writing an ETL that that loads documents from SQL into RavenDB.
There may be multiple patch operations on the same target object. The
patch operations are appending to an array, I need to be sure they
happen in order. I'm using Rhino.ETL for the first time, I like it.

Can I assume that the patch operations within a single save call will
be applied in the order they were added to the session? I only really
care about the order of patches that apply to the same object.

Can I assume that the batches will be applied in the order they're
generated within the ETL process? (with the currently batching, its
possible that updates to the same object will be split across
batches...)

The ETL is loading from SQL into RavenDB. When I run it across the
full data set (~3 million rows) I noticed the ETL process went up to 2
gigs memory, then stayed about there. Is that normal/ok? Perhaps I'm
holding onto memory too long, or maybe its just not aggressive about
reclaiming memory until it needs to.

I see the stack overflow ETL sample writes to file, in this case I'm
using the client API to store objects. Its going to be slower that
way, but I want to keep it simple for now.

Ayende Rahien

unread,
May 14, 2011, 2:58:01 AM5/14/11
to rav...@googlegroups.com
inline

On Sat, May 14, 2011 at 2:00 AM, fschwiet <fsch...@gmail.com> wrote:
I'm writing an ETL that that loads documents from SQL into RavenDB.
There may be multiple patch operations on the same target object.  The
patch operations are appending to an array, I need to be sure they
happen in order.  I'm using Rhino.ETL for the first time, I like it.

Happy to hear that
 
Can I assume that the patch operations within a single save call will
be applied in the order they were added to the session?  I only really
care about the order of patches that apply to the same object.


Yes, they are executed in order
 
Can I assume that the batches will be applied in the order they're
generated within the ETL process?  (with the currently batching, its
possible that updates to the same object will be split across
batches...)


Yes, they are executed in order
 
The ETL is loading from SQL into RavenDB.  When I run it across the
full data set (~3 million rows) I noticed the ETL process went up to 2
gigs memory, then stayed about there.  Is that normal/ok?  Perhaps I'm
holding onto memory too long, or maybe its just not aggressive about
reclaiming memory until it needs to.


Which memory? By default, we limit RavenDB memory consumption to ~ 2 - 3 GB
Are you talking about ETL?

Matt Warren

unread,
May 14, 2011, 3:16:41 AM5/14/11
to ravendb
I ran into a memory usage issues when importing the SO dataset on my
laptop (3GB RAM). The work-around I found was to change the
PipelineExecutor. By default it seems to put each item in a
dictionary, proabably for caching.

You create a new non-caching executor like this:

public class SimplePipelineExecutor : AbstractPipelineExecuter
{
protected override IEnumerable<Row>
DecorateEnumerableForExecution(IOperation operation,
IEnumerable<Row> enumerator)
{
foreach (Row row in new EventRaisingEnumerator(operation,
enumerator))
{
yield return row;
}
}
}

and use it like so:
PipelineExecuter = new SingleThreadedPipelineExecuter();

I got the code from this thread
http://groups.google.com/group/rhino-tools-dev/browse_thread/thread/d0d23df41edb6e28?pli=1

fschwiet

unread,
May 14, 2011, 1:39:01 PM5/14/11
to ravendb
I didn't write it down, as I recall the ETL process was using ~1.6
gig while the raven server process was using ~0.5 gig. I was running
the ETL within NUnit (just being lazy).

Unfortunately I had to change my password last night (long story)
before it could complete. It did get halfway through, and I was able
to restart today. Writing 1.5 million objects this way did take ~3
hours (a very rough estimate), I will likely change the code to write
the PATCH requests to file like the SO ETL sample. The machine I ran
this on is not particularly fast, and the database was on a VM.

Next time I'll start perfmon beforehand, the windows UI became too
unresponsive to want to do much as it ran.

jalchr

unread,
May 16, 2011, 5:32:49 AM5/16/11
to rav...@googlegroups.com
Just curious of why there is a cached version of the etl pipe-line ? 

Will the use of non-cached effectively replace the cached ones ?

fschwiet

unread,
May 16, 2011, 2:38:36 PM5/16/11
to ravendb
Thanks Matt that change made a pretty big difference-- the ETL
client went from 1.6 gig to 65k and the machines UI is no longer
frozen. As expected Raven server is growing to use the extra memory.

On May 14, 12:16 am, Matt Warren <mattd...@gmail.com> wrote:
> I ran into a memory usage issues when importing the SO dataset on my
> laptop (3GB RAM). The work-around I found was to change the
> PipelineExecutor. By default it seems to put each item in a
> dictionary, proabably for caching.
>
> You create a new non-caching executor like this:
>
>     public class SimplePipelineExecutor : AbstractPipelineExecuter
>     {
>         protected override IEnumerable<Row>
>              DecorateEnumerableForExecution(IOperation operation,
> IEnumerable<Row> enumerator)
>         {
>             foreach (Row row in new EventRaisingEnumerator(operation,
> enumerator))
>             {
>                 yield return row;
>             }
>         }
>     }
>
> and use it like so:
>     PipelineExecuter = new SingleThreadedPipelineExecuter();
>
> I got the code from this threadhttp://groups.google.com/group/rhino-tools-dev/browse_thread/thread/d...

Matt Warren

unread,
May 16, 2011, 4:45:59 PM5/16/11
to ravendb
In my case swapping to a non-cached version caused no issues. But I
guess in other scenarios the caching would make the whole process
faster. If for instance certain records were re-used, or if you had a
complex pipeline.

Matt Warren

unread,
May 16, 2011, 4:47:21 PM5/16/11
to ravendb
I saw similar results when I ran it on my laptop, glad you got it
working.

I think that it'll run fine on a 64-bit machine with lots of RAM, but
much less than that and you need to use the non-cached version.

fschwiet

unread,
May 28, 2011, 4:12:08 AM5/28/11
to ravendb
The ETL was split to extract to files first like the SO sample, and
I am able now to import on a faster computer.

Reading from file to RavenDB via batch commands, the ETL client is
running at 50mb memory usage, the ravenDB server eventually reached 7
gigs. I am using the config values below**, I wonder if there is
something else I can use to limit memory usage?

** http://ravendb.net/faq/low-memory-footprint
<add key="Raven/Esent/CacheSizeMax" value="256"/>
<add key="Raven/Esent/MaxVerPages" value="32"/>

Memory usage increased in a seesaw manner, occasionally dropping 1
gig then regaining a bit more, drifting from 2gig to 7gig. Eventually
it was stuck seesawing at 7 gigs, but the seesaw steps are flattened
out to where it stays at 7. This is slowing down the ETL now, but
perhaps it will still finish.

I don't know how to share a repro unfortunately, given the data is
closed. Would it help for me to grab some kind of memory profile
trace? Or other suggestions?
> > > > I'm writing anETLthat that loads documents from SQL intoRavenDB.
> > > > There may be multiple patch operations on the same target object.  The
> > > > patch operations are appending to an array, I need to be sure they
> > > > happen in order.  I'm using Rhino.ETLfor the first time, I like it.
>
> > > > Can I assume that the patch operations within a single save call will
> > > > be applied in the order they were added to the session?  I only really
> > > > care about the order of patches that apply to the same object.
>
> > > > Can I assume that the batches will be applied in the order they're
> > > > generated within theETLprocess?  (with the currently batching, its
> > > > possible that updates to the same object will be split across
> > > > batches...)
>
> > > > TheETLis loading from SQL intoRavenDB.  When I run it across the
> > > > full data set (~3 million rows) I noticed theETLprocess went up to 2
> > > > gigs memory, then stayed about there.  Is that normal/ok?  Perhaps I'm
> > > > holding onto memory too long, or maybe its just not aggressive about
> > > > reclaiming memory until it needs to.
>
> > > > I see the stack overflowETLsample writes to file, in this case I'm

Ayende Rahien

unread,
May 28, 2011, 4:44:53 AM5/28/11
to rav...@googlegroups.com
Memory profile would help, also, you might want to place some limits on the internal caches in RavenDB
Reply all
Reply to author
Forward
0 new messages