Concurrent Bulk Inserts

110 views
Skip to first unread message

Nick P

unread,
Aug 25, 2016, 3:25:16 AM8/25/16
to RavenDB - 2nd generation document database
Hello all,

We are currently working on a web application that allows users to query many remote data providers for information. The query responses could include 100,000 plus records. Each record is not large, around 0.5-1.5 KB. We are using RavenDB in our application to store the responses from the remote data providers and to allow our users to view the response data. When the remote data providers respond to a user query, we have a data processor service that parses the response and uploads each record to RavenDB. To speed up the processing we are working on multi-threading our data processing service. Each thread processes a single data response and uses the Bulk Insert operation while looping through all the response data to upload records to RavenDB. I am currently testing our data processing service on my local machine.

When attempting to process multiple (4-9) data responses of 85,000 records each simultaneously, the Bulk Insert operations began to throw timeout exceptions. I increased the WriteTimeout in the BulkInsertOptions and this resolved the issue for processing 4 responses simultaneously. When I attempted to increase to 9 simultaneously, however, the timeout reappeared. I can increase the timeout more, but continually increasing the timeout does not seem like a particularly good way to scale the amount of parallel processing. 

I found this thread ( https://groups.google.com/forum/#!msg/ravendb/1puQKCxqj8M/_zo8FhlBOAMJ ) from 2014 that suggests that the Bulk Insert operation is optimized for one per RavenDB. Does this remain true? Any suggestions for achieving many bulk inserts in parallel? Or is implementing throttling on our number of threads or maybe some sort of upload queue with a single bulk insert operation per RavenDB server our best option?

Thanks.

Michael Yarichuk

unread,
Aug 25, 2016, 3:49:25 AM8/25/16
to RavenDB - 2nd generation document database
What build of RavenDB that you are running?
Can you take a look at disk queue length when you do alot of Bulk Inserts on a machine where RavenDB is hosted?

--
You received this message because you are subscribed to the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Best regards,

 

Hibernating Rhinos Ltd  cid:image001.png@01CF95E2.8ED1B7D0

Michael Yarichuk l RavenDB Core Team 

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 

RavenDB paving the way to "Data Made Simple"   http://ravendb.net/  

Nick P

unread,
Aug 25, 2016, 11:03:31 AM8/25/16
to RavenDB - 2nd generation document database
We are running RavenDB build #30151.

I just ran two tests with 5 threads doing bulk inserts to RavenDB as described in the original post. 

During the first test I recorded an average Avg. Disk Queue Length of 3.502 with a Max of 31.549 during the bulk inserts. During the second test I recorded an average Avg. Disk Queue Length of 2.034 with a Max of 72.545 during the bulk inserts. 

On Thursday, August 25, 2016 at 3:49:25 AM UTC-4, Michael Yarichuk wrote:
What build of RavenDB that you are running?
Can you take a look at disk queue length when you do alot of Bulk Inserts on a machine where RavenDB is hosted?
On Wed, Aug 24, 2016 at 9:52 PM, Nick P <nbp...@gmail.com> wrote:
Hello all,

We are currently working on a web application that allows users to query many remote data providers for information. The query responses could include 100,000 plus records. Each record is not large, around 0.5-1.5 KB. We are using RavenDB in our application to store the responses from the remote data providers and to allow our users to view the response data. When the remote data providers respond to a user query, we have a data processor service that parses the response and uploads each record to RavenDB. To speed up the processing we are working on multi-threading our data processing service. Each thread processes a single data response and uses the Bulk Insert operation while looping through all the response data to upload records to RavenDB. I am currently testing our data processing service on my local machine.

When attempting to process multiple (4-9) data responses of 85,000 records each simultaneously, the Bulk Insert operations began to throw timeout exceptions. I increased the WriteTimeout in the BulkInsertOptions and this resolved the issue for processing 4 responses simultaneously. When I attempted to increase to 9 simultaneously, however, the timeout reappeared. I can increase the timeout more, but continually increasing the timeout does not seem like a particularly good way to scale the amount of parallel processing. 

I found this thread ( https://groups.google.com/forum/#!msg/ravendb/1puQKCxqj8M/_zo8FhlBOAMJ ) from 2014 that suggests that the Bulk Insert operation is optimized for one per RavenDB. Does this remain true? Any suggestions for achieving many bulk inserts in parallel? Or is implementing throttling on our number of threads or maybe some sort of upload queue with a single bulk insert operation per RavenDB server our best option?

Thanks.

--
You received this message because you are subscribed to the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+u...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Jahmai Lay

unread,
Aug 25, 2016, 8:59:48 PM8/25/16
to RavenDB - 2nd generation document database

Another options is just handling retryable errors. I would consider a timeout error to be retryable.

But it does sound like you should be able to do what you're doing anyway without a problem.

Maxim Buryak

unread,
Aug 28, 2016, 5:02:57 AM8/28/16
to rav...@googlegroups.com
Hi,
We almost recommend having a single bulk insert operation, that is accessed concurrently.
The only limitation is that the default "chunked" bulk insert is not thread safe, so you'll have to pass it BulkInsertOptions with ChunkedBulkInsertOptions value set to null:

 var bulkInsert = store.BulkInsert(options: new BulkInsertOptions()
                    {
                        ChunkedBulkInsertOptions = null
                    });

don't forget to dispose it at the end of the proccess



Best Regards,

Hibernating Rhinos Ltd  cid:image001.png@01CF95E2.8ED1B7D0

Maxim Buryak l Core Team Developer Mobile:+972-54-217-7751

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

RavenDB paving the way to "Data Made Simplehttp://ravendb.net/  


--
You received this message because you are subscribed to the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+unsubscribe@googlegroups.com.

Michael Yarichuk

unread,
Aug 28, 2016, 8:26:01 AM8/28/16
to RavenDB - 2nd generation document database

Hmm... bulk insert in general is designed to take the available IO and network bandwidth, so having concurrent bulk inserts is not an issue by itself, except that concurrency has bulk insert threads 'fighting' for resources, hence the high disk queue length. 

From what it looks like, the solution you mentioned - an upload queue with single bulk insert should be a good one

>> Michael Yarichuk l RavenDB Core Team 
>>
>> Office: +972-4-622-7811 l Fax: +972-153-4-622-7811
>>
>>  
>>
>> RavenDB paving the way to "Data Made Simple"   http://ravendb.net/  
>

> --
> You received this message because you are subscribed to the Google Groups "RavenDB - 2nd generation document database" group.

> To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+unsubscribe@googlegroups.com.


> For more options, visit https://groups.google.com/d/optout.


--
Best regards,

 

Hibernating Rhinos Ltd  

Michael Yarichuk l RavenDB Core Team 

image004.png
Reply all
Reply to author
Forward
0 new messages