Deadlock during bulk migration (3.5)

103 views
Skip to first unread message

Mark Warpool

unread,
Nov 29, 2017, 12:20:37 PM11/29/17
to RavenDB - 2nd generation document database

I have a rather strange problem that I'm having a hard time eliminating.  We are migrating on of our projects from SQL Server to RavenDB, and there's a lot of transformation that needs to be done during this migration, so I've written a windows form app that does this work.  It was originally written as a console app that did the same work sequentially, but it took over 24 hours to run, which is far too long, so I've converted it to WinForms and I'm trying to parallelize the work to make it go faster.  

The code that does the migration starts up 4 new threads and queues up all of the work for the threads.  The actual transformation/migration work is done by static members so there is no shared state save for the work queue.  Each member either: A) Get's its own session, does its work, and disposes the session, or B) Starts a new bulk insert operation and disposes that BI when it has finished it's work.  The only thing they share is the DocumentStore which is in a singleton factory.  

Running this in the debugger, it can run for about an hour and then everything seems to just stop.  Pausing the app in the debugger  shows that each of the threads is in a wait state, usually with 2 threads waiting at:
   Raven.Abstractions.dll!Raven.Abstractions.Util.AsyncHelpers.ExclusiveSynchronizationContext.BeginMessageLoop
And 2 others at:
   Raven.Client.Lightweight.dll!Raven.Client.Document.ChunkedRemoteBulkInsertOperation.Dispose Normal

Eventually the debugger ends up throwing an exception with a TaskCancelledException, at a session.SaveChanges() call.  The exception has a Response property that shows a 503 Service Unavailable status, although I never see this response coming back from the server (the server is running locally, and I never saw that response or anything indicative of such in the raven server console).

Running outside of the debugger the hang usually comes in just a few minutes, and it takes the form of pegging [all 16 of my CPUs] at about 75% utilization and just stays in that state indefinitely.  Running it with fiddler open, again shows no failures from the server (e.g. all responses from the server are 200's), but the very last request that I see going to the server is:

   GET /databases/dmw-staging/changes/config?id=1/5VRLYTIv0Q1/45gKD1fxF5u&command=unwatch-bulk-operation&value=64a2c4d4-3a89-4ea1-92de-d05788128799 HTTP/1.1

Inspecting the response I see that DocumentStore.DebugStatus.WatchedBulkInserts has a ton of entries.  I'm wondering if I need to clear those out, or recycle the DocumentStore after some number of iterations, or something like that.  

I realize this is a difficult issue to diagnose, so maybe someone has and idea where to look?  Maybe there is something I'm doing wrong?  I'm running with all the latest bits, Server Build #35241 and RavenDB.Client v3.5.4.

Oren Eini (Ayende Rahien)

unread,
Nov 30, 2017, 3:52:52 AM11/30/17
to ravendb
How many bulk inserts do you have running concurrently? 
What is using the CPU? The server or client?

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


--
You received this message because you are subscribed to the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mark Warpool

unread,
Nov 30, 2017, 6:51:32 AM11/30/17
to RavenDB - 2nd generation document database
In theory there could be 4 bulk inserts running at the same time, 1 per worker thread.  It would be the client that ties up the CPU.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+u...@googlegroups.com.

Oren Eini (Ayende Rahien)

unread,
Nov 30, 2017, 11:01:54 AM11/30/17
to ravendb
Server & client running on the same location? 
This looks like timeout issue, not a dead lock
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+unsubscribe@googlegroups.com.

Mark Warpool

unread,
Nov 30, 2017, 11:26:49 AM11/30/17
to RavenDB - 2nd generation document database
Yes, the server and client (as well as the SQL Server that I'm reading from) are all running on my local development machine.  I was actually able to solve the problem by creating a separate IDocumentStore for each thread; but that is contrary to the documentation.

Oren Eini (Ayende Rahien)

unread,
Nov 30, 2017, 11:30:53 AM11/30/17
to ravendb
Oh, I know what is going on.
You are running in the max number of concurrent request per host issue. :-)
Use ServicePointManager to increase that.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+unsubscribe@googlegroups.com.

Mark Warpool

unread,
Nov 30, 2017, 11:41:04 AM11/30/17
to RavenDB - 2nd generation document database
Wouldn't that still be a problem with 1 document store per thread, since it's still the same number of requests (actually more because it actually completes the migration). 

Oren Eini (Ayende Rahien)

unread,
Nov 30, 2017, 11:43:25 AM11/30/17
to ravendb
No, each uses its own HttpClient, IIRC
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+unsubscribe@googlegroups.com.

Romesh Wickramasekera

unread,
May 17, 2018, 2:10:26 AM5/17/18
to RavenDB - 2nd generation document database
Do you have any more information/explanation on this issue? We are experiencing what seems like a very similar problem.

Client application (windows service) migrating from sql server to ravendb. ~5 bulk insert queues reading from separate queues that are being fed from various api requests.
As of a few weeks ago, the client application has started to hang after 30-50 minutes of load. 

Remote debugging shows the same information that Mark outlined:
Threads waiting at: 
Raven.Abstractions.dll!Raven.Abstractions.Util.AsyncHelpers.ExclusiveSynchronizationContext.BeginMessageLoop
and
Raven.Client.Lightweight.dll!Raven.Client.Document.ChunkedRemoteBulkInsertOperation.Dispose Normal

"CPUs] at about 75% utilization and just stays in that state indefinitely" this also rings true for us.
The main difference is that the client is sitting on it's own server. So it's using all of that cpu itself.

Could you explain what you mean by 'Use ServicePointManager to increase that"? and that the "max number of concurrent request per host issue" is? 

Oren Eini (Ayende Rahien)

unread,
May 17, 2018, 4:16:25 AM5/17/18
to ravendb
Are you sharing the same bulk insert for concurrent use? 

What is the exact build you are using?
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+unsubscribe@googlegroups.com.
Message has been deleted

Romesh Wickramasekera

unread,
May 17, 2018, 6:31:55 PM5/17/18
to RavenDB - 2nd generation document database
3.5.4 (35215)

No, each bulk insert reads from its own concurrent queue. The queues are written to concurrently.

Oren Eini (Ayende Rahien)

unread,
May 18, 2018, 4:00:22 AM5/18/18
to ravendb
Please try this with the latest version

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


On Fri, May 18, 2018 at 1:31 AM, Romesh Wickramasekera <romesh....@gmail.com> wrote:
3.5.4 (35215)

No, each bulk insert reads from its own concurrent queue. The queues are written to concurrently.

Romesh Wickramasekera

unread,
May 18, 2018, 10:52:03 AM5/18/18
to RavenDB - 2nd generation document database
We have tested upgrading to the latest version on our development machines and get the same problem.

Romesh Wickramasekera

unread,
May 19, 2018, 6:36:38 AM5/19/18
to RavenDB - 2nd generation document database
So some of our developers have had some sucess using the same solution Mark mentioned, by having a separate document store for each bulk insert. Could you explain how this solution would be helping and why your documentation recommends against doing this?

Oren Eini (Ayende Rahien)

unread,
May 20, 2018, 6:11:48 AM5/20/18
to ravendb
It shouldn't, off the top of my head.
Can you try sending a repro for ths?

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


On Sat, May 19, 2018 at 1:36 PM, Romesh Wickramasekera <romesh....@gmail.com> wrote:
So some of our developers have had some sucess using the same solution Mark mentioned, by having a separate document store for each bulk insert. Could you explain how this solution would be helping and why your documentation recommends against doing this?

Romesh Wickramasekera

unread,
May 28, 2018, 1:52:10 AM5/28/18
to RavenDB - 2nd generation document database
Haven't been able to reproduce this in a simple way. Its worth noting though that we found another solution that stopped the deadlock. We reduced the number of cores for the client application to 1 core, and we no longer see the deadlock. We still don't know what cause this to start happening, or really what it was doing stuck in the BeginMessageLoop, but so far reducing the cores or increasing the number of document stores used have both stopped the deadlocks.
Reply all
Reply to author
Forward
0 new messages