Streaming + Bulk Insert issue

390 views
Skip to first unread message

Kijana Woodard

unread,
Jun 10, 2015, 10:40:38 AM6/10/15
to rav...@googlegroups.com
Server 3690 running as a windows service.  
Client 3690 running as a console app.

I'm consuming a Stream and dumping the result it into Bulk Insert [batch size 2048].
Each document from the Stream results in ~3k documents given to Bulk Insert.

Eventually it fails with "An existing connection was forcibly closed by the remote host".

The error occurs consistently around the 5 minute mark.

I can probably get around this by bringing all the data into memory. That's what the OP did in this thread:https://groups.google.com/forum/#!msg/ravendb/4t9CUtuMeQk/lkxi05fMBqsJ. But that defeats one benefit of streaming - keeping a light memory footprint.


I can confirm that this appears to be true. If I change the batch size from 2048 to 8192, the process makes it to the end [and then fails - more below].

1. Tweaking the batch size shouldn't matter. Raven Client and Raven Server should cooperate to keep the stream alive at the pace it's being consumed. In this case the "bottleneck" was pushing documents back through Bulk Insert. The documents were being inserted at ~2200/sec.

2. I thought the issue was solved when I set the batch size to 8192. However, at the end, the bulk insert still failed with {"Error":"This single use token has expired after 3098.5226051 seconds"}. For something like bulk insert, it should get it's token refreshed on every batch sent.

3. After spending ~50 minutes inserting ~40 million documents, none of them made it into the database. I assume this is because of the token expiration.

4. With a smaller number of documents to insert (~360k), the time between the last "Wrote xxx..." and "Finished writing all results..." is ~3x the time to write all the documents. Is it waiting on them all to be index? This was surprising behavior. 

Saz files:

p.s. Complete side note about something I noticed while working on this issue. If you have a collection with relatively large documents (~150k each), trying to look at a document results in the studio displaying a "unresponsive script error" dialog [Firefox]. I choose continue and the dialog keeps reappearing. After 5 attempts I just gave up and killed the browser.  Cpu spikes to 30%. I still have 33% free of 24GB of RAM.

Oren Eini (Ayende Rahien)

unread,
Jun 10, 2015, 11:25:25 AM6/10/15
to ravendb
What is failing here? The bulk insert or the streaming?
Can't look at the SAZ files right now, but I think that it might be that we are actually sending stuff too fast on the streaming, then can't process them fast enough to send to the bulk insert, right? So the stremaing connection get full, and then you have a timeout?

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


--
You received this message because you are subscribed to the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Kijana Woodard

unread,
Jun 10, 2015, 11:34:36 AM6/10/15
to rav...@googlegroups.com
The streaming with a batch size of 2048.
The bulk insert with a batch size of 8192.

I believe the streaming connection is getting full when the batch size is 2048. 
My point #1 speaks to this situation.


Oren Eini (Ayende Rahien)

unread,
Jun 10, 2015, 11:38:36 AM6/10/15
to ravendb
The problem is that I'm not sure what we _can_ do here.
Slow down the connection on the server? 

Kijana Woodard

unread,
Jun 10, 2015, 12:17:46 PM6/10/15
to rav...@googlegroups.com
Have the client signal when it's ready for the next batch?

The token expiration for long running Bulk Inserts is more straight forward, right?

Oren Eini (Ayende Rahien)

unread,
Jun 11, 2015, 6:44:21 AM6/11/15
to ravendb
The single use token is valid for 2.5 minute, and we have a max of 25 outstanding tokens.

We handle the client signaling back using Subscriptions. That ensures that we don't overload the client.
If you are just reading throughout everything, that is the best way to go

Kijana Woodard

unread,
Jun 11, 2015, 11:42:12 AM6/11/15
to rav...@googlegroups.com
"The single use token is valid for 2.5 minute, and we have a max of 25 outstanding tokens."

What happens if a bulk insert session takes 3+ minutes? Since the bulk insert is sending batches, wouldn't it make sense if the token is refreshed/renewed? How would one write code to say "bulk insert for 2.5 minutes, break, then resume the bulk insert process"?

"...using Subscriptions"

I looked at Data Subscriptions. They appear to live across sessions, which is nice for various scenarios. But if you want to read from the beginning every time, what's the best way to handle this? Create the subscription with a "well known name" and delete at startup? What if you have multiple threads traversing the same subscription? You could create a subscription with a guid, but that would leave a lot of orphaned subscriptions in the system. 

Any thoughts about #4? [long delay between "Wrote xxx..." and "Finished writing..."] Is Bulk insert transactional in the sense that all documents get to the db or none do? If so, I didn't realize that. It's unfortunate behavior for certain scenarios. Take import products from external source. I'd rather have 90% of the products than 0%.

Is the "viewing large document hangs studio bug" reproducible?


Oren Eini (Ayende Rahien)

unread,
Jun 11, 2015, 1:22:57 PM6/11/15
to ravendb
That isn't quite how it works.
When we start doing bulk insert, we ask for a token, then immediately use it to open a connection.
That connection is kept open for as long as the bulk insert is running.
Breaking for a long period on bulk insert is not recommended, the server might decide to kill the connection because of this.

Create a subscription from scratch every time. A subscription is single threaded. Note htat subscriptions don't have names.

Bulk insert is running in batches, by default 512 items or so are commited in a single transaction.

Kijana Woodard

unread,
Jun 11, 2015, 1:32:19 PM6/11/15
to rav...@googlegroups.com
"Breaking for a long period on bulk insert is not recommended, the server might decide to kill the connection because of this."

The were no breaks. The bulk insert process ran for ~50 [lots of data]. Once the bulk insert tried to finish [exit using block], it got the expired token error.

"Bulk insert is running in batches, by default 512 items or so are commited in a single transaction."

You're saying each batch is committed independently? If so, that's what I thought happened. So that's why it's odd that exiting the using block takes 2-3x the combined time of sending all the batches.

Will re-read the docs on data subscriptions.

Oren Eini (Ayende Rahien)

unread,
Jun 11, 2015, 1:33:39 PM6/11/15
to ravendb
I'm not sure how something like that can happen. Unless...
Did you enable streaming in Fiddler? Maybe it buffered the request?

Kijana Woodard

unread,
Jun 11, 2015, 1:41:46 PM6/11/15
to rav...@googlegroups.com
http://fiddlerbook.com/fiddler/help/streaming.asp

Nope. Streaming Mode wasn't enabled. So maybe Buffering Mode affected the request.

Oren Eini (Ayende Rahien)

unread,
Jun 11, 2015, 1:44:04 PM6/11/15
to ravendb
That might explain this. RavenDB never saw the request, when you are done, Fiddler sent it all, 50 minutes too late.

Arno den Uijl

unread,
Jun 14, 2015, 3:05:15 PM6/14/15
to rav...@googlegroups.com
Hi,

I've been having the same issues. After stripping a lot of code out .  got to the conclusion that it was the streaming that fails. The minimum code I can reproduce it with is as follows. It generates 5000 documents and streams them out with a thread.sleep of 100ms for each item (to simulate processing). It crashes after ca. 2 minutes with An existing connection was forcibly closed by the remote host.


using System.Linq;
using System.Threading;
using NLog;
using Raven.Client;
using Raven.Client.Document;

namespace NatWa.MidOffice
{
    public static class Program
    {
        private static readonly Logger Logger = LogManager.GetCurrentClassLogger();

        public static void Main(string[] arguments)
        {
            using (var documentStore = new DocumentStore {Url = "http://localhost:8080", DefaultDatabase = "test"})
            {
                documentStore.Initialize();
                documentStore.DatabaseCommands.GlobalAdmin.DeleteDatabase("test", true);
                documentStore.DatabaseCommands.GlobalAdmin.EnsureDatabaseExists("test");

                using (IDocumentSession session = documentStore.OpenSession())
                {
                    for (int i = 1; i < 5000; i++)
                    {
                        session.Store(new
                        {
                            a = i,
                            b = new string(Enumerable.Range(0, 5000).Select(i2 => 'a').ToArray()) // generate a long string to have some body
                        }, "thingy/" + i);
                    }
                    session.SaveChanges();
                }
                using (var documentSession = documentStore.OpenSession())
                {
                    var enumerator = documentSession.Advanced.Stream<dynamic>("thingy/");
                    int counter = 0;
                    while (enumerator.MoveNext())
                    {
                        Logger.Info("{0} ", counter++);
                        Thread.Sleep(100);
                    }
                }
            }
        }
    }

I have a wireshark dump (using a different database because wireshark don't do localhost, but with localhost it fails the same)
at 31748 20:33:00.54626 I see the 200 response from the server (IP .10) to the client (IP .1)  
at 31821 20:34:58.36879 I see three RST's  

I've also attached the logfile from the code above (with 2 hours timezone difference). 
around 20:33:00 (when we see the 200 in wireshark) the code is processing streaming item number 12
only around the processing of item 1194

so it looks like it receives all the data very early (after a couple of seconds) and crashes way after it already has received all data and received a 200 for it. 

It looks to me similar with the opening post. Hope this helps.

cheers,

Arno
logging.log

Grisha Kotler

unread,
Jun 15, 2015, 2:53:02 AM6/15/15
to rav...@googlegroups.com
Thanks. We'll look into this.

Hibernating Rhinos Ltd  cid:image001.png@01CF95E2.8ED1B7D0

Grisha Kotler l RavenDB Core Team Developer Mobile: +972-54-586-8647

RavenDB paving the way to "Data Made Simplehttp://ravendb.net/


--

Oren Eini (Ayende Rahien)

unread,
Jun 15, 2015, 3:56:34 AM6/15/15
to ravendb
Take a look here on why this is happening:


If you want to handle this, you can use subscriptions, which require client ack before sending next batch, and will handle this much more easily.

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


--

Arno den Uijl

unread,
Jun 15, 2015, 5:06:02 AM6/15/15
to rav...@googlegroups.com
Hi,

Thanks for the quick reply. Just read http://ayende.com/blog/170401/timeouts-tcp-and-streaming-operations . If that is the problem then doesn't that mean streaming is very unpredictable (and therefor unusable) when streaming a lot of documents and doing things with them? How does one predict when the TCP buffers are going to overflow? You would have to take in to account the performance of the server, the performance of the client, the processing you do with the streamed items, etc. I wouldn't now how to work with this in practices. Read everything as quick as possible in memory and hope you're fast enough is an option? But that seems exactly opposite of what streaming is for.  

I can think of two ways to handle differences in client and server speed. 
1. With back-pressure. If the client can't keep up the server must slow down
2. With buffering. If the client can't keep up the client must buffer. 

The current implementation seems to be of type of 2. We buffer, but already at reasonable low amounts of data the buffer breaks. And with an exeption that kind of keeps you guessing. From using streams in a couple of cases i've had the experience that they are ideal for streaming large amount of documents and processing them. But this issue kinds of contradicts with that. From following dicussions in the Rx forum i know backpressure is non trivial to implement, so buffering is a valid alternative, but then i would presume you would buffer in memory. Then you can get the situation you suck in such a big amount of data that you get an OOM exception. But i think for the user that is easier to follow then overflowing TCP buffers.  

Data subscription looks like a nice feature, but seems to be aimed at kind of a different use case. 
In our case we have a large amount of documents and we want to go over them once in a custom order. We have 100.000s of events in our system. We create a query to go over all events in a custom order and process them streaming. Fits like a glove with the streaming of ravendb. From what i can see the data subscriptions only order on etag, so that poses a problem. And data subscriptions seem to be more aimed at persistent processing over a longer time and not a single sweep over your data. Doing a single sweep using the data subscriptions feels like using the wrong tool for the job and brings some extra overhead code. 

But the main issue, i think, is the fact that streaming in it's current implementation is unpredictable even when streaming only 5000 documents. The only advice I feel i can give my colleagues at the moment is to read everything in memory. 

kind regards,

Arno



 


Oren Eini (Ayende Rahien)

unread,
Jun 15, 2015, 6:21:49 AM6/15/15
to ravendb
The idea with streaming is that you are sending the data somewhere else. Usually either fast in memory computation or sending to file / client.
It isn't meant to be stoppable. 

You cannot predict this, but there are enough jiggle room that you don't usually need to be concerned with this without actually doing significant work _per item_ on a large set.

Streaming in intended for very large requests, and buffering it in memory might cause a LOT of issues. The idea is that you are reading directly from the I/O buffer, with as little involvement as possible.

Note that you can quite easily stream to memory and handle it that way, but that is a design choice that you take given your own experience and knowledge of the system. Not something that RavenDB can make for you

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


Arno den Uijl

unread,
Jun 16, 2015, 7:47:50 AM6/16/15
to rav...@googlegroups.com
Hi,

Been digging around further. http://ayende.com/blog/170401/timeouts-tcp-and-streaming-operations states that the server pushes down the data (from beginning to end) and that if the client can't keep up packets start dropping from the tcp buffers. I think this is not exactly correct.

I've been running the server from sources and in the method StreamToClient(....) in StreamsController i've added some console logging to the addDocument method that writes how many documents are pushed to the PushStreamContent

Action<JsonDocument> addDocument = doc =>
{
    Console.WriteLine("streaming "  + i++);
    timeout.Delay();
    doc.ToJson().WriteTo(writer);
    writer.WriteRaw(Environment.NewLine);
};

  • When I start streaming 50.000 documents to the client I see the first ca. 6906 are pushed from the server and then server stops pushing documents
  • At that point in Wireshark I see TCP transmission window is now completely full, and a couple of TCP Zero Window segment and TCP Zero Window Probe 
  • After a while in wireshark I see a TCP window update after which the server start pushing more documents until document number 9790
  • At that point in wireshark i see the window is again full 
So it seems that the server does pause when the TCP stack on the client can't take anymore. 
The pause and resume process usually repeat a couple of times depending on the processing time the client takes per document. 

When running the server and the client in debug it's alway the server that throws the first exception. Specifically :
Message: ""
Stacktrace: "   at Microsoft.Owin.Host.HttpListener.RequestProcessing.ExceptionFilterStream.Write(Byte[] buffer, Int32 offset, Int32 count)
   at System.IO.Compression.DeflateStream.WriteDeflaterOutput(Boolean isAsync)
   at System.IO.Compression.DeflateStream.Write(Byte[] array, Int32 offset, Int32 count)
   at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder)
   at System.IO.StreamWriter.Write(String value)
   at Raven.Imports.Newtonsoft.Json.Utilities.JavaScriptUtils.WriteEscapedJavaScriptString(TextWriter writer, String s, Char delimiter, Boolean appendDelimiters, Boolean[] charEscapeFlags, StringEscapeHandling stringEscapeHandling, Char[]& writeBuffer) in d:\Git\ravendb\Imports\Newtonsoft.Json\Src\Newtonsoft.Json\Utilities\JavaScriptUtils.cs:line 190

with inner exception 
Message:  {"The network connection was aborted by the local system"}
Stacktrace: "   at System.Net.HttpResponseStream.Write(Byte[] buffer, Int32 offset, Int32 size)
   at Microsoft.Owin.Host.HttpListener.RequestProcessing.ExceptionFilterStream.Write(Byte[] buffer, Int32 offset, Int32 count)"

After that the client also throws an exception (which isn't strange if the server has blown up the connection). 

So in short. The server does wait on the client, but at some point bails out and the server throws first. Don't know if this helps in anyway, but i think it's good to know there are no client buffers dropping data but it's the server that gets annoyed that it can't push its data.  

kind regards,

Arno

Oren Eini (Ayende Rahien)

unread,
Jun 16, 2015, 9:34:34 AM6/16/15
to ravendb
Yes, that is pretty much what I said.

I just simplified it a bit.

In TCP, the sender buffer data in memory until it receives an ACK from the client.
The sender will typically send several packets before getting an ACK from the client (the TCP window).

Let us examine what is going on in the client side. We have the TCP buffer, and slow reads from it.
So eventually it gets full. The sender still send us stuff, but we have no where to put it, so we just discard it.
The sender will re-send it eventually.

The sender is writing to the TCP buffers, which send it out. 
However, when the buffer get full (the client isn't responding in time), the sender's TCP stack starts to slow things down, it blocks until there is enough space in the buffer.
When the client is able to respond fast enough and free the sender side buffer, it resumes working.
If it doesn't, a timeout occurs, and the sender closes the connection



Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


Nguyễn Huy

unread,
Sep 14, 2015, 3:29:54 AM9/14/15
to RavenDB - 2nd generation document database
Hi all,

i met this issue again .Is anyway to avoid it 

Vào 20:34:34 UTC+7 Thứ Ba, ngày 16 tháng 6 năm 2015, Oren Eini đã viết:
Reply all
Reply to author
Forward
0 new messages