RavenDB 4 - Subscription processing stops after a while

64 views
Skip to first unread message

Andrej Krivulčík

unread,
Oct 20, 2017, 4:03:43 AM10/20/17
to RavenDB - 2nd generation document database
I process a collection using subscriptions, pretty much exactly as described in Listing 5.15 in https://github.com/ravendb/book/blob/v4.0/Ch05/Ch05.md .

The processing changes the documents which are 10-20 kB in size. Each batch (4096 items) is around 60 MB.

The processing works for a while, and then the program fails with the following exception:

Unhandled Exception: System.AggregateException: One or more errors occurred. ---> System.AggregateException: One or more errors occurred. ---> System.AggregateException: One or more errors occurred. ---> Raven.Client.Exceptions.Documents.Subscriptions.SubscriptionDoesNotBelongToNodeException: Subscription With Id '2059' cannot be processed by current node, it will be redirected to
   at Raven.Client.Documents.Subscriptions.Subscription`1.AssertConnectionState(SubscriptionConnectionServerMessage connectionStatus)
   at Raven.Client.Documents.Subscriptions.Subscription`1.<ReadSingleSubscriptionBatchFromServer>d__31.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.ValidateEnd(Task task)
   at Raven.Client.Documents.Subscriptions.Subscription`1.<ProcessSubscriptionAsync>d__30.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Raven.Client.Documents.Subscriptions.Subscription`1.<RunSubscriptionAsync>d__36.MoveNext()
   --- End of inner exception stack trace ---
   at Raven.Client.Documents.Subscriptions.Subscription`1.ShouldTryToReconnect(Exception ex)
   at Raven.Client.Documents.Subscriptions.Subscription`1.<RunSubscriptionAsync>d__36.MoveNext()
   --- End of inner exception stack trace ---
   at Raven.Client.Documents.Subscriptions.Subscription`1.<RunSubscriptionAsync>d__36.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.GetResult()
   at my code
   --- End of inner exception stack trace ---
   at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
   at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
   at System.Threading.Tasks.Task.Wait()
   at my code

At the time of the error, the /admin/cluster/observer/decisions endpoint shows the following (formatting mine, comments mine):

{
  "LeaderNode": "A",
  "Term": 7,
  "Suspended": false,
  "Iteration": 2400,
  "ObserverLog": [
    {
      "Date": "2017-10-20T07:38:27.9245983Z",
      "Iteration": 6,
      "Database": "data-prod",
      "Message": "Node A is currently not responding and moved to rehab"
    },
    {
      "Date": "2017-10-20T07:38:27.9261508Z",
      "Iteration": 6,
      "Database": "data-subset",
      "Message": "Node A is currently not responding and moved to rehab"
    },
    {
      "Date": "2017-10-20T07:38:27.9263065Z",
      "Iteration": 6,
      "Database": "data-subsub",
      "Message": "Node A is currently not responding and moved to rehab"
    },
    {
      "Date": "2017-10-20T07:38:27.9263554Z",
      "Iteration": 6,
      "Database": "TestFacetPerformance",
      "Message": "Node A is currently not responding and moved to rehab"
    },
    {
      "Date": "2017-10-20T07:38:27.9264052Z",
      "Iteration": 6,
      "Database": "TestInvalidIndex",
      "Message": "Node A is currently not responding and moved to rehab"
    },
    {
      "Date": "2017-10-20T07:38:32.7425663Z",
      "Iteration": 13,
      "Database": "data-subsub",
      "Message": "All nodes are not responding, promoting A from rehab"
    },
    {
      "Date": "2017-10-20T07:38:32.7430925Z",
      "Iteration": 13,
      "Database": "TestFacetPerformance",
      "Message": "All nodes are not responding, promoting A from rehab"
    },
    {
      "Date": "2017-10-20T07:38:32.7437067Z",
      "Iteration": 13,
      "Database": "TestInvalidIndex",
      "Message": "All nodes are not responding, promoting A from rehab"
    },
    {
      "Date": "2017-10-20T07:38:33.2240591Z",
      "Iteration": 14,
      "Database": "data-subsub",
      "Message": "Node A is online"
    },
    {
      "Date": "2017-10-20T07:38:33.2242940Z",
      "Iteration": 14,
      "Database": "TestFacetPerformance",
      "Message": "Node A is online"
    },
    {
      "Date": "2017-10-20T07:38:33.2245054Z",
      "Iteration": 14,
      "Database": "TestInvalidIndex",
      "Message": "Node A is online"
    },
    {
      "Date": "2017-10-20T07:38:34.7067482Z",
      "Iteration": 17,
      "Database": "data-subset",
      "Message": "All nodes are not responding, promoting A from rehab"
    },
    {
      "Date": "2017-10-20T07:38:35.2091272Z",
      "Iteration": 18,
      "Database": "data-subset",
      "Message": "Node A is online"
    },
    {
      "Date": "2017-10-20T07:39:11.7202646Z",
      "Iteration": 40,
      "Database": "data-prod",
      "Message": "All nodes are not responding, promoting A from rehab"
    },
    {
      "Date": "2017-10-20T07:39:12.2193681Z",
      "Iteration": 41,
      "Database": "data-prod",
      "Message": "Node A is online", "Comment": "End of database startup, all databases loaded. Processing started (subscription created) at 07:39:57Z"
    },
    {
      "Date": "2017-10-20T07:47:59.9453917Z",
      "Iteration": 1092,
      "Database": "data-prod",
      "Message": "Node A is currently not responding and moved to rehab", "Comment": "This is when the processing stops."
    },
    {
      "Date": "2017-10-20T07:47:59.9457484Z",
      "Iteration": 1092,
      "Database": "data-subset",
      "Message": "Node A is currently not responding and moved to rehab"
    },
    {
      "Date": "2017-10-20T07:47:59.9458942Z",
      "Iteration": 1092,
      "Database": "data-subsub",
      "Message": "Node A is currently not responding and moved to rehab"
    },
    {
      "Date": "2017-10-20T07:47:59.9459562Z",
      "Iteration": 1092,
      "Database": "TestFacetPerformance",
      "Message": "Node A is currently not responding and moved to rehab"
    },
    {
      "Date": "2017-10-20T07:47:59.9460398Z",
      "Iteration": 1092,
      "Database": "TestInvalidIndex",
      "Message": "Node A is currently not responding and moved to rehab"
    },
    {
      "Date": "2017-10-20T07:48:04.1991157Z",
      "Iteration": 1100,
      "Database": "data-prod",
      "Message": "All nodes are not responding, promoting A from rehab"
    },
    {
      "Date": "2017-10-20T07:48:04.1998809Z",
      "Iteration": 1100,
      "Database": "data-subset",
      "Message": "All nodes are not responding, promoting A from rehab"
    },
    {
      "Date": "2017-10-20T07:48:04.2004433Z",
      "Iteration": 1100,
      "Database": "data-subsub",
      "Message": "All nodes are not responding, promoting A from rehab"
    },
    {
      "Date": "2017-10-20T07:48:04.2009312Z",
      "Iteration": 1100,
      "Database": "TestFacetPerformance",
      "Message": "All nodes are not responding, promoting A from rehab"
    },
    {
      "Date": "2017-10-20T07:48:04.2014025Z",
      "Iteration": 1100,
      "Database": "TestInvalidIndex",
      "Message": "All nodes are not responding, promoting A from rehab"
    },
    {
      "Date": "2017-10-20T07:48:04.6984335Z",
      "Iteration": 1101,
      "Database": "data-prod",
      "Message": "Node A is online"
    },
    {
      "Date": "2017-10-20T07:48:04.6987047Z",
      "Iteration": 1101,
      "Database": "data-subset",
      "Message": "Node A is online"
    },
    {
      "Date": "2017-10-20T07:48:04.6988461Z",
      "Iteration": 1101,
      "Database": "data-subsub",
      "Message": "Node A is online"
    },
    {
      "Date": "2017-10-20T07:48:04.6989036Z",
      "Iteration": 1101,
      "Database": "TestFacetPerformance",
      "Message": "Node A is online"
    },
    {
      "Date": "2017-10-20T07:48:04.6989633Z",
      "Iteration": 1101,
      "Database": "TestInvalidIndex",
      "Message": "Node A is online"
    }
  ]
}

It seems that the database experiences some kind of connection problem and disconnects the client. Any ideas about how to investigate this without creating a synthetic test case?

Oren Eini (Ayende Rahien)

unread,
Oct 22, 2017, 3:52:04 AM10/22/17
to ravendb
You have just a single node, right?
From the error, it looks like it can't connect to itself, is that the IP problem you have before?

At any rate, we added some behavior to handle temporary issues like that, you can set a timeout that will be acceptable for the node to be unavailable before we throw.

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


--
You received this message because you are subscribed to the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Andrej Krivulčík

unread,
Oct 22, 2017, 7:31:57 AM10/22/17
to RavenDB - 2nd generation document database
Correct, single node only, and the IP is allowed now. Otherwise, the processing didn't even start. This problem occurs every time after I start the processing, after around 25k documents, taking around 8-10 minutes. So it seems that the issue is systematic, not a random connection failure. How can I set the timeout? I'll try to configure it and see if that would help. If not, I'll also prepare a repro but probably tomorrow.

Maxim Buryak

unread,
Oct 22, 2017, 8:41:01 AM10/22/17
to rav...@googlegroups.com
Hi Andrej, 
Thanks for sharing your experience. Is there any way you could try it using the latest nightly? We've fixex the way we treat an "empty" redirect error, but also added a new field to the connection options: MaxErrorousPeriod, that limits the time during which the subscription can stay in errorous state.

At any rate, we'll also be glad to recieve logs from your server, in order to better understand the nodes "rehab" problem you experience.



Best Regards,

Hibernating Rhinos Ltd  cid:image001.png@01CF95E2.8ED1B7D0

Maxim Buryak l Core Team Developer Mobile:+972-54-217-7751

RavenDB paving the way to "Data Made Simplehttp://ravendb.net/  



--

Andrej Krivulčík

unread,
Oct 23, 2017, 9:10:42 AM10/23/17
to RavenDB - 2nd generation document database
Thanks Maxim for the suggestion.
Updating the server to today's nightly (4.0.0-nightly-20171023-0400) didn't help, the subscription was dropped anyway. After updating the client to today's nightly and setting MaxErrorousPeriod = TimeSpan.FromMinutes(10), the processing works reliably. I didn't test the updated client without the prolonged MaxErrorousPeriod, but it's 5 minutes by default so I think that it would be enough.

I assume that there are no logs present from the past attempts, as I didn't configure logging before. Could you point me to how should I enable logging properly and capture logs to be most use for you if I'll encounter more issues in the future?

Thanks
Andrej
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+u...@googlegroups.com.

Oren Eini (Ayende Rahien)

unread,
Oct 23, 2017, 9:40:09 AM10/23/17
to ravendb
In the settings.json, set Logs.Mode to be Information
Great to know that this now works properly

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages