NServiceBus Saga Azure Worker Host Stops Receiving Messages

92 views
Skip to first unread message

Corith Malin

unread,
Aug 9, 2016, 11:50:45 AM8/9/16
to Particular Software

Helpful information to include

Product name: NServiceBus

Version: 

  • NHibernate 4.0.4.4000
  • NHibernate.SqlAzure 2.1.3
  • NServiceBus 5.2.14
  • NServiceBus.Autofac 5.0.0
  • NServiceBus.Azure 6.2.3
  • NServiceBus.Azure.Transports.WindowsAzureServiceBus 6.4.0
  • NServiceBus.Host 6.0.0
  • NServiceBus.Hosting.Azure 6.2.4
  • NServiceBus.NHibernate 6.2.7
  • WindowsAzure.ServiceBus 2.7.6

Stacktrace: None

Description:

So we have many NServiceBus services running as Azure Worker Roles.  We have only one logical service that uses NHibernate persistence for saga persistence.  For that logical service, we have 5 worker role instances (so 5 instances of the same cloud service which translates to 5 PaaS instances)

 

All the non-NHibernate saga persistence services are working just fine.  But this one service (we call it the Orchestrator service) constantly STOPS receiving messages from the Azure Service Bus Queue.  The CPU usage in Task Manager shows 0%, but occasionally ticks up to 1% before falling back down to 0%.  More curiously, there is always ONE instance of the Orchestrator which stays up and receiving messages normally.  The instance that stays up varies.

 

So we write a log each time we receive a command and another when we exit the handler.  In our log, we'll see things like:

 

command received

command processed

 

and then is just stops.  Nothing more is written to the log (log4net).  I've tried turning on debug for the logging level and no new information is written to the event log.  The only way to fix the issue is to restart the instance or to kill the Microsoft Windows Azure Worker Host (WAWORKERHOST).  Windows Azure will automatically detect that the worker host is down and will restart it.  Normal operation resumes, but within the next 20 minutes or so (it varies and can be stable for hours / days) the process will drop back down to 0% CPU.

 

We're not really sure where to begin to test.  

  1. It's strange to us that ONE instance of the service always remains up and processing messages.
  2. It's the only service we have that's a saga and using NHibernate to persist saga storage.
  3. It's the only instance that has Azure Service Bus Deduplication enabled on its queue.
- Craig

Corith Malin

unread,
Aug 9, 2016, 3:14:25 PM8/9/16
to Particular Software
Additional information:  We've noticed that it's never one instance that stops responding.  It always takes out multiple instances (though not always a consistent number).

- Craig

Yves Goeleven

unread,
Aug 10, 2016, 5:50:01 AM8/10/16
to Particular Software
Do you see any exception occuring at the time it stops, or some time before it? 

How does authentication happen? Is it using ACS or SAS for auth? We've had a similar report last week where it looked like the ASB service could not contact ACS anymore and therefore did not return any results (it didn't throw an exception, the result set was just empty)

timde...@hotmail.com

unread,
Aug 11, 2016, 4:58:09 PM8/11/16
to Particular Software
Hi Yves.  I'm working with Craig on this project and I wanted to add some information uncovered during debugging this afternoon.

I was running procdump.exe and set it to log (without generating a dump file) all first chance exceptions hoping that I would see something new.  I did see approximately five of the following exceptions when the process was stuck at 0%.

Exception: E0434F4D.System.ServiceModel.FaultException1 ("A timeout has occurred during the operation.  TrackingId:[snipped], SystemTracker:myservicename.retries, Timestamp:8/11/2016 4:21:25 PM")

I haven't seen this exception during normal operation so it may be significant.  It appears that NServiceBus is eating this exception because we haven't seen it in any log files (even with debug output enabled on the rolling file appender).  This also makes debugging difficult if we can't see retries (or perhaps are losing them due to repeated timeouts?).

The system was also under fairly significant load at the time, so perhaps there is a correlation between high load and failed attempts to contact the retries queue?  We did observe that the system became more stable if we dropped both the number of instances and the concurrency level.

Yet another possibility is that the retries queue was somehow corrupted?  As part of our debugging we deleted the retries queue and recreated it.  The problem has not come back in the last few hours and we're going to keep watching it to see how the system performs overnight.  Even if that is the fix, it still seems that the above mentioned FaultExceptions should be logged when debug is enabled as timeout errors would be useful information to have on hand.

I'll follow up tomorrow after we see how things go.

Thank you!

Yves Goeleven

unread,
Aug 12, 2016, 4:25:57 AM8/12/16
to Particular Software
Hi Tim,

It's starting to sound very similar to this case: https://github.com/Particular/NServiceBus.AzureServiceBus/issues/246, not exactly the same but sounds very similar. 

We haven't been able to reproduce the problem yet, it seems to have disappeared after removing certain topics. My guess is that there are certain connectivity conditions (given the timeout), that the transport is not handling correctly, we only don't know the conditions yet.

If you have a project that you can share with us, and which shows the behavior in a reproducable way, that would be a very good asset to continue our search. Let me know if the problem persists

Kind regards,
Yves

timde...@hotmail.com

unread,
Aug 15, 2016, 11:05:27 AM8/15/16
to Particular Software
Unfortunately I don't have a way to reproduce the problem now that we've recreated the retries queue for that service.  As of this morning we're still running OK so it does appear that something was wrong with that queue.  My request would be that debug log output be enhanced inside NSB.  Procdump.exe did confirm that an exception was being thrown and NSB was catching it (otherwise we'd have crashed due to an unhandled exception), but it was not logging that an exception occurred.

We're going to keep watching for a repeat occurrence and if that happens I'll see if I can write a test app to hit the retries queue and see if I can get additional information.

Sean Feldman

unread,
Aug 18, 2016, 2:14:48 PM8/18/16
to Particular Software
Hi Tim,

You're correct about NSB catching ASB exception and not always rethrowing. Some exceptions are not supposed to be rethrown (throttling, timeouts, etc).
If you could provide a repro where you think we're should not be swallowing exceptions (or should provide a better log information), that would be great and very welcomed.
With NSB v6 we're revamping a lot of that. Would be nice to hear your thoughts on the changes coming in v6.

Thank you.

Sean Feldman

unread,
Aug 24, 2016, 3:26:07 PM8/24/16
to particula...@googlegroups.com, dae...@outlook.com
Hi Tim,

The transport doesn't treat the .retries queue in any special way. All queues are handled the same way to receive messages with AzureServiceBusDequeueStrategy. When exception during process is thrown, nsb.asb transport delegates decision making to the core of the NSB. So the exception is passed on and eventually is logged.

With a repro code would be easier to dive into details to see what's happening.
Reply all
Reply to author
Forward
0 new messages