Azure endpoints sending messages to deadletter queue

827 views
Skip to first unread message

Adam Jones

unread,
Nov 6, 2014, 10:12:50 AM11/6/14
to particula...@googlegroups.com
NSB 4.7.1 / NSB.Azure 5.3.8

We're seeing an intermittent issue where an endpoint will run perfectly for a while, but then get into a state where all messages sent to the endpoint will be delivered to the deadletter queue.  We are having trouble determining what is triggering this strange behavior.  In our environment we have a parent host and multiple child endpoints.  When the problem is observed it does not impact all children.  Typically we see just one of the child endpoints go into this state.  When in this state, messages delivered to the queue slowly get moved to the deadletter queue with the undeliverable message exception.  Restarting the endpoint resolves the issue.  We've turned up logging to see if there is some logged exception or event that correlates to this behavior, but so far we are not seeing anything in our logs.  What could we try next in order to identify the root cause?  Unfortunately, it is very hard to reproduce this behavior on demand, and once the endpoint enters this state, any change that requires a reboot eliminates the behavior.

Simon Cropp

unread,
Nov 7, 2014, 2:06:26 AM11/7/14
to particula...@googlegroups.com
Adam

What is the exact error message?

Adam Jones

unread,
Nov 7, 2014, 9:06:46 AM11/7/14
to particula...@googlegroups.com
Here are the headers from a sample message sent to the deadletter queue:

NServiceBus.MessageId String 72eac5f4-bfc1-49e2-98f0-a3db0137da7c
NServiceBus.CorrelationId String 72eac5f4-bfc1-49e2-98f0-a3db0137da7c
NServiceBus.OriginatingEndpoint String rollupgrouprecorder
$.diagnostics.originating.hostid String b4ef545a8adde112c481c382d5fe2c01
NServiceBus.MessageIntent String Publish
NServiceBus.Version String 4.7.1
NServiceBus.TimeSent String 2014-11-06 18:55:25:420441 Z
NServiceBus.OriginatingMachine String RD00155D518D54
NServiceBus.ContentType String application/json
NServiceBus.EnclosedMessageTypes String LHPT.Backend.Reporting.RollupGroupRecorder.Messages.IRollupGroupDataReportMessage, LHPT.Backend.Reporting.RollupGroupRecorder.Messages, Version=4.0.2.157, Culture=neutral, PublicKeyToken=null
WinIdName String
NServiceBus.RelatedTo String 9b9cd15a-2c13-4ee7-bd7c-a3db0137d90a
NServiceBus.ConversationId String 78e0a462-2ff4-4faa-97c4-a3db0137d7a4
DeadLetterReason String MaxDeliveryCountExceeded
DeadLetterErrorDescription String Message could not be consumed after 6 delivery attempts.
Message has been deleted
Message has been deleted

Adam Jones

unread,
Nov 9, 2014, 9:33:41 PM11/9/14
to particula...@googlegroups.com
We had another incident of this occur to us yesterday around midnight UTC in our production environment.  

We're currently using Log4Net directly for logging rather than using the NServiceBus logging framework.  Is it possible to enable the NServiceBus logging framework at runtime, or does this require a reboot?  If it is possible to enable it at runtime, is it possible to have the logging framework read the appender configuration in our Log4Net.config file?

Adam Jones

unread,
Nov 10, 2014, 1:27:10 PM11/10/14
to particula...@googlegroups.com
And another incident this morning.  This is the second time we've observed the switch in behavior correlate to the following exception in the Windows Event Log / Application Log:

License Activation (slui.exe) failed with the following error code:
hr=0xC004F074
Command-line arguments:
RuleId=502ff3ba-669a-4674-bbb1-601f34a3b968;Action=AutoActivateSilent;AppId=55c92734-d682-4d71-983e-d6ec3f16059f;SkuId=d3643d60-0c42-412d-a7d6-52e6635327f6;NotificationInterval=1440;Trigger=UserLogon;SessionId=2

I'm not sure how NSB behavior could be getting interrupted by this particular component, but so far it is the only lead we have.

Sean Feldman

unread,
Nov 10, 2014, 7:02:28 PM11/10/14
to particula...@googlegroups.com
Adam,
This is a little odd that windows would complaint about license on Azure. Could you open a ticket with MSFT just to rule that option out?

Yves Goeleven

unread,
Nov 12, 2014, 2:48:13 AM11/12/14
to particula...@googlegroups.com
Adam,

Are there other exceptions in your logs? And how many instances of that endpoint are you running? If you're only running one, than a push to dlq probably means that the message format doesn't deserialize or some other exception occurs on receive.

Adam Jones

unread,
Nov 12, 2014, 9:17:39 AM11/12/14
to particula...@googlegroups.com
Good advice, I'll open a ticket.

Adam Jones

unread,
Nov 12, 2014, 9:24:32 AM11/12/14
to particula...@googlegroups.com
Yves,

  Unfortunately we're not seeing any exceptions either in our logging output nor in the windows event log.  As you say, I suspect that there is some issue prior to NSB attempting to deliver the message to the appropriate handler, but I'm not sure what it could be:
  • Deserialization - we've checked the messages, and the content is consistent between messages that are properly processed when the endpoint is healthy and those that go to deadletter when the endpoint switches behaviors
  • Dependencies - We're currently investigating to see if one of our handler dependencies is getting corrupt in some way such that the constructor fails (i.e. a dependency whose instancing is set to InstancePerUnitOfWork is bad)
  • Infrastructure Issues - We've seen in our dev environment that we had many millions of timeout records for a particular endpoint.  This was due to the earlier defect that prevented following continuation tokens.  When the endpoint started it would attempt to load the list of timeouts and eventually run the machine out of memory.  We tried to go the route of deleting the timeout records just for this endpoint, but after 2.5 million records deleted we still had a long way to go, so we just deleted the whole table.  This was in a development environment, so no big loss, but I'm concerned that our production environment might be suffering from a similar problem.
  • Something else?

We chose to use Log4Net directly due to the changes in V4 that restricted how the logging system could be configured.  Is there any way that, at runtime, we could enable NSB logging to use our current Log4Net configuration?  My fear is that NSB is generating log messages that I just can't see.

Yves Goeleven

unread,
Nov 12, 2014, 10:00:38 AM11/12/14
to particula...@googlegroups.com
Would this help for injecting your own appender? 


If there would be anything else that could possible wrong: maybe a missing header on the message or so?

PS: i registered a bug for your out of memory exception :https://github.com/Particular/NServiceBus.Azure/issues/208

Adam Jones

unread,
Nov 12, 2014, 10:06:24 AM11/12/14
to particula...@googlegroups.com
Thanks for registering the bug.  

It wasn't clear to me how to use the SetLoggingLibrary methods to setup our custom SMTP appender.  We want the Log4Net framework to monitor our appender configuration file (Log4Net.config) live at runtime so that if we log on to the machine and change the root log level or adjust our custom SMTP appender configuration that those changes are seen live at runtime rather than requiring a reboot.  I was under the impression that this would be impossible using the logging configuration options available in NSB v4.

I'm fairly certain that there isn't any issue with the message itself.  Once we see the endpoint go into this behavior, restarting it resolves the issue.  We then return the messages from the deadletter queue to the input queue and they get processed correctly.

Yves Goeleven

unread,
Nov 12, 2014, 10:18:29 AM11/12/14
to particula...@googlegroups.com

Hmm, that's even weirder... do you see this issue only on 4.X endpoints or also on the 5.X ones?

I hope that method can help you get some more insights into what is going on, alternatively you could also try enabling intellitrace and look into those logs to see if something strange is going on?

Adam Jones

unread,
Nov 12, 2014, 10:29:58 AM11/12/14
to particula...@googlegroups.com
All of our endpoints are currently running NSB 4.7.1 / NSB.Azure 5.3.8.  We haven't run any tests with the new 5.x/6.x releases yet.  We did not observe this behavior while using NSB 4.1.3 / NSB.Azure 4.1.5.

Justin Alexander

unread,
Nov 12, 2014, 9:20:07 PM11/12/14
to particula...@googlegroups.com
This feels like a shot in the dark, but could this be an unlogged licensing issue caused by the combination of nsb v4 and nsb.azure v5??

I ask b/c in theory, if nsb refused to process messages b/c the number of messages exceeded the license quota, the symptoms may mimic the serialization failure Yves speculated about.

Yves Goeleven

unread,
Nov 13, 2014, 4:53:18 AM11/13/14
to particula...@googlegroups.com
Justin,

If that were the case then resubmitting the messages from dlq to mainq would also result in max delivery, which isn't the case.

@Adam how many instances of that endpoint are you running?

Adam Jones

unread,
Nov 13, 2014, 9:33:35 AM11/13/14
to particula...@googlegroups.com
The endpoints are set to auto-scale, with a default of a single instance and a maximum of 3 instances.  We do see that the endpoints scale from time to time.  

Yves Goeleven

unread,
Nov 13, 2014, 9:44:33 AM11/13/14
to particula...@googlegroups.com
Oh ok, than a potential explanation is that transient errors occur during processing but that the message ends up in the dlq instead of the errorq.

 
At least your MaxDeliveryCount is configured with the wrong value. You see, the nsb FLR is on an endpoint instance basis, each instance maintains it's own retry count, while the maxdelivery count is on a queue level.
 
So if you have concurrent readers, each of those readers has a retry count of 5... if you have 3 instances that adds up to a potential of 15 retries across the instances before nsb decides to move it to the error queue
 
While the asb queue only has one up to 6 retries. Therefore it will move the message to dlq even if the message has been processed only twice by 3 different endpoint instances.

I suggest you bump the maxdelivery count to max retry count * max instance count + 1, 16 in your case to rule this out.

As I said this is also only a potential explanation, I need the exception info in order to make a better analysis, are you getting the logging to work?

Troglite

unread,
Nov 13, 2014, 5:14:37 PM11/13/14
to particula...@googlegroups.com
Thanks Yves. That is an EXCELLENT point. :)

--
You received this message because you are subscribed to a topic in the Google Groups "Particular Software" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/particularsoftware/lOn_o9TuTBE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to particularsoftw...@googlegroups.com.
To post to this group, send email to particula...@googlegroups.com.
Visit this group at http://groups.google.com/group/particularsoftware.
To view this discussion on the web visit https://groups.google.com/d/msgid/particularsoftware/bffac74e-3c0d-455a-8deb-ada4b8c1c88b%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Adam Jones

unread,
Nov 14, 2014, 10:51:49 AM11/14/14
to particula...@googlegroups.com
Thanks for the tips, Yves!  I have not yet had a chance to modify our logging setup for our endpoints.  I did submit a ticket to Microsoft in relation to the license validation errors we are seeing.  I received a response that these errors are to be expected from time to time and will have no impact on the performance of the machine.

We have not seen another case of this issue over the last few days.  I will continue to monitor for the next event and provide more details when I have them.
Reply all
Reply to author
Forward
0 new messages