MDC propagation issue

288 views
Skip to first unread message

Jeff White

unread,
Jun 13, 2017, 11:49:47 AM6/13/17
to Play Framework
I know MDC thread propagation is not recommended by the Play team, but it works well 99%+ of the time, and I'd like to at least understand why it can't work 100% of the time, and then maybe I'll be convinced to give up. 

My implementation is based on this blog post (which is based James Roper's solution) http://yanns.github.io/blog/2014/05/04/slf4j-mapped-diagnostic-context-mdc-with-play-framework/

override def prepare(): ExecutionContext = new ExecutionContext {
 
// capture the MDC
  val mdcContext = MDC.getCopyOfContextMap
  val id = UUID.randomUUID().toString
 
logger.debug(s"${id} mdcContext=${mdcContext}")
 
def execute(r: Runnable) = self.execute(new Runnable {
   
def run() = {
     
// backup the callee MDC context
      val oldMDCContext = MDC.getCopyOfContextMap
      logger.debug(s"${id} oldMDCContext=${oldMDCContext}")
     
// Run the runnable with the captured context
      setContextMap(mdcContext)
     
logger.debug(s"${id} contextMap set")
     
try {
        r
.run()
     
} finally {
       
// restore the callee MDC context
        logger.debug(s"${id} Restoring to ${oldMDCContext}")
        setContextMap
(oldMDCContext)
       
logger.debug(s"${id} Restored")
     
}
   
}
 
})
 
def reportFailure(t: Throwable) = self.reportFailure(t)
}

In rare cases, the MDC is lost by the time the controller executes.

2017-06-13 06:56:44,026 [DEBUG] [application-akka.actor.default-dispatcher-1132] [requestId=KZ7iVA0XTR2/rHFoelN9sw] c.e.n.b.u.HttpContextFilter:29 requestId set to KZ7iVA0XTR2/rHFoelN9sw
2017-06-13 06:56:44,026 [DEBUG] [application-akka.actor.default-dispatcher-1132] [requestId=KZ7iVA0XTR2/rHFoelN9sw] c.e.n.a.MDCPropagatingDispatcher:67 e839a3d3-f868-4130-9ce1-f81fa64a73fb mdcContext={requestId=KZ7iVA0XTR2/rHFoelN9sw}
2017-06-13 06:56:44,026 [DEBUG] [application-akka.actor.default-dispatcher-1132] [requestId=KZ7iVA0XTR2/rHFoelN9sw] c.e.n.a.MDCPropagatingDispatcher:67 2e89056b-9773-4587-81d6-4241e4872622 mdcContext={requestId=KZ7iVA0XTR2/rHFoelN9sw}
2017-06-13 06:56:44,026 [DEBUG] [application-akka.actor.default-dispatcher-1132] [requestId=KZ7iVA0XTR2/rHFoelN9sw] c.e.n.a.MDCPropagatingDispatcher:80 cd5cebcc-71df-431b-8707-bbc5261c1731 Restoring to null
2017-06-13 06:56:44,026 [DEBUG] [application-akka.actor.default-dispatcher-1132] [] c.e.n.a.MDCPropagatingDispatcher:82 cd5cebcc-71df-431b-8707-bbc5261c1731 Restored
2017-06-13 06:56:44,026 [DEBUG] [application-akka.actor.default-dispatcher-1125] [] c.e.n.a.MDCPropagatingDispatcher:72 2e89056b-9773-4587-81d6-4241e4872622 oldMDCContext=null
2017-06-13 06:56:44,026 [DEBUG] [application-akka.actor.default-dispatcher-1125] [requestId=KZ7iVA0XTR2/rHFoelN9sw] c.e.n.a.MDCPropagatingDispatcher:75 2e89056b-9773-4587-81d6-4241e4872622 contextMap set
2017-06-13 06:56:44,027 [DEBUG] [application-akka.actor.default-dispatcher-1125] [requestId=KZ7iVA0XTR2/rHFoelN9sw] c.e.n.a.MDCPropagatingDispatcher:80 2e89056b-9773-4587-81d6-4241e4872622 Restoring to null
2017-06-13 06:56:44,027 [DEBUG] [application-akka.actor.default-dispatcher-1125] [] c.e.n.a.MDCPropagatingDispatcher:82 2e89056b-9773-4587-81d6-4241e4872622 Restored
2017-06-13 06:56:44,027 [DEBUG] [netty-event-loop-13] [] c.e.n.a.MDCPropagatingDispatcher:67 d8c74914-7524-4c2c-be33-7155d08487b8 mdcContext=null
2017-06-13 06:56:44,027 [DEBUG] [netty-event-loop-13] [] c.e.n.a.MDCPropagatingDispatcher:67 2e6bc121-5058-4230-ab9a-c39dd1e75568 mdcContext=null
2017-06-13 06:56:44,027 [DEBUG] [netty-event-loop-13] [] c.e.n.a.MDCPropagatingDispatcher:67 0ca6968c-cd06-4850-86c5-9114ec6b495f mdcContext=null
2017-06-13 06:56:44,028 [DEBUG] [netty-event-loop-13] [] c.e.n.a.MDCPropagatingDispatcher:67 0a8ac5a3-7cba-4bdd-bae1-386140a580d2 mdcContext=null
2017-06-13 06:56:44,028 [DEBUG] [application-akka.actor.default-dispatcher-1132] [] c.e.n.a.MDCPropagatingDispatcher:72 2e6bc121-5058-4230-ab9a-c39dd1e75568 oldMDCContext=null
2017-06-13 06:56:44,028 [DEBUG] [application-akka.actor.default-dispatcher-1132] [] c.e.n.a.MDCPropagatingDispatcher:75 2e6bc121-5058-4230-ab9a-c39dd1e75568 contextMap set
2017-06-13 06:56:44,028 [INFO ] [application-akka.actor.default-dispatcher-1132] [] c.e.n.b.c.Controller:24 ***MDC IS NULL HERE***

One clue I notice that is different than the happy case is the netty-event-loop thread runs between the filter and the controller, and it enters MDCPropagatingDispatcher with a null MDC.

What exactly is happening and is there any way to make this work?

jeff



Yann Simon

unread,
Jun 13, 2017, 12:06:37 PM6/13/17
to Play Framework
A possible explanation would be that a Runnable is run on a thread that is managed by another execution context.
For example, netty manages its own thread pool.
There was a current change so that it can use an ExecutionContext as well (I cannot find the doc/PR for the moment)

It should not be a problem, as you will `map` or `flatMap`  on the result, and your EC will set the MDC context again.

Propagating the MDC context is somewhat fragile, and penalize the performances. (the Maps are duplicated on each Runnable execution)
As written in another mail, you may consider another possibility: https://playframework.com/documentation/2.6.x/Highlights26#Logging-Marker-API

But if you can explain your problem precisely, I'd be very interested (and could update the post ;)

Regards, Yann

--
You received this message because you are subscribed to the Google Groups "Play Framework" group.
To unsubscribe from this group and stop receiving emails from it, send an email to play-framewor...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/play-framework/c7b0192f-1a6d-41dd-b73c-2059f14a684e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jeff White

unread,
Jun 13, 2017, 12:30:13 PM6/13/17
to Play Framework
Hi Yann,

1) logback is configured to include the MDC, which makes it possible to search logs by requestId for troubleshooting.
2) We pass the requestId to downstream services, so that they in turn set the MDC, and include it in log statements.

This gives us cross-stack request tracing.

jeff

Jeff White

unread,
Jun 15, 2017, 12:51:58 PM6/15/17
to Play Framework
Some additional thoughts:

1) It seems like the netty-event-loop threads are running in the appropriate ExecutionContext since they are going through the MDCPropagatingDispatcher.

2) Could it be a result of filters extending EssentialFilter instead of Filter? I wonder if this old issue is related https://github.com/playframework/playframework/issues/6670

jeff

Yann Simon

unread,
Jun 15, 2017, 12:54:44 PM6/15/17
to Play Framework

Jeff White

unread,
Jun 15, 2017, 12:58:39 PM6/15/17
to Play Framework
Should the netty-event-loop threads be flowing through MDCPropagatingDispatcher then? Maybe that's the problem?

jeff

Yann Simon

unread,
Jun 15, 2017, 1:01:24 PM6/15/17
to Play Framework
This how I understand it:
Netty does not uses any execution context, and though does not use the MDCPropagatingDispatcher.
As I wrote in the first email, there is a recent change to netty so that it can use an execution context.
If we change the `NettyServer` to use this new API, then we should be able to use the MDCPropagatingDispatcher.

Jeff White

unread,
Jun 15, 2017, 1:06:04 PM6/15/17
to Play Framework
Note from my log output that the netty-event-loop threads are going through the MDCPropagatingDispatcher.
I wonder if that's causing the MDC to improperly propagate.
Any chance you can find a reference to configuring the NettyServer ExecutionContext?

jeff

Yann Simon

unread,
Jun 15, 2017, 1:29:29 PM6/15/17
to Play Framework
Interesting!
The following PR allows using an Executor:
https://github.com/netty/netty/pull/1762

Jeff White

unread,
Jun 15, 2017, 1:47:24 PM6/15/17
to Play Framework
Thanks for that link. Looks like it made it into netty 4.1. 
The more I think about it, the more I think there might be something amiss in the play code. 
It seems reasonable and appropriate for the netty server to have its own execution context.
Seems like only the akka threads should be flowing through MDCPropagatingDispatcher.
I wonder if there is some condition that's causing the netty-event-loop threads to improperly "leak" into the MDCPropagatingDispatcher. 
For example, maybe the trampoline EC is being used somewhere that the play EC should be used.

jeff

Jeff White

unread,
Jun 19, 2017, 4:15:08 PM6/19/17
to Play Framework

Will Sargent

unread,
Jun 19, 2017, 7:17:10 PM6/19/17
to play-fr...@googlegroups.com
Hi Jeff,

Please file an issue / PR for this -- it looks straightforward enough to tweak, but we don't have a lot of time before 2.6.x is final.

Thanks,

--
Will Sargent
Engineer, Lightbend, Inc.


To unsubscribe from this group and stop receiving emails from it, send an email to play-framework+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/play-framework/de58622f-4971-4944-89c1-66aa68a69b40%40googlegroups.com.

Jeff White

unread,
Jun 19, 2017, 8:14:30 PM6/19/17
to Play Framework
Reply all
Reply to author
Forward
0 new messages