Hey guys.
(NOTE: I apologize if I duplicated this topic. I posted this once, then waited for an hour, and since it didn't show up yet I'm assuming something went wrong and I'm reposting it here. Luckily I had the text saved :P )
I _think_ that I found a dead lock in the event router just now. This issue has apparently been in the event router for quite some time (
http://groups.google.com/group/mobicents-public/browse_thread/thread/36956c8c8b5eeb50/3fe9eba98441b1ba?lnk=gst&q=sbbentityfactory#3fe9eba98441b1ba ) but since it seems to be a timing issue it's very hard to reproduce. We haven't seen this before, but we recently did some Sbb service changes that made this issue appear on around 10% of our calls, so I started to investigate, and this is what I found:
Symptoms: System sometimes stalls for 10s during a call and then we get a SLEEException:
2012-03-28 14:11:56,693 WARN [EventRoutingTaskImpl] Failed to find next sbb entity to deliver the event EventContext[event type id = EventTypeID[name=our.FirstEvent,vendor=EmblaCom,version=1.0.0] , event = FirstEvent{} , local ac = ACH=NULL>68b4de3a:13658f01043:-7e24 , address = null , serviceID = null] in ACH=NULL>68b4de3a:13658f01043:-7e24
javax.slee.SLEEException: timeout while acquiring lock java.util.concurrent.locks.ReentrantLock@60bcbb4[Locked by thread pool-16-thread-1] for sbb entity with id /ServiceID[name=FirstService,vendor=com.emblacom,version=1.0.0-SNAPSHOT]/_____cdefad40-78c6-11e1-8321-00215e23b2f4
at org.mobicents.slee.runtime.sbbentity.SbbEntityFactoryImpl.lockOrFail(SbbEntityFactoryImpl.java:327)
at org.mobicents.slee.runtime.sbbentity.SbbEntityFactoryImpl.getSbbEntity(SbbEntityFactoryImpl.java:197)
at org.mobicents.slee.runtime.activity.SbbEntityComparator.compare(SbbEntityComparator.java:56)
at org.mobicents.slee.runtime.activity.SbbEntityComparator.compare(SbbEntityComparator.java:38)
at java.util.TreeMap.put(TreeMap.java:545)
at java.util.TreeSet.add(TreeSet.java:255)
at org.mobicents.slee.runtime.activity.ActivityContextImpl.getSortedSbbAttachmentSet(ActivityContextImpl.java:408)
at org.mobicents.slee.runtime.eventrouter.routingtask.NextSbbEntityFinder.next(NextSbbEntityFinder.java:86)
at org.mobicents.slee.runtime.eventrouter.routingtask.EventRoutingTaskImpl.routeQueuedEvent(EventRoutingTaskImpl.java:282)
at org.mobicents.slee.runtime.eventrouter.routingtask.EventRoutingTaskImpl.run(EventRoutingTaskImpl.java:126)
at org.mobicents.slee.runtime.eventrouter.EventRouterExecutorImpl$EventRoutingTaskStatsCollector.run(EventRouterExecutorImpl.java:73)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
2012-03-28 14:11:56,693 WARN [EventRoutingTaskImpl] Failed to find next sbb entity to deliver the event EventContext[event type id = EventTypeID[name=our.SecondEvent,vendor=EmblaCom,version=1.0] , event = SecondEvent{} , local ac = RA:EmblaCustomRa:CustomActivityHandle{} , address = null , serviceID = null] in RA:EmblaCustomRa:CustomActivityHandle{}
javax.slee.SLEEException: timeout while acquiring lock java.util.concurrent.locks.ReentrantLock@103bcf50[Locked by thread pool-18-thread-1] for sbb entity with id /ServiceID[name=SecondService,vendor=com.emblacom=1.0.0]/-120582819____
at org.mobicents.slee.runtime.sbbentity.SbbEntityFactoryImpl.lockOrFail(SbbEntityFactoryImpl.java:327)
at org.mobicents.slee.runtime.sbbentity.SbbEntityFactoryImpl.getSbbEntity(SbbEntityFactoryImpl.java:197)
at org.mobicents.slee.runtime.activity.SbbEntityComparator.compare(SbbEntityComparator.java:56)
at org.mobicents.slee.runtime.activity.SbbEntityComparator.compare(SbbEntityComparator.java:38)
at java.util.TreeMap.put(TreeMap.java:545)
at java.util.TreeSet.add(TreeSet.java:255)
at org.mobicents.slee.runtime.activity.ActivityContextImpl.getSortedSbbAttachmentSet(ActivityContextImpl.java:408)
at org.mobicents.slee.runtime.eventrouter.routingtask.NextSbbEntityFinder.next(NextSbbEntityFinder.java:86)
at org.mobicents.slee.runtime.eventrouter.routingtask.EventRoutingTaskImpl.routeQueuedEvent(EventRoutingTaskImpl.java:282)
at org.mobicents.slee.runtime.eventrouter.routingtask.EventRoutingTaskImpl.run(EventRoutingTaskImpl.java:126)
at org.mobicents.slee.runtime.eventrouter.EventRouterExecutorImpl$EventRoutingTaskStatsCollector.run(EventRouterExecutorImpl.java:73)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
After this the system continues without incident.
Setup: This only appears when we have both FirstService and SecondService deployed. The key to the issue getting reproduced is that both services are attached to the same two ACIs. One null activity and one custom activity from one of our RA's. With other words, both service receive the same events from the same ACIs. The timing issue is that we need to have events fired on both ACIs simultaneously.
Cause?: What I THINK is happening after looking through the event router code is that two threads try to lock the same root sbbs, but in different order. Both threads get one lock and wait for the other, and hence we get a deadlock. When the timeout for the second lock hits, the first lock is freed and the system can continue operating normally.
Details: Before the event routing begins, the event router queries the ACI attachment tree to determine the next Sbb that should receive the event (full path in the stack trace above). During this check, all SbbIDs are fetched, and hence the SbbEntityFactoryImpl will retrieve a lock for them (since getSbbEntity is called). I'm guessing that one or both of these options is the culprit:
a) the set of attached Sbbs (for loop on line 406 in ActivityContextImpl.java) is unsorted and hence might cause locking to be done in random order.
b) the set of attached Sbbs are sorted one way for our Null ACI and then the other way around for the Custom ACI.
Personally I'd put my money on b. :)
Unfortunately I'm unable to provide a test case for this. However, unless I'm missing something, I think it's clear by looking at the code that a dead lock CAN happen here, even if it requires very special circumstances. Do others agree as well, or do you think I should I continue looking for a bug in our own code?
If this is indeed the cause of the issue we're seeing, the obvious workaround would be to put both Sbbs into the same service. This is of course possible, but as they provide completely separate functionality, I would rather keep them as two different services if only possible. Any other ideas for workarounds?
BR,
-Calle
--
Carl-Magnus Björkell
EmblaCom