Deadlock in the event router?

Carl-Magnus Björkell

unread,

Mar 28, 2012, 9:02:51 AM3/28/12

to mobicent...@googlegroups.com

Hey guys.

(NOTE: I apologize if I duplicated this topic. I posted this once, then waited for an hour, and since it didn't show up yet I'm assuming something went wrong and I'm reposting it here. Luckily I had the text saved :P )

I _think_ that I found a dead lock in the event router just now. This issue has apparently been in the event router for quite some time ( http://groups.google.com/group/mobicents-public/browse_thread/thread/36956c8c8b5eeb50/3fe9eba98441b1ba?lnk=gst&q=sbbentityfactory#3fe9eba98441b1ba ) but since it seems to be a timing issue it's very hard to reproduce. We haven't seen this before, but we recently did some Sbb service changes that made this issue appear on around 10% of our calls, so I started to investigate, and this is what I found:

Symptoms: System sometimes stalls for 10s during a call and then we get a SLEEException:
2012-03-28 14:11:56,693 WARN [EventRoutingTaskImpl] Failed to find next sbb entity to deliver the event EventContext[event type id = EventTypeID[name=our.FirstEvent,vendor=EmblaCom,version=1.0.0] , event = FirstEvent{} , local ac = ACH=NULL>68b4de3a:13658f01043:-7e24 , address = null , serviceID = null] in ACH=NULL>68b4de3a:13658f01043:-7e24
javax.slee.SLEEException: timeout while acquiring lock java.util.concurrent.locks.ReentrantLock@60bcbb4[Locked by thread pool-16-thread-1] for sbb entity with id /ServiceID[name=FirstService,vendor=com.emblacom,version=1.0.0-SNAPSHOT]/_____cdefad40-78c6-11e1-8321-00215e23b2f4
        at org.mobicents.slee.runtime.sbbentity.SbbEntityFactoryImpl.lockOrFail(SbbEntityFactoryImpl.java:327)
        at org.mobicents.slee.runtime.sbbentity.SbbEntityFactoryImpl.getSbbEntity(SbbEntityFactoryImpl.java:197)
        at org.mobicents.slee.runtime.activity.SbbEntityComparator.compare(SbbEntityComparator.java:56)
        at org.mobicents.slee.runtime.activity.SbbEntityComparator.compare(SbbEntityComparator.java:38)
        at java.util.TreeMap.put(TreeMap.java:545)
        at java.util.TreeSet.add(TreeSet.java:255)
        at org.mobicents.slee.runtime.activity.ActivityContextImpl.getSortedSbbAttachmentSet(ActivityContextImpl.java:408)
        at org.mobicents.slee.runtime.eventrouter.routingtask.NextSbbEntityFinder.next(NextSbbEntityFinder.java:86)
        at org.mobicents.slee.runtime.eventrouter.routingtask.EventRoutingTaskImpl.routeQueuedEvent(EventRoutingTaskImpl.java:282)
        at org.mobicents.slee.runtime.eventrouter.routingtask.EventRoutingTaskImpl.run(EventRoutingTaskImpl.java:126)
        at org.mobicents.slee.runtime.eventrouter.EventRouterExecutorImpl$EventRoutingTaskStatsCollector.run(EventRouterExecutorImpl.java:73)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:722)
2012-03-28 14:11:56,693 WARN [EventRoutingTaskImpl] Failed to find next sbb entity to deliver the event EventContext[event type id = EventTypeID[name=our.SecondEvent,vendor=EmblaCom,version=1.0] , event = SecondEvent{} , local ac = RA:EmblaCustomRa:CustomActivityHandle{} , address = null , serviceID = null] in RA:EmblaCustomRa:CustomActivityHandle{}
javax.slee.SLEEException: timeout while acquiring lock java.util.concurrent.locks.ReentrantLock@103bcf50[Locked by thread pool-18-thread-1] for sbb entity with id /ServiceID[name=SecondService,vendor=com.emblacom=1.0.0]/-120582819____
        at org.mobicents.slee.runtime.sbbentity.SbbEntityFactoryImpl.lockOrFail(SbbEntityFactoryImpl.java:327)
        at org.mobicents.slee.runtime.sbbentity.SbbEntityFactoryImpl.getSbbEntity(SbbEntityFactoryImpl.java:197)
        at org.mobicents.slee.runtime.activity.SbbEntityComparator.compare(SbbEntityComparator.java:56)
        at org.mobicents.slee.runtime.activity.SbbEntityComparator.compare(SbbEntityComparator.java:38)
        at java.util.TreeMap.put(TreeMap.java:545)
        at java.util.TreeSet.add(TreeSet.java:255)
        at org.mobicents.slee.runtime.activity.ActivityContextImpl.getSortedSbbAttachmentSet(ActivityContextImpl.java:408)
        at org.mobicents.slee.runtime.eventrouter.routingtask.NextSbbEntityFinder.next(NextSbbEntityFinder.java:86)
        at org.mobicents.slee.runtime.eventrouter.routingtask.EventRoutingTaskImpl.routeQueuedEvent(EventRoutingTaskImpl.java:282)
        at org.mobicents.slee.runtime.eventrouter.routingtask.EventRoutingTaskImpl.run(EventRoutingTaskImpl.java:126)
        at org.mobicents.slee.runtime.eventrouter.EventRouterExecutorImpl$EventRoutingTaskStatsCollector.run(EventRouterExecutorImpl.java:73)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:722)

After this the system continues without incident.

Setup: This only appears when we have both FirstService and SecondService deployed. The key to the issue getting reproduced is that both services are attached to the same two ACIs. One null activity and one custom activity from one of our RA's. With other words, both service receive the same events from the same ACIs. The timing issue is that we need to have events fired on both ACIs simultaneously.

Cause?: What I THINK is happening after looking through the event router code is that two threads try to lock the same root sbbs, but in different order. Both threads get one lock and wait for the other, and hence we get a deadlock. When the timeout for the second lock hits, the first lock is freed and the system can continue operating normally.

Details: Before the event routing begins, the event router queries the ACI attachment tree to determine the next Sbb that should receive the event (full path in the stack trace above). During this check, all SbbIDs are fetched, and hence the SbbEntityFactoryImpl will retrieve a lock for them (since getSbbEntity is called). I'm guessing that one or both of these options is the culprit:

a) the set of attached Sbbs (for loop on line 406 in ActivityContextImpl.java) is unsorted and hence might cause locking to be done in random order.
b) the set of attached Sbbs are sorted one way for our Null ACI and then the other way around for the Custom ACI.

Personally I'd put my money on b. :)

Unfortunately I'm unable to provide a test case for this. However, unless I'm missing something, I think it's clear by looking at the code that a dead lock CAN happen here, even if it requires very special circumstances. Do others agree as well, or do you think I should I continue looking for a bug in our own code?

If this is indeed the cause of the issue we're seeing, the obvious workaround would be to put both Sbbs into the same service. This is of course possible, but as they provide completely separate functionality, I would rather keep them as two different services if only possible. Any other ideas for workarounds?

BR,
-Calle
--
Carl-Magnus Björkell
EmblaCom

Eduardo Martins

unread,

Mar 28, 2012, 10:21:30 AM3/28/12

to mobicent...@googlegroups.com

The only enhancement we could do there is to first order the set by id string, before going into getting the related sbb entities (for their priorities). This would not really be a solution that prevents deadlocks, but could avoid a few cases, I just wonder what could be the performance hit, we are talking about ordering possibly very long strings...

Anyway, if these services share resources such as activities, it would probably be wise to come up with a single service and make the old root sbbs as childs. With the latest advances on service upgrades, if you deploy the sbbs and service in different DUs it should be a pain free management future wise.

-- Eduardo
..............................................
http://emmartins.blogspot.com

On Wed, Mar 28, 2012 at 2:47 PM, Eduardo Martins <emma...@gmail.com> wrote:

Have you tried to define different priorities for the services? Such deadlock is always a possible scenario, that's why there is a lock timeout.

-- Eduardo
..............................................
http://emmartins.blogspot.com

2012/3/28 Carl-Magnus Björkell <nrgiz...@gmail.com>

Eduardo Martins

unread,

Mar 28, 2012, 9:47:00 AM3/28/12

to mobicent...@googlegroups.com

Have you tried to define different priorities for the services? Such deadlock is always a possible scenario, that's why there is a lock timeout.

-- Eduardo
..............................................
http://emmartins.blogspot.com

2012/3/28 Carl-Magnus Björkell <nrgiz...@gmail.com>

Hey guys.

Carl-Magnus Björkell

unread,

Mar 28, 2012, 11:20:52 AM3/28/12

to mobicents-public

Hey Eduardo,

About priorities: This was actually my first thought as well, but as
this dead lock is actually happening while investigating the
priorities, they had no effect (we did try though). I can think of
three possible solutions for the dead lock:

a) The order of the set would always be the same for all ACI
attachment sets, just as you suggested (Possible performance hit?)
b) Synchronization between threads so that only one thread at a time
tries to fetch the Sbb entities for order inspection. Probably not
plausible because of performance hits?
c) Remove locking for the "which Sbb entity should be next"
investigation. I didn't go through all the code, but shouldn't this
stage be "read only"? i.e. do we even need Sbb entity locking at this
stage? Removing locking here might even have a positive performance
impact since we aren't locking multiple Sbb entity trees at the same
time, only the one that actually gets the event delivered to it. Or is
this needed (only?) so that we don't miss an attachment in another
transaction?

If option c) is indeed possible, and would be implemented, I guess
dead locks might still be possible if two Sbbs (same setup as OP) has
the same priority, but then it would be a simple case of just setting
one priority higher than the other to avoid it.

In either case, I feel that we would need to figure something out for
this, as right now two services implemented well within the bounds of
the spec can fail in this kind of a manner (very obscure error, and
very hard to find the reason for if you are an Sbb developer).

About the service with the old services as child sbbs: That's a viable
option, even if I on principle still don't like it :) Thanks.

BR,
-Calle

On 28 Mar, 16:21, Eduardo Martins <emmart...@gmail.com> wrote:
> The only enhancement we could do there is to first order the set by id
> string, before going into getting the related sbb entities (for their
> priorities). This would not really be a solution that prevents deadlocks,
> but could avoid a few cases, I just wonder what could be the performance
> hit, we are talking about ordering possibly very long strings...
>
> Anyway, if these services share resources such as activities, it would
> probably be wise to come up with a single service and make the old root
> sbbs as childs. With the latest advances on service upgrades, if you deploy
> the sbbs and service in different DUs it should be a pain free management
> future wise.
>
> -- Eduardo

> ..............................................http://emmartins.blogspot.com

>
> On Wed, Mar 28, 2012 at 2:47 PM, Eduardo Martins <emmart...@gmail.com>wrote:
>
>
>
>
>
>
>
> > Have you tried to define different priorities for the services? Such
> > deadlock is always a possible scenario, that's why there is a lock timeout.
>
> > -- Eduardo
> > ..............................................
> >http://emmartins.blogspot.com
>

> > 2012/3/28 Carl-Magnus Björkell <nrgizer...@gmail.com>

>
> >> Hey guys.
>
> >> (NOTE: I apologize if I duplicated this topic. I posted this once, then
> >> waited for an hour, and since it didn't show up yet I'm assuming something
> >> went wrong and I'm reposting it here. Luckily I had the text saved :P )
>
> >> I _think_ that I found a dead lock in the event router just now. This
> >> issue has apparently been in the event router for quite some time (

> >>http://groups.google.com/group/mobicents-public/browse_thread/thread/...) but since it seems to be a timing issue it's very hard to reproduce. We

Carl-Magnus Björkell

unread,

Mar 30, 2012, 8:35:31 AM3/30/12

to mobicent...@googlegroups.com

For reference: My solution for this problem was to implement an event proxy for one of the ACIs:

(arrows show event firing directions)
Before: Sbb1 -> ACI -> Sbb2 and RA -> ACI -> Sbb1 & Sbb2
Now: Sbb1 -> ACI -> Proxy Sbb -> ACI -> Sbb2 and RA -> ACI -> Sbb1 & Sbb2

In this setup Sbb1 shares the Null ACI with the Proxy Sbb, and the RA ACI with Sbb2, so no dead locking occurs anymore. This is of course a work around since this setup triggers twice the amount of event routing as it did before. However, it is now possible to deploy the Sbb2 service with other services as well without needing to do code changes.

-Calle

Carl-Magnus Björkell

unread,

Mar 31, 2012, 3:00:18 AM3/31/12

to mobicent...@googlegroups.com

Actually, now that I re-read that the info I provided above was't accurate. Revised situation description: :-)

Before: SbbX -> ACI -> Sbb1 & Sbb2 and RA -> ACI -> Sbb1 & Sbb2
Now: SbbX -> ACI -> Sbb1 & Proxy Sbb -> ACI -> Sbb2 and RA -> ACI -> Sbb1 & Sbb2

So basically what changedbetween before and now, is that now Sbb1 and Sbb2 only have the RA ACI in common when receiving events.

-Calle

Reply all

Reply to author

Forward