Re: [tao-users] Multi-threaded CORBA application can be deadlocked inside TAO

Chris Cleeland

unread,

Nov 23, 2009, 2:43:12 PM11/23/09

to <David.Kinder@sungard.com>, tao-...@cs.wustl.edu

Hi, David,

Thanks for using the PRF!

On Nov 23, 2009, at 2:42 AM, <David....@sungard.com> <David....@sungard.com
> wrote:

> SYNOPSIS:
> On a large client/server production system, a deadlock has been seen.
> The result of the deadlock is that there are threads that have called
> ORB::run() waiting on a lock, but no thread is actually doing anything
> to process incoming CORBA events.
>
> DESCRIPTION:
> [...]
>
> The problem seems to be that, once the thread in 2) has become the
> leader, no other thread can get into the reactor. Since the leader
> thread is handling a long-lived upcall, no other CORBA events can be
> processed until it returns.
> This can be a long time. In fact, in the actual system, this can be
> forever, as this upcall can end up depending on other CORBA calls
> which
> can now never be completed.

Yes, that is a known issue.

> Intuitively, this seems wrong. Once a thread is off handling an event,
> it surely should not be the leader any more? Ideally it would allow
> other threads to enter the reactor and handle further events.

While I agree with you, there is a reason. It's been awhile since
I've been intimate enough with that code, so I can't spout off the
precise reason why that which seems intuitive is not possible with
things as they are, but I can assure you that it's not an easy nut to
crack.

> REPEAT BY:
>
> Does anyone have any ideas about this? Are we doing something foolish
> that we've not spotted? We have experimented with "-ORBWaitStrategy
> MT_NOUPCALL"
> without success, though looking at the code for that, we cannot see
> how
> it could be expected to work at all.

In the past, OCI in general, and I specifically, have helped people
explore solutions to this, sometimes within the ORB, and sometimes at
the application level. One of those engagements led to the
implementation of MT_NOUPCALL, but that is really an incomplete
implementation (the customer stopped when it was sufficient for their
needs rather than pursuing a general solution to the end).

I believe that most of these mitigation strategies are document in the
FAQ. We can also provide you with support if you're interested.

---
Chris Cleeland, Principal Software Engineer
http://www.theaceorb.com AND http://www.ociweb.com

David....@sungard.com

unread,

Nov 26, 2009, 1:13:15 PM11/26/09

to tao-...@cs.wustl.edu

After some thinking about the way the client leader logic is followed,
and examining what Russel and Chris have posted, we have come up with
our own patch against 1.7.4 for this problem.

The patch is quite short, and hopefully is fairly clear. Logically, it
breaks down into two parts:

1) In TAO_Leader_Follower::set_upcall_thread(), if we're the client
leader thread, try to relinquish our client leader status.

2) Previously, the logic in TAO_Leader_Follower::wait_for_event() was
such that the thread starts out as a follower, and promotes itself to
client leader only if no-one else is doing any work. This has now been
made into a loop: when we're the client leader, after processing events
we check if we're still the client leader. If not, we go round the new
loop again, so that we go back to being just a follower.

This patch is sufficient to fix the test case I sent in my first
message. We also have a large test suite for our full system, and using
this patch did not cause any obvious problems there.

Would this patch be suitable for inclusion in TAO? The logic seems right
to me, but as Chris noted, why this code is as it is is not always
intuitive...

David

--- Leader_Follower.inl.old 2009-02-17 08:04:16.000000000 +0000
+++ Leader_Follower.inl 2009-11-26 13:03:24.046875000 +0000
@@ -132,10 +132,21 @@
{
TAO_ORB_Core_TSS_Resources *tss = this->get_tss_resources ();

- if (tss->event_loop_thread_ > 0)
+ ACE_ASSERT(tss->client_leader_thread_ == 0 ||
tss->client_leader_thread_ == 1);
+
+ if (tss->event_loop_thread_ > 0 || tss->client_leader_thread_ > 0)
{
ACE_GUARD (TAO_SYNCH_MUTEX, ace_mon, this->lock ());
- this->reset_event_loop_thread_i (tss);
+ if(tss->event_loop_thread_ > 0)
+ {
+ ACE_ASSERT(tss->client_leader_thread_ == 0);
+ this->reset_event_loop_thread_i (tss);
+ }
+ else
+ {
+ ACE_ASSERT(tss->client_leader_thread_ > 0);
+ this->reset_client_leader_thread();
+ }

this->elect_new_leader ();
}
@@ -160,9 +171,13 @@
TAO_Leader_Follower::reset_client_leader_thread (void)
{
TAO_ORB_Core_TSS_Resources *tss = this->get_tss_resources ();
- --tss->client_leader_thread_;
- --this->leaders_;
- --this->client_thread_is_leader_;
+ ACE_ASSERT(tss->client_leader_thread_ == 0 ||
tss->client_leader_thread_ == 1);
+ if(tss->client_leader_thread_ > 0)
+ {
+ --tss->client_leader_thread_;
+ --this->leaders_;
+ --this->client_thread_is_leader_;
+ }
}

ACE_INLINE int
--- Leader_Follower.cpp.old 2009-05-20 05:07:56.000000000 +0100
+++ Leader_Follower.cpp 2009-11-26 13:06:12.656250000 +0000
@@ -216,6 +216,7 @@
t_id = transport->id ();
}

+ while (event->keep_waiting())
{
// Calls this->set_client_thread () on construction and
// this->reset_client_thread () on destruction.
@@ -386,6 +387,7 @@
// the leader.
TAO_LF_Client_Leader_Thread_Helper client_leader_thread_helper
(*this);
ACE_UNUSED_ARG (client_leader_thread_helper);
+ TAO_ORB_Core_TSS_Resources *tss = this->get_tss_resources ();

{
ACE_GUARD_RETURN (ACE_Reverse_Lock<TAO_SYNCH_MUTEX>, rev_mon,
@@ -420,6 +422,24 @@
if (result == -1)
break;

+ // If we have lost the leader then become follower
+ if (tss->client_leader_thread_ == 0)
+ break;
+
+ // If there are event loop threads waiting let them take over
+ // so we are no longer single threaded
+ if (event->keep_waiting())
+ {
+ ACE_GUARD_RETURN (TAO_SYNCH_MUTEX, ace_mon, this->lock
(), -1);
+
+ if(this->event_loop_threads_waiting_)
+ {
+ this->reset_client_leader_thread();
+ this->event_loop_threads_condition_.broadcast ();
+ break;
+ }
+ }
+
// Otherwise, keep going...
}

Johnny Willemsen

unread,

Nov 26, 2009, 1:25:32 PM11/26/09

to David....@sungard.com, tao-...@cs.wustl.edu

Hi,

Please put this in bugzilla to not get lost. Integrating this is will take
time which we are really short for at this moment. It seems a small patch,
but the risks that things break are high because it is really in the core of
TAO. We are preparing for the release of 1.7.5, that will go out just after
the weekend.

Johnny

> _______________________________________________
> tao-users mailing list
> tao-...@list.isis.vanderbilt.edu
> http://list.isis.vanderbilt.edu/mailman/listinfo/tao-users

David....@sungard.com

unread,

Nov 26, 2009, 2:00:45 PM11/26/09

to tao-...@cs.wustl.edu

Johnny Willemsen wrote:
> Please put this in bugzilla to not get lost.

I've just tried to do this: I've created a new bugzilla account for the purpose. However, when I try to add a new bug report, I get to a page with the message

Either no products have been defined to enter bugs against or you have not been given access to any.

>From the "Preferences / Permissions" page in bugzilla I see

You have the following permission bits set on your account:
canconfirm Can confirm a bug.
editbugs Can edit all aspects of any bug.

Which would appear to indicate that I cannot create bug reports. Do I have to do something else to get my account to be allowed to create reports? I can't see any other steps mentioned anywhere.

David

Johnny Willemsen

unread,

Nov 26, 2009, 2:18:28 PM11/26/09

to David....@sungard.com, tao-...@cs.wustl.edu

Hi,

I have given you permission.

Johnny

David....@sungard.com

unread,

Nov 27, 2009, 4:14:44 AM11/27/09

to tao-...@cs.wustl.edu

> I have given you permission.

Thanks. I've added the description, test case and patch as bug #3768.

David

Johnny Willemsen

unread,

Nov 27, 2009, 4:17:39 AM11/27/09

to David....@sungard.com, tao-...@cs.wustl.edu

Hi,

> > I have given you permission.
>
> Thanks. I've added the description, test case and patch as bug #3768.

Great, than it doesn't get lost. Be aware that bugzilla is best effort, no
guarantees given when this are integrated

Johnny

David....@sungard.com

unread,

Nov 27, 2009, 4:28:48 AM11/27/09

to tao-...@cs.wustl.edu

> Great, than it doesn't get lost. Be aware that bugzilla is
> best effort, no guarantees given when this are integrated

Understood :-) I will also look into making the test case fit into the
test harness.

David