Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

Re: [tao-bugs] Multi-threaded CORBA application can be deadlockedinside TAO

71 views

Skip to first unread message

Russell Mora

unread,

Nov 23, 2009, 10:42:15 AM11/23/09

to David....@sungard.com, tao-...@cs.wustl.edu

Hi David,

I've seen the same thing as well, i.e. that client-leaders are special
and don't give up leadership in set_upcall_thread - I could find no
reason for this being the way it is though. I modified my version of
TAO to change this, i.e. client-leader threads can give up leadership,
and I have pasted it below - my version is ancient though (OCI version
1.3a_p10) so you'll have to update it for your version (plus I couldn't
change my ABI so I had to use some hacks to avoid this (setting
client_leader_thread_ to a negative value) - please feel free to update
the patch in this respect if you want). Once done please submit it for
inclusion in the TRUNK.

HTH

Cheers,
Russell.

Index: Leader_Follower.i
===================================================================
RCS file:
/src/apps/VxPSP/Current/source/ACE_wrappers/TAO/tao/Leader_Follower.i,v
retrieving revision 1.1.1.4
retrieving revision 1.5
diff -u -r1.1.1.4 -r1.5
--- Leader_Follower.i 24 Sep 2004 18:00:55 -0000 1.1.1.4
+++ Leader_Follower.i 24 Oct 2009 09:06:19 -0000 1.5
@@ -3,6 +3,8 @@

// ****************************************************************

+TAO_BEGIN_VERSIONED_NAMESPACE_DECL
+
ACE_INLINE
TAO_Leader_Follower::TAO_Leader_Follower (TAO_ORB_Core* orb_core,
TAO_New_Leader_Generator
*new_leader_generator)
@@ -61,6 +63,8 @@
ACE_INLINE int
TAO_Leader_Follower::set_event_loop_thread (ACE_Time_Value
*max_wait_time)
{
+ // No need to grab the lock here - this method is called from the
+ // LF_Strategy object which has already grabbed the lock.
TAO_ORB_Core_TSS_Resources *tss = this->get_tss_resources ();

// Make sure that there is no other client thread run the show. If
@@ -94,6 +98,9 @@
ACE_INLINE void
TAO_Leader_Follower::reset_event_loop_thread_i
(TAO_ORB_Core_TSS_Resources *tss)
{
+ // No need to grab the lock here - this method is called from the
+ // LF_Strategy object which has already grabbed the lock.
+
// Always decrement <event_loop_thread_>. If <event_loop_thread_>
// reaches 0 and we are not a client leader, we are done with our
// duties of running the event loop. Therefore, decrement the
@@ -133,6 +140,16 @@

this->elect_new_leader ();
}
+ else if (tss->client_leader_thread_ > 0)
+ {
+ ACE_GUARD (TAO_SYNCH_MUTEX, ace_mon, this->lock ());
+ this->reset_client_leader_thread ();
+ // We need some way to signal to reset_client_leader_thread that
+ // the leadership has already been given up - we do this by
+ // negating the client_leader_thread_ count.
+ tss->client_leader_thread_ = -tss->client_leader_thread_;
+ this->elect_new_leader ();
+ }
}

ACE_INLINE int
@@ -144,6 +161,8 @@
ACE_INLINE void
TAO_Leader_Follower::set_client_leader_thread (void)
{
+ // No need to grab the lock here - this method is called from inside
+ // wait_for_event where the lock is already held.
TAO_ORB_Core_TSS_Resources *tss = this->get_tss_resources ();
++this->leaders_;
++this->client_thread_is_leader_;
@@ -153,10 +172,27 @@
ACE_INLINE void
TAO_Leader_Follower::reset_client_leader_thread (void)
{
+ // No need to grab the lock here - this method is called from inside
+ // wait_for_event where the lock is already held.
TAO_ORB_Core_TSS_Resources *tss = this->get_tss_resources ();
+ // We may have already been called via set_upcall_thread before we
+ // get called via ~TAO_LF_Client_Leader_Thread_Helper, so we need to
+ // check that client_leader_thread_ is not already zero.
+ if (tss->client_leader_thread_ > 0)
+ {
--tss->client_leader_thread_;
--this->leaders_;
--this->client_thread_is_leader_;
+ }
+ else if (tss->client_leader_thread_ < 0)
+ {
+ // If the client_leader_thread_ count was negative that means
+ // that leadership was given up via a call to set_upcall_thread
+ // in this wait_for_event context - reset the value back to its
+ // postive value which is what is required for the next
+ // wait_for_event context up the stack.
+ tss->client_leader_thread_ = -tss->client_leader_thread_;
+ }
}

ACE_INLINE int
@@ -217,3 +253,5 @@
{
this->leader_follower_.reset_client_leader_thread ();
}
+
+TAO_END_VERSIONED_NAMESPACE_DECL

Index: Leader_Follower.cpp
===================================================================
RCS file:
/src/apps/VxPSP/Current/source/ACE_wrappers/TAO/tao/Leader_Follower.cpp,
v
retrieving revision 1.3
retrieving revision 1.6
diff -u -r1.3 -r1.6
--- Leader_Follower.cpp 26 Apr 2007 20:56:46 -0000 1.3
+++ Leader_Follower.cpp 27 Jan 2009 16:12:42 -0000 1.6
@@ -161,6 +161,7 @@
if (tss->event_loop_thread_ ||
tss->client_leader_thread_)
{
+ // Will this ever be true? And will it ever work?
++this->leaders_;
}

@@ -199,16 +200,29 @@
if (TAO_debug_level)
t_id = transport->id ();

- {
+ { // Scope #1: All threads inside here are client threads
// Calls this->set_client_thread () on construction and
// this->reset_client_thread () on destruction.
TAO_LF_Client_Thread_Helper client_thread_helper (*this);
ACE_UNUSED_ARG (client_thread_helper);

+ // The loop here is for when we get elected (client) leader and
+ // then later relinquish the leader position and our event has
+ // still not completed (and we haven't run out of time).
+
+ // All the conditions below are basically the various ways the
+ // leader loop below can end, other than the event being complete
+
+ while (event->keep_waiting ()
+ && !(result == 0 &&
+ max_wait_time != 0 &&
+ *max_wait_time == ACE_Time_Value::zero)
+ && result != -1)
+ { // Scope #2: threads here alternate between being
leader/followers
// Check if there is a leader. Note that it cannot be us since we
// gave up our leadership when we became a client.
if (this->leader_available ())
- {
+ { // Scope #3: threads here are followers
// = Wait as a follower.

// Grab a follower:
@@ -228,7 +242,7 @@

while (event->keep_waiting () &&
this->leader_available ())
- {
+ { // Scope #4: this loop handles spurious wake-ups
// Add ourselves to the list, do it everytime we wake up
// from the CV loop. Because:
//
@@ -324,7 +338,7 @@
return -1;
}
}
- }
+ } // End Scope #4: loop to handle spurious wakeups

countdown.update ();
@@ -354,7 +368,7 @@
// We only get here if we woke up but the reply is not
// complete yet, time to assume the leader role....
// i.e. ACE_ASSERT (event->successful () == 0);
- }
+ } // End Scope #3: we are no longer a follower

// = Leader Code.

@@ -367,15 +381,16 @@
// construction and this->reset_client_leader_thread ()
// on destruction. Note that this may increase the refcount of
// the leader.
+ { // Scope #5: We are now the client-leader
TAO_LF_Client_Leader_Thread_Helper client_leader_thread_helper
(*this);
ACE_UNUSED_ARG (client_leader_thread_helper);

- {
+ { // Scope #6: release the lock via a reverse lock
ACE_GUARD_RETURN (ACE_Reverse_Lock<TAO_SYNCH_MUTEX>, rev_mon,
this->reverse_lock (), -1);

// Become owner of the reactor.
- ACE_Reactor *reactor = this->reactor_;
+ ACE_Reactor *reactor = this->reactor ();
reactor->owner (ACE_Thread::self ());

// Run the reactor event loop.
@@ -403,6 +418,16 @@
if (result == -1)
break;

+ // Has an event loop thread become available to take over?
+ // Yes, we are checking this without the lock, however, if
+ // we get a false reading we'll just circle around and
+ // become leader again...
+ if (this->event_loop_threads_waiting_)
+ break;
+
+ // Did we give up leadership?
+ if (!this->is_client_leader_thread ())
+ break;
// Otherwise, keep going...
}

@@ -411,13 +436,15 @@
"TAO (%P|%t) -
Leader_Follower[%d]::wait_for_event,"
" (leader) exit reactor event loop\n",
t_id));
- }
- }
+ } // End Scope #6: we should now hold the lock again
//
// End artificial scope for auto_ptr like helpers calling:
- // this->reset_client_thread () and (maybe)
// this->reset_client_leader_thread ().
//
+ } // End Scope #5: we are no longer a client-leader
+ // We only get here if we were the client leader and either our
+ // event completed or an event loop thread has become available to
+ // become leader.

// Wake up the next leader, we cannot do that in handle_input,
// because the woken up thread would try to get into handle_events,
@@ -433,6 +460,26 @@
t_id),
-1);

+ // Yield to any event loop threads that may be waiting to take
+ // leadership - otherwise we will just loop around and take
+ // leadership again (because we hold the lock).
+ if (this->event_loop_threads_waiting_)
+ {
+ ACE_GUARD_RETURN (ACE_Reverse_Lock<TAO_SYNCH_MUTEX>, rev_mon,
+ this->reverse_lock (), -1);
+ ACE_OS::thr_yield ();
+ }
+
+ } // End Scope #2: we loop here if our event is incomplete
+
+ //
+ // End artificial scope for auto_ptr like helpers calling:
+ // this->reset_client_thread ()
+ //
+
+ // We should only get here when our event is complete or timed-out
+ } // End Scope #1: we are no longer a client thread
+
if (result == -1)
ACE_ERROR_RETURN ((LM_ERROR,
"TAO (%P|%t) -
Leader_Follower[%d]::wait_for_event,"

> -----Original Message-----
> From: tao-bugs...@list.isis.vanderbilt.edu [mailto:tao-bugs-
> bou...@list.isis.vanderbilt.edu] On Behalf Of
David....@sungard.com
> Sent: Thursday, November 19, 2009 10:55 AM
> To: tao-...@cs.wustl.edu
> Subject: [tao-bugs] Multi-threaded CORBA application can be
> deadlockedinside TAO
>
> TAO VERSION: 1.7.4
> ACE VERSION: 5.7.4
>
> HOST MACHINE and OPERATING SYSTEM:
> Windows XP SP3
>
> THE $ACE_ROOT/ace/config.h FILE:
> config-win32.h
>
> BUILD METHOD USED:
> Microsoft Visual Studio 2005, TAO_ACE_vc8.sln
>
> AREA/CLASS/EXAMPLE AFFECTED:
> Fault seems to be a logic problem inside TAO
>
> DOES THE PROBLEM AFFECT:
> COMPILATION? No
> LINKING? No
> EXECUTION? Yes, causes deadlock with TAO and application
> OTHER (please specify)? No
>
> SYNOPSIS:
> On a large client/server production system, a deadlock has been seen.
> The
> result of the deadlock is that there are threads that have called
> ORB::run()
> waiting on a lock, but no thread is actually doing anything to process
> incoming CORBA events.
>
> DESCRIPTION:
> Our server architecture has a pool of CORBA worker threads, each of
> which
> calls ORB::run() to process incoming events (that is, CORBA calls from
> client applications). There also exist other threads in the server
that
> perform various tasks, which sometimes involve calling out to other,
> remote
> CORBA objects.
>
> Analysis of the deadlocked server shows that the sequence of events is
> as
> follows:
>
> 1) Client applications make sufficiently many CORBA calls to the
server
> that
> all the CORBA worker threads are busy handling the calls.
>
> 2) A thread (not one of the CORBA worker threads) makes a call to an
> external
> CORBA object.
>
> 3) While the above call is in progress, another application makes a
> CORBA call
> to the server. Normally, this would be handled by one of the CORBA
> worker
> threads. However, as described in 1), these are all busy. Therefore,
> the
> incoming event is handled by the thread in 2) that is waiting for the
> result
> of a call.
>
> 4) The thread in 2) dispatches the event to the appropriate servant,
> leading
> to a long-lived call to code inside the server that performs some
work.
>
> 5) Some of the CORBA threads in 1) complete the dispatching of the
> CORBA
> calls they were handling.
>
> 6) A further CORBA call from a client occurs. However, this is handled
> neither
> by the thread in 2) (since this is handling the upcall) nor any of the
> CORBA
> worker threads, which are all stuck in
> TAO_Leader_Follower::wait_for_client_leader_to_complete(), waiting for
> an event
> to be signalled, so the event is not handled.
>
> The problem seems to be that, once the thread in 2) has become the
> leader, no
> other thread can get into the reactor. Since the leader thread is
> handling a
> long-lived upcall, no other CORBA events can be processed until it
> returns.
> This can be a long time. In fact, in the actual system, this can be
> forever,
> as this upcall can end up depending on other CORBA calls which can now
> never
> be completed.
>
> Intuitively, this seems wrong. Once a thread is off handling an event,
> it
> surely should not be the leader any more? Ideally it would allow other
> threads
> to enter the reactor and handle further events.
>
> REPEAT BY:
>
> The problem in the production system has been reduced to a relatively
> simple
> test case, the source code for which is attached. To reproduce, build
> the
> supplied source files to produce "server" and "client" executables,
> then
>
> 1) Run the server executable. This will wait for 10 seconds, waiting
> for
> a CORBA call from the client executable.
>
> 2) Run the client executable. This will call the server via the CORBA
> method
> Test2.call(), passing in a CORBA object it has instantiated.
>
> 3) The server immediately calls back to the client via Test1.call1().
>
> 4) Since (at this point) in the client application there is only one
> thread
> in the ORB (the main thread) this will handle the incoming call back
> from
> the server. This ends up in TestClient_i::call1(), which starts a pool
> of
> CORBA worker threads and then goes to sleep forever.
>
> 5) After 10 seconds, the server starts making repeated calls to the
> Test1.call2() CORBA method of the client. The first of these calls
> simply
> causes the system to hang: the client does not process the call.
>
> 6) If the final sleep() in TestClient_i::call1() is commented out, the
> system behaves as expected: repeated calls from the server to the
> client
> are
> seen. However, it seems to me that the system should work in any case:
> there
> are plenty of threads waiting to do work in the test client: it's just
> that
> none of them can get into the reactor.
>
> Does anyone have any ideas about this? Are we doing something foolish
> that
> we've not spotted? We have experimented with "-ORBWaitStrategy
> MT_NOUPCALL"
> without success, though looking at the code for that, we cannot see
how
> it
> could be expected to work at all.
>
> David Kinder

0 new messages