[tao-bugs] `Wait_On_LF_No_Upcall::can_process_upcalls` doesn't block upcall (bug?)

11 views
Skip to first unread message

Mickael P. Golovin

unread,
Nov 29, 2005, 3:50:51 PM11/29/05
to
Hi!

TAO VERSION: 1.4.7 (1.4a_p6 OCI has the same problem)
ACE VERSION: 5.4.7

HOST MACHINE and OPERATING SYSTEM: Windows XP (with SP2 and all official fixes)
TARGET MACHINE and OPERATING SYSTEM, if different from HOST: same as host
COMPILER NAME AND VERSION (AND PATCHLEVEL): Microsoft Visual Studio .NET 2003

THE $ACE_ROOT/ace/config.h FILE [if you use a link to a platform-
specific file, simply state which one]:

#define FD_SETSIZE 1024

#define WINVER 0x0400

#define ACE_HAS_ANSI_CASTS 1
#define ACE_HAS_MFC 1
#define ACE_HAS_STANDARD_CPP_LIBRARY 1
#define ACE_HAS_WINNT4 1
#define ACE_HAS_WINSOCK2 1

#if defined (ACE_HAS_MFC) && (ACE_HAS_MFC != 0) && defined (ACE_AS_STATIC_LIBS)
#define ACE_USES_STATIC_MFC
#endif // defined (ACE_HAS_MFC) && (ACE_HAS_MFC != 0) && defined (ACE_AS_STATIC_LIBS)

#define TAO_RESET_OUTPUT_CDR_AFTER_SEND 1

#include "ace/config-win32.h"

THE $ACE_ROOT/include/makeinclude/platform_macros.GNU FILE [if you
use a link to a platform-specific file, simply state which one
(unless this isn't used in this case, e.g., with Microsoft Visual
C++)]: N/A

CONTENTS OF $ACE_ROOT/bin/MakeProjectCreator/config/default.features
(used by MPC when you generate your own makefiles):

mfc=1
qos=1
ssl=1

AREA/CLASS/EXAMPLE AFFECTED: TAO_ORB_Core, ORB_Table

DOES THE PROBLEM AFFECT:
COMPILATION? no
LINKING? no
EXECUTION? yes

SYNOPSIS: `Wait_On_LF_No_Upcall::can_process_upcalls` doesn't block upcall

DESCRIPTION:
I explore method `Wait_On_LF_No_Upcall::can_process_upcalls`, because inside my server (when I use `-ORBClientConnectionHandler mt_noupcall`) it doesn't work correctly and found strange thinks.

There is implementation from `DOC 1.4.7` for file `tao/Wait_On_LF_No_Upcall.cpp` (equ last cvs state for rev. 1.7):
===
...
bool
Wait_On_LF_No_Upcall::can_process_upcalls (void) const
{
TAO_ORB_Core_TSS_Resources *tss =
this->transport_->orb_core()->get_tss_resources ();

if ((this->transport_->opened_as () == TAO::TAO_CLIENT_ROLE) &&
(this->transport_->bidirectional_flag () == 0) &&
(tss->upcalls_temporarily_suspended_on_this_thread_ == true))
return false;

return true;
}
...
===

If I turn off bidirectional support (value of `this->transport_->bidirectional_flag ()` always return `-1`) this method never return `false`. If I turn on bidirectional support this method never return `false` also, because value of `this->transport_->opened_as ()` on the server side is `TAO::TAO_SERVER_ROLE`.

Inside DOC `ChangeLog-04b` I found `Sun Oct 3 13:38:01 2004 Balachandran Natarajan` record for this change.

I explored file `tao/Transport.inl` (rev 1.17) and file `tao/Connection_Handler.cpp` (rev 1.50). I have found that `TAO_Transport::acts_as_server` and `TAO_Connection_Handler::handle_input_eh` had other logic implementation. Previous logic implementation is not equivalent of the current implementation.

There is implementation of `TAO_Transport::acts_as_server`:
===
...
ACE_INLINE int
TAO_Transport::acts_as_server (void) const
{
return (this->opened_as() == TAO_SERVER_ROLE || this->bidirectional_flag_ == 1) ? 1 : 0;
}
...
===

There is implementation of `TAO_Connection_Handler::handle_input_eh`:
===
...
int
TAO_Connection_Handler::handle_input_eh (
ACE_HANDLE h, ACE_Event_Handler *eh)
{
if (this->transport ()->acts_as_server () &&
this->orb_core_->get_tss_resources ()->upcalls_temporarily_suspended_on_this_thread_)
{
#if 0 // DON'T IMPLEMENT YET, BUT RECORD THE IDEA FOR POSTERITY
// ACE_Time_Value this->spin_prevention_backoff_delay_;
ACE_OS::usleep (this->spin_prevention_backoff_delay_);
this->spin_prevention_backoff_delay_ = 2 * this->spin_prevention_backoff_delay_ + 1;
#endif
if (TAO_debug_level > 6)
ACE_DEBUG ((LM_DEBUG,
"(%P|%t) Connection_Handler[%d] - not going to handle_input "
"on Transport %d "
"because upcalls temporarily suspended on this thread\n",
this->transport()->id(),
this->transport()->id()));
return 0;
}

#if 0
this->spin_prevention_backoff_delay_ = 0;
#endif

int result = this->handle_input_internal (h, eh);

if (result == -1)
{
this->close_connection ();
return 0;
}

return result;
}
...
===

I think also, that `this->transport_->orb_core ()->get_tss_resources ()` must be inside `if` - it speedup performance, because we get value from TSS when is need (instead of always).

SAMPLE FIX/WORKAROUND:
There is fix for file `tao/Wait_On_LF_No_Upcall.cpp` (this fix is same for DOC and OCI):
===
...
bool
Wait_On_LF_No_Upcall::can_process_upcalls (void) const
{
// GARANT {
if ((this->transport_->opened_as () == TAO::TAO_SERVER_ROLE) ||
(this->transport_->bidirectional_flag () == 1))
{
TAO_ORB_Core_TSS_Resources *tss =
this->transport_->orb_core ()->get_tss_resources ();

if (tss->upcalls_temporarily_suspended_on_this_thread_)
return false;
}
// } GARANT

return true;
}
...
===

Thanks a lot and best regards.
---
Russia, Moscow, GARANT-SERVICE Company (www.garant.ru)
Mickael P. Golovin, phone://+7-(095)-930-8908+5018

Kostas Lykiardopoulos

unread,
Nov 30, 2005, 5:34:38 AM11/30/05
to
Thanks Mickael!

I had also noticed this problem in TAO 1.4.6 that can_process_upcalls() always
returned true with MT_NO_UPCALL but had been too intimidated by the logic
inside this method. I will give your fix a try.

I thought I would share with you some of our experiences and what route we
have taken...

We are not too happy with the MT strategy because it can result in deadlock if
Application code happens to use mutexes to mutually exclude certain actions,
and then suppose the action that was assigned the thread makes a remote call.
Then the thread can be reassigned to another upcall which could conceivably
be some action that we were trying to "exclude" (with the mutex). Here
deadlock will arise because the same thread requests the same mutex twice, or
rather behaviour is dependant on the Synchronisation implementation (some
implementations will abort). Changing the mutex to be recursive is a stupid
idea because this can lead to much more insiduous problem whereby an
operation which is normally mutually exclusive with some other operation,
that is already taking place, is suddenly allowed to proceed because it
happens to get assigned to the same thread! Therefore we would prefer
MT_NO_UPCALL over plain MT.

Furthermore we have experienced problems with the MT strategy at start up
whereby incoming requests get processed before the ORB is running (premature
request processing). Here consider the setup where we have three servers (A,
B and C), the first is acting as a client. The other two servers are down.
Imagine that A invokes a method on B, which causes server B to come up
(thanks to IMR). Suppose that while B is initialising (i.e. before
orb->run()) B makes a remote call to server C. What happens now is
undesirable: because the thread in B has gone remote it gets reassigned to
process the waiting request from A. The waiting request A->B will suddenly
get processed, even though B has not yet reached orb->run(), nor completed
the rest of its initialisation.

The RW ClientConnectionHandler strategy of course does not have this drawback,
because it does not allow nesting of upcalls. But it has other drawbacks such
as exhibiting "pure client role" and no event loop which makes it pretty
useless for a persistent server. Nevertheless, to get around this, what we
did is we start up all our servers with RW, and then just before callling
orb->run() we modify the ClientConnectionHandler strategy to MT on the fly!!

If it is the case that you have now fixed the MT_NO_UPCALL, it should mean
that we no longer need to mutate the ClientConnectionHandler strategy. For us
that is very good news!!

Thanks again,

Kostas Lykiardopoulos.
INTRACOM SA.

Kostas Lykiardopoulos

unread,
Dec 1, 2005, 4:50:14 AM12/1/05
to
Good moring Chris,

Many thanks for you detailed explainations.

Not wishing to glorify my gross hack (regarding switching from RW->MT on the
fly) ... I just wanted to add the following comments. See below:

On Wednesday 30 November 2005 19:08, you wrote:
> Hi,
>
> As the originator (perpetrator?) or MT_NOUPCALL, I feel I should weigh
> in...


>
> On Wed, 30 Nov 2005, Kostas Lykiardopoulos wrote:
> > The RW ClientConnectionHandler strategy of course does not have this
> > drawback, because it does not allow nesting of upcalls. But it has other
> > drawbacks such as exhibiting "pure client role" and no event loop which
> > makes it pretty useless for a persistent server. Nevertheless, to get
> > around this, what we did is we start up all our servers with RW, and then
> > just before callling orb->run() we modify the ClientConnectionHandler
> > strategy to MT on the fly!!
>

> Yow! That's living on the edge and a race condition waiting to happen.

When my server starts up it is single threaded. The only "window of
oportunity" for the incoming request to get "prematurely" serviced (before
orb->run()), when the MT client strategy is used, is if the server happens to
make a remote call. This window does not exist in the case where the server
is using the RW client strategy.

Just before I reach the point in code where the server goes multi-threaded and
I call orb->run() on each of the threads in the TP, the server switches its
mode to MT. The switch over takes place while server is single threaded, and
the request has no way of being serviced at that moment because the
one-and-only thread is out of the reactor. Only when the reactor gets to run,
afterwards, will the request get handled. It appears to be very deterministic
and reliable, although it is a hack. (-:

However starting up in RW mode is not a feasible option for all servers. A
server which needs to do some sort of handshaking with another server during
start up (as apposed to just pushing stuff to the other server) cannot work.
With RW during at startup and being single-threaded, deadlock will result if
the other server calls back to the first.

>
> > If it is the case that you have now fixed the MT_NO_UPCALL, it should
> > mean that we no longer need to mutate the ClientConnectionHandler
> > strategy. For us that is very good news!!
>

> AFAIK, this change, while good, does nothing to address a more fundamental
> flaw in MT_NOUPCALL. Note that MT_NOUPCALL is marked *EXPERIMENTAL* in the
> docs.
>
> The problem is deep within the way TAO's leader-follower implementation
> works, but, in a nutshell, if the following conditions are met:
>
> 1. Process1: all orb->run() threads are dispatched doing upcalls;
> ==> therefore, there are currently no followers and no leader
>
> 2. Process1: another thread makes an invocation and becomes the special
> "Client Leader";
>
> 3. Process2: the server that's executing the invocation from #2 needs to
> make an invocation and receive a reply from the server in Process1
>
> Now, to the casual observer this does not present a terminal problem
> because even though the thread in #2 is blocked waiting for its reply and
> will not service any incoming calls during that waiting period, eventually
> some thread thread in #1 above will complete and become a follower, and get
> delegated to serve the incoming request(s), which will, in turn, allow the
> invocation in #3 to send the reply to #2, and everything is peaceful and
> happy.
>
> Unfortunately, this isn't how it works. TAO's leader-follower
> implementation grants special status to threads such as #2 that become
> leaders while acting in the client role. In this case, no other thread is
> allowed to even *become* a follower until the client leader completes.
> What that means is that, until the client leader's reply comes in, the size
> of the thread pool is reduced to zero in the case above, and AFAIK there's
> nothing the application can do to prevent this.
>
> In general, MT_NOUPCALL can still lead to deadlock if you run out of
> threads, which is exactly what happens (temporarily) above. In most cases,
> though, that's an application issue since in TAO applications are
> responsible for thread creation.
>
> Finally, the "segregated ORB" solution described by Ivo on this thread and
> further detailed by me does not suffer from any of these issues, uses
> nothing TAO-specific, and does not have a race-condition-in-waiting such as
> changing the wait strategy on the fly.

Regards,
Kostas.


Reply all
Reply to author
Forward
0 new messages