RPC and network disruptions

Hector Santos

unread,

Apr 12, 1999, 3:00:00 AM4/12/99

to

What is the best way to handle RPC client/server operations when
there are network disruptions, like someone breaks a circuit in
daisy-chain ethernet network?

What we have found is that one of our rpc clients function calls
are failing and I want to find out the best way to recover from
this.

Some detail:

The original designer of code (we purchased it), had it so that
the function call is part of a 1 second timer to display server
statistics. But if the function failed, it assumed there was a
server disconnect and immediately aborted the client applet.

I want to change this to gracefully handle the situation and
in general situations throughout the code.

Suggestions, Ideas, Tips?

Felix Kasza [MVP]

unread,

Apr 13, 1999, 3:00:00 AM4/13/99

to

Hector,

> What is the best way to handle RPC client/server operations when
> there are network disruptions, like someone breaks a circuit in
> daisy-chain ethernet network?

Not a lot of choice there -- catch the exception (or use a comm_status
attribute on the RPC functions) and reinitialize/reconnect.

--

Cheers,

Felix.

If you post a reply, kindly refrain from emailing it, too.
Note to spammers: fel...@mvps.org is my real email address.
No anti-spam address here. Just one comment: IN YOUR FACE!

Michael Kelley

unread,

Apr 13, 1999, 3:00:00 AM4/13/99

to

> > What is the best way to handle RPC client/server operations when
> > there are network disruptions

The best way is to not use Win32 RPC in the first place. After 4
years of intensive RPC development, testing, and consulting with MS,
It's become clear that RPC is not the best way to go when you need
reliable messaging in unreliable network environments.

NetBEUI completely ignores RpcMgmtSetComTimeout() suggestions and
tcp/ip is somewhat unreliable at detecting broken connections
(sometimes the 2 hr timeout will kick in and sometimes it won't and
you'll have an outstanding RPC call blocked indefinitely). Dynamic
endpoint resolution is not totally reliable, successfully returning a
bogus binding handle. I found that NT's endpoint resolution under
ncacn_nb_nb is actually the most problematic. Of course, Felix's
favorite OS's, Win95, has tcp/ip memory leaks that you really need
kernel update 1 *and* DCOM95 update to completely fix. Making and
breaking RPC connections on Win95 using lrpc, netbeiu, and spx leak
memory. Under Win9X, there are a few interesting combinations of
making multiple RPC connections that can result in
RpcEpResolveBinding() hung in the kernel with no way to exit except a
hard reset (try 2 local ncacn_nb_nb dynamic ep connections or one
ncalrpc dynamic ep connection and one ncacn_nb_nb dynamic ep
connection). Sometimes RpcMgmtIsServerListening() will report 1726
(RPC_S_CALL_FAILED) when the server is up and running and otherwise
inactive, but if you retry again it works OK.

You can cover up for many of RPC's shortcomings, but you won't know if
you've hit them all. Unfortunately even NT4 has its RPC problems.
Overall, I'd recommend tcp/ip with static endpoints as your best bet
if it weren't for the possible infinite hang when a connection is
lost. I have absolutely no experience with name-service or secure RPC
since my systems couldn't depend upon NT, so I had to stick to
functions available under Win9X and WinNT.

DCOM-based distributed systems will be real sources of problems in the
next few years unless MS has made significant improvements in their
networking implementations. I can't wait to see significant systems
written in VB start to have problems and their developers wonder why.

Mike Kelley

Felix Kasza [MVP]

unread,

Apr 13, 1999, 3:00:00 AM4/13/99

to

Mike,

> NetBEUI [does lots of stupid things]

Gack. I don't even have this excuse for a protocol installed.

> tcp/ip is somewhat unreliable at detecting broken connections
> (sometimes the 2 hr timeout will kick in and sometimes it won't

I didn't notice that one yet. Teardown on loss of connectivity always
happened for me.

> Dynamic endpoint resolution is not totally reliable, successfully
> returning a bogus binding handle.

That's another one I never encountered. Is it possible that the server
process went away without unregistering its endpoints?

> I found that NT's endpoint resolution under ncacn_nb_nb is
> actually the most problematic.

Oh -- OK. I didn't think anybody used _nb protseqs. But hey, what do you
expect from a 15 year old name resolution mechanism? Whoever came up
with mutilating the 16th name byte to express the endpoint must have
been a sadist anyway.

> Of course, Felix's favorite OS's, Win95, has tcp/ip memory leaks

Only minor ones. Why, I have had servers running for a whole day before
the machine augered into the ground. Praise Gates that we will get
_another_ major release of this mega-kluge.

In short, I fully agree with all your points re Win9x, but only
partially so re NT.

Michael Kelley

unread,

Apr 14, 1999, 3:00:00 AM4/14/99

to

Some follow up info:

ncacn_nb_nb (ugly as it may be) has the advantage of quicker
connection attempt timeouts, allowing an applications wants to find
out if a server is unavailable more quickly than possible under
tcp/ip. A portion of the product I work on used _nb for this
particular reason -- faster timeouts. The rest of the product always
used tcp/ip.

I spent about 2 months last year really bearing down and doing tough
testing on rpc connection establishment on a single-segment isolated
network. Most of my testing focused on making connections when the
server was available, because we had a problem with false failures in
resolving bindings. I had 12 pcs, dual booting between NTW4.0SP3 and
a combination of Win95 build 950, Win95 OSR2.1, and Win98. I
literally tested tens of millions of connection attempts. Dynamic ep
resolution under NT demonstrated failure rates between 0.1
failures/1000 attempts and 50/1000, with tcp/ip generally falling in
the range of 0.1-5.0 failures/1000 attempts and netbeui falling in the
range of 1-50 failures/1000 attempts. These failures gave us false
negatives, i.e. the server was up and running and the client couldn't
connect. Oftentimes, once dynamic ep resolution to a particular
server failed, it always failed. In my testing, 3 successive binding
failures stopped the test since continuing was pointless. MS's basic
conclusion was that RPCSS didn't handle multithreaded access well,
which I don't really understand since multiple processes can make
effectively the same demands as a single, multi-threaded application.
RPCSS is certainly of major importance to NT in general. Anyway, MS
stated no interest in improving _nb reliability, which isn't
surprising since it needs to go away.

Years ago I discovered interesting kernel hang scenarios under Win95
for local connections with lrpc, netbeui, and spx. Tcp/ip was
therefore the only protocol of choice for local Win95 connections.
Upgrading Win95 with kernel patch 1 and DCOM95 updates eliminated the
18 byte/tcp/ip connection leak and ~240 byte/dns resolution leak.

Moving to static endpoints and tcp/ip protocol finally allowed >20M
connections without a single error among complex interconnections of
WinNT, Win95, and Win98 clients and servers. My team decided that the
longer timeout for nonexistent servers was better than false failures
when the server was available, so we moved to tcp/ip exclusively.
Once I decided to move to static endpoints, MS was pleased to close
our cases. Maybe they took some of our results (that they also ran
and confirmed) and used them to work on RPCSS -- then again, maybe
not. Our testing was July-September 1998, so I doubt that NT SP4's
updated RPCSS got to address this unless MS was already working on the
problem. I haven't tested SP4 yet to see if our tests get different
results.

Portions of our product used some blocking RPC calls that were
intentionally left blocked in the server and only returned when
updates were available. When the client nodes were disconnected from
the network, sometimes we'd get a 2 hr timeout, often not. Burning
many hours against our premier support contract with MS and
talking/testing directly with the RPC developer, we confirmed that, in
this style of usage, we were not even guaranteed of getting the 2 hr
timeout on a pending RPC call. This was the final blow that convinced
us RPC wasn't right for our future development.

My advise to anyone that has bothered to read thru all this is to do
extensive connectivity testing on prototyped systems very early on,
before you are solidly locked into RPC. You may come across important
RPC issues. In theory, RPC looks great. In practice, it's got a few
implementation problems.

Mike Kelley

Hector Santos

unread,

Apr 14, 1999, 3:00:00 AM4/14/99

to michael kelley

Thanks for the wonderful insight. However, at this time, a complete
design change is out of the question.

We are finding that our current RPC Based client/server design is
extremely reliable with excellent feedback. The RPC design is the
less our problems.

We just took this product over, but it has 4 years of engineering
and we very close with the original designers. We believe this product
RPC's client/server design has shown exact expected behavior,
and has shown incredible reliability, low overhead and has addressed
the speed demands.

In short, we experience no customer support issues regarding RPC
related problems.

I'm only saying this because I have seen some people remark on
how bad RPC is. We just have not encounter these problems. Mind
you, there are some things the original designers did to "compensate
" for problems generated by the MIDL compiler (i.e., server memory
leak problems).

Other than that, we found no other issue.

Nonetheless, we were getting some reports of clients "shutting down" and
we recently and accidentally discovered why - network disruptions.

In this particular case, one of the RPC clients is making a simple call
every second to the server to grab some stats. If it failed, the
client closed down. No warning or error display.

Now that we know this, we can easily add some "smarts" to it
and not be so "direct" in closing down after a single failed
call.

Our current approach will be to add an error counter. i.e, if it
fails 10 times or so, then we can assume there was a permanent
network break.

But I wanted to see if there was a more direct RPC suggested method
to see if we can trap a break so that logic can be added to
gracefully resolve it rather than use a "error count" concept.

We are looking into trying to get a better error report back from
the server or RPC middle ware.

Thanks

Hector Santos

unread,

Apr 14, 1999, 3:00:00 AM4/14/99

to felix kasza [mvp]

On Apr 13, 1999 07:35am, FEL...@MVPS.ORG (FELIX KASZA [MVP]) wrote to HECTOR
SANTOS:

>> What is the best way to handle RPC client/server operations when

>> there are network disruptions, like someone breaks a circuit in
>> daisy-chain ethernet network?

FK> Not a lot of choice there -- catch the exception (or use a comm_status
FK> attribute on the RPC functions) and reinitialize/reconnect.

The midl created function source does create an RPC exception trap
and that is what we get as an return error for the client function call.

Thanks Felix

Felix Kasza [MVP]

unread,

Apr 16, 1999, 3:00:00 AM4/16/99

to

Mike,

thanks for the extensive info. I have been doing RPC for years, and I
have yet to see a single false-negative endpoint resolution that was not
caused by circumstances beyond the control of RPCSS; a quick test (only
50,000 connections) I did earlier today should have given at least one
failure. I'll let it run overnight, too.

Incidentally, what you may have been experiencing might be more of an IP
problem than an RPCSS problem: if the listen queue for rpcss gets full,
it gets full. Period.

Personally, I stay away from _nb protseqs (and from Win9x); and given
those two restrictions, I am perfectly happy with RPC, including some
large-scale production environments.

Michael Kelley

unread,

Apr 19, 1999, 3:00:00 AM4/19/99

to

Felix,

Important architectural things to add. My testing reflected the
architecture used in the product I work on where our RPC clients are
also RPC servers and vice versa. The test was with multiple client
nodes and with multiple threads within the client process making
simultaneous connection attempts. My clients were also RPC servers
and vice versa. These back-channel RPC connections are used for async
status msgs in the product I work on.

My test client application took multiple command line arguments, with
each one being the node name of a server to connect to. Each command
line arg started a background thread connecting/call/disconnecting to
the corresponding server. A call from the "client" test application
to server caused the server to establish a back-channel RPC connection
to the "client". Giving the same node name multiple times generated
multiple threads trying to connect simultaneously. Command line args
were also used to specify a single protocol, optional endpoint if
testing with static endpoints, and a flag to pass via remote procedure
to server to indicate if reverse-channel RPC connections should be
made.

In a simple test with 2 nodes, I'd start the rpc server on each node,
then start the rpc client on each node with 4 args, 2 being the local
node name and 2 being the other node's name. 4 threads going in each
client process, 2 going local, 2 going remote. The server processes
would receive 4 connection requests (asynchronously), followed by a
remote procedure call that would try to make a "back-channel" RPC
connection to a different RPC interface being served by the
originating client process. If the back-channel connection failed,
the remote procedure returned failure and the client process freed the
forward-path RPC binding and then treated the back-channel failure
basically the same as if it were a forward-channel binding failure.
We did keep separate statistics for forward-channel and back-channel
connection failures. In this test I described, you ended up with 16
RPC connections when all worked OK -- 4 forward connections on each of
2 nodes and 4 back-channel connections on each of 2 nodes.

A connection failure was defined as any failure in either direction.
In our test scenario, back-channel connections were probably 10x more
likely. We found it was really necessary to stress RPC in order to
get the false error in RPC ep resolution to occur. Once the error
occurred, it was usually permanent and required the client test
process to be terminated and restarted. While one client process
encountered this error, other client processes on other nodes
continued to operate.

The convincing point was when I tried using static end points and,
obviously, the connections *always* worked (and were obviously a whole
lot faster too). When using dynamic endponts, sometimes
RpcEpResolveBinding() would succeed and the remote procedure call
would immediately fail with an invalid binding error (this was the
failure I really disliked).

Several iterations of my test program were necessary before MS support
accepted it as a reproducible problem and forwarded it to the RPC
developers. The only real feedback I got was that they were looking
at it and thought the NT problems might be problems in protocol
stacks, not the RPC runtime. I basically brought them 3 problems, we
isolated two and worked around them, and the third we simply avoided.

1. ncacn_ip_tcp leaked under Win95 (kernel update and winsock2
fixed, winsock2 for 95 not required if using dotted-decimal addresses
instead of computer names when making rpc bindings)

2. dynamic endpoints sometimes failed to resolve (went to static
endpoints and the performance was so much better anyway). Failed
under 95, 98, and NT4SP3.

3. interesting RpcEpResolveBinding() hang scenarios under Win95
were previously avoided in my product and MS just wasn't interested in
this one.

I realize that my lengthy responses were basically piggybacked on a
simple request for info on RPC exception handling in a dirty network
environment. My particular RPC testing also pointed out some problems
with a clean, isolated, network running with local hosts files.

If you are interested in my test code, email me privately.

Regards,
Mike Kelley
mke...@micros.com

Felix Kasza [MVP]

unread,

Apr 19, 1999, 3:00:00 AM4/19/99

to

Mike,

> 1. ncacn_ip_tcp leaked under Win95 (kernel update and winsock2
> fixed, winsock2 for 95 not required if using dotted-decimal addresses
> instead of computer names when making rpc bindings)

The good old sockets/DNS leak again. For that alone, Win95 should have
been taken out into the yard and put up against the wall.

> 2. dynamic endpoints sometimes failed to resolve (went to static
> endpoints and the performance was so much better anyway). Failed
> under 95, 98, and NT4SP3.

I strongly suspect that the rpcss listen queue overflows, dropping
connections on the floor and leading to ep resolution failure. All the
details you have listed match this scenario -- and with static eps, no
talking rpcss is needed, which would be the reason why it works.

When you call MS again, tell them to (a) up the backlog argument in the
rpcss listen() call and (b) if running on NT4 _Workstation_ or higher,
to optionally log a warning (hard upper limit of 5).

Michael Kelley

unread,

Apr 22, 1999, 3:00:00 AM4/22/99

to

>I strongly suspect that the rpcss listen queue overflows,

If this is the root of the problem, then the listen queue must be
quite small indeed. It seems like in the 2 node example I cited that
you would have at most 8 simultaneous bindings being created and
accessing a single RPCSS. Since this includes local connections, this
limitation really limits node interconnectivity.

I couldn't get MS support to delve into specific configuration of
RPCSS on various platforms. I clearly demonstrated and theyt
accepted the problem as an rpcss limitation, but I never got any real
quantification of the problem. The other nagging problem was
successful ep resolution that immediately failed as an invalid binding
when making a remote procedure call.

Thansk for the insight.
-Mike