Hi all,
[TL;DR version: DISET BaseClient._getBaseStub is slow and (as far as I can
tell) doesn't appear to do anything useful any more; can it be removed?]
I've been tracking down the infamous "invalid action proposal" bug which
crops up from time to time... On the GridPP server, we mainly see this
showing up in our optimizer modules every couple of weeks, which causes all
the jobs to stick in the received state until an admin intervenes (restarts
the services).
The cause of this is a race condition between the client and server DISET
code when making a new connection. The client does the following steps[1]:
1. Makes connection to server.
2. Find out extra information for the connection.
3. Send action proposal.
4. Await & process return values.
The problem here is step 2: Step 1 starts a five second timeout on the
server, so step 2 has to finish within five seconds for the connection to
be successful.
The core bit of step 2 is in the _getBaseStub function of the base
class[2]... It looks for the delegated group, but if it doesn't find one it
tries to work out the default for the user. In cases of server-server
connections, this code always runs as the hostcert doesn't have an
associated group. This lookup is very expensive as it has to iterate over
the CS groups (which in the GridPP case is ~40). Most of the time it still
completes in the five second window, but not always. If it goes over the
time limit, a retry loop does the same thing again a few times, which adds
to the load on the server, generally making things worse.
So this bug seems to mainly be triggered by the following conditions:
- Heavily loaded server.
- Large number of entries in CS groups.
- Frequent Intra-DIRAC component connections.
- High numbers of client threads (I think some parts of the CS client are
sometimes serialised on cache updates, which makes the problem worse).
To fix this, _getBaseStub needs to run as quickly as possible (which is
just good for performance anyway). I couldn't find anything using the
information it collects: Even if it is sent to the server as a hint, the
server still has to do the same thing (and has the same information
available) as it can't trust a client to say which group it's in without
verification.
As a test, I commented the entire function out except for the first and
last lines: As far as I can tell everything still works perfectly, but the
troublesome connections now run far more than 10 times faster. I tried all
kinds of stuff with proxy delegations, but everything appeared to work as
it used to.
I know that DISET is probably going to be replaced with a new transport in
future versions, but I'm quite keen to get this fixed properly until then.
So, I think my main question is: Does any one know what the code in
_getBaseStub is actually used for or can I put in a pull request to remove
it?
If there is a BiLD meeting tomorrow, I should be able to join, so I'm happy
to discuss this there if needed.
Regards,
Simon
[1]
https://github.com/DIRACGrid/DIRAC/blob/integration/Core/DISET/private/InnerRPCClient.py#L35
[2]
https://github.com/DIRACGrid/DIRAC/blob/integration/Core/DISET/private/BaseClient.py#L602