Invalid action proposal / Optimizer problems

7 views
Skip to first unread message

Simon Fayer

unread,
May 9, 2018, 4:10:53 AM5/9/18
to diracgri...@googlegroups.com
Hi all,

[TL;DR version: DISET BaseClient._getBaseStub is slow and (as far as I can
tell) doesn't appear to do anything useful any more; can it be removed?]

I've been tracking down the infamous "invalid action proposal" bug which
crops up from time to time... On the GridPP server, we mainly see this
showing up in our optimizer modules every couple of weeks, which causes all
the jobs to stick in the received state until an admin intervenes (restarts
the services).

The cause of this is a race condition between the client and server DISET
code when making a new connection. The client does the following steps[1]:

1. Makes connection to server.
2. Find out extra information for the connection.
3. Send action proposal.
4. Await & process return values.

The problem here is step 2: Step 1 starts a five second timeout on the
server, so step 2 has to finish within five seconds for the connection to
be successful.

The core bit of step 2 is in the _getBaseStub function of the base
class[2]... It looks for the delegated group, but if it doesn't find one it
tries to work out the default for the user. In cases of server-server
connections, this code always runs as the hostcert doesn't have an
associated group. This lookup is very expensive as it has to iterate over
the CS groups (which in the GridPP case is ~40). Most of the time it still
completes in the five second window, but not always. If it goes over the
time limit, a retry loop does the same thing again a few times, which adds
to the load on the server, generally making things worse.

So this bug seems to mainly be triggered by the following conditions:
- Heavily loaded server.
- Large number of entries in CS groups.
- Frequent Intra-DIRAC component connections.
- High numbers of client threads (I think some parts of the CS client are
sometimes serialised on cache updates, which makes the problem worse).

To fix this, _getBaseStub needs to run as quickly as possible (which is
just good for performance anyway). I couldn't find anything using the
information it collects: Even if it is sent to the server as a hint, the
server still has to do the same thing (and has the same information
available) as it can't trust a client to say which group it's in without
verification.

As a test, I commented the entire function out except for the first and
last lines: As far as I can tell everything still works perfectly, but the
troublesome connections now run far more than 10 times faster. I tried all
kinds of stuff with proxy delegations, but everything appeared to work as
it used to.

I know that DISET is probably going to be replaced with a new transport in
future versions, but I'm quite keen to get this fixed properly until then.

So, I think my main question is: Does any one know what the code in
_getBaseStub is actually used for or can I put in a pull request to remove
it?

If there is a BiLD meeting tomorrow, I should be able to join, so I'm happy
to discuss this there if needed.

Regards,
Simon

[1] https://github.com/DIRACGrid/DIRAC/blob/integration/Core/DISET/private/InnerRPCClient.py#L35
[2] https://github.com/DIRACGrid/DIRAC/blob/integration/Core/DISET/private/BaseClient.py#L602

Federico Stagni

unread,
May 9, 2018, 4:31:08 AM5/9/18
to diracgri...@googlegroups.com, Christophe HAEN, Zoltan
Hi Simon,
there won't be a BiLD meeting tomorrow, as it's holiday at CERN (and in France too), but it will be next week. 

Thanks for the investigation, it's really HIGHLY appreciated. I have cc-ed Chris and Zoltan, they may have looked at this code deeper than I have ever done (at least Chris documented it). I myself would need to read your mail and the the code few more times to understand it properly.
The code itself has been written in 2009.

I would like to know, in the meantime, the tests you have done (are they in a reproducible format?)

Cheers,
Federico




--
You received this message because you are subscribed to the Google Groups "diracgrid-develop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to diracgrid-develop+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mathe Zoltan

unread,
May 9, 2018, 4:52:31 AM5/9/18
to Federico Stagni, diracgri...@googlegroups.com, Christophe HAEN
Hi,

This method can not be removed. The solution is to implement faster CS access. I already see this issue as I am last in the list of users.  
If I remember correctly rpcStub is used various places for example one of the most important is forwarding the DISET requests (It was used in the old web, but It is removed). 
Probably your tests worked, because you only tested in one case. But as soon as you have failover request, the request can not be forwarded, because the rpcStub contains all information needed to forward the request.

Next meeting we can discuss what to do….

Cheers,
Zoltan

Simon Fayer

unread,
May 9, 2018, 5:07:19 AM5/9/18
to diracgri...@googlegroups.com, Federico Stagni, Christophe HAEN
Hi Zoltan,

On Wed, May 09, 2018 at 10:52:27AM +0200, Mathe Zoltan wrote:
> Probably your tests worked, because you only tested in one case. But as soon as you have failover request, the request can not be forwarded, because the rpcStub contains all information needed to forward the request.

Yes, my tests were limited to "things working normally"... I mainly ran
standard user workflows with different types of proxy (DIRAC with group,
VOMS & completely plain) to see if there was anywhere the user/group didn't
get picked up properly.

Which components use failover requests and what would I have to do to
trigger one? I would like to build a test case for this, so I can start
investigating other possible solutions...

Regards,
Simon

Federico Stagni

unread,
May 9, 2018, 5:28:27 AM5/9/18
to Simon Fayer, diracgri...@googlegroups.com, Christophe HAEN
I think the operations Zoltan is talking about are the "forwardDISET" operations (treated by the RequestExecutingAgent). One is created for example in the DataStoreClient.py. 

Simon Fayer

unread,
May 9, 2018, 5:47:20 AM5/9/18
to Federico Stagni, diracgri...@googlegroups.com, Christophe HAEN
Thanks, yes, I see how the delegation parameters are used there... I'll
write some tests to gain some familiarity with that part of the code.

Regards,
Simon
Reply all
Reply to author
Forward
0 new messages