Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[PATCH] better timeout enforcement (was: (no subject)

25 views
Skip to first unread message

Jared Johnson

unread,
Oct 5, 2011, 3:46:31 PM10/5/11
to perl...@perl.org, dc...@efolder.net
eh... didn't mean to send this with no subject :)

> Our software generates a whole lot of concurrent LDAP traffic right now
> and we started running into an issue where our child processes would hang
> forever waiting around for LDAP operations that had apparently hung on the
> server end. This would happen with start_tls(), search(), and bind()
> calls alike. I noticed that Net::LDAP passes the Timeout parameter passed
> to new() into the socket object, so I presume that the intention is for
> operations other than only connecting to time out (e.g. start_SSL()
> inherits the timeout, so $ldap->start_tls() would presumably time out in
> the same time that the initial connection would have). However this was
> not happening. We dug around debugged for days (repro was difficult) and
> found a couple of changes that would make things work a bit better.
>
> First, and maybe least controversially, the root issue we found was that
> process() winds up doing a can_read() call that potentially blocks
> forever. Changing this is pretty simple:
>
> @@ -824,10 +824,11 @@ sub process {
> my $ldap = shift;
> my $what = shift;
> my $sock = $ldap->socket or return LDAP_SERVER_DOWN;
> + my $timeout = $sock->timeout;
> my $sel = IO::Select->new($sock);
> my $ready;
>
> - for( $ready = 1 ; $ready ; $ready = $sel->can_read(0)) {
> + for( $ready = 1 ; $ready ; $ready = $sel->can_read( $timeout )) {
> my $pdu;
> asn_read($sock, $pdu)
> or return _drop_conn($ldap, LDAP_OPERATIONS_ERROR, "Communications
> Error");
>
>
> We tested that patch and it helped to mitigate this issue -- although not
> perfectly, given that multiple process() calls can happen for a given LDAP
> operation, so the effective timeout can be nudged upward if a lot of
> process() calls hang for a long time but don't quite reach the timeout.
> Still, it's better than nothing. If one were to be a bit pedantic, one
> could set some hash key for the life of an external call to indicate how
> long we've been waiting, and then in process(), do something like:
>
> ( $ldap->{waited} < $timeout ? $sel->can_read( $timeout - $ldap->{waited}
> ) : () )
>
> That would probably require more robust testing though.
>
> Something that actually worked a bit better for us, but which it seems may
> be less portable, was to set SO_SNDTIMEO and SO_RCVTIMEO on the underlying
> socket right after establishing the connection:
>
> @@ -117,6 +117,15 @@ sub new {
>
> return undef unless $obj->{net_ldap_socket};
>
> + if ( $arg->{timeout} and IO::Socket->can('SO_SNDTIMEO') ) {
> + my $timeval = pack 'L!L!', $ldap_opts->{timeout}, 0;
> + eval {
> + $obj->{net_ldap_socket}->sockopt( SO_SNDTIMEO, $timeval ) or die
> $!;
> + $obj->{net_ldap_socket}->sockopt( SO_RCVTIMEO, $timeval ) or die
> $!;
> + };
> + warn "Couldn't set socket timeout:$@\n";
> + }
> +
> $obj->{net_ldap_resp} = {};
> $obj->{net_ldap_version} = $arg->{version} || $LDAP_VERSION;
> $obj->{net_ldap_async} = $arg->{async} ? 1 : 0;
>
> I don't have any platforms readily available to test on that don't support
> SO_[SND|RCV]TIMEO, so the can() call there is only a guess at how to make
> sure we only bother with this on platforms that support it. I can say
> from my testing that it looks like even if the sockopt() call fails,
> Net::LDAP soldiers on anyway, so the can() thing may not be necessary at
> all. At any rate, our temporary solution for this issue, since we didn't
> want to run a locally modified LDAP.pm everywhere, was to do that
> sockopt() magic on $ldap->socket right after calling Net::LDAP->new. That
> is well tested, but I have not tested the patch above, though it is very
> similar. This also has the advantage of guarding against timeouts in
> syswrite() calls -- I actually wrote it when I still thought syswrite()
> was the culprit and not can_read().
>
> For more information on the sockopt() business, see, eh, google, and
> especially: http://perldesignpatterns.com/?SocketProgramming
>
> Finally, another change that partially mitigated the issue before we found
> our solution, and we found to be a good idea in general, was to add the
> 'timelimit' param equal to our timeout to the parameters we passed to
> search(). Another possibly controversial improvement to Net::LDAP
> timeouts would be to always add this parameter automatically from within
> search(), if a timeout is set and the parameter was not already passed.
> After all, why let the remote server potentially continue churning their
> answer if we've already given up on them?
>
> We've solved this problem for our own software but would love to see any
> or all of our fixes incorporated upstream, so I would welcome any feedback
> or suggestions on submitting these more suitably as useful patches.
>
> -Jared
>
>


Jared Johnson

unread,
Oct 6, 2011, 10:43:53 AM10/6/11
to Graham Barr, perl...@perl.org, dc...@efolder.net
> $sock->timeout is intended to be a connect timeout. Why should read
> timeout be the same.

well, it seemed like a good compromise short of providing a new API, which
I didn't have any bright ideas about. If you thought it best to provide
such an API, I wouldn't object to using it :) Also, I guess I incorrectly
interpreted Net::LDAP's passing the timeout from new() to the underlying
socket and doing nothing else, as intending to do more than just implement
a connect timeout; e.g. at least one subsequent operation,
$ldap->start_tls(), inherits the same timeout. I assumed there were other
such operations affected, maybe there are not.

>>> my $sel = IO::Select->new($sock);
>>> my $ready;
>>>
>>> - for( $ready = 1 ; $ready ; $ready = $sel->can_read(0)) {
>>> + for( $ready = 1 ; $ready ; $ready = $sel->can_read( $timeout )) {
>

> This would not be sufficient to stop the client hanging. ->process is
> called from within ->sync which will keep calling process until it gets
> the message response it is looking for.

The way I read that loop, as long as we can get can_read() to come back,
if the socket is not ready then it will behave the same as if the server
had dropped the connection - asn_read() will fail, causing process() to
return 'return _drop_conn($ldap, LDAP_OPERATIONS_ERROR, "Communications
Error")', which seems like a fairly appropriate error. If I'm reading it
wrong, then the effects of this change on our hanging-forever bug are odd
-- I can't imagine calling can_read($timeout) over and over again waiting
for it to come back would behave any differently than calling can_read(0),
and yet we did observe very different behavior to the effect of fixing our
bug. Granted it could be that I managed to break things differently
rather than fix them :P

> Also if this message being waited on was the result of a call to, say,
> ->search. The callback for the search should be called with a client side
> error

From my reading above I would think search() would return 'Communications
Error' as well. I was not thorough enough in my testing to say this for
sure though. I'd be willing to go back and test, if you felt this first
patch was conceptually a good idea at all. If you'd rather see this done
with larger design and/or API changes, I don't know that I'd be your man.

>>> + if ( $arg->{timeout} and IO::Socket->can('SO_SNDTIMEO') ) {
>>> + my $timeval = pack 'L!L!', $ldap_opts->{timeout}, 0;
>>> + eval {
>>> + $obj->{net_ldap_socket}->sockopt( SO_SNDTIMEO, $timeval ) or die
>>> $!;
>>> + $obj->{net_ldap_socket}->sockopt( SO_RCVTIMEO, $timeval ) or die
>>> $!;
>>> + };
>

> Using low level socket options is almost never the right thing to be
> doing. using select or poll would be the right solution.

Fair enough... I admit this was a cheap shortcut to deal with the observed
shortcomings of the above patch, that is, it seems to enforce that any
single operation will time out in T seconds rather than T * N seconds

-Jared

0 new messages