[Shib-Users] LDAP connection management; (was Re: increase pool size? (for ldap))

Russell Beall

unread,

Mar 10, 2009, 7:44:20 PM3/10/09

to shibbole...@internet2.edu

Hi Brent,

More research here at USC has uncovered some more facts. First of all, we discovered a clear bottleneck in the process which was key in backlogging responses. There was a bug in Sun DirectoryServer 6.3 which 10x-ed the time required to handle a StartTLS request, as opposed to after the hotfix for this bug. Now that we have the hotfix in place, response times are great and Shibboleth is plowing through requests at a high clip. Navigating through the shibboleth login page proves to be as quick as it really should be (1/2 to 2 seconds approx.).

Regardless of this fix, it would be useful to address the thread locking in shibboleth which caused shibboleth to grind to a halt on LDAP rather than simply timing out some of the connections and keeping other threads flowing. We have a 60000 millisecond timeout which was not being respected. The threads appeared to be waiting and locking prior to when the timer on this would start. (We have seen the timeout actually trigger in other circumstances where the response time actually did go over 60 seconds).

I plan to look at some of the code which closes or manages LDAP connections for error cases. Our LDAP administrator found one source of connection tie-up. Where someone types an incorrect username/password combination an authentication bind is attempted against LDAP. The bind fails and the user gets the authentication failed message but the LDAP connection is left open. 10-12 minutes later the connection is not closed out nicely but simply terminated. Here is some detail from him:

Shib never sends the UNBIND command, I then see the connection stay open from 10-13 minutes and finally an error is seen in the log like "closing from xxx.xxx.xxx.xxx:42737 - B4 - Server failed to flush BER data back to client"

50:43

That error is documented on Sun's site http://docs.sun.com/app/docs/doc/820-2765/fnyss?l=en&a=view

50:57

which is - The server failed to flush data response back to client. This code can occur when the client closes the connection to the server, before the server finished sending data to the client.

One of our theories on thread locking was that when the load got high, a few too many users may have been incorrectly typing passwords and tying up the LDAP connection pool. I added a logger to track the Authenticator object, and I found that we have 10-20 failed password attempts in any given 10-minute period during high load (varies greatly of course).

I tested this line of attack though and I couldn't jam a staging server. I held down the enter key on a browser with a good username but a bad password, and I managed to load up shib with well over 1000 threads, but there were not enough resources held open to actually jam the server and following that with the correct password correctly logged me in.

I was uncertain of the result of your advice to Phil Brown and whether or not we should monkey around with the underlying LDAP engine and try to increase the connection count. It seemed that the upshot was that we should not since it might lead to greater contention between the two pools trying to manage a collection of connections.

Also, I am curious if there is different pool management and separate configuration between the JAAS login module and the LDAP DataConnector, or is there some overlap?

Also... see some responses below to your earlier questions/comments.

Thanks,

Russ.

On Mar 5, 2009, at 10:51 PM, Brent Putman wrote:

It would appear they are all waiting to acquire the monitor lock on that method. That's just a simple synchronized method in the Apache pool class:

public synchronized Object borrowObject() throws Exception { assertOpen(); Object obj = null; try { obj = _pool.pop(); } catch(EmptyStackException e) { if(null == _factory) { throw new NoSuchElementException(); } else { obj = _factory.makeObject(); } } if(null != _factory && null != obj) { _factory.activateObject(obj); } _numActive++; return obj; }

So I guess the question is what is causing all that lock contention. I'm not familiar with whatever JVM issues there might be with per-object monitor lock limits or issues, or whether there are JVM parameters that affects how that works.. Have to research that further, unless someone knows off-hand. What exact version of Java runtime are you using? Sounds like you are on Sun SPARC?

We are using Sun SPARC 5.9. Tomcat 5.5.27. Apache 2.2.9. jdk 1.5.0_15

Since some of these things are outdated, I have a feeling that upgrading tomcat and java might perhaps address some of these threading issues.

The other possibility is that either the factory makeObject() or activateObject() is hanging for some threads. makeObject() doesn't do much other than create the LDAP object. However activateObject() does do the actual LDAP connection, so maybe there's a bottleneck there. Is it possible the JVM process is running out of file descriptors? At least one (maybe 2 or more, can't remember on Solaris) is required for each outbound network connection. As I recall, the per-process default on Solaris is pretty low, like 1024 or 2048. That seems at least like something to investigate. If you encounter that situation again, might be good to grab that for the IdP JVM process from /proc or something, before killing it.

This is something to look into for sure. During the time of contention there were many open connections, but there may have been many more in a state of limbo since the loadbalancer was taking LDAP out of the loop occasionally and probably knocking out the connections.

The issue above of "mismanaged" LDAP connections may be involved here where some connections simply may not be closed appropriately.

--Brent (Puts thinking cap on...)

Even though we've handled the LDAP performance issue and are not likely to see the overload conditions again, it would be good to address whatever threading issues may still be there for scalability. (Thinking cap starts buzzing and popping ;-)

Brent Putman

unread,

Mar 10, 2009, 11:31:05 PM3/10/09

to shibbole...@internet2.edu

Thanks for all the good info. More below

Russell Beall wrote:
> Hi Brent,

>
> There was a bug in Sun DirectoryServer 6.3 which 10x-ed the time
> required to handle a StartTLS request, as opposed to after the hotfix
> for this bug. Now that we have the hotfix in place, response times
> are great and Shibboleth is plowing through requests at a high clip.
> Navigating through the shibboleth login page proves to be as quick as
> it really should be (1/2 to 2 seconds approx.).

Glad to hear it's working better now. That's also very good info for
diagnosing what's going on. Wasn't clear whether there was a
programmatic bug or a JVM issue or just load etc. Now I think it's more
clear. Most likely what was happening was just simple contention for
the lock:

* thread acquires the exclusive lock for the LdapPool
* the pool either pulls from cache or has to create a new one. Under
increasing load, it's more likely to have to create a new one.
* when creating a new connection, setup takes much longer than it
should, so other threads waiting on the pool queue up waiting for the lock
* rinse/repeat

Under load, if threads show up at the pool and wait on the lock faster
than the pool can clear them, then you're going to get a backlog. Also,
I believe the waiting behavior in Java for an object instance's monitor
lock is not "fair", in the sense that it's not a FIFO queue or anything,
so that compounds things. Lock starvation for a given thread is a real
possibility, some may unfairly wait an excessively long time.

So unless there's reason to believe otherwise, I don't think there's
anything more mysterious than that going on.

>
> Regardless of this fix, it would be useful to address the thread
> locking in shibboleth which caused shibboleth to grind to a halt on
> LDAP rather than simply timing out some of the connections and keeping
> other threads flowing.

Yeah, the real issue here is that the LdapPool, based on Commons Pool,
defines the entire borrowObject() method as the critical section, which
on a new object creation encompasses making the LDAP connection. So if
there's atypical delay in the actual LDAP connection being established,
it causes all other threads waiting on the pool to block.

FYI, we're using Pool 1.2, which is actually an oversight, we should
have upgraded to 1.3 when we upgraded VT LDAP to 2.8.x. We'll fix in
the upcoming release. However both versions (and also the latest Pool
1.4) have the same behavior in this respect, so upgrading isn't going to
help this issue.

Assuming VT LdapPool isn't going to change: Only way we could address
this is to not use the VT LdapPool and instead use a different pool impl
for it's LDAP objects, one that doesn't have the LDAP connection
establishment within the critical section (or else use a reader/writer
lock or something). I guess that's a topic for discussion, not my call
to make.

> We have a 60000 millisecond timeout which was not being respected.
> The threads appeared to be waiting and locking prior to when the timer
> on this would start. (We have seen the timeout actually trigger in
> other circumstances where the response time actually did go over 60
> seconds).

Sorry, what timeout is that? Request timeout in your container?

>
> I plan to look at some of the code which closes or manages LDAP
> connections for error cases. Our LDAP administrator found one source
> of connection tie-up. Where someone types an incorrect
> username/password combination an authentication bind is attempted
> against LDAP. The bind fails and the user gets the authentication
> failed message but the LDAP connection is left open. 10-12 minutes
> later the connection is not closed out nicely but simply terminated.

Yes, I believe you have found a bug in the VT Ldap class. I'll have to
punt this one off to Daniel at VT, but based on my read of that code,
it's clearly not explicitly closing the JNDI context on the failed
password case in authenticateAndAuthorize(). It does close on the
success case.

> One of our theories on thread locking was that when the load got high,
> a few too many users may have been incorrectly typing passwords and
> tying up the LDAP connection pool. I added a logger to track the
> Authenticator object, and I found that we have 10-20 failed password
> attempts in any given 10-minute period during high load (varies
> greatly of course).

That's interesting data. Can't really be the case however that that
could have been related to the contention for the LdapPool. The latter
is only used within the attribute resolver LDAP DataConnector. The
Authenticator used in the JAAS module doesn't share that pool, and in
fact isn't pooled at all. It establishes a fresh connection on each
authentication attempt (and as you saw doesn't close it properly on the
bad password case).

>
> I was uncertain of the result of your advice to Phil Brown and whether
> or not we should monkey around with the underlying LDAP engine and try
> to increase the connection count. It seemed that the upshot was that
> we should not since it might lead to greater contention between the
> two pools trying to manage a collection of connections.

Well, IMHO you should probably pick one or the other, not try and mix
them. If you wanted to try JNDI pooling instead, you could. But a
couple of things:
1) Chad mentioned on the Shib dev call just yesterday that it was his
recollection that one of the reasons VT (while Chad was there) decided
to develop that library was that JNDI didn't always work as advertised
or at least as well, and specifically with connection pooling. That was
older Java version I imagine, maybe it's improved. You're welcome to
experiment and find out. :-)
2) My earlier advice to set the LDAP connector init capacity and max
sizes to 0 will work right now with pool 1.2, but when we upgrade next
release to 1.3, that will break due to changes in that lib. You have to
set max size to at least 1. Probably not a big difference, but I think
there's a small possibility for deadlock in that case.
3) You still have the same potential for this lock contention issue.
Because you're still going through the LdapPool, each connection from
the JNDI pool would be subject to the same locking, and blocking,
behavior. If the JNDI pool blocks waiting on a connection to become
available, that will block all other threads from obtaining (or
returning) a connection to the VT LdapPool.

I personally think it's probably not worth it. If you really, really
want JNDI pooling, you should probably look at a different custom data
connector. A basic one that just did JNDI directly wouldn't be terribly
onerous, esp. if you omitted all the bells and whistles that might be
required for general consumption as opposed to say just your own use.

>
> Also, I am curious if there is different pool management and separate
> configuration between the JAAS login module and the LDAP
> DataConnector, or is there some overlap?

Zero overlap. The JAAS module doesn't use the VT LdapPool instance used
by the data connector. Doesn't pool at all, in fact.

>
> We are using Sun SPARC 5.9. Tomcat 5.5.27. Apache 2.2.9. jdk 1.5.0_15
>
> Since some of these things are outdated, I have a feeling that
> upgrading tomcat and java might perhaps address some of these
> threading issues.

I believe that Tomcat 6 supposedly has better request threading
performance than 5.5, or so I've heard. I don't think it would have
really made much difference in this case however. The threads were
waiting on exclusively locked resource unconnected to any of that software.

But as with all things multi-thread and concurrency-related, there's a
whole lot of detail and a whole lot of variables, and a whole lot of
things that could be affecting one another.

>
>>
>> The other possibility is that either the factory makeObject() or
>> activateObject() is hanging for some threads. makeObject() doesn't
>> do much other than create the LDAP object. However activateObject()
>> does do the actual LDAP connection, so maybe there's a bottleneck
>> there.

Pretty much sounds like this was the case.

>> Is it possible the JVM process is running out of file descriptors?
>> At least one (maybe 2 or more, can't remember on Solaris) is required
>> for each outbound network connection. As I recall, the per-process
>> default on Solaris is pretty low, like 1024 or 2048. That seems at
>> least like something to investigate. If you encounter that situation
>> again, might be good to grab that for the IdP JVM process from /proc
>> or something, before killing it.
>
> This is something to look into for sure. During the time of
> contention there were many open connections, but there may have been
> many more in a state of limbo since the loadbalancer was taking LDAP
> out of the loop occasionally and probably knocking out the connections.

I'd still keep a lookout for this. We recently had a similar
load-related issue with an app that made outbound connections for each
inbound request. If you have say 500 requests: 500 descriptors for
inbound TCP sockets + 500 for each outbound connection + a few for some
other app overhead, comes very close to maxing out a 1024 per-process limit.

>
> The issue above of "mismanaged" LDAP connections may be involved here
> where some connections simply may not be closed appropriately.

Yes, that's definitely an issue, especially as it doesn't play nice with
the LDAP servers and makes them use up their resources for excessively
long periods.

--Brent

Scott Cantor

unread,

Mar 11, 2009, 11:16:14 AM3/11/09

to shibbole...@internet2.edu

Brent Putman wrote on 2009-03-10:
> Yeah, the real issue here is that the LdapPool, based on Commons Pool,
> defines the entire borrowObject() method as the critical section, which
> on a new object creation encompasses making the LDAP connection. So if
> there's atypical delay in the actual LDAP connection being established,
> it causes all other threads waiting on the pool to block.

Umm, has anybody reported that as a bug? Since, you know, it's pretty much
rule #1 of how not to build a pooling library? ;-) You never assume
construction of the object being pooled is cheap.

-- Scott

Chad La Joie

unread,

Mar 11, 2009, 11:50:40 AM3/11/09

to shibbole...@internet2.edu

I'm not sure what exactly you're saying here, Scott. If you mean that
having the entire borrowObject method be synchronized is a bug, then
yes, I'd agree with you. If you're saying that ever having an object
creation hold up a request is a bug, then I completely disagree.

--
SWITCH
Serving Swiss Universities
--------------------------
Chad La Joie, Software Engineer, Net Services
Werdstrasse 2, P.O. Box, 8021 Zürich, Switzerland
phone +41 44 268 15 75, fax +41 44 268 15 68
chad....@switch.ch, http://www.switch.ch

Scott Cantor

unread,

Mar 11, 2009, 11:56:28 AM3/11/09

to shibbole...@internet2.edu

Chad La Joie wrote on 2009-03-11:
> I'm not sure what exactly you're saying here, Scott. If you mean that
> having the entire borrowObject method be synchronized is a bug, then
> yes, I'd agree with you.

Yes, that's what I'm saying. You never lock a pool while you create new instances of what's being pooled.

> If you're saying that ever having an object
> creation hold up a request is a bug, then I completely disagree.

I'm not sure what that means, exactly, but I don't think that's what I was saying. If you need to make a request using some object, obviously the object has to be created before you can do so.

-- Scott

Brent Putman

unread,

Mar 11, 2009, 2:40:44 PM3/11/09

to shibbole...@internet2.edu

Scott Cantor wrote:

Brent Putman wrote on 2009-03-10:

Yeah, the real issue here is that the LdapPool, based on Commons Pool,
defines the entire borrowObject() method as the critical section,

Umm, has anybody reported that as a bug? Since, you know, it's pretty much
rule #1 of how not to build a pooling library? ;-) You never assume
construction of the object being pooled is cheap.

I couldn't find anything about the StackObjectPool specifically. However, others have reported similar issues against the GenericObjectPool, and some work has been done in that regard on that. So they're at least aware of it...

They did note in their roadmap that 2.0 would have performance improvements, vs. the current emphasis only on "stability and robustness". However, their 2.0 dev branch doesn't currently have any fixes in this regard in StackObjectPool.

http://wiki.apache.org/commons/PoolRoadMap

BTW, in looking I did run across this issue. If that info is correct (populating an exception stack trace requires acquiring a JVM-wide lock), then upgrading to 1.3 would definitely be advisable. This happens when your cache (stack) is empty and you have to create a new object. (Remember we're still mistakenly shipping 1.2, which has the issue below).

http://issues.apache.org/jira/browse/POOL-58

We should have 1.3 in the next IdP release.

Russell Beall

unread,

Mar 12, 2009, 1:26:27 PM3/12/09

to shibbole...@internet2.edu

On Mar 10, 2009, at 8:31 PM, Brent Putman wrote:

> * thread acquires the exclusive lock for the LdapPool
> * the pool either pulls from cache or has to create a new one. Under
> increasing load, it's more likely to have to create a new one.
> * when creating a new connection, setup takes much longer than it
> should, so other threads waiting on the pool queue up waiting for
> the lock
> * rinse/repeat
>
> Under load, if threads show up at the pool and wait on the lock faster
> than the pool can clear them, then you're going to get a backlog.
> Also,
> I believe the waiting behavior in Java for an object instance's
> monitor
> lock is not "fair", in the sense that it's not a FIFO queue or
> anything,
> so that compounds things. Lock starvation for a given thread is a
> real
> possibility, some may unfairly wait an excessively long time.
>
> So unless there's reason to believe otherwise, I don't think there's
> anything more mysterious than that going on.

Yes, this could be the behavior seen, but I think there is more to it
since I believe we actually ran out of available connections.

More data has come to light through detailed log tracing. It turns
out that an LDAP connection is opened every time someone presses the
login button. This thread is left hanging in limbo and ultimately,
probably when the objects are finalized, the connection is closed
(with a proper unbind).

In the case where the password is typed incorrectly, two LDAP
connections are bound and left hanging. One of them is properly
unbound, and the other is not, after approximately a 10 minute delay.

I have scanned the vt.middleware.ldap code, and on the surface it
appears to be closing the ldap context objects when it is finished,
but it appears that this may not be closing out the underlying
connection.

On the upside, my log tracing shows that the LDAP pool for the
DataConnector object is functioning properly and connections from that
pool get reused.

Is there a JIRA to report the JAAS problem for the LDAP library?

It appears that the connection issues are just in the JAAS portion,
and not in the pooling for the DataConnector.

>
>> We have a 60000 millisecond timeout which was not being respected.
>> The threads appeared to be waiting and locking prior to when the
>> timer
>> on this would start. (We have seen the timeout actually trigger in
>> other circumstances where the response time actually did go over 60
>> seconds).
>
> Sorry, what timeout is that? Request timeout in your container?

There is a timelimit configuration for the data connector. Now I
understand why this timelimit does not apply because the problem was
not with the dataconnector LDAP connection.

>
>
>>
>> I plan to look at some of the code which closes or manages LDAP
>> connections for error cases. Our LDAP administrator found one source
>> of connection tie-up. Where someone types an incorrect
>> username/password combination an authentication bind is attempted
>> against LDAP. The bind fails and the user gets the authentication
>> failed message but the LDAP connection is left open. 10-12 minutes
>> later the connection is not closed out nicely but simply terminated.
>
> Yes, I believe you have found a bug in the VT Ldap class. I'll have
> to
> punt this one off to Daniel at VT, but based on my read of that code,
> it's clearly not explicitly closing the JNDI context on the failed
> password case in authenticateAndAuthorize(). It does close on the
> success case.

Yes, and somewhere there in the code another connection is opened and
fails to be closed.

I have traced this specifically to the searchAttributes connection
where the user's DN is pulled based on the username attribute. The
LDAP logs indicate that the con

>> We are using Sun SPARC 5.9. Tomcat 5.5.27. Apache 2.2.9. jdk
>> 1.5.0_15
>>
>> Since some of these things are outdated, I have a feeling that
>> upgrading tomcat and java might perhaps address some of these
>> threading issues.
>
> I believe that Tomcat 6 supposedly has better request threading
> performance than 5.5, or so I've heard. I don't think it would have
> really made much difference in this case however. The threads were
> waiting on exclusively locked resource unconnected to any of that
> software.
>
> But as with all things multi-thread and concurrency-related, there's a
> whole lot of detail and a whole lot of variables, and a whole lot of
> things that could be affecting one another.

Perhaps with tomcat 6 the threads would be finalized sooner perhaps
and the connections would be closed out...

> I'd still keep a lookout for this. We recently had a similar
> load-related issue with an app that made outbound connections for each
> inbound request. If you have say 500 requests: 500 descriptors for
> inbound TCP sockets + 500 for each outbound connection + a few for
> some
> other app overhead, comes very close to maxing out a 1024 per-
> process limit.

We have 65K available file descriptors on the LDAP server so we
probably didn't hit that limit, but we may have hit the limit on the
individual shib servers.

Russ.

Reply all

Reply to author

Forward