I started out using the June 2005 release. During testing, no problems
ever came up. Then when it went live, it would hang while doing a
synchronous search every day or so. The logic of the code is not
complicated:
1. Connect
2. Perform searches
3. If a search throws an LDAPException where the code is greater than
80, connect again, and continue with searches
Live, this program handles very high volume and so 5 instances of the
process were created for load balancing. Only one thread per process
though, and no connection pools used. However it is running 24/7/365.
I've noticed from the log files that a search throws an LDAPException (91
CONNECT_ERROR), at the same time in all 5 processes. They usually just
connect again and continue on their way. But every so often, one of the
processes will connect again, then return to the search, and then hang.
This is what lead me to believe that it is triggered by a server side
shutdown.
When October 2005 came, I saw there was a new release, and in the change
log (http://developer.novell.com/ndk/jldap_whatsnew.htm) it said: "Fixed
defect so that the connection fails when the same connection is used to
connect to an LDAP server multiple times." This seemed to be the very fix
to my problem, so I upgraded, and saw no improvement.
In the mean time, it was very frustrating to try and solve the this
because I could not recreate the problem outside the live environment.
Then I thought, since this only occurs after I to connect again, instead
of calling connect again on the same LDAPConnection object, I would create
a new LDAPConnection object, then connect with that. Again I get the same
results.
So with the March 2006 release, I downloaded that and started playing with
it to see if that would potentially fix the problem. While trying out
different values for the SocketTimeOut, I tried low values (1 millisecond)
and as expected I got a "Unable to connect LDAP: LDAPException: Reader
thread terminated (91) Connect Error", and I tried a high value (2000
milliseconds) and it connected fine. But then I tried to find the lowest
value possible, but would still connect. At 10-50 milliseconds I found
that it would hang nearly every time! As near I can tell, this is the
same type of hang that is occurring in the live environment, but I'm not
certain.
So now being able to create the senario, I read through most of this
forum, and tried all the suggestions I saw, but none of them fixed it.
What happens is, the connect will return with no exceptions as though it
had connected, then it will go on to do a search, and that is where the
hang occurs. Upon further inspection, I found that isBound(),
isConnected(), and isConnectionAlive() all returned false. So then I
tried an immediate subsequent call to connect where it will also hang.
I also tried adding an Unsolicited Notification Listener, but never
received any messages.
We are locked in to using Java 1.3, and I thought that could be a problem,
but trying each senario in Java 1.4 did not solve the problem either.
I have not tried using the March 2006 version live, with a reasonable
timeout value. But given the fact I can get it to hang at all, makes me
extra cautious now. Any ideas would be appreciated.
The only two hangs I could reproduce were the easy ones: 1. pull the plug on
the server side and 2. bring down nldap on the server side. The socket
timeout change addresses number 1 and the shutdown handler addresses number
two. The other post is a possible hang, see "possible bug in
Connection.java" but that change is included in the March SDK.
> At 10-50 milliseconds I found
> that it would hang nearly every time! As near I can tell, this is the
> same type of hang that is occurring in the live environment, but I'm not
> certain.
So you have it somewhat reproduceable (if it is the same problem) in your
development environment? Can you send me the code to reproduce? I would
really appreciate it! (sperrin at novell dot com)
Thank you
Susan
This code causes a hang on the search method. My guess is that it gets
caught in a state that is partially connected. It doesn't hang every time
but it does most times, and the timeOut value that makes it hang would
likely vary from one environment to another. In either case, no exception
is thrown. The following code causes a hang in the second connect method:
int timeOut = 20;
LDAPConnection lc = new LDAPConnection(timeOut);
lc.connect(server, port);
lc.bind(LDAPConnection.LDAP_V3, userDN, password.getBytes("UTF8"));
if (lc.isConnected() && lc.isConnectionAlive() && lc.isBound()) {
LDAPSearchResults searchResults = lc.search(searchDN,
LDAPConnection.SCOPE_SUB, filter, attributes, false);
}
else {
lc.connect(server, port);
lc.bind(LDAPConnection.LDAP_V3, userDN, password.getBytes("UTF8"));
}
But I'm hoping a much larger value for timeOut will prevent all hangs in
the future. Thanks for your help on this.
Thank you for the interesting idea on how to reproduce this. I worked with
it a bit and there are different problems based on timing. It looks like
the problem is when the socket timeout (or connect failure) hits the reader
thread in certain conditions.
In Connect.java you see this:
* 3) We receive a Server Shutdown notification.
* - Indicated by messageID equal to 0.
* - call Shutdown.
* 4) Another error occured
* - Indicated by an IOException AND notify is not NULL
* - call Shutdown.
*/
if( (! clientActive) || (notify != null)) { //#3 & 4
shutdown( reason, 0, notify );
Now, if the reason is error 91 in the reader thread or server shutdown,
watch what happens. You call shutdown() and shutdown() calls
info.abandon( null, notifyUser); // also notifies the application
info.abandon tries to write to the connection. It doesn't throw an
LDAPException so the error is never received by the calling application.
When Message.java tries to notify the user by doing this:
if( informUserEx != null) {
replies.addElement( new LDAPResponse( informUserEx,
conn.getActiveReferral()));
But the application never gets it because the reader has already timed out.
So connection returns 17:39:55.719 traceMessages: Connection(1): connect:
setup complete
Which is bad, because the connection is actually not connected. And the
application never got the LDAPException.
I don't have a quick fix, so I submitted the problem to engineering as
critical Bug 162943. For now you should probably avoid the situation by
guaranteeing adequate new client connections with the time_wait discussion.
I'll post more if I see more or receive a fix to test.
thank you
Susan