Alright, I was unable to get an exact repo, but I think I saw enough, plus combined with the logs, to figure out what happens:
1) The PingFailureAnalyzer call in onDead in ChannelPinger.java will occasionally fail for various reasons. NPE's, findClass failures. Not always, but sometimes. When this happens, the master will NOT call close on the channel. 2) Normally this appears to be fine. The channel is somehow replaced or null'd out through other means. I think in cases of process termination or other semi-orderly shutdowns the socket connection is notified. 3) However, let's say there is a network anomoly. Power outage, network cable unplugged, dropped connection not on the server/client side. In this case, the master will notice, potentially by noticing a failed ping. If it fails in the PingFailureAnalyzer code, it won't close the channel. 4) The slave comes back, say from a reboot, or the network cable is reinstalled, etc. and attempts to reconnect. The channel is not null, and we get the error.
I think the keys to the repro are the lack of a "Terminate" text in the slave log and the definite issue of not closing the channel when an exception is seen in the PFA. The lack of terminate indicates there was not an orderly shutdown of the channel on the client side.
So, the fix would be to wrap the call to the PFA in a try catch to ensure that channel.close() is in fact called.
The issues I was seeing in my installation have subsided, but this fix was made as they were tailing off, and I think I did not see any already connected errors after that.
|