Hey folks,
we ran into a problem when the MongoDB Java driver v3.0.2 wouldn't free up connections to a node (blade5-9) after that node stepped down from primary to secondary although we have maxConnectionLifetime of 20 mins set for the pool. Usually you would expect that all connections to that (now) secondary get invalidated and will be closed. After initial invalidation, it looks like they will reconnect automatically:
2015-11-08T04:00:03.847+0000 I COMMAND [conn47669] command admin.$cmd command: replSetStepDown { replSetStepDown: 60.0 } keyUpdate
s:0 writeConflicts:0 numYields:0 reslen:37 locks:{} 1875ms2015-11-08T04:00:03.847+0000 I WRITE [conn47700] update aaaa_prd.bcwAAiaaic2t6aaadimBQMstoh query: { uuid: "bcxW6iaaiE6PaaaadokXVc88ta" }
[...] keyUpdates:0 writeConflicts:0 exception: Not primary while writing to aaaa_prd.bcwAAiaaic2t6aaadimBQMstoh code:10107 numYields:0 locks:{} 119ms
The client now detects that the connections are invalid:
[2015-11-08 04:00:03.727 GMT] INFO org.mongodb.driver.connection Opened connection [connectionId{localValue:151324, serverValue:47700}] to blade5-9-vip:XXXX
[...]
[2015-11-08 04:00:03.847 GMT] WARN org.mongodb.driver.connection Got socket exception on connection [connectionId{localValue:151324, serverValue:47700}] to blade5-9-vip:XXXX.All connections to blade5-9-vip:XXXX will be closed.
On the Mongo side, all connections invalidated like this now show the following message:
2015-11-08T04:00:03.882+0000 I NETWORK [conn47471] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [10.0.X.X:52310]
The client detects the transition to secondary for that node:
[2015-11-08 04:00:03.868 GMT] INFO org.mongodb.driver.cluster No server chosen by ReadPreferenceServerSelector{readPreference=primary}from cluster description ClusterDescription{type=REPLICA_SET, connectionMode=MULTIPLE, all=[ServerDescription{address=blade2-9-vip:XXXX,type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 0, 1]}, minWireVersion=0, maxWireVersion=3,electionId=null, maxDocumentSize=16777216, roundTripTimeNanos=484423, setName='rs10', canonicalAddress=blade2-9-vip:XXXX, hosts=[blade5-9-vip:XXXX,
blade2-9-vip:XXXX], passives=[], arbiters=[blade4-4-mongo:XXXX], primary='blade5-9-vip:XXXX', tagSet=TagSet{[]}},
ServerDescription{address=blade4-4-mongo:XXXX, type=REPLICA_SET_ARBITER, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 0, 1]},
minWireVersion=0, maxWireVersion=3, electionId=null, maxDocumentSize=16777216, roundTripTimeNanos=505066, setName='rs10', canonicalAddress=blade4-4-mongo:XXXX,
hosts=[blade5-9-vip:XXXX, blade2-9-vip:XXXX], passives=[], arbiters=[blade4-4-mongo:XXXX], primary='blade5-9-vip:XXXX', tagSet=TagSet{[]}},
ServerDescription{address=blade5-9-vip:XXXX, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoInterruptedException: Interrupted acquiring a permit to retrieve an item from the pool },caused by {java.lang.InterruptedException}}]}. Waiting for 1000 ms before timing out
[...]
[2015-11-08 04:00:03.884 GMT] INFO cluster-ClusterId{value='563bb0081a761c2eecf2a777', description='null'}-blade5-9-vip:XXXX org.mongodb.driver.cluster Monitor thread successfully connected to server with description ServerDescription{address=blade5-9-vip:XXXX, type=REPLICA_SET_SECONDARY, state=CONNECTED,ok=true, version=ServerVersion{versionList=[3, 0, 1]}, minWireVersion=0, maxWireVersion=3, electionId=null, maxDocumentSize=16777216, roundTripTimeNanos=803898, setName='rs10',canonicalAddress=blade5-9-vip:XXXX, hosts=[blade5-9-vip:XXXX, blade2-9-vip:XXXX], passives=[], arbiters=[blade4-4-mongo:XXXX], primary='null', tagSet=TagSet{[]}}
Now, the Mongo client surprisingly opens a connection to the former primary server, before it officially announces that it has become secondary:
2015-11-08T04:00:03.891+0000 I NETWORK [initandlisten] connection accepted from 10.0.X.X:63986 #47733 (23 connections now open)
2015-11-08T04:00:04.435+0000 I REPL [ReplicationExecutor] replSetElect voting yea for blade2-9-vip:XXXX (0)
2015-11-08T04:00:05.839+0000 I REPL [ReplicationExecutor] Member blade2-9-vip:XXXX is now in state PRIMARY
[...]
2015-11-08T04:10:01.845+0000 I REPL [ReplicationExecutor] transition to RECOVERING
[...]
2015-11-08T04:13:58.438+0000 I REPL [ReplicationExecutor] transition to SECONDARY
2015-11-08T04:13:58.439+0000 I COMMAND [conn47994] command aaaa_prd$cmd command: compact { compact: "temp" } keyUpdates:0 writeConflicts:0 numYields:0 reslen:37 locks:{} 8298ms
2015-11-08T04:13:58.439+0000 I REPL [ReplicationExecutor] syncing from: blade2-9-vip:XXXX2015-11-08T04:13:58.452+0000 I NETWORK [conn47994] end connection 10.0.4.129:34086 (120 connections now open)
While the recovery runs, all 100+ connections from our 4 clients reconnect, on top of the 10 connections used internally.
The server connection with id 47733 connecting to the server was open for multiple days before, despite a 20 min max connection life time and a 10 minute idle time. Looks like the pool reconnects these connections (which it shouldn't) and subsequently doesn't shut them down properly.
$ grep "serverValue:47733" * |grep blade5-9
[2015-11-06 00:47:05.486 GMT] INFO org.mongodb.driver.connection Opened connection [connectionId{localValue:39393, serverValue:47733}] to blade5-9-vip:XXXX
[2015-11-06 00:47:10.531 GMT] INFO org.mongodb.driver.connection Closed connection [connectionId{localValue:39393, serverValue:47733}] to blade5-9-vip:vip:XXXX because the pool has been closed.
[2015-11-08 04:00:03.891 GMT] INFO cluster-ClusterId{value='563bb0081a761c2eecf2a777', description='null'}-blade5-9-vip:vip:XXXX org.mongodb.driver.connection Opened connection [connectionId{localValue:417362, serverValue:47733}] to blade5-9-vip:vip:XXXX
Our setup consists of primary, secondary and arbiter and typically there are only 10-12 connections open to any secondary we have.
Any ideas? Could that be a bug in the driver? Do you need more logs?
Thanks,
Christian