Dear mongo users,
We have been runnign multiple replica sets in production for quite a while without major issues. But yesterday we experienced
two outages (within 12 hours) in different datacenters with exactly the same symptoms without any obvious tirgger or reason. I dig through the logs, searched on the internet trying to
find something similar but could not identified anything related. I must say at this point I am totally out of ideas and quite desperate. Any help or suggestion will be very appreciated.
Below is a detailed description of the issue:
At some point under normal (fairly simple put/get and not very high) work load primary replica member becomes completely unresponsive causing time outs of all clients.
This happens with no obvious trigger or any external configuration change or event.
The secondary log starts reporting replication errors every 30 seconds:
[2015-11-01 14:42:55.909 GMT] INFO org.mongodb.driver.connection Opened connection [connectionId{localValue:111480, serverValue:1951132}] to xyz:xyz
[2015-11-01 14:43:00.914 GMT] INFO org.mongodb.driver.connection Closed connection [connectionId{localValue:111480, serverValue:1951132}] to xyz:xyz because the pool has been closed.
[2015-11-01 14:43:00.915 GMT] ERROR Caused by: com.mongodb.MongoSocketReadTimeoutException: Timeout while receiving message
at com.mongodb.connection.InternalStreamConnection.translateReadException(InternalStreamConnection.java:474)
at com.mongodb.connection.InternalStreamConnection.receiveMessage(InternalStreamConnection.java:225)
at com.mongodb.connection.UsageTrackingInternalConnection.receiveMessage(UsageTrackingInternalConnection.java:102)
at com.mongodb.connection.DefaultConnectionPool$PooledConnection.receiveMessage(DefaultConnectionPool.java:416)
at com.mongodb.connection.QueryProtocol.receiveMessage(QueryProtocol.java:282)
at com.mongodb.connection.QueryProtocol.execute(QueryProtocol.java:229)
at com.mongodb.connection.QueryProtocol.execute(QueryProtocol.java:38)
at com.mongodb.connection.DefaultServer$DefaultServerProtocolExecutor.execute(DefaultServer.java:155)
at com.mongodb.connection.DefaultServerConnection.executeProtocol(DefaultServerConnection.java:219)
at com.mongodb.connection.DefaultServerConnection.query(DefaultServerConnection.java:166)
at com.mongodb.operation.FindOperation$1.call(FindOperation.java:397)
at com.mongodb.operation.FindOperation$1.call(FindOperation.java:394)
at com.mongodb.operation.OperationHelper.withConnectionSource(OperationHelper.java:195)
at com.mongodb.operation.OperationHelper.withConnection(OperationHelper.java:168)
at com.mongodb.operation.FindOperation.execute(FindOperation.java:394)
at com.mongodb.operation.FindOperation.execute(FindOperation.java:57)
at com.mongodb.Mongo.execute(Mongo.java:738)
at com.mongodb.Mongo$2.execute(Mongo.java:725)
at com.mongodb.DBCollection.findOne(DBCollection.java:737)
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at com.mongodb.connection.SocketStream.read(SocketStream.java:85)
at com.mongodb.connection.InternalStreamConnection.receiveResponseBuffers(InternalStreamConnection.java:491)
at com.mongodb.connection.InternalStreamConnection.receiveMessage(InternalStreamConnection.java:221)
The primary node log file does not show anything unusual happening before the moment when errors started.
But starting the moment when timeouts started it shows rapid increase in number of new open connections (1500 in about a minute, normally around 50) coming from java clients
(I see at least some connections are closed but much later - 50s - not when client log claims to close them).
2015-11-01T14:54:20.775+0000 E STORAGE [conn1972883] WiredTiger (12) [1446389660:775495][8418:0x7f61bb5d7700], connection.open_session: only configured to support 20010 sessions (including 10 internal): Cannot allocate memory
2015-11-01T14:54:20.775+0000 I - [conn1972883] Invariant failure: ret resulted in status UnknownError 12: Cannot allocate memory at src/mongo/db/storage/wiredtiger/wiredtiger_session_cache.cpp 49
2015-11-01T14:54:20.782+0000 E STORAGE [conn1972884] WiredTiger (12) [1446389660:782141][8418:0x7f61bb4d6700], connection.open_session: only configured to support 20010 sessions (including 10 internal): Cannot allocate memory
2015-11-01T14:54:20.782+0000 I - [conn1972884] Invariant failure: ret resulted in status UnknownError 12: Cannot allocate memory at src/mongo/db/storage/wiredtiger/wiredtiger_session_cache.cpp 49
2015-11-01T14:54:20.788+0000 I CONTROL [conn1972883]
After primary process died secondary finally detected problem and took over. After that things are fine again.
First I thought that the problem is in the number of connections but now I actually think that the system recovered only because it eventually
killed "frozen" primary node and let healthy secondary to take over. Otherwise outage would lasted much longer.
We are running Wired Tiger "db version v3.0.1". Our 4 java clients are configured to use socketTimeout of 5s with connection pool size of 100.
Please reply, if you have seen anything similar or have any idea about what I should try next to investigate further or come up with a workaround.
Thanks in advance,
Sergei