Wired tiger primary node suddenly stops responding to clients causing outage.

467 views
Skip to first unread message

Sergei Bobovich

unread,
Nov 1, 2015, 5:48:11 PM11/1/15
to mongodb-user
Dear mongo users,
We have been runnign multiple replica sets in production for quite a while without major issues. But yesterday we experienced 
two outages (within 12 hours) in different datacenters with exactly the same symptoms without any obvious tirgger or reason. I dig through the logs, searched on the internet trying to 
find something similar but could not identified anything related. I must say at this point I am totally out of ideas and quite desperate. Any help or suggestion will be very appreciated. 

Below is a detailed description of the issue:

At some point under normal (fairly simple put/get and not very high) work load primary replica member becomes completely unresponsive causing time outs of all clients.
This happens with no obvious trigger or any external configuration change or event.  
The secondary log starts reporting replication errors every 30 seconds: 
====
2015-11-01T14:40:44.411+0000 I NETWORK  [rsBackgroundSync] Socket recv() timeout  xyz:xyz
2015-11-01T14:40:44.411+0000 I NETWORK  [rsBackgroundSync] SocketException: remote: xyz:xyz error: 9001 socket exception [RECV_TIMEOUT] server [xyz:xyz] 
2015-11-01T14:40:44.411+0000 E REPL     [rsBackgroundSync] sync producer problem: 10278 dbclient error communicating with server: xyz:xyz
2015-11-01T14:40:44.411+0000 I REPL     [ReplicationExecutor] syncing from: xyz:xyz
====

And java client calls start failing due to read time out as well. 
==== 
[2015-11-01 14:42:55.909 GMT] INFO  org.mongodb.driver.connection   Opened connection [connectionId{localValue:111480, serverValue:1951132}] to xyz:xyz
[2015-11-01 14:43:00.914 GMT] INFO  org.mongodb.driver.connection   Closed connection [connectionId{localValue:111480, serverValue:1951132}] to xyz:xyz because the pool has been closed.
[2015-11-01 14:43:00.915 GMT] ERROR Caused by: com.mongodb.MongoSocketReadTimeoutException: Timeout while receiving message
        at com.mongodb.connection.InternalStreamConnection.translateReadException(InternalStreamConnection.java:474)
        at com.mongodb.connection.InternalStreamConnection.receiveMessage(InternalStreamConnection.java:225)
        at com.mongodb.connection.UsageTrackingInternalConnection.receiveMessage(UsageTrackingInternalConnection.java:102)
        at com.mongodb.connection.DefaultConnectionPool$PooledConnection.receiveMessage(DefaultConnectionPool.java:416)
        at com.mongodb.connection.QueryProtocol.receiveMessage(QueryProtocol.java:282)
        at com.mongodb.connection.QueryProtocol.execute(QueryProtocol.java:229)
        at com.mongodb.connection.QueryProtocol.execute(QueryProtocol.java:38)
        at com.mongodb.connection.DefaultServer$DefaultServerProtocolExecutor.execute(DefaultServer.java:155)
        at com.mongodb.connection.DefaultServerConnection.executeProtocol(DefaultServerConnection.java:219)
        at com.mongodb.connection.DefaultServerConnection.query(DefaultServerConnection.java:166)
        at com.mongodb.operation.FindOperation$1.call(FindOperation.java:397)
        at com.mongodb.operation.FindOperation$1.call(FindOperation.java:394)
        at com.mongodb.operation.OperationHelper.withConnectionSource(OperationHelper.java:195)
        at com.mongodb.operation.OperationHelper.withConnection(OperationHelper.java:168)
        at com.mongodb.operation.FindOperation.execute(FindOperation.java:394)
        at com.mongodb.operation.FindOperation.execute(FindOperation.java:57)
        at com.mongodb.Mongo.execute(Mongo.java:738)
        at com.mongodb.Mongo$2.execute(Mongo.java:725)
        at com.mongodb.DBCollection.findOne(DBCollection.java:737)
        ... 59 more
Caused by: java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:170)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at com.mongodb.connection.SocketStream.read(SocketStream.java:85)
        at com.mongodb.connection.InternalStreamConnection.receiveResponseBuffers(InternalStreamConnection.java:491)
        at com.mongodb.connection.InternalStreamConnection.receiveMessage(InternalStreamConnection.java:221)
====

The primary node log file does not show anything unusual happening before the moment when errors started. 
But starting the moment when timeouts started it shows rapid increase in number of new open connections (1500 in about a minute, normally around 50) coming from java clients 
(I see at least some connections are closed but much later - 50s - not when client log claims to close them). 
===
2015-11-01T14:40:15.763+0000 I NETWORK  [initandlisten] connection accepted from xyz:xyz #1947370 (48 connections now open)
2015-11-01T14:40:16.037+0000 I NETWORK  [initandlisten] connection accepted from xyz:xyz #1947371 (49 connections now open)
...
2015-11-01T14:41:21.496+0000 I NETWORK  [initandlisten] connection accepted from xyz:xyz #1949123 (1793 connections now open)
2015-11-01T14:41:21.507+0000 I NETWORK  [initandlisten] connection accepted from xyz:xyz #1949124 (1794 connections now open)
===
Also log shows several normal queries taking absurdly long time - getmore local.oplog.rs took 85s.
===
2015-11-01T14:41:39.640+0000 I QUERY    [conn1940885] getmore local.oplog.rs cursorid:219474599607 ntoreturn:0 keyUpdates:0 writeConflicts:0 numYields:3 nreturned:394 reslen:2794861 locks:{} 
85229ms
===

Eventually after almost 15 minutes primary dies because number of connections exceeded the limit.
===
2015-11-01T14:54:20.775+0000 E STORAGE  [conn1972883] WiredTiger (12) [1446389660:775495][8418:0x7f61bb5d7700], connection.open_session: only configured to support 20010 sessions (including 10 internal): Cannot allocate memory
2015-11-01T14:54:20.775+0000 I -        [conn1972883] Invariant failure: ret resulted in status UnknownError 12: Cannot allocate memory at src/mongo/db/storage/wiredtiger/wiredtiger_session_cache.cpp 49
2015-11-01T14:54:20.782+0000 E STORAGE  [conn1972884] WiredTiger (12) [1446389660:782141][8418:0x7f61bb4d6700], connection.open_session: only configured to support 20010 sessions (including 10 internal): Cannot allocate memory
2015-11-01T14:54:20.782+0000 I -        [conn1972884] Invariant failure: ret resulted in status UnknownError 12: Cannot allocate memory at src/mongo/db/storage/wiredtiger/wiredtiger_session_cache.cpp 49
2015-11-01T14:54:20.788+0000 I CONTROL  [conn1972883] 
===  

After primary process died secondary finally detected problem and took over. After that things are fine again.

First I thought that the problem is in the number of connections but now I actually think that the system recovered only because it eventually 
killed "frozen" primary node and let healthy secondary to take over. Otherwise outage would lasted much longer.

We are running Wired Tiger "db version v3.0.1". Our 4 java clients are configured to use socketTimeout of 5s with connection pool size of 100.

Please reply, if you have seen anything similar or have any idea about what I should try next to investigate further or come up with a workaround.

Thanks in advance,
Sergei 



Asya Kamsky

unread,
Nov 1, 2015, 7:02:22 PM11/1/15
to mongodb-user
First, the connections are a symptom, not the cause.   Main suggestion I have would be to upgrade as soon as possible from "db version v3.0.1" to current version 3.0.7 as *many* bugs have been fixed in that time frame.

Asya


--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
 
For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/597320e3-1c53-46b3-b67a-6b1517d4dd2c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Asya Kamsky
Lead Product Manager
MongoDB
Download MongoDB - mongodb.org/downloads
Free MongoDB Monitoring - cloud.mongodb.com
Free Online Education - university.mongodb.com
Get Involved - mongodb.org/community
We're Hiring! - https://www.mongodb.com/careers

Sergei Bobovich

unread,
Nov 1, 2015, 7:35:30 PM11/1/15
to mongodb-user
Hi Asya,

Thanks a lot for the prompt reply. By any chance does something that is fixed in 3.0.7 look to you similar to what we experience? The upgrade to 3.0.7 is a very valid and logical suggestion. It is in our plan to roll it out early next year. Unfortunately given our release cycles and deployment procedures moving to it quickly might not be an option. I was hoping for possible workaround or at least to gather better understanding of the likely root cause to be able to reproduce in QA environment/mitigate the issue in a mean time.

Best regards,
Sergei 

Asya Kamsky

unread,
Dec 6, 2015, 5:47:02 PM12/6/15
to mongodb-user
The issue could be a number of things - you would need to examine your workload.  There is a lot of information that you don't include:

For example - 

do you frequently drop collections and recreate them (or create new ones)?   
Do you use capped collections?  
TTL indexes?  
Are your documents on average large?  How large?  
What OS is mongod running on?  
What options do you start mongod with?   
The actual servers - how much RAM do they have, and what are the disks/file system configuration?

Asya



For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages