Hello,
I faced a very strange behavior from our MongoDB clusters in 2 projects simultaneously. Both projects are not related to each other, so facing the same issue simultaneously means that there was an issue not related to our application which runs on top.
Here are some details about our setup:
1. I have 3 DCs - DC1, DC2 and DC3.
1.1. DC3 contain only MongoDB arbiter nodes.
1.2. DC1 and DC2 contain MongoDB replica members.
1.3. Each replica set contain 2 members from DC1, 2 members from DC2 and 1 member (arbiter) from DC3. DC3 member has priority 1, DC1 members have priority 2 and DC2 members have priority 3.
This setup ensures that the active MongoDB DC will be DC2, as the nodes have the highest priority. If DC2 is down, DC1 will take the primary role until the issue in DC2 is resolved.
I experienced an unpredicted situation - The Cisco ASR router, responsible for the NAT translation in DC2 started to reject the new connections, as its NAT connection pool was exhausted. Surprisingly, this lead to no re-election for the Primary. This resulted our MongoDB database clusters to be not accessible from our frontend (apache2) nodes, as the client side was trying to make a connection to the primary node in DC2. Because of the above issue, the new connection was not possible.
In our MongoDB logs, I found that during our downtime, there were no health check errors for our clusters. The only errors I found were related to the oplog connectivity timeout.
2018-12-02T06:10:42.412Z I ASIO [NetworkInterfaceASIO-RS-0] Failed to connect to 11.22.33.43:10000 - NetworkInterfaceExceededTimeLimit: Operation timed out, request was RemoteCommand 299125357 -- target:11.22.33.43:10000 db:admin cmd:{ isMaster: 1 }
2018-12-02T06:10:42.413Z I ASIO [NetworkInterfaceASIO-RS-0] Dropping all pooled connections to 11.22.33.43:10000 due to failed operation on a connection
2018-12-02T06:10:42.413Z I ASIO [NetworkInterfaceASIO-RS-0] Failed to connect to 11.22.33.43:10000 - NetworkInterfaceExceededTimeLimit: Operation timed out, request was RemoteCommand 299125359 -- target:11.22.33.43:10000 db:admin cmd:{ isMaster: 1 }
2018-12-02T06:10:42.413Z I REPL [replication-678] Error returned from oplog query (no more query restarts left): ExceededTimeLimit: Operation timed out, request was RemoteCommand 299125357 -- target:11.22.33.43:10000 db:admin cmd:{ isMaster: 1 }
2018-12-02T06:10:42.413Z I ASIO [NetworkInterfaceASIO-RS-0] Connecting to 11.22.33.43:10000
2018-12-02T06:10:42.413Z I REPL [SyncSourceFeedback] SyncSourceFeedback error sending update to 11.22.33.43:10000: ExceededTimeLimit: Operation timed out, request was RemoteCommand 299125357 -- target:11.22.33.43:10000 db:admin cmd:{ isMaster: 1 }
2018-12-02T06:10:42.413Z W REPL [rsBackgroundSync] Fetcher stopped querying remote oplog with error: ExceededTimeLimit: Operation timed out, request was RemoteCommand 299125357 -- target:11.22.33.43:10000 db:admin cmd:{ isMaster: 1 }
2018-12-02T06:10:42.413Z I REPL [rsBackgroundSync] sync source candidate: 11.22.33.46:10000
................
2018-12-02T06:11:02.413Z I ASIO [NetworkInterfaceASIO-RS-0] Failed to connect to 11.22.33.43:10000 - NetworkInterfaceExceededTimeLimit: Operation timed out, request was RemoteCommand 299125480 -- target:11.22.33.43:10000 db:admin cmd:{ isMaster: 1 }
2018-12-02T06:11:02.413Z I ASIO [NetworkInterfaceASIO-RS-0] Dropping all pooled connections to 11.22.33.43:10000 due to failed operation on a connection
2018-12-02T06:11:02.413Z I ASIO [NetworkInterfaceASIO-RS-0] Failed to connect to 11.22.33.46:10000 - NetworkInterfaceExceededTimeLimit: Operation timed out, request was RemoteCommand 299125482 -- target:11.22.33.46:10000 db:admin cmd:{ isMaster: 1 }
2018-12-02T06:11:02.414Z I ASIO [NetworkInterfaceASIO-RS-0] Dropping all pooled connections to 11.22.33.46:10000 due to failed operation on a connection
2018-12-02T06:11:02.414Z I REPL [replication-678] Blacklisting 11.22.33.46:10000 due to error: 'ExceededTimeLimit: Couldn't get a connection within the time limit' for 10s until: 2018-12-02T06:11:12.413+0000
2018-12-02T06:11:02.414Z I REPL [replication-678] sync source candidate: 11.22.33.43:10000
The 11.22.33.43 corresponds to db03 in DC2, 11.22.33.46 is db04 in DC2.
As you can see, the oplog synchronization timed out on db03 (DC2, which lead to synchronization attempt from db04 (DC2). During the same time, the health check was succeeding, as I am missing such errors in our logs.
Once I logged on the database from a node in DC1, I was able to see that the health check timestamp is accurate. For more details, you can see the attached rs.status and rs.config.
"optimeDate" : ISODate("2018-12-02T06:13:47Z"),
"optimeDurableDate" : ISODate("2018-12-02T06:13:47Z"),
"lastHeartbeat" : ISODate("2018-12-02T07:04:51.836Z"),
"lastHeartbeatRecv" : ISODate("2018-12-02T07:04:51.223Z"),
From this log, I can see that the optimeDate was 51minutes behind the lastHeartbeat, which proves that the heartbeat succeeded.
In order to resolve this, I had to manually reconfigure our replica sets to exclude DC2, which resolved the issue.
I was able to reproduce the same behavior by applying this rule on our primary node:
iptables -I INPUT -j DROP -p tcp --destination-port 10000 --syn
This resulted the Secondary nodes to not keep up with the primary.
"optimeDate" : ISODate("2018-12-04T14:09:58Z"),
"optimeDurableDate" : ISODate("2018-12-04T14:09:58Z"),
"lastHeartbeat" : ISODate("2018-12-04T14:45:51.137Z"),
"lastHeartbeatRecv" : ISODate("2018-12-04T14:45:48.889Z"),
I need your help for 2 things:
1. Confirm if this is expected behavior.
2. Decide the best approach for countermeasure, as I need to overcome this situation.
Best Regards,
Hi Martin,
It’s been some time since you posted this question. However I may have some thought that’s worthwhile to share.
If I understand the issue correctly, it seems to me that new connections cannot be created to DC2, but it sounds like previously established connections can still work. This may explain why there is no election, since the replica set can still ping each other, just not send data between them. Election would be called when communication is totally cut or DC2 is down, which doesn’t sound like the case here. If my understanding of the situation is correct, then this is expected behaviour, since for all purposes, the nodes are still running.
To mitigate the situation in the future, you may need to have additional metric to monitor the network health between DCs, or monitor the replication lag between the nodes using e.g. regularly monitoring the output of rs.printSlaveReplicationInfo() to determine if replication lag is within expected window.
Best regards,
Kevin