Yes I've noticed the chattiness increase with an increased log level. I guess I'm just confused that there doesn't seem to be any extra heartbeat information, simply that more queries are being logged (at least from what I've seen in my logs so far).
I would like to increase my log level because I'm trying to trouble shoot an issue where my primary server is slow to respond to my secondary and arbiter but we've done a wireshark trace on all the traffic between the two and determined it is not an infrastructure issue, the network looks perfectly healthy at the time of this slowness. So I want to see if there is any information available from the database since the boxes are new and look fine as well. I would just change thee timeout for fail over to a slightly longer timeout since we are getting so many false positives but I don't believe this is configurable. So I'd like to increase logging to get to the bottom of the issue but I'd prefer not to have insanely large logs if I'm not going to get the information I need from them.
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to a topic in the Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mongodb-user/O4nBV4Gzf-s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/7bc0c901-744e-457e-9321-e123f71e3e7e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/CAM%3DPFgoMiSejX_FsBOxppVSk6XxbHwR29A0kRSM9sbhKecks6g%40mail.gmail.com.
When we looked through the trace we could see clear examples where the replica at heartbeats were, in fact, slow to respond. It would take 14-16s before the primary would send a response even though we can see it ack the request from the secondary (or arbiter). And then we would see an election go through. At any rate, we haven't seen any latency in our application or any monitoring on the mongos, when looking in the logs it looks pretty normal. So it's curious for us that we don't see other symptoms but we do see this happen every few weeks. Typically we don't have any problems but recently there was a bug (which has been fixed i believe) where mongo wouldn't rollback replica set transactions properly. And I'm not sure which build its been fixed in, but we are running 2.6.1 and just had it happen a couple weeks ago so it must be after the build we have. Aside from this heartbeat issue we haven't run into any other problems with our cluster. So it would be nice to iron this out so that we have a stable cluster again.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/CAA2sbES6USvcjLfPimO7j%2B7hLnH-s-6vXdxA2ovoim0XKfiXhw%40mail.gmail.com.
I can post the traces when I get back in the office on Monday if that will help.
And sorry to not be clear. The bug was that the primary that stepped down because of slow heartbeats wasn't able to rollback the transactions and we ended up having to manually sync because it just shut down.
I would be willing to try upgrading to 2.6.5 it would just require a little testing beforehand since we do have production systems running on it.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/8b83fdda-c2d7-4581-83c1-479b39cd49e7%40googlegroups.com.
2014-11-20T16:59:06.575-0600 [conn60120] command admin.$cmd command: serverStatus { serverStatus: 1 } ntoreturn:1 keyUpdates:0 numYields:0 locks(micros) r:39 reslen:3656 866ms
2014-11-20T16:59:07.579-0600 [conn60120] end connection xx.x.xxx.xxx:xxxxx (173 connections now open)
2014-11-20T16:59:08.528-0600 [conn60124] end connection xx.x.xxx.xxx:xxxxx (172 connections now open)2014-11-20T16:59:29.613-0600 [clientcursormon] mem (MB) res:236057 virt:3745418
2014-11-20T16:59:29.613-0600 [clientcursormon] mapped (incl journal view):3737954
2014-11-20T16:59:29.613-0600 [clientcursormon] connections:172
2014-11-20T16:59:29.613-0600 [clientcursormon] replication threads:32
2014-11-20T16:59:30.754-0600 [conn57521] end connection xx.x.xxx.xxx:xxxxx (171 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection accepted from xx.x.xxx.xxx:xxxxx #60126 (172 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection accepted from xx.x.xxx.xxx:xxxxx #60127 (173 connections now open)
2014-11-20T17:00:00.433-0600 [conn60125] end connection xx.x.xxx.xxx:xxxxx (172 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection accepted from xx.x.xxx.xxx:xxxxx #60128 (174 connections now open)
2014-11-20T17:00:00.433-0600 [conn60126] end connection xx.x.xxx.xxx:xxxxx (172 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection accepted from xx.x.xxx.xxx:xxxxx #60129 (174 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection accepted from xx.x.xxx.xxx:xxxxx #60130 (174 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection accepted from xx.x.xxx.xxx:xxxxx #60131 (175 connections now open)
2014-11-20T17:00:00.433-0600 [conn60127] end connection xx.x.xxx.xxx:xxxxx (174 connections now open)
2014-11-20T17:00:00.433-0600 [conn60128] end connection xx.x.xxx.xxx:xxxxx (174 connections now open)
2014-11-20T17:00:00.434-0600 [conn60130] end connection xx.x.xxx.xxx:xxxxx (174 connections now open)
2014-11-20T17:00:00.433-0600 [conn60129] end connection xx.x.xxx.xxx:xxxxx (174 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection accepted from xx.x.xxx.xxx:xxxxx #60132 (176 connections now open)
2014-11-20T17:00:00.434-0600 [conn60131] end connection xx.x.xxx.xxx:xxxxx (171 connections now open)
2014-11-20T17:00:00.434-0600 [initandlisten] connection accepted from xx.x.xxx.xxx:xxxxx #60133 (173 connections now open)
2014-11-20T17:00:00.434-0600 [initandlisten] connection accepted from xx.x.xxx.xxx:xxxxx #60134 (173 connections now open)
2014-11-20T17:00:00.434-0600 [conn60132] end connection xx.x.xxx.xxx:xxxxx (172 connections now open)
2014-11-20T17:00:00.434-0600 [conn60134] authenticate db: local { authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T17:00:00.435-0600 [rsHealthPoll] replset info ARBITER:27017 thinks that we are down
2014-11-20T17:00:00.477-0600 [rsHealthPoll] replSet member MONGOSERVER2:27017 is now in state PRIMARY
2014-11-20T17:00:00.477-0600 [rsMgr] stepping down; another primary was elected more recently
2014-11-20T17:00:00.478-0600 [rsMgr] replSet relinquishing primary state
2014-11-20T17:00:00.478-0600 [rsMgr] replSet SECONDARY
2014-11-20T17:00:00.478-0600 [rsMgr] replSet closing client sockets after relinquishing primaryI can post the traces when I get back in the office on Monday if that will help.
And sorry to not be clear. The bug was that the primary that stepped down because of slow heartbeats wasn't able to rollback the transactions and we ended up having to manually sync because it just shut down.
I would be willing to try upgrading to 2.6.5 it would just require a little testing beforehand since we do have production systems running on it.
On Nov 29, 2014 12:14 AM, "Adam C" <> wrote:
Well, it doesn't sound like an issue I recall from the earlier 2.6 builds, but I may just not remember the issue in question. There are >160 issues closed between 2.6.1 and 2.6.5 - I'll have a look through them and see if any can account for such a slow response from a heartbeat that I might have missed. Is an upgrade to 2.6.5 something you would be willing to try? We don't have a specific smoking gun, but before we look at trying to investigate a new potential bug it is always advisable to be on the latest point release of the build to make sure the issue has not already be fixed.
It would certainly be interesting to get a look at the traces showing the slow responses also, though I understand that might not be possible.
Adam
On Thursday, November 27, 2014 3:46:24 PM UTC-5, rhea ghosh wrote:
When we looked through the trace we could see clear examples where the replica at heartbeats were, in fact, slow to respond. It would take 14-16s before the primary would send a response even though we can see it ack the request from the secondary (or arbiter). And then we would see an election go through. At any rate, we haven't seen any latency in our application or any monitoring on the mongos, when looking in the logs it looks pretty normal. So it's curious for us that we don't see other symptoms but we do see this happen every few weeks. Typically we don't have any problems but recently there was a bug (which has been fixed i believe) where mongo wouldn't rollback replica set transactions properly. And I'm not sure which build its been fixed in, but we are running 2.6.1 and just had it happen a couple weeks ago so it must be after the build we have. Aside from this heartbeat issue we haven't run into any other problems with our cluster. So it would be nice to iron this out so that we have a stable cluster again.
To unsubscribe from this group and all its topics, send an email to mongodb-user+unsubscribe@googlegroups.com.
2014-11-20T16:59:01.735-0600 [conn59639] authenticate db: admin { authenticate: 1, nonce: "xxx", user: "svcClusterAdmin", key: "xxx" }
2014-11-20T16:59:02.000-0600 [repl index builder 30] Index Build(background): 36200500/124435801 29%
2014-11-20T16:59:05.003-0600 [repl index builder 30] Index Build(background): 36577700/124435801 29%
2014-11-20T16:59:08.000-0600 [repl index builder 30] Index Build(background): 36958700/124435801 29%
2014-11-20T16:59:08.528-0600 [conn59637] end connection 10.0.240.250:59921 (70 connections now open)
2014-11-20T16:59:11.000-0600 [repl index builder 30] Index Build(background): 37338400/124435801 30%
2014-11-20T16:59:14.000-0600 [repl index builder 30] Index Build(background): 37697700/124435801 30%
2014-11-20T16:59:17.010-0600 [repl index builder 30] Index Build(background): 38079300/124435801 30%
2014-11-20T16:59:18.527-0600 [rsHealthPoll] DBClientCursor::init call() failed
2014-11-20T16:59:18.547-0600 [rsHealthPoll] can't authenticate to CHILXPROD061:27017 (10.0.240.250) failed as internal user, error: DBClientBase::findN: transport error: CHILXPROD061:27017 ns: local.$cmd query: { getnonce: 1 }
2014-11-20T16:59:20.000-0600 [repl index builder 30] Index Build(background): 38458700/124435801 30%
2014-11-20T16:59:23.000-0600 [repl index builder 30] Index Build(background): 38835100/124435801 31%
2014-11-20T16:59:26.000-0600 [repl index builder 30] Index Build(background): 39211100/124435801 31%
2014-11-20T16:59:26.645-0600 [conn59638] end connection xxx.xx.xxx.xxx:xxxxx (69 connections now open)
2014-11-20T16:59:26.645-0600 [initandlisten] connection accepted from xxx.xx.xxx.xxx:xxxxx #59640 (70 connections now open)
2014-11-20T16:59:26.678-0600 [conn59640] authenticate db: local { authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T16:59:28.549-0600 [rsHealthPoll] DBClientCursor::init call() failed
2014-11-20T16:59:28.556-0600 [rsHealthPoll] replSet info MONGOSERVER1:27017 is down (or slow to respond):
2014-11-20T16:59:28.556-0600 [rsHealthPoll] replSet member MONGOSERVER1:27017 is now in state DOWN
2014-11-20T16:59:28.556-0600 [rsMgr] replSet info electSelf 1
2014-11-20T16:59:29.000-0600 [repl index builder 30] Index Build(background): 39581200/124435801 31%
2014-11-20T16:59:30.766-0600 [rsMgr] replSet PRIMARY
2014-11-20T16:59:32.000-0600 [repl index builder 30] Index Build(background): 39965000/124435801 32%
2014-11-20T16:59:33.972-0600 [conn59639] command admin.$cmd command: serverStatus { serverStatus: 1 } ntoreturn:1 keyUpdates:0 numYields:0 locks(micros) r:49 reslen:3608 154ms
2014-11-20T16:59:35.000-0600 [repl index builder 30] Index Build(background): 40289100/124435801 32%
2014-11-20T16:59:38.000-0600 [repl index builder 30] Index Build(background): 40679700/124435801 32%
2014-11-20T16:59:38.735-0600 [journal] old journal file will be removed: /opt/mongodata/journal/j._87
2014-11-20T16:59:40.597-0600 [rsHealthPoll] DBClientCursor::init call() failed
2014-11-20T16:59:41.000-0600 [repl index builder 30] Index Build(background): 41064300/124435801 33%
2014-11-20T16:59:44.000-0600 [repl index builder 30] Index Build(background): 41449500/124435801 33%
2014-11-20T16:59:47.000-0600 [repl index builder 30] Index Build(background): 41858100/124435801 33%
2014-11-20T16:59:50.000-0600 [repl index builder 30] Index Build(background): 42233500/124435801 33%
2014-11-20T16:59:52.598-0600 [rsHealthPoll] DBClientCursor::init call() failed
2014-11-20T16:59:53.000-0600 [repl index builder 30] Index Build(background): 42613000/124435801 34%
2014-11-20T16:59:56.000-0600 [repl index builder 30] Index Build(background): 42954300/124435801 34%
2014-11-20T16:59:56.685-0600 [conn59640] end connection xxx.xx.xxx.xxx:xxxxx (69 connections now open)
2014-11-20T16:59:56.686-0600 [initandlisten] connection accepted from 10.0.104.189:50816 #59641 (70 connections now open)
2014-11-20T16:59:56.726-0600 [conn59641] authenticate db: local { authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T16:59:59.000-0600 [repl index builder 30] Index Build(background): 43317900/124435801 34%
2014-11-20T17:00:00.435-0600 [rsHealthPoll] replSet member MONGOSERVER1:27017 is up
2014-11-20T17:00:00.435-0600 [rsHealthPoll] replSet member MONGOSERVER1:27017 is now in state PRIMARY
2014-11-20T17:00:00.435-0600 [rsMgr] another PRIMARY detected but it should step down since it was elected earlier than me
2014-11-20T17:00:00.435-0600 [rsMgr] another PRIMARY detected but it should step down since it was elected earlier than me2014-11-20T16:58:50.711-0600 [conn41035] authenticate db: local { authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T16:59:13.954-0600 [rsHealthPoll] DBClientCursor::init call() failed
2014-11-20T16:59:13.954-0600 [rsHealthPoll] can't authenticate to CHILXPROD061:27017 (10.0.240.250) failed as internal user, error: DBClientBase::findN: transport error: CHILXPROD061:27017 ns: local.$cmd query: { getnonce: 1 }
2014-11-20T16:59:20.741-0600 [conn41036] authenticate db: local { authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T16:59:23.955-0600 [rsHealthPoll] DBClientCursor::init call() failed
2014-11-20T16:59:23.955-0600 [rsHealthPoll] replSet info MONGOSERVER1:27017 is down (or slow to respond):
2014-11-20T16:59:23.955-0600 [rsHealthPoll] replSet member MONGOSERVER1:27017 is now in state DOWN
2014-11-20T16:59:28.556-0600 [conn41036] replSet info voting yea for CHILXPROD062:27017 (1)
2014-11-20T16:59:32.680-0600 [rsHealthPoll] replSet member MONGOSERVER2:27017 is now in state PRIMARY
2014-11-20T16:59:35.956-0600 [rsHealthPoll] DBClientCursor::init call() failed
2014-11-20T16:59:47.958-0600 [rsHealthPoll] DBClientCursor::init call() failed
2014-11-20T16:59:50.790-0600 [conn41037] authenticate db: local { authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T16:59:59.959-0600 [rsHealthPoll] DBClientCursor::init call() failed
2014-11-20T17:00:00.434-0600 [conn41038] authenticate db: local { authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T17:00:01.962-0600 [rsHealthPoll] replSet member MONGOSERVER1:27017 is up
2014-11-20T17:00:01.962-0600 [rsHealthPoll] replSet member MONGOSERVER1:27017 is now in state SECONDARY...
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/690b2b5d-3850-497e-aa98-fc9cf5b2b79f%40googlegroups.com.
...
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/b2f9e814-c935-47a5-a4fa-83a06d30591c%40googlegroups.com.
...