Strange Behavior with Replica Set Operations

152 views
Skip to first unread message

Mike Tewner

unread,
Jul 12, 2012, 3:37:57 AM7/12/12
to mongod...@googlegroups.com
Hi Guys - 

Thank you for your continued help in all things Mongo! We're very happy with the database - keep up the great work!

Yesterday, I replaced our arbiter node [ rs.addArb(newNode:port), then rs.remove(oldNode:port) - for each of 2 shards] - and we've identified two strange behaviors which I'm not sure are expected or not.
(We're using version 2.0.2 for everything)

Issue #1: After removing the old arbiter, It seems that, for an instant, our MongoDB cluster failed over:

Logs from the PRIMARY:
Wed Jul 11 11:22:33 [conn889074] replSet replSetReconfig config object parses ok, 3 members specified
Wed Jul 11 11:22:33 [conn889074] replSet replSetReconfig [2]
Wed Jul 11 11:22:33 [conn889074] replSet info saving a newer config version to local.system.replset
Wed Jul 11 11:22:33 [conn889074] replSet saveConfigLocally done
Wed Jul 11 11:22:33 [conn889074] replSet relinquishing primary state
Wed Jul 11 11:22:33 [conn889074] replSet SECONDARY
Wed Jul 11 11:22:33 [conn889074] replSet closing client sockets after reqlinquishing primary
Wed Jul 11 11:22:33 [conn889074] replSet PRIMARY
Wed Jul 11 11:22:33 [conn889074] replSet replSetReconfig new config saved locally
Wed Jul 11 11:22:33 [rsHealthPoll] replSet member db01srv02:10001 is now in state DOWN
Wed Jul 11 11:22:33 [rsHealthPoll] replSet member shard001arb:10001 is now in state DOWN
Wed Jul 11 11:22:35 [rsHealthPoll] replSet member db01arb01:10001 is now in state DOWN
Wed Jul 11 11:22:35 [rsHealthPoll] replSet member db01srv02:10001 is up
Wed Jul 11 11:22:35 [rsHealthPoll] replSet member db01srv02:10001 is now in state SECONDARY
Wed Jul 11 11:22:35 [rsHealthPoll] replSet member shard001arb:10001 is up
Wed Jul 11 11:22:35 [rsHealthPoll] replSet member shard001arb:10001 is now in state ARBITER

Issue #2, Since then, we've been getting a lot of errors, which I understand we're supposed to ignore - Example:

Thu Jul 12 06:33:05 [conn2] created new distributed lock for XXXXX.m_questions on db01srv01:20001,db02srv01:20001,db01srv02:20001 ( lock timeout : 900000, ping interval : 30000, process : 0 )
Thu Jul 12 06:33:05 [conn2] ChunkManager: time to load chunks for XXXXX.m_questions: 0ms sequenceNumber: 29 version: 1|0

some of this:

Thu Jul 12 06:27:41 [conn44059] ns: XXXX.m_venues could not initialize cursor across all shards because : stale config detected for ns: XXXX.m_venues ParallelCursor::_init @ shard001/db01srv01:10001,db01srv02:10001 attempt: 0

immediately followed by lots of this:

Thu Jul 12 06:27:58 [conn44061] Socket recv() errno:104 Connection reset by peer 192.168.100.76:10001
Thu Jul 12 06:27:58 [conn44061] SocketException: remote: 192.168.100.76:10001 error: 9001 socket exception [1] server [192.168.100.76:10001
Thu Jul 12 06:27:58 [conn44061] DBClientCursor::init call() failed
Thu Jul 12 06:27:58 [conn44061] DBException in process: could not initialize cursor across all shards because : DBClientBase::findN: transport error: db02srv01:10001 query: { setShardVersion: "XXXX.m_venues", configdb: "db01srv01:20001,db02srv01:20001,db01srv02:20001", version: Timestamp 2000|0, serverID: ObjectId('4fe16f879f73b9bd68e05321'), shard: "shard002", shardHost: "shard002/db02srv01:10001,db02srv02:10001" } @ shard001/db01srv01:10001,db01srv02:10001

It's all fine-and-good that these are just for for notification purposes, but these are triggering unhandled "com.mongodb.MongoException"s in the JAVA driver:

2012-07-12 06:27:58,288 ERROR ConsoleServiceImpl.getAllHeaders: Failed to get venues
com.mongodb.MongoException: could not initialize cursor across all shards because : DBClientBase::findN: transport error: db02srv01:10001 query: { setShardVersion: "XXXX.m_venues", configdb: "db01srv01:20001,db02srv01:20001,db01srv02:20001", version: Timestamp 2000|0, serverID: ObjectId('4fe16f879f73b9bd68e05321'), shard: "shard002", shardHost: "shard002/db02srv01:10001,db02srv02:10001" } @ shard001/db01srv01:10001,db01srv02:10001
        at com.mongodb.MongoException.parse(MongoException.java:82)
        at com.mongodb.DBApiLayer$MyCollection.__find(DBApiLayer.java:310)
        at com.mongodb.DBCursor._check(DBCursor.java:360)
        at com.mongodb.DBCursor._hasNext(DBCursor.java:490)
        at com.mongodb.DBCursor.hasNext(DBCursor.java:515)
        at com.google.code.morphia.query.MorphiaIterator.hasNext(MorphiaIterator.java:40)
        at com.google.code.morphia.query.QueryImpl.asList(QueryImpl.java:254)
        at com.XXXXX.XXXXX.dao.GenericDaoImpl.getAllEntities(GenericDaoImpl.java:92)
        at com.XXXXX.XXXXX.dao.GenericDaoImpl.getAllEntities(GenericDaoImpl.java:84)
        at com.XXXXX.XXXXX.services.venue.VenueServiceImpl.getVenueHeaders(VenueServiceImpl.java:1274)

I'm not sure whether (a) The JAVA driver should handle this exception and wait for the next attempt, and only throw the exception after three attempts or (b) Mongo shouldn't return an error to the driver at all and silently retry, but either way, a _Notification_ should NOT be passed to an application as a fatal error.

What am I missing?
-MIke


This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. Please note that any disclosure, copying or distribution of the content of this information is strictly forbidden. If you have received this email message in error, please destroy it immediately and notify its sender.

Nat

unread,
Jul 12, 2012, 4:03:11 AM7/12/12
to mongod...@googlegroups.com
Mike,

Does your error go away if you restart the whole cluster? It seems like you are impacted by the caching.

Nat
From: Mike Tewner <mi...@scene53.com>
Date: Thu, 12 Jul 2012 10:37:57 +0300
Subject: [mongodb-user] Strange Behavior with Replica Set Operations
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com
To unsubscribe from this group, send email to
mongodb-user...@googlegroups.com
See also the IRC channel -- freenode.net#mongodb

Mike Tewner

unread,
Jul 12, 2012, 4:12:51 AM7/12/12
to mongod...@googlegroups.com, nat....@gmail.com
The errors seem to gradually disappear. I'm not able to restart things after every topology change, but I assume it should work. 
...but that doesn't help with either of the two issues mentioned below.

See also the IRC channel -- freenode.net#mongodb

See also the IRC channel -- freenode.net#mongodb

See also the IRC channel -- freenode.net#mongodb

See also the IRC channel -- freenode.net#mongodb
Reply all
Reply to author
Forward
0 new messages