Stuck in STARTUP2

730 views
Skip to first unread message

ka...@sentimetrix.com

unread,
Oct 24, 2016, 5:05:25 PM10/24/16
to mongodb-user
We have been having issues with our Mongo installation  (Mongo 3.2, running on AWS EC2 Linux cluster. 

There are 2 servers in a replica set and 2 shards (sXrY … X is the shard and Y is the replica number.
- s1r1 (stuck in startup2)
- s1r3 (no data but intended to be a replica of s1r1)
- s3r1
- s3r3

* First, we had some issues with the RAID 10 syncs slowing down the database. Eventually, our `s1r3` server fell far behind the Master `s1r1`
  * Eventually, we getting the following errors for our clients:

pymongo.errors.OperationFailure: could not find host matching read preference { mode: "primary", tags: [ {} ] } for set dev-s1r


  * `s1r1` refuses to boot out of STARTUP2, even if we have it as the only member of the Replica set (by removing  `s1r3`)
  * We tried wiping the data off of `s1r3` and having it try to sync with `s1r1` but it did not work

Any insights?

Some logs: [addresses and ports reducted]

dev-s1r:STARTUP2> rs.status()
{
    "set" : "dev-s1r",
    "date" : ISODate("2016-10-20T17:39:11.825Z"),
    "myState" : 5,
    "term" : NumberLong(0),
    "heartbeatIntervalMillis" : NumberLong(2000),
    "members" : [
        {
            "_id" : 4,
            "name" : "yourserver:22222",
            "health" : 1,
            "state" : 5,
            "stateStr" : "STARTUP2",
            "uptime" : 281245,
            "optime" : {
                "ts" : Timestamp(0, 0),
                "t" : NumberLong(-1)
            },
            "optimeDate" : ISODate("1970-01-01T00:00:00Z"),
            "infoMessage" : "could not find member to sync from",
            "configVersion" : 137490,
            "self" : true
        }
    ],
    "ok" : 1
}


mongod.log: 

2016-10-17T11:31:46.497+0000 I CONTROL  [main] ***** SERVER RESTARTED *****
2016-10-17T11:31:46.507+0000 I CONTROL  [initandlisten] MongoDB starting : pid=27201 port=22222 dbpath=/data 64-bit host=ip-10-229-63-46
2016-10-17T11:31:46.507+0000 I CONTROL  [initandlisten] db version v3.2.8
2016-10-17T11:31:46.507+0000 I CONTROL  [initandlisten] git version: ed70e33130c977bda0024c125b56d159573dbaf0
2016-10-17T11:31:46.507+0000 I CONTROL  [initandlisten] OpenSSL version: OpenSSL 1.0.0-fips 29 Mar 2010
2016-10-17T11:31:46.507+0000 I CONTROL  [initandlisten] allocator: tcmalloc
2016-10-17T11:31:46.507+0000 I CONTROL  [initandlisten] modules: none
2016-10-17T11:31:46.507+0000 I CONTROL  [initandlisten] build environment:
2016-10-17T11:31:46.507+0000 I CONTROL  [initandlisten]     distmod: amazon
2016-10-17T11:31:46.507+0000 I CONTROL  [initandlisten]     distarch: x86_64
2016-10-17T11:31:46.507+0000 I CONTROL  [initandlisten]     target_arch: x86_64
2016-10-17T11:31:46.507+0000 I CONTROL  [initandlisten] options: { config: "/etc/mongod.conf", net: { bindIp: "0.0.0.0", http: { RESTInterfaceEnabled: true, enabled: true }, port: 22222 }, processManagement: { fork: true, pidFilePath: "/var/run/mongodb/mongod.pid" }, replication: { replSetName: "dev-s1r" }, sharding: { clusterRole: "shardsvr" }, storage: { dbPath: "/data", journal: { enabled: true } }, systemLog: { destination: "file", logAppend: true, path: "/log/mongod.log" } }
2016-10-17T11:31:46.527+0000 I -        [initandlisten] Detected data files in /data created by the 'wiredTiger' storage engine, so setting the active storage engine to 'wiredTiger'.
2016-10-17T11:31:46.528+0000 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=3G,session_max=20000,eviction=(threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
2016-10-17T11:31:46.814+0000 I STORAGE  [initandlisten] Starting WiredTigerRecordStoreThread local.oplog.rs
2016-10-17T11:31:46.814+0000 I STORAGE  [initandlisten] The size storer reports that the oplog contains 0 records totaling to 0 bytes
2016-10-17T11:31:46.814+0000 I STORAGE  [initandlisten] Scanning the oplog to determine where to place markers for truncation
2016-10-17T11:31:47.608+0000 I FTDC     [initandlisten] Initializing full-time diagnostic data capture with directory '/data/diagnostic.data'
2016-10-17T11:31:47.609+0000 I NETWORK  [initandlisten] waiting for connections on port 22222
2016-10-17T11:31:47.609+0000 I NETWORK  [websvr] admin web console waiting for connections on port 22222
2016-10-17T11:31:47.609+0000 I NETWORK  [HostnameCanonicalizationWorker] Starting hostname canonicalization worker
2016-10-17T11:31:47.612+0000 W REPL     [ReplicationExecutor] Failed to load timestamp of most recently applied operation: NoMatchingDocument: Did not find any entries in local.oplog.rs
2016-10-17T11:31:47.612+0000 I REPL     [ReplicationExecutor] New replica set config in use: { _id: "dev-s1r", version: 137490, protocolVersion: 1, members: [ { _id: 4, host: "yourserver:22222", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('56f9da3e3110ecc7beced234') } }
2016-10-17T11:31:47.612+0000 I REPL     [ReplicationExecutor] This node is yourserver:22222 in the config
2016-10-17T11:31:47.612+0000 I REPL     [ReplicationExecutor] transition to STARTUP2
2016-10-17T11:31:47.612+0000 I REPL     [ReplicationExecutor] Starting replication applier threads
2016-10-17T11:31:47.616+0000 I REPL     [rsSync] initial sync pending
<SNIP>
2016-10-20T17:36:45.655+0000 I REPL     [rsSync] initial sync pending
2016-10-20T17:36:45.655+0000 I REPL     [rsSync] no valid sync sources found in current replset to do an initial sync
2016-10-20T17:36:46.656+0000 I REPL     [rsSync] initial sync pending
2016-10-20T17:36:46.656+0000 I REPL     [rsSync] no valid sync sources found in current replset to do an initial sync
2016-10-20T17:36:47.656+0000 I REPL     [rsSync] initial sync pending
2016-10-20T17:36:47.656+0000 I REPL     [rsSync] no valid sync sources found in current replset to do an initial sync
2016-10-20T17:36:48.657+0000 I REPL     [rsSync] initial sync pending
2016-10-20T17:36:48.657+0000 I REPL     [rsSync] no valid sync sources found in current replset to do an initial sync
2016-10-20T17:36:49.657+0000 I REPL     [rsSync] initial sync pending
2016-10-20T17:36:49.657+0000 I REPL     [rsSync] no valid sync sources found in current replset to do an initial sync
2016-10-20T17:36:50.658+0000 I REPL     [rsSync] initial sync pending
2016-10-20T17:36:50.659+0000 I REPL     [rsSync] no valid sync sources found in current replset to do an initial sync
2016-10-20T17:36:51.659+0000 I REPL     [rsSync] initial sync pending
2016-10-20T17:36:51.659+0000 I REPL     [rsSync] no valid sync sources found in current replset to do an initial sync
2016-10-20T17:36:52.659+0000 I REPL     [rsSync] initial sync pending
2016-10-20T17:36:52.660+0000 I REPL     [rsSync] no valid sync sources found in current replset to do an initial sync
2016-10-20T17:36:53.660+0000 I REPL     [rsSync] initial sync pending
2016-10-20T17:36:53.660+0000 I REPL     [rsSync] no valid sync sources found in current replset to do an initial sync

/log/mongos.log:

2016-10-20T17:37:53.386+0000 W NETWORK  [Balancer] No primary detected for set dev-s1r
2016-10-20T17:37:53.887+0000 W NETWORK  [Balancer] No primary detected for set dev-s1r
2016-10-20T17:37:54.388+0000 W NETWORK  [Balancer] No primary detected for set dev-s1r
2016-10-20T17:37:54.889+0000 W NETWORK  [Balancer] No primary detected for set dev-s1r
2016-10-20T17:37:55.390+0000 W NETWORK  [Balancer] No primary detected for set dev-s1r
2016-10-20T17:37:55.891+0000 W NETWORK  [Balancer] No primary detected for set dev-s1r
2016-10-20T17:37:56.392+0000 W NETWORK  [Balancer] No primary detected for set dev-s1r
2016-10-20T17:37:56.893+0000 W NETWORK  [Balancer] No primary detected for set dev-s1r

Stephen Steneker

unread,
Oct 25, 2016, 7:52:04 AM10/25/16
to mongodb-user
On Tuesday, 25 October 2016 08:05:25 UTC+11, kagan wrote:
We have been having issues with our Mongo installation  (Mongo 3.2, running on AWS EC2 Linux cluster. 

There are 2 servers in a replica set and 2 shards (sXrY … X is the shard and Y is the replica number.
- s1r1 (stuck in startup2)
- s1r3 (no data but intended to be a replica of s1r1)
- s3r1
- s3r3

* First, we had some issues with the RAID 10 syncs slowing down the database. Eventually, our `s1r3` server fell far behind the Master `s1r1`
  * Eventually, we getting the following errors for our clients:

pymongo.errors.OperationFailure: could not find host matching read preference { mode: "primary", tags: [ {} ] } for set dev-s1r

  * `s1r1` refuses to boot out of STARTUP2, even if we have it as the only member of the Replica set (by removing  `s1r3`)
  * We tried wiping the data off of `s1r3` and having it try to sync with `s1r1` but it did not work

Any insights?

Hi,

It looks like there are a few issues with your replica set:
  • s1r1 is stuck in STARTUP2 because:

    • It does not appear to have a valid oplog:
      • 2016-10-17T11:31:47.612+0000 W REPL     [ReplicationExecutor] Failed to load timestamp of most recently applied operation: NoMatchingDocument: Did not find any entries in local.oplog.rs

    • As a result, initial sync is pending:
      • 2016-10-17T11:31:47.616+0000 I REPL     [rsSync] initial sync pending

    • However, there are no other members in the replica set configuration so it can't find another member to sync from:
      • "infoMessage" : "could not find member to sync from",

    • PyMongo & mongos are reporting the same error: there is no primary in this replica set

    Is there a third member of this replica set, or only the two data bearing nodes?

    I'm not sure exactly how you got into this state, but since you've already dropped the data on s1r3 I expect the fastest path to fix would be to recreate the replica set:
    • Start up s1r1 in standalone mode (i.e. without the replSet parameter)
    • Drop the local database:
      • db.getSiblingDB('local').dropDatabase()
    • Restart s1r1 with the replSet parameter
    • Run rs.initate() to create a new oplog and replica set configuration
    • Re-add s1r3 to the replica set to perform initial sync:
      • rs.add("s1r3")
    Once you have the replica set healthy again, I would consider upgrading from 3.2.8 to 3.2.10 as there have been a number of important fixes for general stability and performance. However, definitely fix the immediate replica set configuration problem before upgrading.

    Let us know if you have any further questions, or if you were able to get your replica set back online.

    Regards,
    Stephen
    Reply all
    Reply to author
    Forward
    0 new messages