Replica set stucked in "Recovery" mode

3,188 views
Skip to first unread message

Matías Waisgold

unread,
Feb 14, 2013, 8:36:49 AM2/14/13
to mongod...@googlegroups.com
Hi, i'm having an issue with my replica. I have 3 servers, 1 primary 2 secondaries and one of the secondaries cannot be restored.
I tried removing the directory with the files, and restarted it. After one day it's in Recovery status with the following message in logs:

Thu Feb 14 09:28:55 [rsBackgroundSync] replSet not trying to sync from e-00000XXX.cloud.com:27017, it is vetoed for 195 more seconds                                                                                                                                                                                    
Thu Feb 14 09:28:55 [rsBackgroundSync] replSet not trying to sync from e-00000ZZZ.cloud.com:27017, it is vetoed for 0 more seconds

Rs.status:

IndexShard1:RECOVERING> rs.status()
{
        "set" : "IndexShard1",
        "date" : ISODate("2013-02-14T13:33:24Z"),
        "myState" : 3,
        "members" : [
                {
                        "_id" : 1,
                        "name" : "e-00000XXX.cloud.com:27017",
                        "health" : 1,
                        "state" : 1,
                        "stateStr" : "PRIMARY",
                        "uptime" : 85903,
                        "optime" : Timestamp(1360848802000, 115),
                        "optimeDate" : ISODate("2013-02-14T13:33:22Z"),
                        "lastHeartbeat" : ISODate("2013-02-14T13:33:23Z"),
                        "pingMs" : 1
                },
                {
                        "_id" : 2,
                        "name" : "e-00000ZZZ.cloud.com:27017",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                        "uptime" : 85903,
                        "optime" : Timestamp(1360848802000, 92),
                        "optimeDate" : ISODate("2013-02-14T13:33:22Z"),
                        "lastHeartbeat" : ISODate("2013-02-14T13:33:24Z"),
                        "pingMs" : 1
                },
                {
                        "_id" : 3,
                        "name" : "e-00000YYY.cloud.com:27017",
                        "health" : 1,
                        "state" : 3,
                        "stateStr" : "RECOVERING",
                        "uptime" : 86030,
                        "optime" : Timestamp(1360762916000, 93),
                        "optimeDate" : ISODate("2013-02-13T13:41:56Z"),
                        "errmsg" : "error RS102 too stale to catch up",
                        "self" : true
                }
        ],
        "ok" : 1
}

IndexShard1:RECOVERING> db.printReplicationInfo()
configured oplog size:   9393.956640625MB
log length start to end: 0secs (0hrs)
oplog first event time:  Wed Feb 13 2013 09:41:56 GMT-0400
oplog last event time:   Wed Feb 13 2013 09:41:56 GMT-0400
now:                     Thu Feb 14 2013 09:34:46 GMT-0400

Any ideas? Should I increase oplog size and resync my replica?

Thanks
Matias

Matías Waisgold

unread,
Feb 18, 2013, 10:54:38 AM2/18/13
to mongodb-user
Can anyone help me with this?


2013/2/14 Matías Waisgold <mwai...@gmail.com>

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com
To unsubscribe from this group, send email to
mongodb-user...@googlegroups.com
See also the IRC channel -- freenode.net#mongodb
 
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Ronald Stalder

unread,
Feb 18, 2013, 12:00:50 PM2/18/13
to mongod...@googlegroups.com
Hi Matias

you said you remobed the directory with the files - did you recreate that directory (I assume it's the --dbpath for this server) and does mongod have write access?

if you take down that mongod (on YYY) and restart it, what does it's console log say?

Matías Waisgold

unread,
Feb 18, 2013, 1:03:09 PM2/18/13
to mongodb-user
I have this:

Mon Feb 18 14:00:21 [rsStart] trying to contact XXX:27017
Mon Feb 18 14:00:21 [rsStart] trying to contact ZZZ:27017
Mon Feb 18 14:00:21 [rsStart] replSet I am YYY:27017
Mon Feb 18 14:00:21 [rsStart] replSet got config version 3 from a remote, saving locally
Mon Feb 18 14:00:21 [rsStart] replSet info saving a newer config version to local.system.replset
Mon Feb 18 14:00:21 [FileAllocator] allocating new datafile /mnt/mongodb_vol/local.ns, filling with zeroes...
Mon Feb 18 14:00:21 [FileAllocator] creating directory /mnt/mongodb_vol/_tmp
Mon Feb 18 14:00:21 [FileAllocator] done allocating datafile /mnt/mongodb_vol/local.ns, size: 16MB,  took 0.003 secs
Mon Feb 18 14:00:21 [FileAllocator] allocating new datafile /mnt/mongodb_vol/local.0, filling with zeroes...
Mon Feb 18 14:00:21 [FileAllocator] done allocating datafile /mnt/mongodb_vol/local.0, size: 64MB,  took 0.028 secs
Mon Feb 18 14:00:21 [FileAllocator] allocating new datafile /mnt/mongodb_vol/local.1, filling with zeroes...
Mon Feb 18 14:00:21 [rsStart] replSet saveConfigLocally done
Mon Feb 18 14:00:21 [rsStart] replSet STARTUP2
Mon Feb 18 14:00:21 [rsSync] ******
Mon Feb 18 14:00:21 [rsSync] creating replication oplog of size: 9393MB...
Mon Feb 18 14:00:21 [FileAllocator] done allocating datafile /mnt/mongodb_vol/local.1, size: 128MB,  took 0.024 secs
Mon Feb 18 14:00:21 [FileAllocator] allocating new datafile /mnt/mongodb_vol/local.2, filling with zeroes...
Mon Feb 18 14:00:21 [FileAllocator] done allocating datafile /mnt/mongodb_vol/local.2, size: 2047MB,  took 0.112 secs
Mon Feb 18 14:00:21 [FileAllocator] allocating new datafile /mnt/mongodb_vol/local.3, filling with zeroes...
Mon Feb 18 14:00:21 [FileAllocator] done allocating datafile /mnt/mongodb_vol/local.3, size: 2047MB,  took 0.119 secs
Mon Feb 18 14:00:21 [FileAllocator] allocating new datafile /mnt/mongodb_vol/local.4, filling with zeroes...
Mon Feb 18 14:00:21 [FileAllocator] done allocating datafile /mnt/mongodb_vol/local.4, size: 2047MB,  took 0.072 secs
Mon Feb 18 14:00:21 [FileAllocator] allocating new datafile /mnt/mongodb_vol/local.5, filling with zeroes...
Mon Feb 18 14:00:22 [FileAllocator] done allocating datafile /mnt/mongodb_vol/local.5, size: 2047MB,  took 0.089 secs
Mon Feb 18 14:00:22 [FileAllocator] allocating new datafile /mnt/mongodb_vol/local.6, filling with zeroes...
Mon Feb 18 14:00:22 [FileAllocator] done allocating datafile /mnt/mongodb_vol/local.6, size: 2047MB,  took 0.084 secs
Mon Feb 18 14:00:22 [rsSync] ******
Mon Feb 18 14:00:22 [rsSync] replSet initial sync pending
Mon Feb 18 14:00:22 [rsSync] replSet initial sync need a member to be primary or secondary to do our initial sync
Mon Feb 18 14:00:23 [rsHealthPoll] replSet member ZZZ:27017 is up
Mon Feb 18 14:00:23 [rsHealthPoll] replSet member ZZZ:27017 is now in state SECONDARY
Mon Feb 18 14:00:23 [rsHealthPoll] replSet member XXX:27017 is up
Mon Feb 18 14:00:23 [rsHealthPoll] replSet member XXX:27017 is now in state PRIMARY



2013/2/18 Ronald Stalder <rona...@gmail.com>
Hi Matias

you said you remobed the directory with the files - did you recreate that directory (I assume it's the --dbpath for this server) and does mongod have write access?

if you take down that mongod (on YYY) and restart it, what does it's console log say?

--

Ronald Stalder

unread,
Feb 18, 2013, 1:23:44 PM2/18/13
to mongod...@googlegroups.com
I don't get it...

It says "need primary or secondary for inital sync" and only a second later the heartbeat starts and it sees the primary. So, in my opinion, the initial sync should be starting at this time - did anything happen in YYY console log since posting ??

There are a few scenarios in the doc about re-syncing, see http://docs.mongodb.org/manual/administration/replica-sets/#replica-set-resync-stale-member

But I can't see what you'd do wrong in respect of these procedures..

Matías Waisgold

unread,
Feb 18, 2013, 1:28:02 PM2/18/13
to mongodb-user
Ronald, it starts cloning correctly, but after some time it changes to recovery state and dies.

This are my current logs about sync:

Mon Feb 18 14:26:03 [rsSync] clone items_index.active_items 1326719
Mon Feb 18 14:26:03 [rsSync] 1326720 objects cloned so far from collection items_index.active_items


2013/2/18 Ronald Stalder <rona...@gmail.com>

Ronald Stalder

unread,
Feb 18, 2013, 1:32:37 PM2/18/13
to mongod...@googlegroups.com
you seem to have quite a big database, if it takes more than half an
hour to perform the initial sync. May be you want to try this
http://docs.mongodb.org/manual/administration/replica-sets/#replica-set-resync-by-copying

Matías Waisgold

unread,
Feb 18, 2013, 1:50:54 PM2/18/13
to mongodb-user
I think I have a big database and a slow network :P.
I'll try running the mongodump from the remote machine and start from there.

Thank's Ronald


2013/2/18 Ronald Stalder <rona...@gmail.com>
you seem to have quite a big database, if it takes more than half an hour to perform the initial sync. May be you want to try this http://docs.mongodb.org/manual/administration/replica-sets/#replica-set-resync-by-copying
--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com
To unsubscribe from this group, send email to

See also the IRC channel -- freenode.net#mongodb

--- You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+unsubscribe@googlegroups.com.

Ronald Stalder

unread,
Feb 18, 2013, 1:53:07 PM2/18/13
to mongod...@googlegroups.com
a big and BUSY data base, as the oplog gets stalled during initial sync....

Good luck!

Jeff Lee

unread,
Feb 18, 2013, 2:08:33 PM2/18/13
to mongod...@googlegroups.com
Hi Matias,

You can't use mongodump to seed a new replicaset member.  

Your best bet is probably to take a filesystem snapshot of the secondary and try to copy the datafiles over, but if you are not able to sync a new member in 24 hours due to network issues I'm not sure how much luck you'll have copying datafiles over manually.  You may also need to increase your oplog size.





See also the IRC channel -- freenode.net#mongodb
 
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.

Jeff Lee

unread,
Feb 18, 2013, 3:16:36 PM2/18/13
to mongod...@googlegroups.com

On Mon, Feb 18, 2013 at 11:08 AM, Jeff Lee <jeff...@gmail.com> wrote:
You can't use mongodump to seed a new replicaset member.  

Actually, I take this back.  Although I haven't done it personally, it might be possible by dumping the oplog along with the data.  Probably still easier to do the filesystem snapshot though.


Matías Waisgold

unread,
Feb 25, 2013, 9:01:15 AM2/25/13
to mongodb-user
Finally I had to stop writes on the entire cluster and wait for secondaries to resync and then restarted writing.
Thank's for your help.


2013/2/18 Jeff Lee <jeff...@gmail.com>

On Mon, Feb 18, 2013 at 11:08 AM, Jeff Lee <jeff...@gmail.com> wrote:
You can't use mongodump to seed a new replicaset member.  

Actually, I take this back.  Although I haven't done it personally, it might be possible by dumping the oplog along with the data.  Probably still easier to do the filesystem snapshot though.


Dwayne Mcnab

unread,
Mar 2, 2015, 11:54:39 AM3/2/15
to mongod...@googlegroups.com
Hi Matias,

 Your database was stuck in recovering state because there was too much of a lag between secondary and primary. I think your method of recovery was a bad idea.Well it depends on if you can afford to have that much downtime on your database. This is how i would solve this problem:

1. take a dump of the last entry of the oplog from one of your working secondaries.
2. stop the secondary that is in recovery state. 
3. start is up in stand alone mode.
4. connect to the database and drop the 'oplog.rs' collection.
5. restore the dump you took previously.
6 restart the server as a member of the replicaSet.
7. server should now have the latest copy of oplog and should be synced up.

ps. I would search the mongo logs to see why there was such a hug replication lag.

Dwayne Mcnab

unread,
Mar 2, 2015, 12:44:18 PM3/2/15
to mongod...@googlegroups.com
I just used this method to recover a delayed slave. I had the same problem and I was able to restore with 0  downtime.

For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to a topic in the Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mongodb-user/UxFxd6wmot0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/ca8588d7-db8d-43c9-a4b1-af13e3be25df%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages