Replica state is recovering state for ever

Anil Lingamallu

unread,

May 4, 2012, 5:58:22 PM5/4/12

to mongod...@googlegroups.com

I am hitting an issue when a creating a replicaset to be used as a Shrad. I started replicaset with 3 replicas. I am creating this replicaset for the first time. One of the replicas is in "Recovering" state forever. The error message shown when I type rs.status() on shell is "error RS102 too stale to catch up". I read that if oplogsize is less, I get this error when one of the replica comes backup after lot of writes. I do not understand how can be a replicaset be stale when there is no data (apart from Mongo initial data). Not sure if I am missing anything obvious. This does not happen all the times. Please clarify.

The version I am using is 2.1.1-pre- downloaded from http://www.mongodb.org/display/DOCS/MongoDB+on+Azure.

Attaching the zipped log files of three mongod instances.

Here is output of db.printReplicationInfo()

Shard1:RECOVERING> db.printReplicationInfo()
configured oplog size:   10MB
log length start to end: 0secs (0hrs)
oplog first event time: Fri May 04 2012 13:59:32 GMT-0700 (Pacific Daylight Time)
oplog last event time:   Fri May 04 2012 13:59:32 GMT-0700 (Pacific Daylight Time)
now:                     Fri May 04 2012 14:39:37 GMT-0700 (Pacific Daylight Time)

Here is the output of rs.status()

Shard1:RECOVERING> rs.status()
{

        "set" : "Shard1",
        "date" : ISODate("2012-05-04T21:39:57Z"),
        "myState" : 3,
        "members" : [
                {
                        "_id" : 0,
                        "name" : "127.0.0.1:27100",
                        "health" : 1,
                        "state" : 3,
                        "stateStr" : "RECOVERING",
                        "uptime" : 2437,
                        "optime" : Timestamp(1336165172000, 1),
                        "optimeDate" : ISODate("2012-05-04T20:59:32Z"),
                        "errmsg" : "error RS102 too stale to catch up",
                        "self" : true
                },
                {
                        "_id" : 1,
                        "name" : "127.0.0.1:27101",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                        "uptime" : 2417,
                        "optime" : Timestamp(1336165173000, 1),
                        "optimeDate" : ISODate("2012-05-04T20:59:33Z"),
                        "lastHeartbeat" : ISODate("2012-05-04T21:39:56Z"),
                        "pingMs" : 0
                },
                {
                        "_id" : 2,
                        "name" : "127.0.0.1:27102",
                        "health" : 1,
                        "state" : 1,
                        "stateStr" : "PRIMARY",
                        "uptime" : 2417,
                        "optime" : Timestamp(1336165173000, 1),
                        "optimeDate" : ISODate("2012-05-04T20:59:33Z"),
                        "lastHeartbeat" : ISODate("2012-05-04T21:39:56Z"),
                        "pingMs" : 0
                }
        ],
        "ok" : 1
}

Regards

Logs.zip

Dan Crosta

unread,

May 7, 2012, 3:06:40 PM5/7/12

to mongodb-user

Your oplog size is 10 megabytes, which is very much on the small side.
Depending on the size of your operations (i.e. inserts and updates)
this may only be able to hold a few operations before it "rolls over,"
which could cause this RS102 error. Did you set it to so small a size
on purpose?

As far as getting back to a healthy state, if there truly is no data
in any of the databases, you can probably just stop mongod on port
27100, remove the data files from the data path used by that mongod,
then restart it. It will reconnect to the other replica set members
and resync any data that there is.

- Dan

On May 4, 5:58 pm, Anil Lingamallu <lsa...@gmail.com> wrote:
> I am hitting an issue when a creating a replicaset to be used as a Shrad. I
> started replicaset with 3 replicas. I am creating this replicaset for the
> first time. One of the replicas is in "Recovering" state forever. The error
> message shown when I type rs.status() on shell is "error RS102 too stale to
> catch up". I read that if oplogsize is less, I get this error when one of
> the replica comes backup after lot of writes. I do not understand how can
> be a replicaset be stale when there is no data (apart from Mongo initial
> data). Not sure if I am missing anything obvious. This does not happen all
> the times. Please clarify.
>

> The version I am using is 2.1.1-pre- downloaded fromhttp://www.mongodb.org/display/DOCS/MongoDB+on+Azure.

>
> Attaching the zipped log files of three mongod instances.
>

> Here is output of *db.printReplicationInfo()*

>
> Shard1:RECOVERING> db.printReplicationInfo()
> configured oplog size: 10MB
> log length start to end: 0secs (0hrs)
> oplog first event time: Fri May 04 2012 13:59:32 GMT-0700 (Pacific
> Daylight Time)
> oplog last event time: Fri May 04 2012 13:59:32 GMT-0700 (Pacific
> Daylight Time)
> now: Fri May 04 2012 14:39:37 GMT-0700 (Pacific
> Daylight Time)
>

> Here is the output of *rs.status()*

> Logs.zip
> 229KViewDownload

Anil Lingamallu

unread,

May 9, 2012, 1:21:02 PM5/9/12

to mongod...@googlegroups.com

Thank you Dan. Increasing the oplogsize fixed the issue. I am running on a test machine with low disk space, so I reduced the oplog size.

Will the oplog size matters even for the first start of replicaset (there is no used added data)? Please clarify.

Dan Crosta

unread,

May 9, 2012, 2:30:36 PM5/9/12

to mongodb-user

Even on a test instance, it's probably best to leave the default oplog
size setting (5% of disk space) as-is. It's very difficult to change
the oplog size after the fact. The oplog will need to contain some
data at initial startup -- usually just a "no-op" operation so that
the secondaries have something to query for.

Note that even with "no data", you may still have oplog entries. This
could happen, for instance, if you create a collection, insert some
documents, and then drop it. Each of the inserts, as well as the drop,
will end up as oplog entries. If a secondary is down while you do
this, it is possible for it to "fall off the end of the oplog" before
it's brought back up, particularly with such a small oplog.

- Dan

Reply all

Reply to author

Forward

Replica state is recovering state for ever - Azure Binaries

Anil Lingamallu

Dan Crosta

Anil Lingamallu

Dan Crosta