replSet error RS102 too stale to catch up while resync secondary member

Sayta Dommeti

unread,

Aug 14, 2014, 11:42:06 PM8/14/14

to mongod...@googlegroups.com

Hi All,

I configured replica set with 3 members i.e One Primary, One Secondary, One Arbiter.

After some days, Secondary service was stopped unexpectedly. I found the following error in mongo log

==========================================================================================
2014-08-14T15:20:00.139+0530 [RangeDeleter] warning: Error encountered while trying to delete range: Error encountered while deleting range: database.collection from { _id: "value" } -> { _id: "value" }, cause by: :: caused by :: 10107 waitForReplication called but not master anymore
2014-08-14T15:20:00.139+0530 [RangeDeleter] Helpers::removeRangeUnlocked time spent waiting for replication: 0ms
2014-08-14T15:20:00.139+0530 [RangeDeleter] rangeDeleter deleted 0 documents for database.collection from { _id: "value" } -> { _id: "value" }
2014-08-14T15:20:00.139+0530 [RangeDeleter] Assertion: 10107:replicatedToNum called but not master anymore
2014-08-14T15:20:00.142+0530 [RangeDeleter] 0x11c0e91 0x1163109 0x11477e6 0xebb7ab 0xeb5cfd 0xde2830 0xddd2dd 0x1205829 0x3518607851 0x3517ee890d
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11c0e91]
/usr/bin/mongod(_ZN5mongo10logContextEPKc+0x159) [0x1163109]
/usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0xe6) [0x11477e6]
/usr/bin/mongod(_ZN5mongo13SlaveTracking18opReplicatedEnoughENS_6OpTimeERKSs+0x8b) [0xebb7ab]
/usr/bin/mongod(_ZN5mongo18opReplicatedEnoughENS_6OpTimeENS_11BSONElementE+0xdd) [0xeb5cfd]
/usr/bin/mongod(_ZN5mongo17RangeDeleterDBEnv11deleteRangeERKNS_10StringDataERKNS_7BSONObjES6_S6_bPSs+0xbb0) [0xde2830]
/usr/bin/mongod(_ZN5mongo12RangeDeleter6doWorkEv+0x86d) [0xddd2dd]
/usr/bin/mongod() [0x1205829]
/lib64/libpthread.so.0() [0x3518607851]
/lib64/libc.so.6(clone+0x6d) [0x3517ee890d]
2014-08-14T15:20:00.162+0530 [RangeDeleter] SEVERE: terminate() called, printing stack (if implemented for platform):
0x11c0e91 0x11c0118 0x351cebcbe6 0x351cebcc13 0x351cebcd0e 0x114788e 0xebb7ab 0xeb5cfd 0xde2830 0xddd2dd 0x1205829 0x3518607851 0x3517ee890d
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11c0e91]
/usr/bin/mongod() [0x11c0118]
/usr/lib64/libstdc++.so.6() [0x351cebcbe6]
/usr/lib64/libstdc++.so.6() [0x351cebcc13]
/usr/lib64/libstdc++.so.6() [0x351cebcd0e]
/usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x18e) [0x114788e]
/usr/bin/mongod(_ZN5mongo13SlaveTracking18opReplicatedEnoughENS_6OpTimeERKSs+0x8b) [0xebb7ab]
/usr/bin/mongod(_ZN5mongo18opReplicatedEnoughENS_6OpTimeENS_11BSONElementE+0xdd) [0xeb5cfd]
/usr/bin/mongod(_ZN5mongo17RangeDeleterDBEnv11deleteRangeERKNS_10StringDataERKNS_7BSONObjES6_S6_bPSs+0xbb0) [0xde2830]
/usr/bin/mongod(_ZN5mongo12RangeDeleter6doWorkEv+0x86d) [0xddd2dd]
/usr/bin/mongod() [0x1205829]
/lib64/libpthread.so.0() [0x3518607851]
/lib64/libc.so.6(clone+0x6d) [0x3517ee890d]
========================================================================================

After that I had started secondary service manually. It was in Recovery state due to secondary didn't catch up with primary.

and then I had started resync the replica set member from initial. But, it is RECOVERY state only. The following is the error log

============================================================================================
2014-08-15T05:13:00.357+0530 [rsSync] building index using bulk method
2014-08-15T05:13:03.000+0530 [rsSync] Index Build: 674600/1018567 66%
2014-08-15T05:13:08.609+0530 [rsSync] build index done. scanned 1018567 total records. 8.252 secs
2014-08-15T05:13:08.751+0530 [rsSync] replSet initial sync data copy, starting syncup
2014-08-15T05:13:08.751+0530 [rsSync] oplog sync 1 of 3
2014-08-15T05:13:09.780+0530 [rsBackgroundSync] replSet syncing to: PrimaryHost:27011
2014-08-15T05:13:09.781+0530 [rsBackgroundSync] replSet error RS102 too stale to catch up, at least from PrimaryHost:27011
2014-08-15T05:13:09.781+0530 [rsBackgroundSync] replSet our last optime : Aug 14 23:45:46 53ecfcd2:42
2014-08-15T05:13:09.781+0530 [rsBackgroundSync] replSet oldest at PrimaryHost:27011 : Aug 15 03:59:32 53ed384c:4c8
2014-08-15T05:13:09.781+0530 [rsBackgroundSync] replSet See http://dochub.mongodb.org/core/resyncingaverystalereplicasetmember
2014-08-15T05:13:10.054+0530 [rsBackgroundSync] replSet error RS102 too stale to catch up
2014-08-15T05:13:10.054+0530 [rsBackgroundSync] replSet RECOVERING
2014-08-15T05:13:12.408+0530 [conn1339] end connection Arbiter:53931 (26 connections now open)
2014-08-15T05:13:12.408+0530 [initandlisten] connection accepted from Arbiter:53938 #1341 (27 connections now open)
2014-08-15T05:13:24.103+0530 [conn1340] end connection PrimaryHost:45803 (26 connections now open)
2014-08-15T05:13:24.104+0530 [initandlisten] connection accepted from PrimaryHost:45809 #1342 (27 connections now open)
2014-08-15T05:13:42.421+0530 [conn1341] end connection Arbiter:53938 (26 connections now open)
2014-08-15T05:13:42.421+0530 [initandlisten] connection accepted from Arbiter:53944 #1343 (27 connections now open)
2014-08-15T05:13:54.115+0530 [conn1342] end connection PrimaryHost:45809 (26 connections now open)
2014-08-15T05:13:54.116+0530 [initandlisten] connection accepted from PrimaryHost:45815 #1344 (27 connections now open)
2014-08-15T05:14:08.748+0530 [clientcursormon] mem (MB) res:91991 virt:233739
2014-08-15T05:14:08.748+0530 [clientcursormon] mapped (incl journal view):232428
2014-08-15T05:14:08.748+0530 [clientcursormon] connections:27
2014-08-15T05:14:08.748+0530 [clientcursormon] replication threads:32
2014-08-15T05:14:12.432+0530 [conn1343] end connection Arbiter:53944 (26 connections now open)
2014-08-15T05:14:12.432+0530 [initandlisten] connection accepted from Arbiter:53950 #1345 (27 connections now open)
2014-08-15T05:14:24.128+0530 [conn1344] end connection PrimaryHost:45815 (26 connections now open)
2014-08-15T05:14:24.128+0530 [initandlisten] connection accepted from PrimaryHost:45821 #1346 (27 connections now open)
2014-08-15T05:14:42.444+0530 [conn1345] end connection Arbiter:53950 (26 connections now open)
2014-08-15T05:14:42.444+0530 [initandlisten] connection accepted from Arbiter:53956 #1347 (27 connections now open)
2014-08-15T05:14:54.145+0530 [conn1346] end connection PrimaryHost:45821 (26 connections now open)
2014-08-15T05:14:54.146+0530 [initandlisten] connection accepted from PrimaryHost:45827 #1348 (27 connections now open)
2014-08-15T05:15:12.455+0530 [conn1347] end connection Arbiter:53956 (26 connections now open)
2014-08-15T05:15:12.455+0530 [initandlisten] connection accepted from Arbiter:53962 #1349 (27 connections now open)
2014-08-15T05:15:15.055+0530 [rsBackgroundSync] replSet not trying to sync from PrimaryHost:27011, it is vetoed for 474 more seconds
2014-08-15T05:15:15.055+0530 [rsBackgroundSync] replSet not trying to sync from PrimaryHost:27011, it is vetoed for 474 more seconds
2014-08-15T05:15:20.056+0530 [rsBackgroundSync] replSet not trying to sync from PrimaryHost:27011, it is vetoed for 469 more seconds
2014-08-15T05:15:20.056+0530 [rsBackgroundSync] replSet not trying to sync from PrimaryHost:27011, it is vetoed for 469 more seconds
2014-08-15T05:15:24.164+0530 [conn1348] end connection PrimaryHost:45827 (26 connections now open)
2014-08-15T05:15:24.165+0530 [initandlisten] connection accepted from PrimaryHost:45833 #1350 (27 connections now open)
2014-08-15T05:15:25.056+0530 [rsBackgroundSync] replSet not trying to sync from PrimaryHost:27011, it is vetoed for 464 more seconds
2014-08-15T05:15:25.057+0530 [rsBackgroundSync] replSet not trying to sync from PrimaryHost:27011, it is vetoed for 464 more seconds
2014-08-15T05:15:29.692+0530 [conn8] command admin.$cmd command: writebacklisten { writebacklisten: ObjectId('53bfc43d429aac2a013bbe25') } ntoreturn:1 keyUpdates:0 numYields:0 reslen:44 300000ms
2014-08-15T05:15:29.710+0530 [conn10] command admin.$cmd command: writebacklisten { writebacklisten: ObjectId('53bfda689b3f9287fd61fa97') } ntoreturn:1 keyUpdates:0 numYields:0 reslen:44 300000ms
2014-08-15T05:15:29.711+0530 [conn12] command admin.$cmd command: writebacklisten { writebacklisten: ObjectId('53bfc5344fe0e2debbacc83d') } ntoreturn:1 keyUpdates:0 numYields:0 reslen:44 300000ms
2014-08-15T05:15:29.713+0530 [conn11] command admin.$cmd command: writebacklisten { writebacklisten: ObjectId('53bfe37114393bcbfc59e2dd') } ntoreturn:1 keyUpdates:0 numYields:0 reslen:44 300000ms
2014-08-15T05:15:29.713+0530 [conn9] command admin.$cmd command: writebacklisten { writebacklisten: ObjectId('53b7238fa3c2fc4f399d430f') } ntoreturn:1 keyUpdates:0 numYields:0 reslen:44 300000ms
2014-08-15T05:15:29.724+0530 [conn13] command admin.$cmd command: writebacklisten { writebacklisten: ObjectId('53bfc50ccbac799a6d9da8c8') } ntoreturn:1 keyUpdates:0 numYields:0 reslen:44 300000ms
2014-08-15T05:15:30.057+0530 [rsBackgroundSync] replSet not trying to sync from PrimaryHost:27011, it is vetoed for 459 more seconds
2014-08-15T05:15:30.057+0530 [rsBackgroundSync] replSet not trying to sync from PrimaryHost:27011, it is vetoed for 459 more seconds
====================================================================================================

I observed Oplog size is 10GB

Actually, my doubt is why the secondary member stopped unexpectedly?

Why the resyc is not successful ? Is resyn depends on Oplog size ?

Best Regards,
Satya D

Asya Kamsky

unread,

Aug 15, 2014, 12:55:21 AM8/15/14

to mongodb-user

First, you don't just have a replica set - this is part of a sharded
cluster, right?

Secondly, yes, oplog size is very critical to a node being able to
resync - since the oplog has to hold all the changes happening while
the secondary copies the full data from the primary...

Asya

> --
> You received this message because you are subscribed to the Google Groups
> "mongodb-user"
> group.
>
> For other MongoDB technical support options, see:
> http://www.mongodb.org/about/support/.
> ---
> You received this message because you are subscribed to the Google Groups
> "mongodb-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mongodb-user...@googlegroups.com.
> To post to this group, send email to mongod...@googlegroups.com.
> Visit this group at http://groups.google.com/group/mongodb-user.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/mongodb-user/362b9304-ea62-4e3c-89b7-9059916ba045%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Sayta Dommeti

unread,

Aug 15, 2014, 2:15:58 AM8/15/14

to mongod...@googlegroups.com

Hi Asya,

Thank you so much for quick response.

Yes, It is part of sharded cluster. There are four shards each has replication set and 3 config servers.

I faced issue on one of the shard i.e shard2.

Total disk space size is 512GB. Now, I have put oplogSize=25600 (i.e 25GB) in config file of secondary server and started the service.

Primary Node data size is around 200 GB. Secondary is sync with Primary upto 212 GB and then it changed the state from "STATRTUP" to "RECOVERY"

Please find attached mongod log file for reference.

It is critical server. Can you please help me on this issue.

Regards,
Satya D

mongoLog.txt

Sayta Dommeti

unread,

Aug 15, 2014, 2:41:51 AM8/15/14

to mongod...@googlegroups.com

Sorry, its typo mistake

Primary Node data size is around 200 GB. Secondary is sync with Primary upto 112 GB and then it changed the state from "STATRTUP" to "RECOVERY"

Asya Kamsky

unread,

Aug 15, 2014, 7:20:30 PM8/15/14

to mongodb-user

The problem isn't the oplog size on the secondary - it's the oplog
size on the primary (or one of the other machines it can sync from).

The way the sync works is secondary notes a place in the oplog on the
primary, then starts copying all the data over.
When it copied the data over, it needs to replay the oplog operations
that happened during that time. It goes back to the primary but
OOOPS! the place that it noted in the oplog has been written over!
The primary's oplog was not big enough!

I recomment that you resync using the file system copy method, rather
than the full sync.

http://docs.mongodb.org/manual/tutorial/resync-replica-set-member/#sync-by-copying-data-files-from-another-member

Asya

> https://groups.google.com/d/msgid/mongodb-user/d9b5c3c8-f129-4889-a711-570889975772%40googlegroups.com.

Sayta Dommeti

unread,

Aug 16, 2014, 10:42:33 AM8/16/14

to mongod...@googlegroups.com

Thank you Asya..

You received this message because you are subscribed to a topic in the Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mongodb-user/w9udbgKRUwk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mongodb-user...@googlegroups.com.

To post to this group, send email to mongod...@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.

To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/CAOe6dJDxzUUHpG9TBnC9J12weV%2Bkr8GRq%2B59qEyZTm%2Bqoj-AdQ%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--

Regards,

Satya D

Blake Mitchell

unread,

Feb 20, 2015, 3:24:48 PM2/20/15

to mongod...@googlegroups.com

Does the oplog window need to be large enough to sync each database individually, or the entire replica set?

Will Berkeley

unread,

Feb 20, 2015, 3:41:13 PM2/20/15

to mongod...@googlegroups.com

There's one oplog for all of the operations on all of the databases.

-Will

Reply all

Reply to author

Forward