We have found an issue with the replica sets going out of sync after
approximately 1 hour under high load. In the slave we get the
following in the log:
Sun Apr 24 20:44:42 [FileAllocator] allocating new datafile /data/
sessions.1/session_cache.5, filling with zeroes...
Sun Apr 24 20:44:46 [FileAllocator] done allocating datafile /data/
sessions.1/session_cache.5, size: 2047MB, took 4.217 secs
Sun Apr 24 20:54:31 [conn142] query admin.$cmd ntoreturn:1 command:
{ serverStatus: 1 } reslen:1368 145ms
Sun Apr 24 20:54:37 [conn144] query admin.$cmd ntoreturn:1 command:
{ serverStatus: 1 } reslen:1368 109ms
Sun Apr 24 20:55:30 [conn146] query admin.$cmd ntoreturn:1 command:
{ serverStatus: 1 } reslen:1368 200ms
Sun Apr 24 20:55:38 [conn150] query admin.$cmd ntoreturn:1 command:
{ serverStatus: 1 } reslen:1368 853ms
Sun Apr 24 20:58:30 [conn164] query admin.$cmd ntoreturn:1 command:
{ serverStatus: 1 } reslen:1368 360ms
Sun Apr 24 21:10:43 [conn268] query admin.$cmd ntoreturn:1 command:
{ serverStatus: 1 } reslen:1368 285ms
Sun Apr 24 21:10:43 [conn267] query admin.$cmd ntoreturn:1 command:
{ serverStatus: 1 } reslen:1368 314ms
Sun Apr 24 21:10:43 [conn269] query admin.$cmd ntoreturn:1 command:
{ serverStatus: 1 } reslen:1368 279ms
Sun Apr 24 21:10:46 [conn270] query admin.$cmd ntoreturn:1 command:
{ serverStatus: 1 } reslen:1368 150ms
Sun Apr 24 21:12:32 [conn278] query admin.$cmd ntoreturn:1 command:
{ serverStatus: 1 } reslen:1368 335ms
Sun Apr 24 21:20:41 [conn337] query admin.$cmd ntoreturn:1 command:
{ serverStatus: 1 } reslen:1368 538ms
Sun Apr 24 21:22:55 [replica set sync] repl: old cursor isDead, will
initiate a new one
Sun Apr 24 21:22:56 [replica set sync] replSet error RS102 too stale
to catch up, at least from sessionmgr01:27717
Sun Apr 24 21:22:56 [replica set sync] replSet our last optime : Apr
24 21:17:08 4db4e7b4:54
Sun Apr 24 21:22:56 [replica set sync] replSet oldest at
sessionmgr01:27717 : Apr 24 21:17:11 4db4e7b7:4f7
Sun Apr 24 21:22:56 [replica set sync] replSet See
http://www.mongodb.org/display/DOCS/Resyncing+a+Very+Stale+Replica+Set+Member
Sun Apr 24 21:22:56 [replica set sync] replSet error RS102 too stale
to catch up
Sun Apr 24 21:22:56 [replica set sync] replSet RECOVERING
In the master the following is found:
Sun Apr 24 21:22:49 [conn35] getmore
local.oplog.rs cid:
968739367957150436 getMore: { ts: { $gte: new
Date(5599344019072101879) } } bytes:4194364 nreturned:5506 3902ms
Sun Apr 24 21:22:55 [conn35] getmore
local.oplog.rs cid:
968739367957150436 getMore: { ts: { $gte: new
Date(5599344019072101879) } } bytes:62797 nreturned:96 4185ms
Sun Apr 24 21:22:55 [conn35] getMore: cursorid not found
local.oplog.rs 968739367957150436
Sun Apr 24 21:22:56 [ReplSetHealthPollTask] replSet member
sessionmgr02:27717 RECOVERING
Our oplog is sized at 1 GB which should fill up in about 3 minutes
based on the size of the records we are inserting and updating. Is
there any approach to determine how far behind a replica is and also
determine why it is not catching up. The replica is on identical
hardware as the master.