Replica sync hangs after initial sync

853 views
Skip to first unread message

p_h_i_l_l_i_p

unread,
Jun 22, 2012, 7:27:11 AM6/22/12
to mongodb-user
Hi,
We have mongo configured in replica set (master + slave + arbiter).
One day slave was to stale to catch up, so I did a full resync
(deleted /data dir) - since then I can't get slave to catch up master.
Each time after initial sync it stops syncing and moves forward only
few seconds each hour so after some time it gets stale again.
mongo ver: 2.0.3
uname -a: Linux mongo-db-2 2.6.18-194.32.1.el5.centos.plusxen #1 SMP
Wed Jan 5 19:10:49 EST 2011 x86_64 x86_64 x86_64 GNU/Linux
rs.status: http://pastebin.com/5u9gu8W9
db.currentOp: http://pastebin.com/wCLumni7
db.printReplicationInfo: http://pastebin.com/s7k9QF7P
logs from slave: http://pastebin.com/fGkgiM9D

Any ideas ?

A. Jesse Jiryu Davis

unread,
Jun 22, 2012, 5:05:20 PM6/22/12
to mongod...@googlegroups.com
Can you try restarting the secondary, without wiping its data or anything? Connect to it, db.adminCommand('shutdown'), and restart. Let us know if that fixes the problem.

myCloudWatcher

unread,
Jun 23, 2012, 12:45:37 PM6/23/12
to mongod...@googlegroups.com
I have an issue that is close to the same problem.

I needed to re-sync a Secondary and after the Sync was over it never started to replicate.  Over time it became "stale" and did the sync again.  It then stalled again....

At the same time,  the server that was used for the "Clone" was up and down a few times.

So my question is:  Why did the SYNC "end" and the server not just start replicating?

If some error was caused by the "host" of the clone going up and down,  why no errors in the logs?


Please call any time:
Cell:      916-202-1600
Skype:  EdwardMGoldberg

Edward M. Goldberg
http://myCloudWatcher.com/
e.m.g.

A. Jesse Jiryu Davis

unread,
Jun 26, 2012, 5:28:11 PM6/26/12
to mongod...@googlegroups.com
Edward, try restarting the secondary's mongod with some verbosity (-vv should do it) and make sure you don't see errors in the logs. I'm particularly concerned about 

https://jira.mongodb.org/browse/SERVER-4781. Additionally, if your hosts are in MMS, check for the faults per second metric on the primary and secondaries. If they're spending too much time hitting the disk they may not be able to catch up.

p_h_i_l_l_i_p

unread,
Jun 27, 2012, 4:44:54 AM6/27/12
to mongod...@googlegroups.com
Hi, sorry for delay, I'll try it and post feedback later.

p_h_i_l_l_i_p

unread,
Jun 27, 2012, 8:04:17 AM6/27/12
to mongod...@googlegroups.com
Unfortunately that didn't help. Secondary still cant catch master.

SECONDARY> db.printReplicationInfo()
configured oplog size:   4605.766796875MB
log length start to end: 194secs (0.05hrs)
oplog first event time:  Wed Jun 27 2012 10:44:27 GMT+0200 (CEST)
oplog last event time:   Wed Jun 27 2012 10:47:41 GMT+0200 (CEST)
now:                     Wed Jun 27 2012 13:48:29 GMT+0200 (CEST)
SECONDARY> db.printReplicationInfo()
configured oplog size:   4605.766796875MB
log length start to end: 195secs (0.05hrs)
oplog first event time:  Wed Jun 27 2012 10:44:27 GMT+0200 (CEST)
oplog last event time:   Wed Jun 27 2012 10:47:42 GMT+0200 (CEST)
now:                     Wed Jun 27 2012 13:49:07 GMT+0200 (CEST)

Don't know if that irrelevant but output of shutdown command:

SECONDARY> db.adminCommand('shutdown')
Wed Jun 27 13:45:35 DBClientCursor::init call() failed
Wed Jun 27 13:45:35 query failed : admin.$cmd { shutdown: 1.0 } to: 127.0.0.1
Wed Jun 27 13:45:35 Error: error doing query: failed shell/collection.js:151
Wed Jun 27 13:45:35 trying reconnect to 127.0.0.1
Wed Jun 27 13:45:35 reconnect 127.0.0.1 failed couldn't connect to server 127.0.0.1

Adam Fields

unread,
Jun 27, 2012, 8:06:44 AM6/27/12
to mongod...@googlegroups.com, mongod...@googlegroups.com


On Jun 27, 2012, at 8:04 AM, p_h_i_l_l_i_p <fili...@gmail.com> wrote:

Unfortunately that didn't help. Secondary still cant catch master.


Is the master under load? I've been completely unable to ever get a secondary to stay in sync when the master is backed up.

--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com
To unsubscribe from this group, send email to
mongodb-user...@googlegroups.com
See also the IRC channel -- freenode.net#mongodb

p_h_i_l_l_i_p

unread,
Jun 27, 2012, 8:53:15 AM6/27/12
to mongod...@googlegroups.com
No master's not under heavy load:

load average: 0.07, 0.04, 0.00

~ $vmstat 10 2
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0 348028  17656  43312 7889120    1    1   152   112    3    2  2  1 91  3  3
 1  0 348028  17688  43372 7889224    0    0     0    94 5527  379  0  0 96  2  2

PRIMARY> db.serverStatus()
{
    "host" : "mongo-db-1",
    "version" : "2.0.3",
    "process" : "mongod",
    "uptime" : 6047645,
    "uptimeEstimate" : 5017003,
    "localTime" : ISODate("2012-06-27T12:47:00.479Z"),
    "globalLock" : {
        "totalTime" : 6047644507690,
        "lockTime" : 222879637747,
        "ratio" : 0.03685395817554968,
        "currentQueue" : {
            "total" : 0,
            "readers" : 0,
            "writers" : 0
        },
        "activeClients" : {
            "total" : 0,
            "readers" : 0,
            "writers" : 0
        }
    },
    "mem" : {
        "bits" : 64,
        "resident" : 6832,
        "virtual" : 107541,
        "supported" : true,
        "mapped" : 53383,
        "mappedWithJournal" : 106766
    },
    "connections" : {
        "current" : 153,
        "available" : 666
    },
    "extra_info" : {
        "note" : "fields vary by platform",
        "heap_usage_bytes" : 3903328,
        "page_faults" : 8991390
    },
    "indexCounters" : {
        "btree" : {
            "accesses" : 38392165,
            "hits" : 38389982,
            "misses" : 2183,
            "resets" : 0,
            "missRatio" : 0.000056860559960606544
        }
    },
    "backgroundFlushing" : {
        "flushes" : 100404,
        "total_ms" : 125879832,
        "average_ms" : 1253.7332377196128,
        "last_ms" : 153,
        "last_finished" : ISODate("2012-06-27T12:46:31.535Z")
    },
    "cursors" : {
        "totalOpen" : 0,
        "clientCursors_size" : 0,
        "timedOut" : 143
    },
    "network" : {
        "bytesIn" : 14621019470,
        "bytesOut" : 164105909349,
        "numRequests" : 100053313
    },
    "repl" : {
        "setName" : "mongodb-prod",
        "ismaster" : true,
        "secondary" : false,
        "hosts" : [
            "mongo-db-1:27017",
            "mongo-db-2:27017"
        ],
        "arbiters" : [
            "mongo-db-3:27017"
        ],
        "primary" : "mongo-db-1:27017",
        "me" : "mongo-db-1:27017"
    },
    "opcounters" : {
        "insert" : 467804598,
        "query" : 30831,
        "update" : 19,
        "delete" : 104,
        "getmore" : 1861174,
        "command" : 25636871
    },
    "asserts" : {
        "regular" : 0,
        "warning" : 0,
        "msg" : 0,
        "user" : 28,
        "rollovers" : 0
    },
    "writeBacksQueued" : false,
    "dur" : {
        "commits" : 27,
        "journaledMB" : 0.16384,
        "writeToDataFilesMB" : 0.189508,
        "compression" : 0.9999938965216276,
        "commitsInWriteLock" : 0,
        "earlyCommits" : 0,
        "timeMs" : {
            "dt" : 3042,
            "prepLogBuffer" : 0,
            "writeToJournal" : 117,
            "writeToDataFiles" : 3,
            "remapPrivateView" : 0
        }
    },
    "ok" : 1

A. Jesse Jiryu Davis

unread,
Jun 27, 2012, 2:58:30 PM6/27/12
to mongod...@googlegroups.com
Phillip, the errors in the shell after calling 'shutdown' are normal -- it's just the shell reporting the closed connection, messily.

Can you post the output of db.printReplicationInfo() on the PRIMARY?

I'm concerned that your configured oplog size is only 4.5G. You might have wrapped around in the oplog while the secondary was trying to sync initially. 

Can you grep for "RS102" in your secondary's log file?

It may be that you need to create a new secondary, either from scratch or from a backup of the primary, and resync.

myCloudWatcher

unread,
Jun 27, 2012, 10:23:34 PM6/27/12
to mongod...@googlegroups.com
Sorry that I have lost all of the context for this error.

I stopped that servers and deleted all of the data and did a re-sync.

I was not happy that the data size for the total server disk space looked too small.

After the new SYNC the server was very happy and had a good disk space use.  No issues after the new sync.

Sorry for the lost information about this error,  next time.  I will remember the -vv trick....

I did do several stop and starts of Mongod just to see it it would start.



Please call any time:
Cell:      916-202-1600
Skype:  EdwardMGoldberg

Edward M. Goldberg
http://myCloudWatcher.com/
e.m.g.



p_h_i_l_l_i_p

unread,
Jun 28, 2012, 5:18:57 AM6/28/12
to mongod...@googlegroups.com
PRIMARY> db.printReplicationInfo()
configured oplog size:   4612.6083984375MB
log length start to end: 109428secs (30.4hrs)
oplog first event time:  Wed Jun 27 2012 04:33:03 GMT+0200 (CEST)
oplog last event time:   Thu Jun 28 2012 10:56:51 GMT+0200 (CEST)
now:                     Thu Jun 28 2012 10:56:51 GMT+0200 (CEST)

The oplog size was my first idea, but since replication hangs after 2 or so hours it shouldn’t be a concern. It still have more than a day data backup in oplog.

RS102 problem didyn occur yet, since I started full resync yesterday, but since oplog has only 30h it will occur probably today.

As you'r saying, since I can't do a full resync, I'll try to do a backup of master data and start secondary with that data.

A. Jesse Jiryu Davis

unread,
Jun 28, 2012, 12:03:58 PM6/28/12
to mongod...@googlegroups.com
So you've seen an RS102 error in the log? Can you post that error here?

How much data do you have in total?

Let us know how it goes when you create a new secondary from a backup of the primary.

p_h_i_l_l_i_p

unread,
Jun 29, 2012, 3:16:43 AM6/29/12
to mongod...@googlegroups.com
Total size of database on master - 47.9306640625GB
Size that is replicated to slave - 23.9423828125GB (sometimes it stops on 19GB)

log with RS102 (hope it's enough - if not I can post larger piece of log):

Wed Jun 27 10:37:41 [conn126] command: { replSetHeartbeat: "mongodb-prod", v: 3, pv: 1, checkEmpty: false, from: "mongo-db-1:27017" }
Wed Jun 27 10:37:41 [conn126] command admin.$cmd command: { replSetHeartbeat: "mongodb-prod", v: 3, pv: 1, checkEmpty: false, from: "mongo-db-1:27017" } ntoreturn:1 reslen:128 0ms
Wed Jun 27 10:37:41 [rsSync] replSet syncing to: mongo-db-1:27017
Wed Jun 27 10:37:41 BackgroundJob starting: ConnectBG
Wed Jun 27 10:37:41 [rsSync] replHandshake res not: 1 res: { ok: 1.0 }
Wed Jun 27 10:37:41 [rsSync] replSet remoteOldestOp: Jun 26 04:16:09 4fe91b69:6c9
Wed Jun 27 10:37:41 [rsSync] replSet error RS102 too stale to catch up, at least from mongo-db-1:27017
Wed Jun 27 10:37:41 [rsSync] replSet our last optime : Jun 22 11:42:02 4fe43dea:4
Wed Jun 27 10:37:41 [rsSync] replSet oldest at mongo-db-1:27017 : Jun 26 04:16:09 4fe91b69:6c9
Wed Jun 27 10:37:41 [rsSync] replSet See http://www.mongodb.org/display/DOCS/Resyncing+a+Very+Stale+Replica+Set+Member
Wed Jun 27 10:37:41 [rsSync] replSet error RS102 too stale to catch up
Wed Jun 27 10:37:41 [conn4] runQuery called admin.$cmd { ismaster: 1 }
Wed Jun 27 10:37:41 [conn4] run command admin.$cmd { ismaster: 1 }
Wed Jun 27 10:37:41 [conn4] command admin.$cmd command: { ismaster: 1 } ntoreturn:1 reslen:283 0ms
Wed Jun 27 10:37:42 [conn7] runQuery called admin.$cmd { ismaster: 1 }
Wed Jun 27 10:37:42 [conn7] run command admin.$cmd { ismaster: 1 }
Wed Jun 27 10:37:42 [conn7] command admin.$cmd command: { ismaster: 1 } ntoreturn:1 reslen:283 0ms
Wed Jun 27 10:37:42 [conn9] runQuery called admin.$cmd { ismaster: 1 }
Wed Jun 27 10:37:42 [conn9] run command admin.$cmd { ismaster: 1 }
Wed Jun 27 10:37:42 [conn9] command admin.$cmd command: { ismaster: 1 } ntoreturn:1 reslen:283 0ms
Wed Jun 27 10:37:42 [conn127] runQuery called admin.$cmd { replSetHeartbeat: "mongodb-prod", v: 3, pv: 1, checkEmpty: false, from: "mongo-db-3:27017" }
Wed Jun 27 10:37:42 [conn127] run command admin.$cmd { replSetHeartbeat: "mongodb-prod", v: 3, pv: 1, checkEmpty: false, from: "mongo-db-3:27017" }
Wed Jun 27 10:37:42 [conn127] command: { replSetHeartbeat: "mongodb-prod", v: 3, pv: 1, checkEmpty: false, from: "mongo-db-3:27017" }

A. Jesse Jiryu Davis

unread,
Jun 29, 2012, 10:58:45 AM6/29/12
to mongod...@googlegroups.com
So, I'm confused about your situation. You say you started a full resync, and yet your log shows the secondary's last optime four days prior:

Wed Jun 27 10:37:41 [rsSync] replSet our last optime : Jun 22 11:42:02 4fe43dea:4

Definitely shut down this secondary, replace its data files with a recent backup of the primary, and restart. Follow the "Copy data from another member" section at the bottom of this. You probably want to use fsync and lock, copy the files, and unlock, to minimize downtime on the primary:

p_h_i_l_l_i_p

unread,
Jul 9, 2012, 7:02:38 AM7/9/12
to mongod...@googlegroups.com
Sorry again for delay - I was absent for week.

You wanted me to show you a rs problem in logs, since It didn't occur yet after full resync It was piece of logs before full resync.

I will try to replace data files on secondary with primary's backup in next few days.

Erez Zarum

unread,
Jul 9, 2012, 8:05:06 AM7/9/12
to mongod...@googlegroups.com
This is a known bug in 2.0.3 which was fixed in 2.0.4, i suggest upgrading to 2.0.6

p_h_i_l_l_i_p

unread,
Aug 13, 2012, 4:00:48 AM8/13/12
to mongod...@googlegroups.com

Unfortunetly I did upgrade to 2.0.6 last week and thesame thing happened again:

slave logs:

Fri Aug 10 13:44:01 [rsSync] replSet initialSyncOplogApplication applied 17000 operations, synced to Aug
 9 14:31:05:2
Fri Aug 10 15:09:48 [rsSync] replSet initialSyncOplogApplication applied 18000 operations, synced to Aug
 9 14:33:13:3
Fri Aug 10 16:33:17 [rsSync] replSet initialSyncOplogApplication applied 19000 operations, synced to Aug
 9 14:35:18:6
Fri Aug 10 17:54:58 [rsSync] replSet initialSyncOplogApplication applied 20000 operations, synced to Aug
 9 14:37:16:4
Fri Aug 10 18:19:35 [rsSync] replSet initial sync finishing up
Fri Aug 10 18:19:35 [rsSync] replSet set minValid=5023c081:1
Fri Aug 10 18:19:36 [rsSync] build index local.replset.minvalid { _id: 1 }
Fri Aug 10 18:19:36 [rsSync] build index done 0 records 0.011 secs
Fri Aug 10 18:19:36 [rsSync] replSet initial sync done
Fri Aug 10 18:19:37 [rsSync] replSet syncing to: mongo-db-1:27017
Fri Aug 10 18:27:07 [rsSync] replSet still syncing, not yet to minValid optime5023c081:1
Sat Aug 11 16:55:02 [rsSync] replSet still syncing, not yet to minValid optime5023c081:1
Sat Aug 11 16:55:02 [rsSync] repl: old cursor isDead, will initiate a new one
Sat Aug 11 16:55:03 [rsSync] replSet syncing to: mongo-db-1:27017
Sat Aug 11 16:55:03 [rsSync] replSet error RS102 too stale to catch up, at least from mongo-db-1:27017
Sat Aug 11 16:55:03 [rsSync] replSet our last optime : Aug  9 15:21:32 5023b95c:1
Sat Aug 11 16:55:03 [rsSync] replSet oldest at mongo-db-1:27017 : Aug 10 01:12:49 502443f1:20
Sat Aug 11 16:55:03 [rsSync] replSet See http://www.mongodb.org/display/DOCS/Resyncing+a+Very+Stale+Replica+Set+Member

Well the last thing to try is to copy all data from master and then start slave.

markh

unread,
Aug 13, 2012, 5:55:48 AM8/13/12
to mongod...@googlegroups.com
Hi,

Yes, that it is the recommended step to remove the RS102 error.

Remember to either stop the mongod on the primary or perform a fsyc & lock on the primary when performing the file copy.

Mark

Eagle Fan

unread,
Aug 14, 2012, 5:15:53 AM8/14/12
to mongod...@googlegroups.com
Hi:

I have the same problem. My mongodb version is 2.0.2. We have one primary and 4 secondaries. But all 4 secondaries got RS102 error.

The initial sync was stopped after first very small table index built. I checked the data files under /data directory, the timestamp doesn't change and no more [rsSync] information in mongodb log.

Is there any way to build the slave without shutdown or lock primary? We only have one primary now. The data size is about 170GB.

Thanks.

--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com
To unsubscribe from this group, send email to

See also the IRC channel -- freenode.net#mongodb



--
Eagle Fan (www.dbafan.com)

markh

unread,
Aug 14, 2012, 5:37:56 AM8/14/12
to mongod...@googlegroups.com
Hi,

You could perform a full resync. If you stop the failed mongod, back up the data from this secondary, delete all data in the dbpath (including subdirectories), and restart it, it will automatically resynchronize itself. Resyncing may take a long time if the database is huge or the network slow. I would advise you to monitor the performance of the primary throughout this operation. 

You can then sync the other secondaries (don't sync them all at the same time obviously) off this newly updated secondary, using firewall rules to prevent syncing off the primary (in 2.0, you cannot define specific nodes to replicate off but you can do that in 2.2).

This is the only option to avoid having to lock the primary to prevent writes whilst you perform a manual copy. If the full resync doesn't work, then you are looking at the stop/lock and manual copy solution.

Mark
Reply all
Reply to author
Forward
0 new messages