Broken master-slave replication

141 views
Skip to first unread message

Paweł Tęcza

unread,
Dec 27, 2010, 8:00:46 AM12/27/10
to mongod...@googlegroups.com
Hello!

I set up master-slave replication about 1 month ago and I was thinking
that it works good, but recently I've noticed that probably something is
broken.

I have "mdstore" db on master, but it has been missing on slave. So I
was trying to resync slave, but mongoDB shell tells me that everything
is alright:

> db.runCommand( {resync:1} )
{ "errmsg" : "not dead, no need to resync", "ok" : 0 }

Below you can see a result of db.printReplicationInfo() for my master
and db.printSlaveReplicationInfo() for my slave and also content of
their dbpath directory:

> db.printReplicationInfo()
configured oplog size: 1048.576MB
log length start to end: 47962secs (13.32hrs)
oplog first event time: Thu Nov 25 2010 23:19:51 GMT+0000 (UTC)
oplog last event time: Fri Nov 26 2010 12:39:13 GMT+0000 (UTC)
now: Mon Dec 27 2010 10:54:22 GMT+0000 (UTC)

-rw------- 1 mongodb nogroup 64M 2010-11-25 23:20 local.0
-rw------- 1 mongodb nogroup 128M 2010-11-25 23:19 local.1
-rw------- 1 mongodb nogroup 1.0G 2010-12-27 10:25 local.2
-rw------- 1 mongodb nogroup 16M 2010-11-26 12:39 local.ns
-rw------- 1 mongodb nogroup 64M 2010-12-20 13:02 mdstore.0
-rw------- 1 mongodb nogroup 128M 2010-11-26 16:17 mdstore.1
-rw------- 1 mongodb nogroup 16M 2010-12-20 13:02 mdstore.ns
-rwxr-xr-x 1 mongodb nogroup 5 2010-11-26 15:02 mongod.lock
-rw------- 1 mongodb nogroup 64M 2010-11-25 23:33 mydb.0
-rw------- 1 mongodb nogroup 128M 2010-11-30 14:58 mydb.1
-rw------- 1 mongodb nogroup 16M 2010-11-25 23:33 mydb.ns
drwxr-xr-x 2 mongodb nogroup 6 2010-12-20 13:02 _tmp

> db.printSlaveReplicationInfo()
source: masterhost
syncedTo: Fri Nov 26 2010 12:39:13 GMT+0000 (UTC)
= 2672021secs ago (742.23hrs)

-rw------- 1 mongodb nogroup 64M 2010-12-27 10:50 local.0
-rw------- 1 mongodb nogroup 128M 2010-11-26 23:24 local.1
-rw------- 1 mongodb nogroup 16M 2010-11-25 23:24 local.ns
-rwxr-xr-x 1 mongodb nogroup 0 2010-12-27 10:50 mongod.lock
-rw------- 1 mongodb nogroup 64M 2010-11-25 23:33 mydb.0
-rw------- 1 mongodb nogroup 128M 2010-12-20 10:26 mydb.1
-rw------- 1 mongodb nogroup 16M 2010-11-25 23:33 mydb.ns

I thought that slave can't resync, because its data is too stale, so I
stopped slave, removed all files from its dbpath directory and started
it again.

It seems that it helped a bit, because now I have "mdstore" db on slave.
Unfortunately db.printSlaveReplicationInfo() still tells me that my
slave is synced to Fri Nov 26 2010 12:39:13 GMT+0000 and
db.printReplicationInfo() still shows me info about master's oplog event
times in November.

I inserted some data into testing "mydb" db on master in order to check
if replication works, but I can't see them on slave:

> db.foo.save( { 'b' : '2' } )
magic == 0x41424344
>
> db.foo.find()
{ "_id" : ObjectId("4ceef235be088fcabdc27864"), "a" : 1 }
{ "_id" : ObjectId("4d187752bd265e1fe7d22d8b"), "b" : "2" }

> db.foo.find()
{ "_id" : ObjectId("4ceef235be088fcabdc27864"), "a" : 1 }

It's rather strange, because I can see a lot of following lines in the
slave's log file:

Mon Dec 27 12:52:27 [replslave] repl: from host:masterhost
Mon Dec 27 12:52:32 [replslave] repl: from host:masterhost
Mon Dec 27 12:52:36 [replslave] repl: from host:masterhost
Mon Dec 27 12:52:40 [replslave] repl: from host:masterhost
Mon Dec 27 12:52:44 [replslave] repl: from host:masterhost

What's wrong with my mongoDB? How can I fix it?

Here are the details about my system:

* kernel: Linux 2.6.32-5-xen-amd64
* distro: Ubuntu Lucid 10.04 LTS
* mongoDB: stable 20101123

My best regards,
Pawel

Eliot Horowitz

unread,
Dec 27, 2010, 10:32:41 AM12/27/10
to mongod...@googlegroups.com
resync is only for pairs.
What kind of replication are you using?
Can you send the entire slave log?

2010/12/27 Paweł Tęcza <P.T...@icm.edu.pl>:

> --
> You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
>
>

Paweł Tęcza

unread,
Dec 27, 2010, 11:40:47 AM12/27/10
to mongod...@googlegroups.com
Dnia 2010-12-27, pon o godzinie 10:32 -0500, Eliot Horowitz pisze:
> resync is only for pairs.
> What kind of replication are you using?
> Can you send the entire slave log?

Thanks a lot for your reply, Eliot!

This is simple master-slave replication with one master and only one
slave. Below is my configuration:

dbpath=/srv/mongodb
logpath=/var/log/mongodb/mongodb.log
logappend=true
master = true
oplogSize = 1000

dbpath=/srv/mongodb
logpath=/var/log/mongodb/mongodb.log
logappend=true
slave = true
source = masterhost

Unfortunately the slave's log in not rotated and it has about 32MB, so
it's not good idea to send it to the group... But I can send it to you,
if you agree, of course :)

Cheers,
Pawel


Eliot Horowitz

unread,
Dec 27, 2010, 11:38:24 AM12/27/10
to mongod...@googlegroups.com
You should upload to jira.mongodb.org, pastebin, or some other service.
Please do not send directly to me.

2010/12/27 Paweł Tęcza <P.T...@icm.edu.pl>:

Paweł Tęcza

unread,
Dec 27, 2010, 12:02:28 PM12/27/10
to mongod...@googlegroups.com
Dnia 2010-12-27, pon o godzinie 11:38 -0500, Eliot Horowitz pisze:
> You should upload to jira.mongodb.org, pastebin, or some other service.
> Please do not send directly to me.

OK, but I'm affraid the log is too big for uploading it there. Of course
I can try with Rapidshare or Mediafire :)

I wrote you that it has ~32MB. Now I can add that it contains ~559k
lines at all and ~548k lines of "[replslave] repl: from
host:masterhost"...

Do you really need entire log?

P.


Eliot Horowitz

unread,
Dec 27, 2010, 12:00:32 PM12/27/10
to mongod...@googlegroups.com
Its not too big for jira, especially zipped.

The key is probably the other lines, but hard to tell without looking...


2010/12/27 Paweł Tęcza <P.T...@icm.edu.pl>:

Paweł Tęcza

unread,
Dec 27, 2010, 12:15:55 PM12/27/10
to mongod...@googlegroups.com
Dnia 2010-12-27, pon o godzinie 18:02 +0100, Paweł Tęcza pisze:

> Do you really need entire log?

I took a deeper look at my log file and I've just discovered several
following entries there:

Sun Dec 5 12:32:10 MongoDB starting : pid=526 port=27017
dbpath=/srv/mongodb slave=1 64-bit
Sun Dec 5 12:32:11 db version v1.6.4, pdfile version 4.5
Sun Dec 5 12:32:11 git version:
4f5c02f8d92ff213b71b88f5eb643b7f62b50abc
Sun Dec 5 12:32:11 sys info: Linux domU-12-31-39-06-79-A1
2.6.21.7-2.ec2.v1.2.fc8xen #1 SMP Fri Nov 20 17:48:28 EST 2009
x86_64 BOOST_LIB_VERSION=1_41
**************
old lock file: /srv/mongodb/mongod.lock. probably means unclean
shutdown
recommend removing file and running --repair
see: http://dochub.mongodb.org/core/repair for more information
*************
Sun Dec 5 12:32:11 exception in initAndListen std::exception: old lock
file, terminating
Sun Dec 5 12:32:11 dbexit:

Sun Dec 5 12:32:11 shutdown: going to close listening sockets...
Sun Dec 5 12:32:11 shutdown: going to flush oplog...
Sun Dec 5 12:32:11 shutdown: going to close sockets...
Sun Dec 5 12:32:11 shutdown: waiting for fs preallocator...
Sun Dec 5 12:32:11 shutdown: closing all files...
Sun Dec 5 12:32:11 closeAllFiles() finished

Sun Dec 5 12:32:11 dbexit: really exiting now

Do you think it can be related to replication issues?

P.

Eliot Horowitz

unread,
Dec 27, 2010, 12:19:47 PM12/27/10
to mongod...@googlegroups.com
Well it could certainly cause some issues.
What happens after that?
What happened after you wiped the data and started over?

2010/12/27 Paweł Tęcza <P.T...@icm.edu.pl>:

Paweł Tęcza

unread,
Dec 27, 2010, 12:35:31 PM12/27/10
to mongod...@googlegroups.com
Dnia 2010-12-27, pon o godzinie 12:00 -0500, Eliot Horowitz pisze:
> Its not too big for jira, especially zipped.
>
> The key is probably the other lines, but hard to tell without looking...

I've created a new issue on your Jira:

http://jira.mongodb.org/browse/SERVER-2290

The slave log is attached, of course.

P.


Paweł Tęcza

unread,
Dec 27, 2010, 12:41:27 PM12/27/10
to mongod...@googlegroups.com
Dnia 2010-12-27, pon o godzinie 12:19 -0500, Eliot Horowitz pisze:
> Well it could certainly cause some issues.
> What happens after that?
> What happened after you wiped the data and started over?

It seems that slave took the data from master when today I have stopped
the slave, removed its files and started it again. But the log entries
about repairing mongoDB are older...

P.


Eliot Horowitz

unread,
Dec 27, 2010, 12:39:02 PM12/27/10
to mongod...@googlegroups.com
It looks you're using replica pairs? Is that correct?
If so, you should use sets or regular master slave/
If not, can you send all the options you're staring the master and slave with

2010/12/27 Paweł Tęcza <P.T...@icm.edu.pl>:

Paweł Tęcza

unread,
Dec 27, 2010, 5:38:00 PM12/27/10
to mongod...@googlegroups.com
Dnia 2010-12-27, pon o godzinie 12:39 -0500, Eliot Horowitz pisze:
> It looks you're using replica pairs? Is that correct?
> If so, you should use sets or regular master slave/
> If not, can you send all the options you're staring the master and slave with

No, I've never used replica pairs. I configured regular master/slave
replication using details from your web site:

http://www.mongodb.org/display/DOCS/Master+Slave

This is my configuration:

dbpath=/srv/mongodb
logpath=/var/log/mongodb/mongodb.log
logappend=true
master = true
oplogSize = 1000

dbpath=/srv/mongodb
logpath=/var/log/mongodb/mongodb.log
logappend=true
slave = true
source = masterhost

I haven't been using another settings for my mongoDB, probably except of
dbpath. But I believe it doesn't matter, because I'm sure I stopped
mongoDB before, next moved old data directory to new place and finally
started mongoDB one more time with fixed dbpath.

BTW, do you recommend me to upgrade to replica sets? Now I have very
simple scenario: only one master and only one slave. Of course,
automatic failover and recovery are great features for me, but I'm not
sure whether I need more nodes, for example for an arbiter.

P.


Eliot Horowitz

unread,
Dec 27, 2010, 11:33:53 PM12/27/10
to mongod...@googlegroups.com
Can you restart slave with -vvvv

2010/12/27 Paweł Tęcza <P.T...@icm.edu.pl>:

Paweł Tęcza

unread,
Dec 28, 2010, 4:04:17 AM12/28/10
to mongod...@googlegroups.com
Dnia 2010-12-27, pon o godzinie 23:33 -0500, Eliot Horowitz pisze:
> Can you restart slave with -vvvv

Sure! I copied and pasted the output at the Jira:

http://jira.mongodb.org/browse/SERVER-2290

P.


Nat

unread,
Dec 28, 2010, 6:24:07 AM12/28/10
to mongodb-user
I think there is something wrong with your oplog. oplog collection
should be updated every 10 seconds or so i.e. 'oplog last event time'
should be updated every 10 seconds. Maybe, you can try to delete
"local.*" files from master and start over.

> db.printReplicationInfo()

configured oplog size: 1048.576MB
log length start to end: 47962secs (13.32hrs)
oplog first event time: Thu Nov 25 2010 23:19:51 GMT+0000 (UTC)
oplog last event time: Fri Nov 26 2010 12:39:13 GMT+0000 (UTC)
now: Mon Dec 27 2010 10:54:22 GMT+0000 (UTC)

Paweł Tęcza

unread,
Dec 28, 2010, 7:33:10 AM12/28/10
to mongod...@googlegroups.com
Dnia 2010-12-28, wto o godzinie 03:24 -0800, Nat pisze:
> I think there is something wrong with your oplog. oplog collection
> should be updated every 10 seconds or so i.e. 'oplog last event time'
> should be updated every 10 seconds. Maybe, you can try to delete
> "local.*" files from master and start over.

Hi Nat,

Thank you for your response! Probably you're right, because I could see
the following messages in the master's log file:

Tue Dec 28 11:43:35 [replmaster] caught exception in replMasterThread()
Tue Dec 28 11:43:45 [replmaster] Assertion failure magic == 0x41424344
db/pdfile.h 241
0x534a61 0x54163f 0x6e56db 0x6e59e1 0x6e9c2a 0x6dfbc5 0x6dfc0a 0x72247f
0x6bb591 0x6babcf 0x650932 0x837550 0x7f1b50e7b9ca 0x7f1b5042a70d
/usr/bin/mongod(_ZN5mongo12sayDbContextEPKc+0xb1) [0x534a61]
/usr/bin/mongod(_ZN5mongo8assertedEPKcS1_j+0x10f) [0x54163f]
/usr/bin/mongod(_ZNK5mongo16NamespaceDetails11inCapExtentERKNS_7DiskLocE
+0xfb) [0x6e56db]
/usr/bin/mongod(_ZN5mongo16NamespaceDetails10__capAllocEi+0x101)
[0x6e59e1]
/usr/bin/mongod(_ZN5mongo16NamespaceDetails11cappedAllocEPKci+0x21a)
[0x6e9c2a]
/usr/bin/mongod(_ZN5mongo16NamespaceDetails6_allocEPKci+0x25) [0x6dfbc5]
/usr/bin/mongod(_ZN5mongo16NamespaceDetails5allocEPKciRNS_7DiskLocE
+0x3a) [0x6dfc0a]
/usr/bin/mongod(_ZN5mongo11DataFileMgr17fast_oplog_insertEPNS_16NamespaceDetailsEPKci+0x17f) [0x72247f]
/usr/bin/mongod() [0x6bb591]
/usr/bin/mongod(_ZN5mongo12logKeepaliveEv+0x4f) [0x6babcf]
/usr/bin/mongod() [0x650932]
/usr/bin/mongod(thread_proxy+0x80) [0x837550]
/lib/libpthread.so.0(+0x69ca) [0x7f1b50e7b9ca]
/lib/libc.so.6(clone+0x6d) [0x7f1b5042a70d]
Tue Dec 28 11:43:45 [replmaster] caught exception in replMasterThread()

I stopped the master (`/etc/init.d/mongodb stop`), purged its local.*
files and started again (`/usr/bin/mongod --config /etc/mongodb.conf`).
I had some troubles with it, because here the init script kills mongoDB
daemon, but its lock file still exists... Probably I should create a new
thread for reporting that issue.

Now it seems that master-slave replication works fine for me :D

> db.printReplicationInfo()
configured oplog size: 1048.576MB

log length start to end: 1314secs (0.37hrs)
oplog first event time: Tue Dec 28 2010 11:49:19 GMT+0000 (UTC)
oplog last event time: Tue Dec 28 2010 12:11:13 GMT+0000 (UTC)
now: Tue Dec 28 2010 12:11:15 GMT+0000 (UTC)

> db.printSlaveReplicationInfo()
source: masterhost
syncedTo: Tue Dec 28 2010 12:11:03 GMT+0000 (UTC)
= 13secs ago (0hrs)

Thanks a lot for your help!

My best regards,
Pawel


Paweł Tęcza

unread,
Dec 28, 2010, 7:51:10 AM12/28/10
to mongod...@googlegroups.com
Dnia 2010-12-28, wto o godzinie 13:33 +0100, Paweł Tęcza pisze:

> I stopped the master (`/etc/init.d/mongodb stop`), purged its local.*
> files and started again (`/usr/bin/mongod --config /etc/mongodb.conf`).
> I had some troubles with it, because here the init script kills mongoDB
> daemon, but its lock file still exists... Probably I should create a new
> thread for reporting that issue.
>
> Now it seems that master-slave replication works fine for me :D

I forgot to add I had to resync my slave:

> use admin
switched to db admin
> db.runCommand({resync: 1})
{ "info" : "triggered resync for all sources", "ok" : 1 }

I think it can be useful tip for another MongoDB users :)

P.


Eliot Horowitz

unread,
Dec 28, 2010, 10:19:12 AM12/28/10
to mongod...@googlegroups.com
It looks like your master crashed.
I would remove local.* and run a repair, then start replication over again.

2010/12/28 Paweł Tęcza <P.T...@icm.edu.pl>:

Reply all
Reply to author
Forward
0 new messages