"Invalid BSONObj size" in Primary, Secondary and oplog.rs

1,309 views
Skip to first unread message

Mihnea Giurgea

unread,
Jun 29, 2012, 7:23:06 AM6/29/12
to mongod...@googlegroups.com
1. One of our shards (Primary node) is throwing the following error:

Fri Jun 29 14:09:24 uncaught exception: error {
"$err" : "Invalid BSONObj size: 1159736947 (0x732E2045) first element: pecially Miss Rhode Island: ?type=115",
"code" : 10334
}

when we try to do any writes (insert, update or delete) on some _id

2. Running find_one on that _id works. However, trying to find records left or right of that _id does not work.

3. The highlighted text ("pecially Miss Rhode Island") is very probably a part of some mongo document, but not from the document with the bad _id.

4. We get the same error when we try to look into the oplog replay collection: 

use local
db.oplog.rs.findOne() 

5. We took the Primary down and noticed the same errors on our Secondary node.

6. We repaired the Primary node and, although the repair command finished executing without any errors, the same errors remained.

Running on mongo version: 2.0.5, Ubuntu 12.04 and Amazon EBS.

Mihnea Giurgea

unread,
Jun 29, 2012, 12:14:35 PM6/29/12
to mongod...@googlegroups.com
Ran validate on oplog.rs collection and got this response (truncated, attached full response in validate.out):

PRIMARY> db.oplog.rs.validate({ full: true })
{
"ns" : "local.oplog.rs",
"capped" : 1,
"max" : 2147483647,
"firstExtent" : "2:2000 ns:local.oplog.rs",
"lastExtent" : "32:2000 ns:local.oplog.rs",
"extentCount" : 49,
"extents" : [ ... ],
"datasize" : 102604459720,
"nrecords" : 140819284,
"lastExtentSize" : 1829110784,
"padding" : 1,
"firstExtentDetails" : {
"loc" : "2:2000",
"xnext" : "3:2000",
"xprev" : "null",
"nsdiag" : "local.oplog.rs",
"size" : 2146426864,
"firstRecord" : "2:7fefd33c",
"lastRecord" : "2:7fefc764"
},
"valid" : false,
"errors" : [
"exception during validate"
],
"advice" : "ns corrupt, requires repair",
"ok" : 1
validate.out

Mihnea Giurgea

unread,
Jun 29, 2012, 12:21:52 PM6/29/12
to mongod...@googlegroups.com
Tried to repair local database, but did not work:
 
> use local
switched to db local
> db.repairDatabase()
{
"errmsg" : "exception: nextSafe(): { $err: \"Invalid BSONObj size: 1159736947 (0x732E2045) first element: pecially Miss Rhode Island: ?type=115\", code: 10334 }",
"code" : 13106,
"ok" : 0
}

Kevin Matulef

unread,
Jun 29, 2012, 12:54:17 PM6/29/12
to mongod...@googlegroups.com
If repairDatabase doesn't work on the primary or the secondary, you can try using mongodump with the --repair option to dump the affected database, then restore it using mongorestore (see here: http://www.mongodb.org/display/DOCS/Import+Export+Tools).  

I would try this on the secondary machine, by temporarily taking it out of the replica set, and restarting it in standalone mode (without the --replSet option) on a different port.  If it works, then you can cycle the primary/secondary and repeat the process on the other machine.

For future prevention, it'd be good to figure out how this corruption happened.  Did these machines experience unclean shutdowns?  Are you running with journaling on? 

Mihnea Giurgea

unread,
Jun 29, 2012, 2:05:48 PM6/29/12
to mongod...@googlegroups.com
We are running with journaling on, and the machines were never restarted.

We erased the local database from our Primary node (before reading your reply), recreated it, then everything worked.

Mihnea Giurgea

unread,
Jun 29, 2012, 5:31:20 PM6/29/12
to mongod...@googlegroups.com
Post-mortem analysis:

The entire problem was caused by a corrupt local.oplog.rs collection. We don't know what caused the corruption but we do know that it was present in the Primary node, as well as the Secondary node - executing a findOne() in that collection raised an "Invalid BSONObj size" error. Repairing the oplog.rs collection worked, but did not solve / remove the error. We did not try mongodump --repair.

The fact that the same error was present in both nodes was completely unexpected, as it ruled out the possibility of hardware failure (nodes were running on distinct AWS instances & EBS volumes).

The mongo instances had not been shutdown or restarted (or had any other issues) in the last 2 weeks (significantly more than the oplog log length).

We fixed the issue by dropping the oplog.rs collection and inserting an oplog entry into it (just here: http://www.kchodorow.com/blog/2011/02/22/resizing-your-oplog/) and then cloning the working primary local.oplog.rs to the secondary.
Reply all
Reply to author
Forward
0 new messages