Data corruption, again

Sabin Iacob

unread,

Feb 9, 2012, 6:12:33 AM2/9/12

to mongod...@googlegroups.com

Hello

[long post ahead]

I had a data corruption problem a while ago, managed to get it repaired (but lost a lot of not-essential-but-nice-to-have data because db.repairDatabase and replication encountered an error and came to the conclusion that the database cloning/repair is finished, skipping a bunch of collections). I remember pasting errors from the logs and being told to produce the "original error", whatever that means (the errors were all the same).

Now it has happened again: data corruption in a heavily used collection, and I have the first occurrence in the log.

Here it is:

Wed Feb 8 09:02:00 [conn176788] Assertion: 10320:BSONElement: bad type 113

0x584722 0x506e7d 0x662809 0x663bfd 0x86b7dd 0x86ba18 0x9717b9 0x8c63c6 0x8d5f00 0x8d6556 0x8d9d8e 0x8db7a3 0x8dca77 0x964959 0x97db97 0x97987d 0x97a9df 0x962de5 0x966171 0x883877

/usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x112) [0x584722]

/usr/bin/mongod(_ZNK5mongo11BSONElement4sizeEv+0x1bd) [0x506e7d]

/usr/bin/mongod(_ZN5mongo7Matcher13matchesDottedEPKcRKNS_11BSONElementERKNS_7BSONObjEiRKNS_14ElementMatcherEbPNS_12MatchDetailsE+0x21c9) [0x662809]

/usr/bin/mongod(_ZN5mongo7Matcher7matchesERKNS_7BSONObjEPNS_12MatchDetailsE+0xdd) [0x663bfd]

/usr/bin/mongod(_ZN5mongo19CoveredIndexMatcher7matchesERKNS_7BSONObjERKNS_7DiskLocEPNS_12MatchDetailsEb+0xcd) [0x86b7dd]

/usr/bin/mongod(_ZN5mongo19CoveredIndexMatcher14matchesCurrentEPNS_6CursorEPNS_12MatchDetailsE+0xa8) [0x86ba18]

/usr/bin/mongod(_ZN5mongo7CountOp4nextEv+0x8e9) [0x9717b9]

/usr/bin/mongod(_ZN5mongo12QueryPlanSet6Runner6nextOpERNS_7QueryOpE+0x56) [0x8c63c6]

/usr/bin/mongod(_ZN5mongo12QueryPlanSet6Runner4nextEv+0x110) [0x8d5f00]

/usr/bin/mongod(_ZN5mongo12QueryPlanSet6Runner22runUntilFirstCompletesEv+0x56) [0x8d6556]

/usr/bin/mongod(_ZN5mongo12QueryPlanSet5runOpERNS_7QueryOpE+0x11e) [0x8d9d8e]

/usr/bin/mongod(_ZN5mongo16MultiPlanScanner9runOpOnceERNS_7QueryOpE+0x523) [0x8db7a3]

/usr/bin/mongod(_ZN5mongo16MultiPlanScanner5runOpERNS_7QueryOpE+0x17) [0x8dca77]

/usr/bin/mongod(_ZN5mongo8runCountEPKcRKNS_7BSONObjERSs+0x3b9) [0x964959]

/usr/bin/mongod(_ZN5mongo8CmdCount3runERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0xb7) [0x97db97]

/usr/bin/mongod(_ZN5mongo11execCommandEPNS_7CommandERNS_6ClientEiPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0xb3d) [0x97987d]

/usr/bin/mongod(_ZN5mongo12_runCommandsEPKcRNS_7BSONObjERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x6ff) [0x97a9df]

/usr/bin/mongod(_ZN5mongo11runCommandsEPKcRNS_7BSONObjERNS_5CurOpERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x35) [0x962de5]

/usr/bin/mongod(_ZN5mongo8runQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1_+0x11e1) [0x966171]

/usr/bin/mongod() [0x883877]

Wed Feb 8 09:02:00 [conn176788] Count with ns: mailing.mail_pendingmail and query: { was_send: false } failed with exception: exception: 10320 BSONElement: bad type 113

Wed Feb 8 09:02:00 [conn176788] command mailing.$cmd command: { count: "mail_pendingmail", query: { was_send: false }, fields: null } ntoreturn:1 reslen:48 117588ms

It repeated itself when trying to access that collection until 1h ago when mongo just locked up[1]; I am running on the secondary and the primary is undergoing a --repair right now, I _hope_ it won't do like db.repairDatabase and skip all collections when it finds an error.

Incidentally, I am seeing a lot of these during the --repair:

Thu Feb 9 12:01:28 [initandlisten] warning: ClientCursor::yield can't unlock b/c of recursive lock ns: mailing.mail_pendingmail top: { opid: 4699, active: true, waitingForLock: false, secs_running: 0, op: "getmore", ns: "mailing.mail_pendingmail", query: {}, client: "0.0.0.0:0", desc: "initandlisten", threadId: "0x7f90eab3e720", numYields: 0 }

The repair process reached the error, said this:

Thu Feb 9 12:02:34 [initandlisten] Assertion: 10334:Invalid BSONObj size: 1931501856 (0x20612073) first element: [text removed]

--------------------------------------------------------------------------- : ?type=101

0x584722 0x508021 0x86b86f 0x86ba18 0x96395a 0x885cd4 0x88aaaa 0x88b9d1 0x5dfe7b 0x5e03ed 0x84cf59 0x84eb15 0x850409 0x8b937d 0xa8e7b8 0xa8fa6a 0xa91544 0xa91e7d 0xa9bbaa 0x7f90e9bf9c4d

mongod(_ZN5mongo11msgassertedEiPKc+0x112) [0x584722]

mongod(_ZNK5mongo7BSONObj14_assertInvalidEv+0x471) [0x508021]

mongod(_ZN5mongo19CoveredIndexMatcher7matchesERKNS_7BSONObjERKNS_7DiskLocEPNS_12MatchDetailsEb+0x15f) [0x86b86f]

mongod(_ZN5mongo19CoveredIndexMatcher14matchesCurrentEPNS_6CursorEPNS_12MatchDetailsE+0xa8) [0x86ba18]

mongod(_ZN5mongo14processGetMoreEPKcixRNS_5CurOpEiRb+0x3ca) [0x96395a]

mongod(_ZN5mongo15receivedGetMoreERNS_10DbResponseERNS_7MessageERNS_5CurOpE+0x1b4) [0x885cd4]

mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0xf6a) [0x88aaaa]

mongod(_ZN5mongo14DBDirectClient4callERNS_7MessageES2_bPSs+0x81) [0x88b9d1]

mongod(_ZN5mongo14DBClientCursor11requestMoreEv+0x35b) [0x5dfe7b]

mongod(_ZN5mongo14DBClientCursor4moreEv+0x4d) [0x5e03ed]

mongod(_ZN5mongo6Cloner4copyEPKcS2_bbbbbbNS_5QueryE+0x789) [0x84cf59]

mongod(_ZN5mongo6Cloner2goEPKcRSsRKSsbbbbbbPi+0x1665) [0x84eb15]

mongod(_ZN5mongo9cloneFromEPKcRSsRKSsbbbbbbPi+0x59) [0x850409]

mongod(_ZN5mongo14repairDatabaseESsRSsbb+0x4cd) [0x8b937d]

mongod(_ZN5mongo11doDBUpgradeERKSsSsPNS_14DataFileHeaderE+0x68) [0xa8e7b8]

mongod() [0xa8fa6a]

mongod(_ZN5mongo14_initAndListenEi+0x404) [0xa91544]

mongod(_ZN5mongo13initAndListenEi+0x1d) [0xa91e7d]

mongod(main+0x9aaa) [0xa9bbaa]

/lib/libc.so.6(__libc_start_main+0xfd) [0x7f90e9bf9c4d]

Thu Feb 9 12:02:34 [initandlisten] getmore mailing.mail_pendingmail cursorid:5944541588944789531 exception: Invalid BSONObj size: 1931501856 (0x20612073) first element: [text removed]

--------------------------------------------------------------------------- : ?type=101 code:10334 reslen:20 634ms

... and started building the indexes; now that _may_ have been the last object that was written, but I have no idea if it dropped the rest of the collection or not; unlike repairDatabase, it didn't skip the other collections, so I hope I'll end up with a consistent database when all this ends

[1] mongostat says things are fine, existing mongo connection works, the replicas see that the master is alive and well, but new connections hang; I had to stop the primary for the secondary to take over. I remember getting this a long time ago (1.8? 1.6?), with slight modifications (namely mongo and mongostat would not work AFAIR).

Scott Hernandez

unread,

Feb 9, 2012, 8:01:30 AM2/9/12

to mongod...@googlegroups.com

What version is this? Are you using journaling? Have you had any
unclean shutdowns? Do you know the cause of your corruption?

> --
> You received this message because you are subscribed to the Google Groups
> "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to
> mongodb-user...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/mongodb-user?hl=en.

Sabin Iacob

unread,

Feb 9, 2012, 10:59:40 AM2/9/12

to mongod...@googlegroups.com

On 9 February 2012 15:01, Scott Hernandez <scotthe...@gmail.com> wrote:

What version is this?

2.0.2, official deb packages (mongodb-10gen), Ubuntu 10.04

Are you using journaling?

yes

Have you had any unclean shutdowns?

no

Do you know the cause of your corruption?

no; the hardware seems fine (I haven't detected any filesystem corruption, the ISP swears that everything else is fine), and coding errors on the client side shouldn't be able to cause data corruption on the server, right?

Eliot Horowitz

unread,

Feb 12, 2012, 1:27:54 AM2/12/12

to mongod...@googlegroups.com

Is this in a replica set?
If so, are the secondaries ok?
Can you start the server with --objcheck?

If not a replica set - I would highly recommend for both durability
and to help diagnose.

Sabin Iacob

unread,

Feb 12, 2012, 6:12:08 PM2/12/12

to mongod...@googlegroups.com

On 12 February 2012 08:27, Eliot Horowitz <el...@10gen.com> wrote:

Is this in a replica set?
If so, are the secondaries ok?
Can you start the server with --objcheck?

If not a replica set - I would highly recommend for both durability
and to help diagnose.

yes, it's a replica set, secondaries seem ok; I'll add objcheck to all just to be sure

it's broken again, BTW, in the same collection, so I it may be something in the way the guys write to it, or do their clean-ups :-/

What should I look for in the logs with objcheck enabled?

Eliot Horowitz

unread,

Feb 13, 2012, 12:10:56 AM2/13/12

to mongod...@googlegroups.com

objcheck will give you errors on insertion client side if you do safe
writes and ugly errors server side

have you tried the same queries on the secondary?

Sabin Iacob

unread,

Feb 13, 2012, 5:54:29 AM2/13/12

to mongod...@googlegroups.com

yes, generally after one such error all inserts on that collection seem to fail; it would be brilliant if I could repair it online instead of using mongod --repair, but online repairs skip the rest of the database when they reach that broken object and conclude that all is repaired

I asked the ISP to make the former secondary viable for production usage (didn't have enough RAM), and I'll let it be primary, to make sure it's not a hardware problem

It should break soon if it's in the software (took less than 2 days after the last repair).

Reply all

Reply to author

Forward