Hello
[long post ahead]
I had a data corruption problem a while ago, managed to get it repaired (but lost a lot of not-essential-but-nice-to-have data because db.repairDatabase and replication encountered an error and came to the conclusion that the database cloning/repair is finished, skipping a bunch of collections). I remember pasting errors from the logs and being told to produce the "original error", whatever that means (the errors were all the same).
Now it has happened again: data corruption in a heavily used collection, and I have the first occurrence in the log.
Here it is:
Wed Feb 8 09:02:00 [conn176788] Assertion: 10320:BSONElement: bad type 113
0x584722 0x506e7d 0x662809 0x663bfd 0x86b7dd 0x86ba18 0x9717b9 0x8c63c6 0x8d5f00 0x8d6556 0x8d9d8e 0x8db7a3 0x8dca77 0x964959 0x97db97 0x97987d 0x97a9df 0x962de5 0x966171 0x883877
/usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x112) [0x584722]
/usr/bin/mongod(_ZNK5mongo11BSONElement4sizeEv+0x1bd) [0x506e7d]
/usr/bin/mongod(_ZN5mongo7Matcher13matchesDottedEPKcRKNS_11BSONElementERKNS_7BSONObjEiRKNS_14ElementMatcherEbPNS_12MatchDetailsE+0x21c9) [0x662809]
/usr/bin/mongod(_ZN5mongo7Matcher7matchesERKNS_7BSONObjEPNS_12MatchDetailsE+0xdd) [0x663bfd]
/usr/bin/mongod(_ZN5mongo19CoveredIndexMatcher7matchesERKNS_7BSONObjERKNS_7DiskLocEPNS_12MatchDetailsEb+0xcd) [0x86b7dd]
/usr/bin/mongod(_ZN5mongo19CoveredIndexMatcher14matchesCurrentEPNS_6CursorEPNS_12MatchDetailsE+0xa8) [0x86ba18]
/usr/bin/mongod(_ZN5mongo7CountOp4nextEv+0x8e9) [0x9717b9]
/usr/bin/mongod(_ZN5mongo12QueryPlanSet6Runner6nextOpERNS_7QueryOpE+0x56) [0x8c63c6]
/usr/bin/mongod(_ZN5mongo12QueryPlanSet6Runner4nextEv+0x110) [0x8d5f00]
/usr/bin/mongod(_ZN5mongo12QueryPlanSet6Runner22runUntilFirstCompletesEv+0x56) [0x8d6556]
/usr/bin/mongod(_ZN5mongo12QueryPlanSet5runOpERNS_7QueryOpE+0x11e) [0x8d9d8e]
/usr/bin/mongod(_ZN5mongo16MultiPlanScanner9runOpOnceERNS_7QueryOpE+0x523) [0x8db7a3]
/usr/bin/mongod(_ZN5mongo16MultiPlanScanner5runOpERNS_7QueryOpE+0x17) [0x8dca77]
/usr/bin/mongod(_ZN5mongo8runCountEPKcRKNS_7BSONObjERSs+0x3b9) [0x964959]
/usr/bin/mongod(_ZN5mongo8CmdCount3runERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0xb7) [0x97db97]
/usr/bin/mongod(_ZN5mongo11execCommandEPNS_7CommandERNS_6ClientEiPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0xb3d) [0x97987d]
/usr/bin/mongod(_ZN5mongo12_runCommandsEPKcRNS_7BSONObjERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x6ff) [0x97a9df]
/usr/bin/mongod(_ZN5mongo11runCommandsEPKcRNS_7BSONObjERNS_5CurOpERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x35) [0x962de5]
/usr/bin/mongod(_ZN5mongo8runQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1_+0x11e1) [0x966171]
/usr/bin/mongod() [0x883877]
Wed Feb 8 09:02:00 [conn176788] Count with ns: mailing.mail_pendingmail and query: { was_send: false } failed with exception: exception: 10320 BSONElement: bad type 113
Wed Feb 8 09:02:00 [conn176788] command mailing.$cmd command: { count: "mail_pendingmail", query: { was_send: false }, fields: null } ntoreturn:1 reslen:48 117588ms
It repeated itself when trying to access that collection until 1h ago when mongo just locked up[1]; I am running on the secondary and the primary is undergoing a --repair right now, I _hope_ it won't do like db.repairDatabase and skip all collections when it finds an error.
Incidentally, I am seeing a lot of these during the --repair:
Thu Feb 9 12:01:28 [initandlisten] warning: ClientCursor::yield can't unlock b/c of recursive lock ns: mailing.mail_pendingmail top: { opid: 4699, active: true, waitingForLock: false, secs_running: 0, op: "getmore", ns: "mailing.mail_pendingmail", query: {}, client: "
0.0.0.0:0", desc: "initandlisten", threadId: "0x7f90eab3e720", numYields: 0 }
The repair process reached the error, said this:
Thu Feb 9 12:02:34 [initandlisten] Assertion: 10334:Invalid BSONObj size: 1931501856 (0x20612073) first element: [text removed]
--------------------------------------------------------------------------- : ?type=101
0x584722 0x508021 0x86b86f 0x86ba18 0x96395a 0x885cd4 0x88aaaa 0x88b9d1 0x5dfe7b 0x5e03ed 0x84cf59 0x84eb15 0x850409 0x8b937d 0xa8e7b8 0xa8fa6a 0xa91544 0xa91e7d 0xa9bbaa 0x7f90e9bf9c4d
mongod(_ZN5mongo11msgassertedEiPKc+0x112) [0x584722]
mongod(_ZNK5mongo7BSONObj14_assertInvalidEv+0x471) [0x508021]
mongod(_ZN5mongo19CoveredIndexMatcher7matchesERKNS_7BSONObjERKNS_7DiskLocEPNS_12MatchDetailsEb+0x15f) [0x86b86f]
mongod(_ZN5mongo19CoveredIndexMatcher14matchesCurrentEPNS_6CursorEPNS_12MatchDetailsE+0xa8) [0x86ba18]
mongod(_ZN5mongo14processGetMoreEPKcixRNS_5CurOpEiRb+0x3ca) [0x96395a]
mongod(_ZN5mongo15receivedGetMoreERNS_10DbResponseERNS_7MessageERNS_5CurOpE+0x1b4) [0x885cd4]
mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0xf6a) [0x88aaaa]
mongod(_ZN5mongo14DBDirectClient4callERNS_7MessageES2_bPSs+0x81) [0x88b9d1]
mongod(_ZN5mongo14DBClientCursor11requestMoreEv+0x35b) [0x5dfe7b]
mongod(_ZN5mongo14DBClientCursor4moreEv+0x4d) [0x5e03ed]
mongod(_ZN5mongo6Cloner4copyEPKcS2_bbbbbbNS_5QueryE+0x789) [0x84cf59]
mongod(_ZN5mongo6Cloner2goEPKcRSsRKSsbbbbbbPi+0x1665) [0x84eb15]
mongod(_ZN5mongo9cloneFromEPKcRSsRKSsbbbbbbPi+0x59) [0x850409]
mongod(_ZN5mongo14repairDatabaseESsRSsbb+0x4cd) [0x8b937d]
mongod(_ZN5mongo11doDBUpgradeERKSsSsPNS_14DataFileHeaderE+0x68) [0xa8e7b8]
mongod() [0xa8fa6a]
mongod(_ZN5mongo14_initAndListenEi+0x404) [0xa91544]
mongod(_ZN5mongo13initAndListenEi+0x1d) [0xa91e7d]
mongod(main+0x9aaa) [0xa9bbaa]
/lib/libc.so.6(__libc_start_main+0xfd) [0x7f90e9bf9c4d]
Thu Feb 9 12:02:34 [initandlisten] getmore mailing.mail_pendingmail cursorid:5944541588944789531 exception: Invalid BSONObj size: 1931501856 (0x20612073) first element: [text removed]
--------------------------------------------------------------------------- : ?type=101 code:10334 reslen:20 634ms
... and started building the indexes; now that _may_ have been the last object that was written, but I have no idea if it dropped the rest of the collection or not; unlike repairDatabase, it didn't skip the other collections, so I hope I'll end up with a consistent database when all this ends
[1] mongostat says things are fine, existing mongo connection works, the replicas see that the master is alive and well, but new connections hang; I had to stop the primary for the secondary to take over. I remember getting this a long time ago (1.8? 1.6?), with slight modifications (namely mongo and mongostat would not work AFAIR).