Assertion related to TTLMonitor when running RS on Windows 64bit

264 views
Skip to first unread message

Assaf Lavie

unread,
Oct 17, 2012, 11:14:19 AM10/17/12
to mongod...@googlegroups.com
After upgrading Mongo to 2.3 on my local Windows dev machine, suddenly the replica set console windows report an assertion whenever any query is made against the DB:

[slaveTracking] update local.slaves query: { _id: ObjectId('4fba5ef4bc2c86368476beed'), host: "127.0.0.1", ns:
 "local.oplog.rs" } update: { $set: { syncedTo: Timestamp 1350486453000|1 } } keyUpdates:0 exception: Invalid BSONObj size: 0 (0x0
0000000) first element: EOO code:10334 locks(micros) w:44430 44ms

[TTLMonitor] assertion 0 assertion d:\slave\windows_64bit_v2.2\mongo\src\mongo\db\pdfile.h:360 ns:local.system
.indexes query:{ expireAfterSeconds: { $exists: true } }

TTLMonitor] problem detected during query over local.system.indexes : { $err: "assertion d:\slave\windows_64b
it_v2.2\mongo\src\mongo\db\pdfile.h:360" }

[TTLMonitor] ERROR: error processing ttl for db: local 10065 invalid parameter: expected an object ()

Any idea what could cause this? All 3 nodes in teh RS are on the same machine and aside from these errors the queries themselves seem to succeed.

Assaf

Tad Marshall

unread,
Oct 19, 2012, 3:53:09 PM10/19/12
to mongod...@googlegroups.com
Hi Assaf,

The error messages from TTLMonitor may be  https://jira.mongodb.org/browse/SERVER-6911 , which is fixed in version 2.2.1-rc1. Version 2.2.1-rc1 is close to release but not released as of this writing.  You could try a recent nightly build in the 2.2 branch ( http://downloads.mongodb.org/win32/mongodb-win32-x86_64-v2.2-latest.zip ) to see if it makes the TTLMonitor messages go away.

The [slaveTracking] message may be something different.  Was there a stack trace following this line?  This would help pin down where the error is coming from.

If there was no stack trace, could you try either restarting mongod.exe with an added "--traceExceptions" command line switch or issue a command in the shell to enable stack traces for all exceptions:

db.adminCommand( { setParameter: 1, traceExceptions: true} )

On Windows, you need to have the mongod.pdb file in the same directory as mongod.exe to get usable stack traces.

Let us know is this works for you, thanks!

Tad

Assaf Lavie

unread,
Oct 21, 2012, 4:07:00 AM10/21/12
to mongod...@googlegroups.com
Thanks Tad.

Running the recent nightly yield a different exception every second (regardless of actual queries):

Sun Oct 21 09:59:38 [rsSyncNotifier] kernel32.dll    BaseThreadInitThunk+0xd
Sun Oct 21 09:59:38 [rsSyncNotifier] replset tracking exception: exception: 0 assertion d:\slave\windows_64bit_v2.2\mongo\src\mongo\db\pdfile.h:360
Sun Oct 21 09:59:39 [rsSyncNotifier] replset setting oplog notifier to localhost:27017
Sun Oct 21 09:59:39 [rsSyncNotifier]  local.me Assertion failure isOk() d:\slave\windows_64bit_v2.2\mongo\src\mongo\db\pdfile.h 360

However, this only happens on one of the secondary instances.

If I run mongod with --traceExceptions I don't see an exception every second - I only see the following exceptions whenever a query is performed:

Sun Oct 21 10:02:56 [slaveTracking] update local.slaves query: { _id: ObjectId('4fba5e0d0c61cd1166308cac'), host: "127.0.0.1", ns: "local.oplog.rs" } update: { $set: { syncedTo: Timestamp 1350806576000|1 } } keyUpdates:0 exception: Invalid BSONObj size: 0 (0x00000000) first element: EOO code:10334 locks(micros) w:88764 88ms
Sun Oct 21 10:02:56 [slaveTracking] Assertion: 10334:Invalid BSONObj size: 0 (0x00000000) first element: EOO
Sun Oct 21 10:02:56 [slaveTracking] mongod.exe      ???
...
Sun Oct 21 10:02:56 [slaveTracking] mongod.exe      ???
Sun Oct 21 10:02:56 [slaveTracking] kernel32.dll    BaseThreadInitThunk+0xd
Sun Oct 21 10:02:56 [slaveTracking] warning: DBException thrown :: caused by :: 10334 Invalid BSONObj size: 0 (0x00000000) first element: EOO

Assaf

--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com
To unsubscribe from this group, send email to
mongodb-user...@googlegroups.com
See also the IRC channel -- freenode.net#mongodb

Tad Marshall

unread,
Oct 21, 2012, 6:33:06 AM10/21/12
to mongod...@googlegroups.com
Hi Assaf,

The assertion pointing to pdfile.h line 360 is indicating corruption.  This line verifies that an extent contains a valid signature, and the test is failing.

This was also happening in the earlier log snippet you posted, which I should have noticed, but I focused on the TTLMonitor aspect and missed the corruption part.  The second log post adds another indication of corruption; Assertion: 10334:Invalid BSONObj size.

The collections that are named are in your 'local' database; 'local.me' and 'local.slaves'.  It seems that both are corrupt.

The messages every second are triggered by a loop that tries to reconnect to the server that the secondary is trying to sync from.  It is failing each time, so you get repeating messages.  At log level 1 you would see an additional message; "replset could not connect to localhost:27017".

You are not getting a good stack trace (indicated by the "???" lines).  Is mongod.pdb in the same directory as mongod.exe?  It needs to be find-able by mongod.exe to get information about mongod.exe in the stack trace.

Has this server had hard shutdowns?  Can you check the Windows Event Log for NTFS errors?

Tad

Assaf Lavie

unread,
Oct 21, 2012, 7:09:36 AM10/21/12
to mongod...@googlegroups.com
Hi Tad,

I rebuilt the RS from a backup and the assertions are now gone, so it definitely looks like corruption.

The mongod instances are running on my development Windows machine and are usually killed off by closing their console windows. There haven't been many hard shutdowns (e.g. power loss) at all - can't remember when one happened (there's a UPS here).

In addition, the drive that contains the data is actually a mirrored RAID drive, so the chances of HW problems are very slim indeed.

Nothing in Window's event log related to NTFS.

I should mention, though, that almost every time I re-launch the mongod's of the RS I get the following:

 preallocating a journal file d:/data/mongosets/set3/journal/prealloc.2
52428800/1073741824     4%
...

Which usually takes about a minute. This happens even now, after completely rebuilding the RS.

Assaf

Assaf Lavie

unread,
Oct 21, 2012, 7:10:24 AM10/21/12
to mongod...@googlegroups.com
And, yes, mongod.pdb is there (part of the download package you linked to).

Assaf

Tad Marshall

unread,
Oct 21, 2012, 12:34:43 PM10/21/12
to mongod...@googlegroups.com
Hi Assaf,

Closing the console window by clicking the red X or using Close from the system menu will give you a clean shutdown, so that's fine.

NTFS problems are not necessarily the same as disk hardware problems, though of course hardware problems can cause NTFS problems.  We have seen one case of corruption that was accompanied by an NTFS error stating that NTFS was unable to flush buffers to disk, so it is worth looking at; thanks for checking.

mongod tests the speed of writing to a new file with the speed of overwriting the same file a second time and if the overwrite is faster it will preallocate journal files on the theory that this will make journaling faster.  If it is taking one minute to allocate three 1 GB files, it looks like this heuristic isn't working right on your system.  You can use the --nopreallocj option to disable preallocation of the journal files.

What is your RAID setup?  Is it using Intel's RAID chipset and Rapid Storage Technology software, Windows' software RAID, or some other solution?

What version of Windows are you running?

You may need to have your current working directory be the directory that mongod.exe is in for the pdb file to be found.  I just did some experimenting and that may be the missing ingredient.  I'll post a Jira ticket to get that fixed.

Tad

Assaf Lavie

unread,
Oct 22, 2012, 1:39:26 AM10/22/12
to mongod...@googlegroups.com
Thanks for all the explanations.

I'm running Windows 7 64 bit, and using Windows' software RAID to mirror two HDs.

Alas, I can't recreate the problem any more since I've rebuilt my RS, so running from the bin directory is something I could try only next time something similar happens.

Thanks,
Assaf

Tad Marshall

unread,
Oct 22, 2012, 8:19:32 AM10/22/12
to mongod...@googlegroups.com
Hi Assaf,

I filed  https://jira.mongodb.org/browse/SERVER-7435  to get the Windows stack traces to work right when starting mongod.exe in a directory other than the current working directory.

I'll do some testing with Windows' software RAID and see if I can duplicate the corruption you saw.  Was the corruption just on a secondary, and if so, had it been a primary at an earlier time?  Are you using capped collections or only normal collections?  How large are your disk drives?  They are in RAID 1 mode (mirroring), correct?  How large is your database, and does it consist of a single collection or some other number?  When writing to the database, are you mostly inserting or do you do a lot of updates?  If you do updates, are they mostly in-place updates that would not change the size of a record, or are they additions of new information that would force the records to move to a new location?  How many indexes do you have on your collections?  Do you index just simple fields, or do you have indexes on arrays?

I'm just probing for the parameters I should use in trying to reproduce your problem.  Thanks in advance!

Tad

Assaf Lavie

unread,
Oct 22, 2012, 8:54:55 AM10/22/12
to mongod...@googlegroups.com
Thanks, Tad.
I'll answer inline.



On Mon, Oct 22, 2012 at 2:19 PM, Tad Marshall <t...@10gen.com> wrote:
Hi Assaf,

I filed  https://jira.mongodb.org/browse/SERVER-7435  to get the Windows stack traces to work right when starting mongod.exe in a directory other than the current working directory.
Appreciated.
 

I'll do some testing with Windows' software RAID and see if I can duplicate the corruption you saw.  Was the corruption just on a secondary, and if so, had it been a primary at an earlier time?  
The errors I was getting were from secondary instances. However, I have no idea where exactly the corruption was. All I know is that I deleted all data directories (of all 3 nodes) and restored the data from a dump.
 
Are you using capped collections or only normal collections?  
Yes, some capped collections for logging as well.
 
How large are your disk drives?  They are in RAID 1 mode (mirroring), correct?  
Yes. 2x2TB in a mirror raid (1).
 
How large is your database, and does it consist of a single collection or some other number?  
The data dump is about 4GB in size and is comprised of about 10 collections.
 
When writing to the database, are you mostly inserting or do you do a lot of updates?  
Very few updates.
 
If you do updates, are they mostly in-place updates that would not change the size of a record, or are they additions of new information that would force the records to move to a new location?  
We do update some documents (seldom), and when we do we might very well change the size of the content in some field, but not by much. Only very tiny documents are updated (few 1..2KB...)
 
How many indexes do you have on your collections?  
Typically no more than a few per collection. Most of the times just one field indexed.
 
Do you index just simple fields, or do you have indexes on arrays?
No indexes on arrays.

I'm just probing for the parameters I should use in trying to reproduce your problem.  Thanks in advance!
No problem. I appreciate the thorough investigation.

Assaf
Reply all
Reply to author
Forward
0 new messages