Today, we had the following seemingly spontaneously occur:
Tue May 22 14:51:45 [conn27990] update default.instances query: { [...SNIP...] } idhack:1 upsert:1 124ms Tue May 22 14:51:47 access violation Tue May 22 14:52:15 [rsHealthPoll] replSet member iis-man-002:27021 is now in state PRIMARY Tue May 22 14:52:17 [rsSync] replSet syncing to: iis-man-002:27021
***** SERVER RESTARTED *****
Tue May 22 14:52:19 [initandlisten] MongoDB starting : pid=1960 port=27021 dbpath=C:\SitesData\testing-website\mongodata 64-bit host=iis-man-003 Tue May 22 14:52:19 [initandlisten] db version v2.0.4, pdfile version 4.5 Tue May 22 14:52:19 [initandlisten] git version: 329f3c47fe8136c03392c8f0e548506cb21f8ebf Tue May 22 14:52:19 [initandlisten] build info: windows sys.getwindowsversion(major=6, minor=1, build=7601, platform=2, service_pack='Service Pack 1') BOOST_LIB_VERSION=1_42 Tue May 22 14:52:19 [initandlisten] options: { dbpath: "C:\SitesData\testing-website\mongodata", logappend: true, logpath: "C:\SitesData\testing-website\mongo.log", port: 27021, quiet: true, replSet: "rsTest", rest: true, service: true } Tue May 22 14:52:19 [initandlisten] journal dir=C:/SitesData/testing-website/mongodata/journal Tue May 22 14:52:19 [initandlisten] recover begin Tue May 22 14:52:19 [initandlisten] recover lsn: 258699461 Tue May 22 14:52:19 [initandlisten] recover C:/SitesData/testing-website/mongodata/journal/j._4 Tue May 22 14:52:20 [initandlisten] recover skipping application of section seq:209344204 < lsn:258699461 Tue May 22 14:52:20 [initandlisten] recover skipping application of section seq:209382318 < lsn:258699461 Tue May 22 14:52:20 [initandlisten] recover skipping application of section seq:209420683 < lsn:258699461 Tue May 22 14:52:20 [initandlisten] recover skipping application of section seq:209458887 < lsn:258699461 Tue May 22 14:52:20 [initandlisten] recover skipping application of section seq:209497082 < lsn:258699461 Tue May 22 14:52:20 [initandlisten] recover skipping application of section seq:209535336 < lsn:258699461 Tue May 22 14:52:20 [initandlisten] recover skipping application of section seq:209573520 < lsn:258699461 Tue May 22 14:52:20 [initandlisten] recover skipping application of section seq:209611794 < lsn:258699461 Tue May 22 14:52:20 [initandlisten] recover skipping application of section seq:209649958 < lsn:258699461 Tue May 22 14:52:20 [initandlisten] recover skipping application of section more... Tue May 22 14:52:58 [initandlisten] recover C:/SitesData/testing-website/mongodata/journal/j._5 Tue May 22 14:53:17 [initandlisten] recover cleaning up Tue May 22 14:53:17 [initandlisten] removeJournalFiles Tue May 22 14:53:17 [initandlisten] recover done Tue May 22 14:53:18 [initandlisten] waiting for connections on port 27021 Tue May 22 14:53:18 [websvr] admin web console waiting for connections on port 28021 Tue May 22 14:53:19 [rsStart] trying to contact iis-man-002:27021 Tue May 22 14:53:19 [rsHealthPoll] replSet member iis-man-002:27021 is up Tue May 22 14:53:19 [rsHealthPoll] replSet member iis-man-002:27021 is now in state PRIMARY Tue May 22 14:53:19 [rsStart] replSet STARTUP2 Tue May 22 14:53:19 [rsHealthPoll] replSet member iis-man-001:27021 is up Tue May 22 14:53:19 [rsHealthPoll] replSet member iis-man-001:27021 is now in state ARBITER Tue May 22 14:53:19 [rsSync] replSet SECONDARY Tue May 22 14:53:23 [rsSync] replSet syncing to: iis-man-002:27021 Tue May 22 14:53:24 access violation
***** SERVER RESTARTED *****
Tue May 22 14:53:25 [initandlisten] MongoDB starting : pid=2356 port=27021 dbpath=C:\SitesData\testing-website\mongodata 64-bit host=iis-man-003 Tue May 22 14:53:25 [initandlisten] db version v2.0.4, pdfile version 4.5 Tue May 22 14:53:25 [initandlisten] git version: 329f3c47fe8136c03392c8f0e548506cb21f8ebf Tue May 22 14:53:25 [initandlisten] build info: windows sys.getwindowsversion(major=6, minor=1, build=7601, platform=2, service_pack='Service Pack 1') BOOST_LIB_VERSION=1_42 Tue May 22 14:53:25 [initandlisten] options: { dbpath: "C:\SitesData\testing-website\mongodata", logappend: true, logpath: "C:\SitesData\testing-website\mongo.log", port: 27021, quiet: true, replSet: "rsTest", rest: true, service: true } Tue May 22 14:53:25 [initandlisten] journal dir=C:/SitesData/testing-website/mongodata/journal Tue May 22 14:53:25 [initandlisten] recover begin Tue May 22 14:53:25 [initandlisten] recover lsn: 844 Tue May 22 14:53:25 [initandlisten] recover C:/SitesData/testing-website/mongodata/journal/j._0 Tue May 22 14:53:25 [initandlisten] recover cleaning up Tue May 22 14:53:25 [initandlisten] removeJournalFiles Tue May 22 14:53:25 [initandlisten] recover done Tue May 22 14:53:25 [initandlisten] waiting for connections on port 27021 Tue May 22 14:53:25 [websvr] admin web console waiting for connections on port 28021 Tue May 22 14:53:25 [rsStart] replSet STARTUP2 Tue May 22 14:53:25 [rsSync] replSet SECONDARY Tue May 22 14:53:25 [rsHealthPoll] replSet member iis-man-002:27021 is up Tue May 22 14:53:25 [rsHealthPoll] replSet member iis-man-002:27021 is now in state PRIMARY Tue May 22 14:53:25 [rsHealthPoll] replSet member iis-man-001:27021 is up Tue May 22 14:53:25 [rsHealthPoll] replSet member iis-man-001:27021 is now in state ARBITER Tue May 22 14:53:30 [rsSync] replSet syncing to: iis-man-002:27021 Tue May 22 14:53:30 access violation
***** SERVER RESTARTED *****
Tue May 22 14:53:30 [initandlisten] MongoDB starting : pid=2268 port=27021 dbpath=C:\SitesData\testing-website\mongodata 64-bit host=iis-man-003 Tue May 22 14:53:30 [initandlisten] db version v2.0.4, pdfile version 4.5 Tue May 22 14:53:30 [initandlisten] git version: 329f3c47fe8136c03392c8f0e548506cb21f8ebf Tue May 22 14:53:30 [initandlisten] build info: windows sys.getwindowsversion(major=6, minor=1, build=7601, platform=2, service_pack='Service Pack 1') BOOST_LIB_VERSION=1_42 Tue May 22 14:53:30 [initandlisten] options: { dbpath: "C:\SitesData\testing-website\mongodata", logappend: true, logpath: "C:\SitesData\testing-website\mongo.log", port: 27021, quiet: true, replSet: "rsTest", rest: true, service: true } Tue May 22 14:53:30 [initandlisten] journal dir=C:/SitesData/testing-website/mongodata/journal Tue May 22 14:53:30 [initandlisten] recover begin Tue May 22 14:53:30 [initandlisten] info no lsn file in journal/ directory Tue May 22 14:53:30 [initandlisten] recover lsn: 0 Tue May 22 14:53:30 [initandlisten] recover C:/SitesData/testing-website/mongodata/journal/j._0 Tue May 22 14:53:30 [initandlisten] recover cleaning up Tue May 22 14:53:30 [initandlisten] removeJournalFiles Tue May 22 14:53:30 [initandlisten] recover done Tue May 22 14:53:30 [initandlisten] waiting for connections on port 27021 Tue May 22 14:53:30 [websvr] admin web console waiting for connections on port 28021 Tue May 22 14:53:30 [rsStart] replSet STARTUP2 Tue May 22 14:53:30 [rsSync] replSet SECONDARY Tue May 22 14:53:30 [rsHealthPoll] replSet member iis-man-002:27021 is up Tue May 22 14:53:30 [rsHealthPoll] replSet member iis-man-002:27021 is now in state PRIMARY Tue May 22 14:53:30 [rsHealthPoll] replSet member iis-man-001:27021 is up Tue May 22 14:53:30 [rsHealthPoll] replSet member iis-man-001:27021 is now in state ARBITER Tue May 22 14:53:35 [rsSync] replSet syncing to: iis-man-002:27021 Tue May 22 14:53:35 access violation
***** SERVER RESTARTED *****
The last pattern above repeated many times until someone noticed the error in our website logging system. After seeing that, I manually stopped the service, which gave the following log messages:
***** SERVER RESTARTED *****
Tue May 22 15:20:03 [initandlisten] MongoDB starting : pid=3132 port=27021 dbpath=C:\SitesData\testing-website\mongodata 64-bit host=iis-man-003 Tue May 22 15:20:03 [initandlisten] db version v2.0.4, pdfile version 4.5 Tue May 22 15:20:03 [initandlisten] git version: 329f3c47fe8136c03392c8f0e548506cb21f8ebf Tue May 22 15:20:03 [initandlisten] build info: windows sys.getwindowsversion(major=6, minor=1, build=7601, platform=2, service_pack='Service Pack 1') BOOST_LIB_VERSION=1_42 Tue May 22 15:20:03 [initandlisten] options: { dbpath: "C:\SitesData\testing-website\mongodata", logappend: true, logpath: "C:\SitesData\testing-website\mongo.log", port: 27021, quiet: true, replSet: "rsTest", rest: true, service: true } Tue May 22 15:20:03 [initandlisten] journal dir=C:/SitesData/testing-website/mongodata/journal Tue May 22 15:20:03 [initandlisten] recover begin Tue May 22 15:20:03 [initandlisten] info no lsn file in journal/ directory Tue May 22 15:20:03 [initandlisten] recover lsn: 0 Tue May 22 15:20:03 [initandlisten] recover C:/SitesData/testing-website/mongodata/journal/j._0 Tue May 22 15:20:03 [initandlisten] recover cleaning up Tue May 22 15:20:03 [initandlisten] removeJournalFiles Tue May 22 15:20:03 [initandlisten] recover done Tue May 22 15:20:03 [initandlisten] waiting for connections on port 27021 Tue May 22 15:20:03 [websvr] admin web console waiting for connections on port 28021 Tue May 22 15:20:03 [rsStart] replSet STARTUP2 Tue May 22 15:20:03 [rsSync] replSet SECONDARY Tue May 22 15:20:03 [rsHealthPoll] replSet member iis-man-002:27021 is up Tue May 22 15:20:03 [rsHealthPoll] replSet member iis-man-002:27021 is now in state PRIMARY Tue May 22 15:20:03 [rsHealthPoll] replSet member iis-man-001:27021 is up Tue May 22 15:20:03 [rsHealthPoll] replSet member iis-man-001:27021 is now in state ARBITER Tue May 22 15:20:04 shutdown: going to close listening sockets... Tue May 22 15:20:04 closing listening socket: 320 Tue May 22 15:20:04 closing listening socket: 368 Tue May 22 15:20:04 shutdown: going to flush diaglog... Tue May 22 15:20:04 shutdown: going to close sockets... Tue May 22 15:20:04 shutdown: waiting for fs preallocator... Tue May 22 15:20:04 shutdown: lock for final commit... Tue May 22 15:20:04 shutdown: final commit... Tue May 22 15:20:04 shutdown: closing all files... Tue May 22
...
There are a couple of things going on here. The main one is that it seems that something being replicated is causing an access violation on your secondary. This is an error in memory addressing, the same thing as a segfault (signal 11) on Unix. The second thing is that the Windows service settings that were created by mongod.exe when you used --install to install the service instruct the Windows Service Control Manager to restart mongod.exe if it ever fails. Because you have journaling enabled, mongod is able to recover from each restart, but when it recovers it starts syncing from the primary again, reads the same oplog entry that killed it the previous time and so gets another access violation and the process repeats.
Version 2.0.4 gives very little information about the access violation, just the faulting address, which makes it hard to debug. In addition, Address Space Layout Randomization (ASLR) makes the address hard to use in finding out what source code line triggered the crash. Version 2.2 will behave a little better on access violations in Windows: it will show the thread name that crashed (which is almost certainly rsSync for your crash), the type of memory access and the address that was used. In addition, it will generate a minidump that can help diagnose the problem and then shut down "cleanly" so no recovery will be required on restart.
There are a couple of things that could be tried to get you out of this hole. One possibility is to temporarily try the most recent nightly build and see if the crashing problem has been fixed. If it hasn't been fixed, you will get a minidump that we could use to analyze the crash. If this fixes the problem, you could let it run long enough to fully resync from the primary and then switch back to your 2.0.4 version.
There are other options that could be tried, such as wiping the data and restoring from a dump from the primary, but it might be worth trying the nightly build first. If it worked, it would be a lot faster. The current nightly build is getting close to the version that will ship as 2.2.0-rc0 (the first release candidate for version 2.2) but it isn't there yet and we don't recommend running "unstable" nightly builds in production. As a short term test to get you past this outage, it might be worth trying.
On Tuesday, May 22, 2012 5:18:19 PM UTC-4, Mo Cassidy wrote:
> Today, we had the following seemingly spontaneously occur:
> Tue May 22 14:51:45 [conn27990] update default.instances query: { > [...SNIP...] } idhack:1 upsert:1 124ms > Tue May 22 14:51:47 access violation > Tue May 22 14:52:15 [rsHealthPoll] replSet member iis-man-002:27021 is now > in state PRIMARY > Tue May 22 14:52:17 [rsSync] replSet syncing to: iis-man-002:27021
> ***** SERVER RESTARTED *****
> Tue May 22 14:52:19 [initandlisten] MongoDB starting : pid=1960 port=27021 > dbpath=C:\SitesData\testing-website\mongodata 64-bit host=iis-man-003 > Tue May 22 14:52:19 [initandlisten] db version v2.0.4, pdfile version 4.5 > Tue May 22 14:52:19 [initandlisten] git version: > 329f3c47fe8136c03392c8f0e548506cb21f8ebf > Tue May 22 14:52:19 [initandlisten] build info: windows > sys.getwindowsversion(major=6, minor=1, build=7601, platform=2, > service_pack='Service Pack 1') BOOST_LIB_VERSION=1_42 > Tue May 22 14:52:19 [initandlisten] options: { dbpath: > "C:\SitesData\testing-website\mongodata", logappend: true, logpath: > "C:\SitesData\testing-website\mongo.log", port: 27021, quiet: true, > replSet: "rsTest", rest: true, service: true } > Tue May 22 14:52:19 [initandlisten] journal > dir=C:/SitesData/testing-website/mongodata/journal > Tue May 22 14:52:19 [initandlisten] recover begin > Tue May 22 14:52:19 [initandlisten] recover lsn: 258699461 > Tue May 22 14:52:19 [initandlisten] recover > C:/SitesData/testing-website/mongodata/journal/j._4 > Tue May 22 14:52:20 [initandlisten] recover skipping application of > section seq:209344204 < lsn:258699461 > Tue May 22 14:52:20 [initandlisten] recover skipping application of > section seq:209382318 < lsn:258699461 > Tue May 22 14:52:20 [initandlisten] recover skipping application of > section seq:209420683 < lsn:258699461 > Tue May 22 14:52:20 [initandlisten] recover skipping application of > section seq:209458887 < lsn:258699461 > Tue May 22 14:52:20 [initandlisten] recover skipping application of > section seq:209497082 < lsn:258699461 > Tue May 22 14:52:20 [initandlisten] recover skipping application of > section seq:209535336 < lsn:258699461 > Tue May 22 14:52:20 [initandlisten] recover skipping application of > section seq:209573520 < lsn:258699461 > Tue May 22 14:52:20 [initandlisten] recover skipping application of > section seq:209611794 < lsn:258699461 > Tue May 22 14:52:20 [initandlisten] recover skipping application of > section seq:209649958 < lsn:258699461 > Tue May 22 14:52:20 [initandlisten] recover skipping application of > section more... > Tue May 22 14:52:58 [initandlisten] recover > C:/SitesData/testing-website/mongodata/journal/j._5 > Tue May 22 14:53:17 [initandlisten] recover cleaning up > Tue May 22 14:53:17 [initandlisten] removeJournalFiles > Tue May 22 14:53:17 [initandlisten] recover done > Tue May 22 14:53:18 [initandlisten] waiting for connections on port 27021 > Tue May 22 14:53:18 [websvr] admin web console waiting for connections on > port 28021 > Tue May 22 14:53:19 [rsStart] trying to contact iis-man-002:27021 > Tue May 22 14:53:19 [rsHealthPoll] replSet member iis-man-002:27021 is up > Tue May 22 14:53:19 [rsHealthPoll] replSet member iis-man-002:27021 is now > in state PRIMARY > Tue May 22 14:53:19 [rsStart] replSet STARTUP2 > Tue May 22 14:53:19 [rsHealthPoll] replSet member iis-man-001:27021 is up > Tue May 22 14:53:19 [rsHealthPoll] replSet member iis-man-001:27021 is now > in state ARBITER > Tue May 22 14:53:19 [rsSync] replSet SECONDARY > Tue May 22 14:53:23 [rsSync] replSet syncing to: iis-man-002:27021 > Tue May 22 14:53:24 access violation
> ***** SERVER RESTARTED *****
> Tue May 22 14:53:25 [initandlisten] MongoDB starting : pid=2356 port=27021 > dbpath=C:\SitesData\testing-website\mongodata 64-bit host=iis-man-003 > Tue May 22 14:53:25 [initandlisten] db version v2.0.4, pdfile version 4.5 > Tue May 22 14:53:25 [initandlisten] git version: > 329f3c47fe8136c03392c8f0e548506cb21f8ebf > Tue May 22 14:53:25 [initandlisten] build info: windows > sys.getwindowsversion(major=6, minor=1, build=7601, platform=2, > service_pack='Service Pack 1') BOOST_LIB_VERSION=1_42 > Tue May 22 14:53:25 [initandlisten] options: { dbpath: > "C:\SitesData\testing-website\mongodata", logappend: true, logpath: > "C:\SitesData\testing-website\mongo.log", port: 27021, quiet: true, > replSet: "rsTest", rest: true, service: true } > Tue May 22 14:53:25 [initandlisten] journal > dir=C:/SitesData/testing-website/mongodata/journal > Tue May 22 14:53:25 [initandlisten] recover begin > Tue May 22 14:53:25 [initandlisten] recover lsn: 844 > Tue May 22 14:53:25 [initandlisten] recover > C:/SitesData/testing-website/mongodata/journal/j._0 > Tue May 22 14:53:25 [initandlisten] recover cleaning up > Tue May 22 14:53:25 [initandlisten] removeJournalFiles > Tue May 22 14:53:25 [initandlisten] recover done > Tue May 22 14:53:25 [initandlisten] waiting for connections on port 27021 > Tue May 22 14:53:25 [websvr] admin web console waiting for connections on > port 28021 > Tue May 22 14:53:25 [rsStart] replSet STARTUP2 > Tue May 22 14:53:25 [rsSync] replSet SECONDARY > Tue May 22 14:53:25 [rsHealthPoll] replSet member iis-man-002:27021 is up > Tue May 22 14:53:25 [rsHealthPoll] replSet member iis-man-002:27021 is now > in state PRIMARY > Tue May 22 14:53:25 [rsHealthPoll] replSet member iis-man-001:27021 is up > Tue May 22 14:53:25 [rsHealthPoll] replSet member iis-man-001:27021 is now > in state ARBITER > Tue May 22 14:53:30 [rsSync] replSet syncing to: iis-man-002:27021 > Tue May 22 14:53:30 access violation
> ***** SERVER RESTARTED *****
> Tue May 22 14:53:30 [initandlisten] MongoDB starting : pid=2268 port=27021 > dbpath=C:\SitesData\testing-website\mongodata 64-bit host=iis-man-003 > Tue May 22 14:53:30 [initandlisten] db version v2.0.4, pdfile version 4.5 > Tue May 22 14:53:30 [initandlisten] git version: > 329f3c47fe8136c03392c8f0e548506cb21f8ebf > Tue May 22 14:53:30 [initandlisten] build info: windows > sys.getwindowsversion(major=6, minor=1, build=7601, platform=2, > service_pack='Service Pack 1') BOOST_LIB_VERSION=1_42 > Tue May 22 14:53:30 [initandlisten] options: { dbpath: > "C:\SitesData\testing-website\mongodata", logappend: true, logpath: > "C:\SitesData\testing-website\mongo.log", port: 27021, quiet: true, > replSet: "rsTest", rest: true, service: true } > Tue May 22 14:53:30 [initandlisten] journal > dir=C:/SitesData/testing-website/mongodata/journal > Tue May 22 14:53:30 [initandlisten] recover begin > Tue May 22 14:53:30 [initandlisten] info no lsn file in journal/ directory > Tue May 22 14:53:30 [initandlisten] recover lsn: 0 > Tue May 22 14:53:30 [initandlisten] recover > C:/SitesData/testing-website/mongodata/journal/j._0 > Tue May 22 14:53:30 [initandlisten] recover cleaning up > Tue May 22 14:53:30 [initandlisten] removeJournalFiles > Tue May 22 14:53:30 [initandlisten] recover done > Tue May 22 14:53:30 [initandlisten] waiting for connections on port 27021 > Tue May 22 14:53:30 [websvr] admin web console waiting for connections on > port 28021 > Tue May 22 14:53:30 [rsStart] replSet STARTUP2 > Tue May 22 14:53:30 [rsSync] replSet SECONDARY > Tue May 22 14:53:30 [rsHealthPoll] replSet member iis-man-002:27021 is up > Tue May 22 14:53:30 [rsHealthPoll] replSet member iis-man-002:27021 is now > in state PRIMARY > Tue May 22 14:53:30 [rsHealthPoll] replSet member iis-man-001:27021 is up > Tue May 22 14:53:30 [rsHealthPoll] replSet member iis-man-001:27021 is now > in state ARBITER > Tue May 22 14:53:35
Wow, sorry Tad, I had totally forgotten about this until today (guess what happened again today). To get out of the hole, I just went ahead and did a full resync. It didn't affect performance a whole lot and was done relatively quickly (I think maybe 30 minutes). Thanks for all the info, it's really interesting. I'll just do the full resync again today, and I look forward to 2.2 so I'll have more info if this happens again! :)
On Wednesday, 23 May 2012 05:18:26 UTC-5, Tad Marshall wrote:
> There are a couple of things going on here. The main one is that it seems > that something being replicated is causing an access violation on your > secondary. This is an error in memory addressing, the same thing as a > segfault (signal 11) on Unix. The second thing is that the Windows service > settings that were created by mongod.exe when you used --install to install > the service instruct the Windows Service Control Manager to restart > mongod.exe if it ever fails. Because you have journaling enabled, mongod > is able to recover from each restart, but when it recovers it starts > syncing from the primary again, reads the same oplog entry that killed it > the previous time and so gets another access violation and the process > repeats.
> Version 2.0.4 gives very little information about the access violation, > just the faulting address, which makes it hard to debug. In addition, > Address Space Layout Randomization (ASLR) makes the address hard to use in > finding out what source code line triggered the crash. Version 2.2 will > behave a little better on access violations in Windows: it will show the > thread name that crashed (which is almost certainly rsSync for your crash), > the type of memory access and the address that was used. In addition, it > will generate a minidump that can help diagnose the problem and then shut > down "cleanly" so no recovery will be required on restart.
> There are a couple of things that could be tried to get you out of this > hole. One possibility is to temporarily try the most recent nightly build > and see if the crashing problem has been fixed. If it hasn't been fixed, > you will get a minidump that we could use to analyze the crash. If this > fixes the problem, you could let it run long enough to fully resync from > the primary and then switch back to your 2.0.4 version.
> There are other options that could be tried, such as wiping the data and > restoring from a dump from the primary, but it might be worth trying the > nightly build first. If it worked, it would be a lot faster. The current > nightly build is getting close to the version that will ship as 2.2.0-rc0 > (the first release candidate for version 2.2) but it isn't there yet and we > don't recommend running "unstable" nightly builds in production. As a > short term test to get you past this outage, it might be worth trying.
> On Tuesday, May 22, 2012 5:18:19 PM UTC-4, Mo Cassidy wrote:
>> Today, we had the following seemingly spontaneously occur:
>> Tue May 22 14:51:45 [conn27990] update default.instances query: { >> [...SNIP...] } idhack:1 upsert:1 124ms >> Tue May 22 14:51:47 access violation >> Tue May 22 14:52:15 [rsHealthPoll] replSet member iis-man-002:27021 is >> now in state PRIMARY >> Tue May 22 14:52:17 [rsSync] replSet syncing to: iis-man-002:27021
>> ***** SERVER RESTARTED *****
>> Tue May 22 14:52:19 [initandlisten] MongoDB starting : pid=1960 >> port=27021 dbpath=C:\SitesData\testing-website\mongodata 64-bit >> host=iis-man-003 >> Tue May 22 14:52:19 [initandlisten] db version v2.0.4, pdfile version 4.5 >> Tue May 22 14:52:19 [initandlisten] git version: >> 329f3c47fe8136c03392c8f0e548506cb21f8ebf >> Tue May 22 14:52:19 [initandlisten] build info: windows >> sys.getwindowsversion(major=6, minor=1, build=7601, platform=2, >> service_pack='Service Pack 1') BOOST_LIB_VERSION=1_42 >> Tue May 22 14:52:19 [initandlisten] options: { dbpath: >> "C:\SitesData\testing-website\mongodata", logappend: true, logpath: >> "C:\SitesData\testing-website\mongo.log", port: 27021, quiet: true, >> replSet: "rsTest", rest: true, service: true } >> Tue May 22 14:52:19 [initandlisten] journal >> dir=C:/SitesData/testing-website/mongodata/journal >> Tue May 22 14:52:19 [initandlisten] recover begin >> Tue May 22 14:52:19 [initandlisten] recover lsn: 258699461 >> Tue May 22 14:52:19 [initandlisten] recover >> C:/SitesData/testing-website/mongodata/journal/j._4 >> Tue May 22 14:52:20 [initandlisten] recover skipping application of >> section seq:209344204 < lsn:258699461 >> Tue May 22 14:52:20 [initandlisten] recover skipping application of >> section seq:209382318 < lsn:258699461 >> Tue May 22 14:52:20 [initandlisten] recover skipping application of >> section seq:209420683 < lsn:258699461 >> Tue May 22 14:52:20 [initandlisten] recover skipping application of >> section seq:209458887 < lsn:258699461 >> Tue May 22 14:52:20 [initandlisten] recover skipping application of >> section seq:209497082 < lsn:258699461 >> Tue May 22 14:52:20 [initandlisten] recover skipping application of >> section seq:209535336 < lsn:258699461 >> Tue May 22 14:52:20 [initandlisten] recover skipping application of >> section seq:209573520 < lsn:258699461 >> Tue May 22 14:52:20 [initandlisten] recover skipping application of >> section seq:209611794 < lsn:258699461 >> Tue May 22 14:52:20 [initandlisten] recover skipping application of >> section seq:209649958 < lsn:258699461 >> Tue May 22 14:52:20 [initandlisten] recover skipping application of >> section more... >> Tue May 22 14:52:58 [initandlisten] recover >> C:/SitesData/testing-website/mongodata/journal/j._5 >> Tue May 22 14:53:17 [initandlisten] recover cleaning up >> Tue May 22 14:53:17 [initandlisten] removeJournalFiles >> Tue May 22 14:53:17 [initandlisten] recover done >> Tue May 22 14:53:18 [initandlisten] waiting for connections on port 27021 >> Tue May 22 14:53:18 [websvr] admin web console waiting for connections on >> port 28021 >> Tue May 22 14:53:19 [rsStart] trying to contact iis-man-002:27021 >> Tue May 22 14:53:19 [rsHealthPoll] replSet member iis-man-002:27021 is up >> Tue May 22 14:53:19 [rsHealthPoll] replSet member iis-man-002:27021 is >> now in state PRIMARY >> Tue May 22 14:53:19 [rsStart] replSet STARTUP2 >> Tue May 22 14:53:19 [rsHealthPoll] replSet member iis-man-001:27021 is up >> Tue May 22 14:53:19 [rsHealthPoll] replSet member iis-man-001:27021 is >> now in state ARBITER >> Tue May 22 14:53:19 [rsSync] replSet SECONDARY >> Tue May 22 14:53:23 [rsSync] replSet syncing to: iis-man-002:27021 >> Tue May 22 14:53:24 access violation
>> ***** SERVER RESTARTED *****
>> Tue May 22 14:53:25 [initandlisten] MongoDB starting : pid=2356 >> port=27021 dbpath=C:\SitesData\testing-website\mongodata 64-bit >> host=iis-man-003 >> Tue May 22 14:53:25 [initandlisten] db version v2.0.4, pdfile version 4.5 >> Tue May 22 14:53:25 [initandlisten] git version: >> 329f3c47fe8136c03392c8f0e548506cb21f8ebf >> Tue May 22 14:53:25 [initandlisten] build info: windows >> sys.getwindowsversion(major=6, minor=1, build=7601, platform=2, >> service_pack='Service Pack 1') BOOST_LIB_VERSION=1_42 >> Tue May 22 14:53:25 [initandlisten] options: { dbpath: >> "C:\SitesData\testing-website\mongodata", logappend: true, logpath: >> "C:\SitesData\testing-website\mongo.log", port: 27021, quiet: true, >> replSet: "rsTest", rest: true, service: true } >> Tue May 22 14:53:25 [initandlisten] journal >> dir=C:/SitesData/testing-website/mongodata/journal >> Tue May 22 14:53:25 [initandlisten] recover begin >> Tue May 22 14:53:25 [initandlisten] recover lsn: 844 >> Tue May 22 14:53:25 [initandlisten] recover >> C:/SitesData/testing-website/mongodata/journal/j._0 >> Tue May 22 14:53:25 [initandlisten] recover cleaning up >> Tue May 22 14:53:25 [initandlisten] removeJournalFiles >> Tue May 22 14:53:25 [initandlisten] recover done >> Tue May 22 14:53:25 [initandlisten] waiting for connections on port 27021 >> Tue May 22 14:53:25 [websvr] admin web console waiting for connections on >> port 28021 >> Tue May 22 14:53:25 [rsStart] replSet STARTUP2 >> Tue May 22 14:53:25 [rsSync] replSet SECONDARY >> Tue May 22 14:53:25 [rsHealthPoll] replSet member iis-man-002:27021 is up >> Tue May 22 14:53:25 [rsHealthPoll] replSet member iis-man-002:27021 is >> now in state PRIMARY >> Tue May 22 14:53:25 [rsHealthPoll] replSet member iis-man-001:27021 is up >> Tue May 22 14:53:25 [rsHealthPoll] replSet member iis-man-001:27021 is >> now in state ARBITER >> Tue May 22 14:53:30 [rsSync] replSet syncing to: iis-man-002:27021 >> Tue May 22 14:53:30 access violation
>> ***** SERVER RESTARTED *****
>> Tue May 22 14:53:30 [initandlisten] MongoDB starting : pid=2268 >> port=27021 dbpath=C:\SitesData\testing-website\mongodata 64-bit >> host=iis-man-003 >> Tue May 22 14:53:30 [initandlisten] db version v2.0.4, pdfile version 4.5 >> Tue May 22 14:53:30 [initandlisten] git version: >> 329f3c47fe8136c03392c8f0e548506cb21f8ebf >> Tue May 22 14:53:30 [initandlisten] build info: windows >> sys.getwindowsversion(major=6, minor=1, build=7601, platform=2, >> service_pack='Service Pack 1') BOOST_LIB_VERSION=1_42 >> Tue May 22 14:53:30 [initandlisten] options: { dbpath: >> "C:\SitesData\testing-website\mongodata", logappend: true, logpath: >> "C:\SitesData\testing-website\mongo.log", port: 27021, quiet: true, >> replSet: "rsTest", rest: true, service: true } >> Tue May 22 14:53:30 [initandlisten] journal >> dir=C:/SitesData/testing-website/mongodata/journal >> Tue May 22 14:53:30 [initandlisten] recover begin >> Tue May 22 14:53:30 [initandlisten] info no lsn file in journal/ directory >> Tue May 22 14:53:30 [initandlisten] recover lsn: 0 >> Tue May 22 14:53:30 [initandlisten] recover >> C:/SitesData/testing-website/mongodata/journal/j._0 >> Tue May 22 14:53:30 [initandlisten] recover cleaning up >> Tue May 22 14:53:30
Hmmm. It happened a third time, today. I noticed something fishy about the timing, so I calculated the time between failures (some of it is estimated, because my records aren't complete), and it was ~38 days each time (930 hours between the 1st and 2nd failure, and 910 hours between the 2nd and 3rd failure). It's always my secondary (I have two active nodes and an arbiter) that crashes like this, but the secondary is always running on our 8GB RAM server, while the primary is on a 12GB server. Both servers are shared by Mongo and fairly busy IIS nodes of a webfarm. I haven't been closely watching what's going on with 2.2, but hopefully this is useful information for the devs.
On Friday, 29 June 2012 11:31:48 UTC-5, Mo Cassidy wrote:
> Wow, sorry Tad, I had totally forgotten about this until today (guess what > happened again today). To get out of the hole, I just went ahead and did a > full resync. It didn't affect performance a whole lot and was done > relatively quickly (I think maybe 30 minutes). Thanks for all the info, > it's really interesting. I'll just do the full resync again today, and I > look forward to 2.2 so I'll have more info if this happens again! :)
> On Wednesday, 23 May 2012 05:18:26 UTC-5, Tad Marshall wrote:
>> There are a couple of things going on here. The main one is that it >> seems that something being replicated is causing an access violation on >> your secondary. This is an error in memory addressing, the same thing as a >> segfault (signal 11) on Unix. The second thing is that the Windows service >> settings that were created by mongod.exe when you used --install to install >> the service instruct the Windows Service Control Manager to restart >> mongod.exe if it ever fails. Because you have journaling enabled, mongod >> is able to recover from each restart, but when it recovers it starts >> syncing from the primary again, reads the same oplog entry that killed it >> the previous time and so gets another access violation and the process >> repeats.
>> Version 2.0.4 gives very little information about the access violation, >> just the faulting address, which makes it hard to debug. In addition, >> Address Space Layout Randomization (ASLR) makes the address hard to use in >> finding out what source code line triggered the crash. Version 2.2 will >> behave a little better on access violations in Windows: it will show the >> thread name that crashed (which is almost certainly rsSync for your crash), >> the type of memory access and the address that was used. In addition, it >> will generate a minidump that can help diagnose the problem and then shut >> down "cleanly" so no recovery will be required on restart.
>> There are a couple of things that could be tried to get you out of this >> hole. One possibility is to temporarily try the most recent nightly build >> and see if the crashing problem has been fixed. If it hasn't been fixed, >> you will get a minidump that we could use to analyze the crash. If this >> fixes the problem, you could let it run long enough to fully resync from >> the primary and then switch back to your 2.0.4 version.
>> There are other options that could be tried, such as wiping the data and >> restoring from a dump from the primary, but it might be worth trying the >> nightly build first. If it worked, it would be a lot faster. The current >> nightly build is getting close to the version that will ship as 2.2.0-rc0 >> (the first release candidate for version 2.2) but it isn't there yet and we >> don't recommend running "unstable" nightly builds in production. As a >> short term test to get you past this outage, it might be worth trying.
>> On Tuesday, May 22, 2012 5:18:19 PM UTC-4, Mo Cassidy wrote:
>>> Today, we had the following seemingly spontaneously occur:
>>> Tue May 22 14:51:45 [conn27990] update default.instances query: { >>> [...SNIP...] } idhack:1 upsert:1 124ms >>> Tue May 22 14:51:47 access violation >>> Tue May 22 14:52:15 [rsHealthPoll] replSet member iis-man-002:27021 is >>> now in state PRIMARY >>> Tue May 22 14:52:17 [rsSync] replSet syncing to: iis-man-002:27021
>>> ***** SERVER RESTARTED *****
>>> Tue May 22 14:52:19 [initandlisten] MongoDB starting : pid=1960 >>> port=27021 dbpath=C:\SitesData\testing-website\mongodata 64-bit >>> host=iis-man-003 >>> Tue May 22 14:52:19 [initandlisten] db version v2.0.4, pdfile version 4.5 >>> Tue May 22 14:52:19 [initandlisten] git version: >>> 329f3c47fe8136c03392c8f0e548506cb21f8ebf >>> Tue May 22 14:52:19 [initandlisten] build info: windows >>> sys.getwindowsversion(major=6, minor=1, build=7601, platform=2, >>> service_pack='Service Pack 1') BOOST_LIB_VERSION=1_42 >>> Tue May 22 14:52:19 [initandlisten] options: { dbpath: >>> "C:\SitesData\testing-website\mongodata", logappend: true, logpath: >>> "C:\SitesData\testing-website\mongo.log", port: 27021, quiet: true, >>> replSet: "rsTest", rest: true, service: true } >>> Tue May 22 14:52:19 [initandlisten] journal >>> dir=C:/SitesData/testing-website/mongodata/journal >>> Tue May 22 14:52:19 [initandlisten] recover begin >>> Tue May 22 14:52:19 [initandlisten] recover lsn: 258699461 >>> Tue May 22 14:52:19 [initandlisten] recover >>> C:/SitesData/testing-website/mongodata/journal/j._4 >>> Tue May 22 14:52:20 [initandlisten] recover skipping application of >>> section seq:209344204 < lsn:258699461 >>> Tue May 22 14:52:20 [initandlisten] recover skipping application of >>> section seq:209382318 < lsn:258699461 >>> Tue May 22 14:52:20 [initandlisten] recover skipping application of >>> section seq:209420683 < lsn:258699461 >>> Tue May 22 14:52:20 [initandlisten] recover skipping application of >>> section seq:209458887 < lsn:258699461 >>> Tue May 22 14:52:20 [initandlisten] recover skipping application of >>> section seq:209497082 < lsn:258699461 >>> Tue May 22 14:52:20 [initandlisten] recover skipping application of >>> section seq:209535336 < lsn:258699461 >>> Tue May 22 14:52:20 [initandlisten] recover skipping application of >>> section seq:209573520 < lsn:258699461 >>> Tue May 22 14:52:20 [initandlisten] recover skipping application of >>> section seq:209611794 < lsn:258699461 >>> Tue May 22 14:52:20 [initandlisten] recover skipping application of >>> section seq:209649958 < lsn:258699461 >>> Tue May 22 14:52:20 [initandlisten] recover skipping application of >>> section more... >>> Tue May 22 14:52:58 [initandlisten] recover >>> C:/SitesData/testing-website/mongodata/journal/j._5 >>> Tue May 22 14:53:17 [initandlisten] recover cleaning up >>> Tue May 22 14:53:17 [initandlisten] removeJournalFiles >>> Tue May 22 14:53:17 [initandlisten] recover done >>> Tue May 22 14:53:18 [initandlisten] waiting for connections on port 27021 >>> Tue May 22 14:53:18 [websvr] admin web console waiting for connections >>> on port 28021 >>> Tue May 22 14:53:19 [rsStart] trying to contact iis-man-002:27021 >>> Tue May 22 14:53:19 [rsHealthPoll] replSet member iis-man-002:27021 is up >>> Tue May 22 14:53:19 [rsHealthPoll] replSet member iis-man-002:27021 is >>> now in state PRIMARY >>> Tue May 22 14:53:19 [rsStart] replSet STARTUP2 >>> Tue May 22 14:53:19 [rsHealthPoll] replSet member iis-man-001:27021 is up >>> Tue May 22 14:53:19 [rsHealthPoll] replSet member iis-man-001:27021 is >>> now in state ARBITER >>> Tue May 22 14:53:19 [rsSync] replSet SECONDARY >>> Tue May 22 14:53:23 [rsSync] replSet syncing to: iis-man-002:27021 >>> Tue May 22 14:53:24 access violation
>>> ***** SERVER RESTARTED *****
>>> Tue May 22 14:53:25 [initandlisten] MongoDB starting : pid=2356 >>> port=27021 dbpath=C:\SitesData\testing-website\mongodata 64-bit >>> host=iis-man-003 >>> Tue May 22 14:53:25 [initandlisten] db version v2.0.4, pdfile version 4.5 >>> Tue May 22 14:53:25 [initandlisten] git version: >>> 329f3c47fe8136c03392c8f0e548506cb21f8ebf >>> Tue May 22 14:53:25 [initandlisten] build info: windows >>> sys.getwindowsversion(major=6, minor=1, build=7601, platform=2, >>> service_pack='Service Pack 1') BOOST_LIB_VERSION=1_42 >>> Tue May 22 14:53:25 [initandlisten] options: { dbpath: >>> "C:\SitesData\testing-website\mongodata", logappend: true, logpath: >>> "C:\SitesData\testing-website\mongo.log", port: 27021, quiet: true, >>> replSet: "rsTest", rest: true, service: true } >>> Tue May 22 14:53:25 [initandlisten] journal >>> dir=C:/SitesData/testing-website/mongodata/journal >>> Tue May 22 14:53:25 [initandlisten] recover begin >>> Tue May 22 14:53:25 [initandlisten] recover lsn: 844 >>> Tue May 22 14:53:25 [initandlisten] recover >>> C:/SitesData/testing-website/mongodata/journal/j._0 >>> Tue May 22 14:53:25 [initandlisten] recover cleaning up >>> Tue May 22 14:53:25 [initandlisten] removeJournalFiles >>> Tue May 22 14:53:25 [initandlisten] recover done >>> Tue May 22 14:53:25 [initandlisten] waiting for connections on port 27021 >>> Tue May 22 14:53:25 [websvr] admin web console waiting for connections >>> on port 28021 >>> Tue May 22 14:53:25 [rsStart] replSet STARTUP2 >>> Tue May 22 14:53:25 [rsSync] replSet SECONDARY >>> Tue May 22 14:53:25 [rsHealthPoll] replSet member iis-man-002:27021 is up >>> Tue May 22 14:53:25 [rsHealthPoll] replSet member iis-man-002:27021 is >>> now in state PRIMARY >>> Tue May 22 14:53:25 [rsHealthPoll] replSet member iis-man-001:27021 is up >>> Tue May 22 14:53:25 [rsHealthPoll] replSet member iis-man-001:27021 is >>> now in state ARBITER >>> Tue May 22 14:53:30 [rsSync] replSet syncing to: iis-man-002:27021 >>> Tue May 22 14:53:30 access violation
>>> ***** SERVER RESTARTED *****
>>> Tue May 22 14:53:30 [initandlisten] MongoDB starting : pid=2268 >>> port=27021 dbpath=C:\SitesData\testing-website\mongodata 64-bit >>> host=iis-man-003 >>> Tue May 22 14:53:30 [initandlisten] db version v2.0.4, pdfile version 4.5 >>> Tue May 22 14:53:30 [initandlisten] git version: