As we know, Mongo doesn't fsync on every write. I understand the
reasoning behind this and I think I'm able to buy into the pro/cons
that come with it.
However, I feel like there are two different scenarios here that often
gets mixed up (correct me if I'm wrong).
Namely, on-disk corruption and loss of data.
1) If my server has a power failure (god forbid), is there a scenario
where it's impossible to restore data due to complete, on-disk
corruption?
2) I have no problem losing X seconds worth of data, as long as it
happens atomically. What I mean, is it possible to get, say half-
written strings? Changing the content of string "Hello World" to
"Goodbye World", could that potentially yield a string that says
"Goolo World" in the repaired version of my database?
3) If MongoDB has the same characteristics as MyISAM when it comes
disk consistency/fsync, how come MyISAM is so well accepted in the
database community? Can the same arguments pro/con MyISAM be applied
to MongoDB in respect of disk consistency?
This behavior seem to be an issue for many people (including yours
truly). Think it would be great if the documentation could be expanded
to include some reasoning and comparisons to other databases using
similar approaches.
As with all technology, there are always trade-offs :)
The characteristic of the storage mechanism are almost identical to myisam.
If there is a power failure, there will certainly be loss of X seconds
of data if there isn't replication (which is how we recommend handling
durability).
There is an exceedingly small chance of a record being mangled.
This could only happen if the object spanned disk blocks, and there
wasn't a battery on the drive.
In the highly unlikely chance that happened, on the repair when you
came back up, it should be marked as corrupt and skipped over.
Our general feeling on durability is the only real way to keep your
data secure is through replication, both locally and globally. Hard
drives can fail, entire data centers can go offline, raid controllers
can break, etc... At least in my experience, the biggest factors for
failures are hardware (drivers, raid controllers, ram) and not power
loss or os crash.
That being said, better durability is planned, just not at the top of
the list at the moment.
> As we know, Mongo doesn't fsync on every write. I understand the
> reasoning behind this and I think I'm able to buy into the pro/cons
> that come with it.
> However, I feel like there are two different scenarios here that often
> gets mixed up (correct me if I'm wrong).
> Namely, on-disk corruption and loss of data.
> 1) If my server has a power failure (god forbid), is there a scenario
> where it's impossible to restore data due to complete, on-disk
> corruption?
> 2) I have no problem losing X seconds worth of data, as long as it
> happens atomically. What I mean, is it possible to get, say half-
> written strings? Changing the content of string "Hello World" to
> "Goodbye World", could that potentially yield a string that says
> "Goolo World" in the repaired version of my database?
> 3) If MongoDB has the same characteristics as MyISAM when it comes
> disk consistency/fsync, how come MyISAM is so well accepted in the
> database community? Can the same arguments pro/con MyISAM be applied
> to MongoDB in respect of disk consistency?
> This behavior seem to be an issue for many people (including yours
> truly). Think it would be great if the documentation could be expanded
> to include some reasoning and comparisons to other databases using
> similar approaches.
> As with all technology, there are always trade-offs :)
> --
> Carl
> --
> You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> To post to this group, send email to mongodb-user@googlegroups.com.
> To unsubscribe from this group, send email to mongodb-user+unsubscribe@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
> 3) If MongoDB has the same characteristics as MyISAM when it comes
> disk consistency/fsync, how come MyISAM is so well accepted in the
> database community?
MyISAM isn't commonly used in busy write/read environments (but because of its performance, not the durability as much).
On Dec 14, 2:50 am, Eliot Horowitz <eliothorow...@gmail.com> wrote:
> In the highly unlikely chance that happened, on the repair when you
> came back up, it should be marked as corrupt and skipped over.
Ok, I see. How does Mongo know it's corrupt?
Will this happen on a database/namespace/document or field level?
I'm fine with this, as long I'm aware that it occurred. Silent fails
are the worst.
> That being said, better durability is planned, just not at the top of
> the list at the moment.
Is there anything documented regarding this durability improvement?
the bson spec is pretty rigid, so if there is this unlikely single
object corruption you end up with invalid bson.
for example, if you make a string longer, but there is a block
boundary in the middle of the string, the next bson element type will
most likely be garbage.
> Will this happen on a database/namespace/document or field level?
document level
> I'm fine with this, as long I'm aware that it occurred. Silent fails
> are the worst.
It will try to print out the _id of the corrupted object, and its so
corrupt it can't do that, will at least say it skipped an object.
>> That being said, better durability is planned, just not at the top of
>> the list at the moment.
> Is there anything documented regarding this durability improvement?
No - sorry. It'll most likely be a transaction log that we can replay
though. Pretty traditional. You'll most likely be able to control
the frequency of the transaction log being flushed to disk.
I'm very skeptical about the mongodb consistency model, and came across this thread, so I thought I'd try to poke some holes in it.
> There is an exceedingly small chance of a record being mangled. > This could only happen if the object spanned disk blocks, and there > wasn't a battery on the drive.
Whether or not there was a bettery on the drive (I assume you are referring to battery-backed RAID controllers and the like) should be irrelevant to the issue at hand since mongodb does not do any fsync() or similar at all. You have arbitrarily ordered writes, some of them written some of them not, some of them partially written. Battery backed RAID caches do not become relevant until you have actually gotten to the point of getting the data onto the RAID controller; it will not help you at all at any layer above that.
> In the highly unlikely chance that happened, on the repair when you > came back up, it should be marked as corrupt and skipped over.
Note that you cannot assume ordered writes. If you just write a bunch of stuff to your memory mapped area, several things can happen, usually for efficiency (doing in-order writes for all writes would be extremely inefficient; in some cases only slightly less efficient, and in some cases crazy inefficient):
* The operating system may write out dirty buffers in any arbitrary order. * The file system might do something about ordering (probably not an issue for traditional fs:es though). * The I/O scheduling in the operating system kernel will do out-of- order writing. * The RAID controller, if any, may do out-of-order writes (probably, but not certainly, if the RAID controller has a battery backed cache). * The disk drives may do out of order writes (probably, but not certanly, irrelevant if the RAID controller has a battery backed cache).
In short pretty much anything can happen, which is why one normally uses write barriers to ensure consistency.
You say that it is at the document level. I did some very brief poking around the source and see that mongodb seems to be using a btree. How is the consistency of the btree maintained for example? What prevents the btree from pointing to invalid values? What prevents the btree from being internally inconsistent, with nodes pointing to invalid would-be nodes?
Please do correct me if I am missing something, but as far as I can tell mongodb does not seem to have any internal consistency what-so- ever, which results in arbitrary corruption on certain categories of crashes (definitely power outtages and kernel panics - perhaps not process crashes if things happen in the right order in memory).
And please note that these concerns are not academic. On real production systems there is no reason to believe you would not have hundreds of megabytes or even gigabytes of dirty data, spanning a lot of objects and btree information. The probability of corruption really should be significant, unless, again, I am missing something about mongodb consistency.
> Our general feeling on durability is the only real way to keep your > data secure is through replication, both locally and globally. Hard
Everyone who had a crash and actually immediately removed the database and restored from backup or got stuff back from a replicate, raise your hand ;)
It sounds to me like you will effectively have no idea whether you have corruption or to what extent (except perhaps that it doesn't seem to be completely borked if you can run for a while without a crash). How many people, in real life, just start it and see if it "seems to work"?
> drives can fail, entire data centers can go offline, raid controllers > can break, etc... At least in my experience, the biggest factors for > failures are hardware (drivers, raid controllers, ram) and not power > loss or os crash.
Stuff always breaks, agreed. But there is a difference between breaking by design and breaking by bug. For example, supposing you have a replicated setup, it may be unlikely that several independent systems fail in a buggy fashion at the same time - if they are all properly set up for correct storage semantics (correct fs, correct os, battery-bached caching RAID with correct configuration, etc).
First - thanks for thinking about this is detail :) We definitely have room for improvement, so I just want to make sure everyone understands that current state of things.
So, I'll start by making sure our stance on the current durability is clear. After an os crash/power failure - a repair is required. the repair will re-build every b-tree, and check all the extents and objects. To validate the data, we walk the extents for each object getting the objects, and then starting a new clean data file. The durability is basically the same as myisam, so there is definitely a potential for data loss on a single machine.
We've talked about requiring a repair after an unclean shutdown, so people can't "cheat" but at least most of the people i've talked to run replicas, etc... and switch over in the event of a crash.
We do do an a background fsync every minute (configurable), so there is a limit to the amount of possible data loss.
Practically, I think it makes sent to really look at the details between breaking by design or bug: - many many people i've talked to running mysql with innodb who assumed they are safe have various problems. Many people don't realize they need hardware buffering off of a battery backup. Also - many people assume their data is safe, and therefore when there is a disk/raid controller problem they lose data. - server process crash - the mmapped data that hasn't been synced isn't tossed and is actually still synced back to disk. of course, if there is a process crash, maybe that lead to corruption, but that can happen anywhere - os crash - at this point in the linux kernel, its at least 50% a hardware problem (in my experience) and data is suspect - raid controller/disk failure - fairly common, on disk durability doesn't really help you here - power failure - this is the one where on disk durability really does help. with mongo, you're only way to protect yourself is off-site replication
A lot of decisions we've made are from our experience running large systems, and may not work for everyone at the moment.
All this being said, single server durability is something we are going to work more on next year, and there are a lot of things we are planning that will make a single server more durable.
On Sat, Dec 19, 2009 at 6:14 PM, scode <peter.schul...@infidyne.com> wrote: > Hello,
> I'm very skeptical r the mongodb consistency model, and came > across this thread, so I thought I'd try to poke some holes in it.
>> There is an exceedingly small chance of a record being mangled. >> This could only happen if the object spanned disk blocks, and there >> wasn't a battery on the drive.
> Whether or not there was a bettery on the drive (I assume you are > referring to battery-backed RAID controllers and the like) should be > irrelevant to the issue at hand since mongodb does not do any fsync() > or similar at all. You have arbitrarily ordered writes, some of them > written some of them not, some of them partially written. Battery > backed RAID caches do not become relevant until you have actually > gotten to the point of getting the data onto the RAID controller; it > will not help you at all at any layer above that.
>> In the highly unlikely chance that happened, on the repair when you >> came back up, it should be marked as corrupt and skipped over.
> Note that you cannot assume ordered writes. If you just write a bunch > of stuff to your memory mapped area, several things can happen, > usually for efficiency (doing in-order writes for all writes would be > extremely inefficient; in some cases only slightly less efficient, and > in some cases crazy inefficient):
> * The operating system may write out dirty buffers in any arbitrary > order. > * The file system might do something about ordering (probably not an > issue for traditional fs:es though). > * The I/O scheduling in the operating system kernel will do out-of- > order writing. > * The RAID controller, if any, may do out-of-order writes (probably, > but not certainly, if the RAID controller has a battery backed cache). > * The disk drives may do out of order writes (probably, but not > certanly, irrelevant if the RAID controller has a battery backed > cache).
> In short pretty much anything can happen, which is why one normally > uses write barriers to ensure consistency.
> You say that it is at the document level. I did some very brief poking > around the source and see that mongodb seems to be using a btree. How > is the consistency of the btree maintained for example? What prevents > the btree from pointing to invalid values? What prevents the btree > from being internally inconsistent, with nodes pointing to invalid > would-be nodes?
> Please do correct me if I am missing something, but as far as I can > tell mongodb does not seem to have any internal consistency what-so- > ever, which results in arbitrary corruption on certain categories of > crashes (definitely power outtages and kernel panics - perhaps not > process crashes if things happen in the right order in memory).
> And please note that these concerns are not academic. On real > production systems there is no reason to believe you would not have > hundreds of megabytes or even gigabytes of dirty data, spanning a lot > of objects and btree information. The probability of corruption really > should be significant, unless, again, I am missing something about > mongodb consistency.
>> Our general feeling on durability is the only real way to keep your >> data secure is through replication, both locally and globally. Hard
> Everyone who had a crash and actually immediately removed the database > and restored from backup or got stuff back from a replicate, raise > your hand ;)
> It sounds to me like you will effectively have no idea whether you > have corruption or to what extent (except perhaps that it doesn't seem > to be completely borked if you can run for a while without a crash). > How many people, in real life, just start it and see if it "seems to > work"?
>> drives can fail, entire data centers can go offline, raid controllers >> can break, etc... At least in my experience, the biggest factors for >> failures are hardware (drivers, raid controllers, ram) and not power >> loss or os crash.
> Stuff always breaks, agreed. But there is a difference between > breaking by design and breaking by bug. For example, supposing you > have a replicated setup, it may be unlikely that several independent > systems fail in a buggy fashion at the same time - if they are all > properly set up for correct storage semantics (correct fs, correct os, > battery-bached caching RAID with correct configuration, etc).
> / Peter Schuller
> --
> You received this message because you are subscribed to the Google Groups "mongodb-user" group. > To post to this group, send email to mongodb-user@googlegroups.com. > To unsubscribe from this group, send email to mongodb-user+unsubscribe@googlegroups.com. > For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
> Everyone who had a crash and actually immediately removed the database > and restored from backup or got stuff back from a replicate, raise > your hand ;)
Huh? It happens all the time (well, not that often, but often enough to have lost count). I don't have any significant production MongoDBs yet; so that's just talking about MySQL.
I suspect we have a comparable number of "the box is completely lost/screwed up" failures as we have "stuff crashed". As Eliot said the only practical way to deal with the box going completely kaboom is to have another.
> It sounds to me like you will effectively have no idea whether you > have corruption or to what extent (except perhaps that it doesn't seem > to be completely borked if you can run for a while without a crash). > How many people, in real life, just start it and see if it "seems to > work"?
That is true ... On MySQL we use the wonderful maatkit tools to test the integrity of the replication mirrors from time to time; it'd be nice if there was something similar for MongoDB.
> That is true ... On MySQL we use the wonderful maatkit tools to test the integrity of the replication mirrors from time to time; it'd be nice if there was something similar for MongoDB.
> So, I'll start by making sure our stance on the current durability is > clear. After an os crash/power failure - a repair is required. the > repair will re-build every b-tree, and check all the extents and > objects. To validate the data, we walk the extents for each object > getting the objects, and then starting a new clean data file. The > durability is basically the same as myisam, so there is definitely a > potential for data loss on a single machine.
Ok. Just to be clear; so the input to the repair process does not include the btree? I.e., you do the repair (build a new db + index) based on some form of sequential (or otherwise) scan of the datafile, relying only on accurately detecting the existence of a data record, and leaving room for the possibility of it being partially written?
> We do do an a background fsync every minute (configurable), so there > is a limit to the amount of possible data loss.
A very key point though is whether the data loss is truly limited to loosing N seconds of recent activity, or whether there is a possibility of arbitrary DB corruption (e.g. broken btree index) that leads to further loss. Also, it is relevant to know whether or not an update to a data record is atomic; by what I have read so far then, that is not the case.
But we do believe that you would not lose older data not actively being written recently before the crash.
> - many many people i've talked to running mysql with innodb who > assumed they are safe have various problems. Many people don't > realize they need hardware buffering off of a battery backup. Also - > many people assume their data is safe, and therefore when there is a > disk/raid controller problem they lose data.
Almost everyone has broken storage setups and don't know about it. It's a pet peeve of mine. Part of the problem is that it is very difficult to get the relevant information, and sometimes systems are delivered by hardware vendor out of the box in such a way that they do not have correct semantics.
This may or may not be acceptable, but it is IMO a decision that needs to be explicitly taken with knowledge and understanding (i.e., if someone did not specifically choose a lack of consistency, I like the default, in terms of hardware/OS, to be that of consistency).
> - server process crash - the mmapped data that hasn't been synced > isn't tossed and is actually still synced back to disk. of course, if > there is a process crash, maybe that lead to corruption, but that can > happen anywhere
Yes, any corruption due to an arbitrary bug in the database software can of course lead to corruption. This is however a case where often bugs/crashes are localized in terms of their "domain" such that a very small subset of bugs actually do tend to cause corruption.
> - os crash - at this point in the linux kernel, its at least 50% a > hardware problem (in my experience) and data is suspect
Not sure I agree here. You can have various forms of panics that are non-arbitrary in nature, such as memory allocation issues in the kernel, or an assertion triggered panic, etc. However certainly if you do not know what caused the crash, all bets are off. And hardware issues are common, agreed.
> - raid controller/disk failure - fairly common, on disk durability > doesn't really help you here
It does in the presence of replication though. If your systems (that you're running on) are designed for correctness, it should hopefully not be very likely that a single power outtage breaks several of them. People do run production databases with high consistency guarantees (postgresql, oracle etc) on such systems that do survive power failures all the time. That doesn't mean it will always be okay, but there is a key difference between such a situation and a situation where you replicate to 5 hosts all of which are by design almost certainly going to suffer corruption when there is a power outtage in the data center.
(You can of course make arguments about proper backups. But even with backups you prefer, for production systems with uptime requirements, to not have to restore from one.)
> - power failure - this is the one where on disk durability really > does help. with mongo, you're only way to protect yourself is > off-site replication
(Interestingly though I believe you are in fact safe with, for example, ZFS where writes happen to be ordered (not actually ordered, but effectively due to transaction group commit semantics). I'm not sure how strong a guarantee that is though, and whether it applies to mmap() and not just regular writes.)
> A lot of decisions we've made are from our experience running large > systems, and may not work for everyone at the moment.
I do not claim it is an unacceptable solution, however I would say that it would be good to be very open and clear about consistency guarantees so that one can make an accurate choice as to what solution to use for what problem. As I indicated, the lack of clear information on consistency (at all levels; RAID controllers, disks, file systems, etc) is something I do care about.
> All this being said, single server durability is something we are > going to work more on next year, and there are a lot of things we are > planning that will make a single server more durable.
>> Everyone who had a crash and actually immediately removed the database >> and restored from backup or got stuff back from a replicate, raise >> your hand ;)
> Huh? It happens all the time (well, not that often, but often enough to have lost count). I don't have any significant production MongoDBs yet; so that's > just talking about MySQL.
What I mean is, I do not think I have ever been in a situation where someone other than me has voluntarily agreed that "ok, we have a system that does not guarantee consistency, and we had a crash of type X, therefor let's restore from backup". Instead, they *invariably* want to "see if it works" first. And invariably, there is a high probability that it *does* seem to work. And if it does, you likely *have* corruption anyway, regardless of the fact that it "seems to work".
So I was trying to point out that in the real world people do not tend to take the safe/responsible route and adopt the "all bets are off" approach. I did not mean to imply (absolutely not) that crashes/inconsistencies do not happen.
> I suspect we have a comparable number of "the box is completely lost/screwed up" failures as we have "stuff crashed". As Eliot said the only practical way to deal with the box going completely kaboom is to have another.
The key point is - how do you know whether you're fine? And unless you have a completey obvious "it won't start" type of situation, most peope, I beieve, wil just assume it's fune and truck along happily - until some problem happens as a result of the corruption in the future, at which point the association is never made.
> That is true ... On MySQL we use the wonderful maatkit tools to test the integrity of the replication mirrors from time to time; it'd be nice if there was something similar for MongoDB.
Yes, if we assume a replicated setup which does seem to be the intended design, it would certainly be very interesting to have that. Effectively it provides a kind of high-level 'voting mechanism' that shoud be able to give some statistically good results almost independently of the failure mode being recovered from, without relying on perfect storage semantics on individual nodes.
I completely agree that such behavior is very desirable; I do not mean to imply that there is a black&white situation with storage either being correct or not, and that there is nothing to be done in the latter case.
>> That is true ... On MySQL we use the wonderful maatkit tools to test the integrity of the replication mirrors from time to time; it'd be nice if there was something similar for MongoDB.
> .validate() does this.
It compares the data on the various replicas and helps you figure out what, if anything, is different? If so that's not clear from the documentation I could find. :-)
>>> That is true ... On MySQL we use the wonderful maatkit tools to test the integrity of the replication mirrors from time to time; it'd be nice if there was something similar for MongoDB.
>> .validate() does this.
> It compares the data on the various replicas and helps you figure out what, if anything, is different? If so that's not clear from the documentation I could find. :-)
> You received this message because you are subscribed to the Google Groups "mongodb-user" group. > To post to this group, send email to mongodb-user@googlegroups.com. > To unsubscribe from this group, send email to mongodb-user+unsubscribe@googlegroups.com. > For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.