On-disk corruption vs loss of data (fsync again)

264 views
Skip to first unread message

Carl Byström

unread,
Dec 13, 2009, 7:49:29 PM12/13/09
to mongodb-user
As we know, Mongo doesn't fsync on every write. I understand the
reasoning behind this and I think I'm able to buy into the pro/cons
that come with it.

However, I feel like there are two different scenarios here that often
gets mixed up (correct me if I'm wrong).
Namely, on-disk corruption and loss of data.

1) If my server has a power failure (god forbid), is there a scenario
where it's impossible to restore data due to complete, on-disk
corruption?

2) I have no problem losing X seconds worth of data, as long as it
happens atomically. What I mean, is it possible to get, say half-
written strings? Changing the content of string "Hello World" to
"Goodbye World", could that potentially yield a string that says
"Goolo World" in the repaired version of my database?

3) If MongoDB has the same characteristics as MyISAM when it comes
disk consistency/fsync, how come MyISAM is so well accepted in the
database community? Can the same arguments pro/con MyISAM be applied
to MongoDB in respect of disk consistency?

This behavior seem to be an issue for many people (including yours
truly). Think it would be great if the documentation could be expanded
to include some reasoning and comparisons to other databases using
similar approaches.

As with all technology, there are always trade-offs :)

--
Carl

Eliot Horowitz

unread,
Dec 13, 2009, 8:50:19 PM12/13/09
to mongod...@googlegroups.com
The characteristic of the storage mechanism are almost identical to myisam.

If there is a power failure, there will certainly be loss of X seconds
of data if there isn't replication (which is how we recommend handling
durability).

There is an exceedingly small chance of a record being mangled.
This could only happen if the object spanned disk blocks, and there
wasn't a battery on the drive.

In the highly unlikely chance that happened, on the repair when you
came back up, it should be marked as corrupt and skipped over.

Our general feeling on durability is the only real way to keep your
data secure is through replication, both locally and globally. Hard
drives can fail, entire data centers can go offline, raid controllers
can break, etc... At least in my experience, the biggest factors for
failures are hardware (drivers, raid controllers, ram) and not power
loss or os crash.

That being said, better durability is planned, just not at the top of
the list at the moment.

-Eliot


2009/12/13 Carl Byström <cgby...@gmail.com>:
> --
>
> You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
>
>
>

Ask Bjørn Hansen

unread,
Dec 13, 2009, 11:59:16 PM12/13/09
to mongod...@googlegroups.com

On Dec 13, 2009, at 16:49, Carl Byström wrote:

> 3) If MongoDB has the same characteristics as MyISAM when it comes
> disk consistency/fsync, how come MyISAM is so well accepted in the
> database community?

MyISAM isn't commonly used in busy write/read environments (but because of its performance, not the durability as much).


- ask

--
http://develooper.com/ - http://askask.com/


Carl Byström

unread,
Dec 14, 2009, 4:16:11 AM12/14/09
to mongodb-user
On Dec 14, 2:50 am, Eliot Horowitz <eliothorow...@gmail.com> wrote:

> In the highly unlikely chance that happened, on the repair when you
> came back up, it should be marked as corrupt and skipped over.

Ok, I see. How does Mongo know it's corrupt?
Will this happen on a database/namespace/document or field level?
I'm fine with this, as long I'm aware that it occurred. Silent fails
are the worst.

> That being said, better durability is planned, just not at the top of
> the list at the moment.

Is there anything documented regarding this durability improvement?

Thanks for answering, both of you.

--
Carl

Eliot Horowitz

unread,
Dec 14, 2009, 8:13:08 AM12/14/09
to mongod...@googlegroups.com
> Ok, I see. How does Mongo know it's corrupt?

the bson spec is pretty rigid, so if there is this unlikely single
object corruption you end up with invalid bson.
for example, if you make a string longer, but there is a block
boundary in the middle of the string, the next bson element type will
most likely be garbage.


> Will this happen on a database/namespace/document or field level?

document level

> I'm fine with this, as long I'm aware that it occurred. Silent fails
> are the worst.

It will try to print out the _id of the corrupted object, and its so
corrupt it can't do that, will at least say it skipped an object.

>> That being said, better durability is planned, just not at the top of
>> the list at the moment.
>
> Is there anything documented regarding this durability improvement?

No - sorry. It'll most likely be a transaction log that we can replay
though. Pretty traditional. You'll most likely be able to control
the frequency of the transaction log being flushed to disk.

Carl Byström

unread,
Dec 14, 2009, 5:25:45 PM12/14/09
to mongod...@googlegroups.com
Thanks for bringing some clarity to this.

scode

unread,
Dec 19, 2009, 6:14:22 PM12/19/09
to mongodb-user
Hello,

I'm very skeptical about the mongodb consistency model, and came
across this thread, so I thought I'd try to poke some holes in it.

> There is an exceedingly small chance of a record being mangled.
> This could only happen if the object spanned disk blocks, and there
> wasn't a battery on the drive.

Whether or not there was a bettery on the drive (I assume you are
referring to battery-backed RAID controllers and the like) should be
irrelevant to the issue at hand since mongodb does not do any fsync()
or similar at all. You have arbitrarily ordered writes, some of them
written some of them not, some of them partially written. Battery
backed RAID caches do not become relevant until you have actually
gotten to the point of getting the data onto the RAID controller; it
will not help you at all at any layer above that.

> In the highly unlikely chance that happened, on the repair when you
> came back up, it should be marked as corrupt and skipped over.

Note that you cannot assume ordered writes. If you just write a bunch
of stuff to your memory mapped area, several things can happen,
usually for efficiency (doing in-order writes for all writes would be
extremely inefficient; in some cases only slightly less efficient, and
in some cases crazy inefficient):

* The operating system may write out dirty buffers in any arbitrary
order.
* The file system might do something about ordering (probably not an
issue for traditional fs:es though).
* The I/O scheduling in the operating system kernel will do out-of-
order writing.
* The RAID controller, if any, may do out-of-order writes (probably,
but not certainly, if the RAID controller has a battery backed cache).
* The disk drives may do out of order writes (probably, but not
certanly, irrelevant if the RAID controller has a battery backed
cache).

In short pretty much anything can happen, which is why one normally
uses write barriers to ensure consistency.

You say that it is at the document level. I did some very brief poking
around the source and see that mongodb seems to be using a btree. How
is the consistency of the btree maintained for example? What prevents
the btree from pointing to invalid values? What prevents the btree
from being internally inconsistent, with nodes pointing to invalid
would-be nodes?

Please do correct me if I am missing something, but as far as I can
tell mongodb does not seem to have any internal consistency what-so-
ever, which results in arbitrary corruption on certain categories of
crashes (definitely power outtages and kernel panics - perhaps not
process crashes if things happen in the right order in memory).

And please note that these concerns are not academic. On real
production systems there is no reason to believe you would not have
hundreds of megabytes or even gigabytes of dirty data, spanning a lot
of objects and btree information. The probability of corruption really
should be significant, unless, again, I am missing something about
mongodb consistency.

> Our general feeling on durability is the only real way to keep your
> data secure is through replication, both locally and globally.  Hard

Everyone who had a crash and actually immediately removed the database
and restored from backup or got stuff back from a replicate, raise
your hand ;)

It sounds to me like you will effectively have no idea whether you
have corruption or to what extent (except perhaps that it doesn't seem
to be completely borked if you can run for a while without a crash).
How many people, in real life, just start it and see if it "seems to
work"?

> drives can fail, entire data centers can go offline, raid controllers
> can break, etc...  At least in my experience, the biggest factors for
> failures are hardware (drivers, raid controllers, ram) and not power
> loss or os crash.

Stuff always breaks, agreed. But there is a difference between
breaking by design and breaking by bug. For example, supposing you
have a replicated setup, it may be unlikely that several independent
systems fail in a buggy fashion at the same time - if they are all
properly set up for correct storage semantics (correct fs, correct os,
battery-bached caching RAID with correct configuration, etc).

/ Peter Schuller

Eliot Horowitz

unread,
Dec 19, 2009, 9:24:24 PM12/19/09
to mongod...@googlegroups.com
First - thanks for thinking about this is detail :) We definitely
have room for improvement, so I just want to make sure everyone
understands that current state of things.

So, I'll start by making sure our stance on the current durability is
clear. After an os crash/power failure - a repair is required. the
repair will re-build every b-tree, and check all the extents and
objects. To validate the data, we walk the extents for each object
getting the objects, and then starting a new clean data file. The
durability is basically the same as myisam, so there is definitely a
potential for data loss on a single machine.

We've talked about requiring a repair after an unclean shutdown, so
people can't "cheat" but at least most of the people i've talked to
run replicas, etc... and switch over in the event of a crash.

We do do an a background fsync every minute (configurable), so there
is a limit to the amount of possible data loss.

Practically, I think it makes sent to really look at the details
between breaking by design or bug:
- many many people i've talked to running mysql with innodb who
assumed they are safe have various problems. Many people don't
realize they need hardware buffering off of a battery backup. Also -
many people assume their data is safe, and therefore when there is a
disk/raid controller problem they lose data.
- server process crash - the mmapped data that hasn't been synced
isn't tossed and is actually still synced back to disk. of course, if
there is a process crash, maybe that lead to corruption, but that can
happen anywhere
- os crash - at this point in the linux kernel, its at least 50% a
hardware problem (in my experience) and data is suspect
- raid controller/disk failure - fairly common, on disk durability
doesn't really help you here
- power failure - this is the one where on disk durability really
does help. with mongo, you're only way to protect yourself is
off-site replication

A lot of decisions we've made are from our experience running large
systems, and may not work for everyone at the moment.

All this being said, single server durability is something we are
going to work more on next year, and there are a lot of things we are
planning that will make a single server more durable.

-Eliot


On Sat, Dec 19, 2009 at 6:14 PM, scode <peter.s...@infidyne.com> wrote:
> Hello,
>

> I'm very skeptical r the mongodb consistency model, and came

Ask Bjørn Hansen

unread,
Dec 20, 2009, 3:47:03 AM12/20/09
to mongod...@googlegroups.com

On Dec 19, 2009, at 15:14, scode wrote:

> Everyone who had a crash and actually immediately removed the database
> and restored from backup or got stuff back from a replicate, raise
> your hand ;)

Huh? It happens all the time (well, not that often, but often enough to have lost count). I don't have any significant production MongoDBs yet; so that's just talking about MySQL.

I suspect we have a comparable number of "the box is completely lost/screwed up" failures as we have "stuff crashed". As Eliot said the only practical way to deal with the box going completely kaboom is to have another.

> It sounds to me like you will effectively have no idea whether you
> have corruption or to what extent (except perhaps that it doesn't seem
> to be completely borked if you can run for a while without a crash).
> How many people, in real life, just start it and see if it "seems to
> work"?


That is true ... On MySQL we use the wonderful maatkit tools to test the integrity of the replication mirrors from time to time; it'd be nice if there was something similar for MongoDB.


- ask

Eliot Horowitz

unread,
Dec 20, 2009, 6:56:01 AM12/20/09
to mongod...@googlegroups.com
> That is true ...   On MySQL we use the wonderful maatkit tools to test the integrity of the replication mirrors from time to time; it'd be nice if there was something similar for MongoDB.

.validate() does this.

Peter Schuller

unread,
Dec 20, 2009, 7:18:07 AM12/20/09
to mongod...@googlegroups.com
> So, I'll start by making sure our stance on the current durability is
> clear.  After an os crash/power failure - a repair is required.  the
> repair will re-build every b-tree, and check all the extents and
> objects. To validate the data, we walk the extents for each object
> getting the objects, and then starting a new clean data file.  The
> durability is basically the same as myisam, so there is definitely a
> potential for data loss on a single machine.

Ok. Just to be clear; so the input to the repair process does not
include the btree? I.e., you do the repair (build a new db + index)
based on some form of sequential (or otherwise) scan of the datafile,
relying only on accurately detecting the existence of a data record,
and leaving room for the possibility of it being partially written?

> We do do an a background fsync every minute (configurable), so there
> is a limit to the amount of possible data loss.

A very key point though is whether the data loss is truly limited to
loosing N seconds of recent activity, or whether there is a
possibility of arbitrary DB corruption (e.g. broken btree index) that
leads to further loss. Also, it is relevant to know whether or not an
update to a data record is atomic; by what I have read so far then,
that is not the case.

But we do believe that you would not lose older data not actively
being written recently before the crash.

>  - many many people i've talked to running mysql with innodb who
> assumed they are safe have various problems.  Many people don't
> realize they need hardware buffering off of a battery backup.  Also -
> many people assume their data is safe, and therefore when there is a
> disk/raid controller problem they lose data.

Almost everyone has broken storage setups and don't know about it.
It's a pet peeve of mine. Part of the problem is that it is very
difficult to get the relevant information, and sometimes systems are
delivered by hardware vendor out of the box in such a way that they do
not have correct semantics.

This may or may not be acceptable, but it is IMO a decision that needs
to be explicitly taken with knowledge and understanding (i.e., if
someone did not specifically choose a lack of consistency, I like the
default, in terms of hardware/OS, to be that of consistency).

>  - server process crash - the mmapped data that hasn't been synced
> isn't tossed and is actually still synced back to disk.  of course, if
> there is a process crash, maybe that lead to corruption, but that can
> happen anywhere

Yes, any corruption due to an arbitrary bug in the database software
can of course lead to corruption. This is however a case where often
bugs/crashes are localized in terms of their "domain" such that a very
small subset of bugs actually do tend to cause corruption.

>  - os crash - at this point in the linux kernel, its at least 50% a
> hardware problem (in my experience) and data is suspect

Not sure I agree here. You can have various forms of panics that are
non-arbitrary in nature, such as memory allocation issues in the
kernel, or an assertion triggered panic, etc. However certainly if you
do not know what caused the crash, all bets are off. And hardware
issues are common, agreed.

>  - raid controller/disk failure - fairly common, on disk durability
> doesn't really help you here

It does in the presence of replication though. If your systems (that
you're running on) are designed for correctness, it should hopefully
not be very likely that a single power outtage breaks several of them.
People do run production databases with high consistency guarantees
(postgresql, oracle etc) on such systems that do survive power
failures all the time. That doesn't mean it will always be okay, but
there is a key difference between such a situation and a situation
where you replicate to 5 hosts all of which are by design almost
certainly going to suffer corruption when there is a power outtage in
the data center.

(You can of course make arguments about proper backups. But even with
backups you prefer, for production systems with uptime requirements,
to not have to restore from one.)

>  - power failure - this is the one where on disk durability really
> does help.  with mongo, you're only way to protect yourself is
> off-site replication

(Interestingly though I believe you are in fact safe with, for
example, ZFS where writes happen to be ordered (not actually ordered,
but effectively due to transaction group commit semantics). I'm not
sure how strong a guarantee that is though, and whether it applies to
mmap() and not just regular writes.)

> A lot of decisions we've made are from our experience running large
> systems, and may not work for everyone at the moment.

I do not claim it is an unacceptable solution, however I would say
that it would be good to be very open and clear about consistency
guarantees so that one can make an accurate choice as to what solution
to use for what problem. As I indicated, the lack of clear information
on consistency (at all levels; RAID controllers, disks, file systems,
etc) is something I do care about.

> All this being said, single server durability is something we are
> going to work more on next year, and there are a lot of things we are
> planning that will make a single server more durable.

Cool.

--
/ Peter Schuller

Peter Schuller

unread,
Dec 20, 2009, 7:23:53 AM12/20/09
to mongod...@googlegroups.com
>> Everyone who had a crash and actually immediately removed the database
>> and restored from backup or got stuff back from a replicate, raise
>> your hand ;)
>
> Huh?  It happens all the time (well, not that often, but often enough to have lost count).   I don't have any significant production MongoDBs yet; so that's > just talking about MySQL.

What I mean is, I do not think I have ever been in a situation where
someone other than me has voluntarily agreed that "ok, we have a
system that does not guarantee consistency, and we had a crash of type
X, therefor let's restore from backup". Instead, they *invariably*
want to "see if it works" first. And invariably, there is a high
probability that it *does* seem to work. And if it does, you likely
*have* corruption anyway, regardless of the fact that it "seems to
work".

So I was trying to point out that in the real world people do not tend
to take the safe/responsible route and adopt the "all bets are off"
approach. I did not mean to imply (absolutely not) that
crashes/inconsistencies do not happen.

> I suspect we have a comparable number of "the box is completely lost/screwed up" failures as we have "stuff crashed".  As Eliot said the only practical way to deal with the box going completely kaboom is to have another.

The key point is - how do you know whether you're fine? And unless you
have a completey obvious "it won't start" type of situation, most
peope, I beieve, wil just assume it's fune and truck along happily -
until some problem happens as a result of the corruption in the
future, at which point the association is never made.

> That is true ...   On MySQL we use the wonderful maatkit tools to test the integrity of the replication mirrors from time to time; it'd be nice if there was something similar for MongoDB.

Yes, if we assume a replicated setup which does seem to be the
intended design, it would certainly be very interesting to have that.
Effectively it provides a kind of high-level 'voting mechanism' that
shoud be able to give some statistically good results almost
independently of the failure mode being recovered from, without
relying on perfect storage semantics on individual nodes.

I completely agree that such behavior is very desirable; I do not mean
to imply that there is a black&white situation with storage either
being correct or not, and that there is nothing to be done in the
latter case.

--
/ Peter Schuller

Ask Bjørn Hansen

unread,
Dec 21, 2009, 2:54:04 AM12/21/09
to mongod...@googlegroups.com

On Dec 20, 2009, at 3:56, Eliot Horowitz wrote:

>> That is true ... On MySQL we use the wonderful maatkit tools to test the integrity of the replication mirrors from time to time; it'd be nice if there was something similar for MongoDB.
>
> .validate() does this.

It compares the data on the various replicas and helps you figure out what, if anything, is different? If so that's not clear from the documentation I could find. :-)

http://www.mongodb.org/display/DOCS/Validate+Command

http://www.maatkit.org/doc/mk-table-checksum.html


- ask

Eliot Horowitz

unread,
Dec 21, 2009, 8:50:05 AM12/21/09
to mongod...@googlegroups.com
Sorry - didn't realize that's what maatkit did.
Yes - that would be a good utility - pretty easy for anyone to write.

Reply all
Reply to author
Forward
0 new messages