Best practices for MongoDB + RAID on EBS?

2,715 views
Skip to first unread message

Michael Conigliaro

unread,
May 17, 2011, 5:55:10 PM5/17/11
to mongodb-user
Hey guys,

So I've been resisting using RAID in EC2 for a long time now, mostly
because of things people have told me (e.g. one slow drive in your
array can slow everything down), and various articles I've read
online. For example:

http://www.nevdull.com/2008/08/24/why-raid-10-doesnt-help-on-ebs/

So overall, it seemed like RAID just wasn't worth the extra effort and
complexity. But more recently, I've been reading more positive things
about using RAID on EC2. I know that MongoDB recommends a RAID 10
configuration on production clusters now, and since it looks like
we're starting to hit an IO bottleneck here, I figured I should at
least give it a try in our testing environment.

So first of all, I was curious about how people were configuring RAID
in their environments. Anyone care to share their experiences and/or
mdadm commands? Based on what I've read so far (e.g.
http://orion.heroku.com/past/2009/7/29/io_performance_on_ebs/),
something like this seems reasonable:

# mdadm --create /dev/md0 --level 10 --chunk 256 --raid-devices 4 /
dev/sdc /dev/sdd /dev/sde /dev/sdf
# blockdev --setra 65536 /dev/md0

Though it looks like there are actually a bunch of different ways of
setting up RAID 10 with mdadm, and I wasn't sure if any one way was
more correct than the others (at least as it relates to MongoDB). It
would be nice if the recommended commands were somewhere in the
MongoDB manual (if they are there, I couldn't find them).

So once the RAID device is created, is everyone using (or is it
recommended to use) LVM on top of that? I can see how that could be
useful to resize the volume. But then again, is it even possible to
resize the underlying RAID device once it's created? Excuse my
ignorance here. I've never actually tried to permanently add/remove
devices from a RAID device like this.

And lastly, when using RAID on EC2, is it necessary to keep track of
which volumes are attached as which devices? It seems like you'd need
to know this in case the instance ever got terminated or whatever. Or
can mdadm automatically reassemble a RAID somehow device given the
correct list of EBS volumes? I just want to make sure I'm keeping
track of everything I might need to know in case an instance
disappears.

Thanks in advance!

- Mike

Brendan W. McAdams

unread,
May 19, 2011, 1:46:44 PM5/19/11
to mongod...@googlegroups.com
Sorry for the delay, been making sure I have all the right info and check a few things.  See answers inline, below.

On Tue, May 17, 2011 at 5:55 PM, Michael Conigliaro <mike.co...@livingsocial.com> wrote:
Hey guys,

So I've been resisting using RAID in EC2 for a long time now, mostly
because of things people have told me (e.g. one slow drive in your
array can slow everything down), and various articles I've read
online. For example:

http://www.nevdull.com/2008/08/24/why-raid-10-doesnt-help-on-ebs/


Context matters, and these kinds of benchmarks are definitely all over the place ( I mean there's lots of them not that they vary in content per se even if they do ).

Unfortunately, I haven't been able to pull up the original benchmark he links to (or find a copy in google cache) --- partly to get a feel for what his "F2" variant of RAID 10 is (I suspect the one I usually build out isn't "F2"; I know in general how F2 looks but not how to get it setup safely on EBS). [Addendum after I wrote this bit] It was pointed out in discussion here at our office that one of the classic problems faced with any disk management software such as RAIDs on EBS is that they incorrectly assume a standard physical disk layout and profile, which EBS is definitely not.  Our brief look through of F2 (and similar "far" configs) on this end looks like it is a software RAID variant designed very much to optimize for physical disk and spindle layouts, the concepts wouldn't hold up on EBS and are likely not a safe way to go. 

Many of these benchmarks push the max throughput numbers they see from various setups.  The one linked in particular is pointing out that a single drive maxed out at 65 Mb/sec versus RAID 10 maxing out at 55 Mb/sec.  Notably we are talking about maxing out as well as single drive.  The reality of what we're most concerned with on EBS however is different: inconsistent performance and failure tolerance.  

That is to say --- the nature of RAID 10 is more likely to give us a consistently high average throughput versus a single disk.  If the single disk slows down at all everything slows down.  And of course, if that single disk fails any apprehensiveness about RAID on EBS quickly turns to regret for not having RAID ;)

 
So overall, it seemed like RAID just wasn't worth the extra effort and
complexity. But more recently, I've been reading more positive things
about using RAID on EC2. I know that MongoDB recommends a RAID 10
configuration on production clusters now, and since it looks like
we're starting to hit an IO bottleneck here, I figured I should at
least give it a try in our testing environment.


Tying this into my previous block, keep in mind that  Maximum Throughput is not comorbid with the needs of a typical Database Workload.  

What we've found, in general, is that for a Database Workload (and specifically MongoDB) RAID 10 on EBS makes the absolute most sense.  The most important bit of RAID 10 (at least as done by Linux's software RAID tools) is that each disk access gets ultimately split into full speed disk accesses to different drives.  You get a lot of the read/write performance of RAID 0 but you don't rely on the stripe being on both drives.  This seems to fit particularly well with the EBS model and the question of "What does the physical layout of EBS' underlying disks look like in comparison to an actual physical raw, bought it at best buy and plugged it in disk".  This paired with the fact that underneath the stripe we have mirroring gives you redundancy also, that parts of the stripe can fail or ... more important on EBS: get slower.

The big winner in RAID 10 is reading, of course in that RAID 10 only requires that you read from half of the mirror.  There's some tradeoff in writing (e.g. you don't get that same "touch half the mirror" benefit) but again, given the fact that most databases (And certainly not MongoDB) are not constantly driving the write heads to stream gigabits per second of  data onto disk ... RAID 10 works.  (See: http://www.intel.com/support/chipsets/imsm/sb/CS-020655.htm)

So first of all, I was curious about how people were configuring RAID
in their environments. Anyone care to share their experiences and/or
mdadm commands? Based on what I've read so far (e.g.
http://orion.heroku.com/past/2009/7/29/io_performance_on_ebs/),
something like this seems reasonable:

 # mdadm --create /dev/md0 --level 10 --chunk 256 --raid-devices 4 /
dev/sdc /dev/sdd /dev/sde /dev/sdf
 # blockdev --setra 65536 /dev/md0

Though it looks like there are actually a bunch of different ways of
setting up RAID 10 with mdadm, and I wasn't sure if any one way was
more correct than the others (at least as it relates to MongoDB). It
would be nice if the recommended commands were somewhere in the
MongoDB manual (if they are there, I couldn't find them).

So once the RAID device is created, is everyone using (or is it
recommended to use) LVM on top of that? I can see how that could be
useful to resize the volume. But then again, is it even possible to
resize the underlying RAID device once it's created? Excuse my
ignorance here. I've never actually tried to permanently add/remove
devices from a RAID device like this.

I got tasked during the runup to releasing MongoDB 1.8 with testing a few interesting hypothesis regarding MongoDB on top of EBS both with and without LVM.  They mostly related to the idea of "can you do snapshot (LVM or EBS) backups without locking MongoDB and use the journal to safely use that backup, thereby having a non blocking reliable backup of MongoDB using the system level tools?‡".  As part of that I spent a lot of time putting together scripts to bring up a full LVM RAID 10 quickly on top of 4 EBS volumes. 

I can't guarantee you they are the most optimal configuration but they're put together from a variety of best practices writeups on building RAID 10 on Linux that I dug through online.  These are using LVM, on top of MDADM but your mileage may vary.  I have reused it recently to do testing of larger sharded MongoDB clusters for some new features in our Hadoop driver and been very pleased with the setup.  I was able to migrate a disk array to a whole new instance at one point two when it was necessary because I fat fingered "terminate" instead of "stop" ;)

I've attached my script, consider a disclaimer this:

Please don't use this script blindly, without reading through it and understanding it or assume it is the best way to do things.  It is provided as a guide to something I've been using to test but not an "official 10gen RAID 10 LVM+MDADM" installer of any kind.  It may become one later but for now just a pointer in the right direction. If you run my script without thinking about it or verifying the commands work on a test box, you'll make me very disappointed.

This was passed out to a few other people here to do a second round of testing where they ran some of the same tests I ran while I tested other things, so I rejiggered it (represented in its current form) to be slightly interactive.

Here's the basic rundown.

*** Remember that the capacity of RAID 10 is (N/2) * S(min) where N is the number of drives in the set and min is the smallest volume size.  For 4 40 gigabyte volumes you'll get 20 gigs of capacity.  See my note below on sizing your physical volume (The view of a 'real' disk that MDADM constructs from the EBS volumes) versus your logical volume (The actual filesystem bearing chunk of diskiness that LVM slaps on top of MDADM's physical).  

- Create your EBS volumes, attach them to your instance

- I (and by extension, my script) assume the existence of 4 disks: /dev/sdf, /dev/sdg, /dev/sdh, /dev/sdi ... make sure you attach them as this

- This script assumes & requires SFDISK is installed.  SFDISK is a cute little fdisk-alike that allows you to pipe arguments into it from the shell.  Awful useful for doing scripted disk creation.  Most distros (amazon AMI does and AFAIK Ubuntu) ship sfdisk by default, if not you can find it easily. 

- We need to create physical partitions on the volumes that we mapped from EBS.  MDADM and LVM will build the RAID on top of these partitions.  We *DO NOT FORMAT* these volumes however, because when we're done creating the RAID there will be one single EXT4 filesystem that floats above the 4 actual disks; the magic that allows half of this filesystem to exist on each volume is handled by MDADM.

- I didn't do any testing with unevenly sized volumes so I'd recommend at least with this configuration that you create them all the same size.

- I don't think there was much analysis or tweaking put into optimal extent size.  A few people here suggested from anecdotal evidence on past client engagements etc that 64M was a good default.  YMMV!  I seem to recall it affected that "note below" on snapshotting a bit though (partly with regards to the fact that you define the logical volume not by physical disk size but by # of extents.  The logical volume is built on top of a volume group [which is built on top of a physical volume and has the extent size defined] by # of extents. Still with me?) 

- I do *everything* with EXT4 in this script.  See all of our other notes in various forums and writeups about why.  Short Version: EXT3 , to allocate a 2 gig file on disk for MongoDB must physically do 2 gigs worth of IO writing 0s to this new file.  EXT4 supports allocating large files on the filesystem without writing out all $n gigs of 0s to denote emptiness.  Much better on IO and performance, which will help a lot when you're creating a virtual disk on top of a virtual network disk.  If you want to change from EXT4 I'd suggest XFS as the only 'mongodb recommended' alternative. 

- Finally, we don't mount your new volume by default, but it does spit out a suggested entry for fstab.  Note the end of the script: It will have a slightly funky name.  The physical disk isn't part of the picture, the path to your new volume will be /dev/<volume group name>/<logical volume name>   
    * This script creates a physical volume of /dev/md0
    * This script creates a volume group of "mongodb_vg"
    * This script creates a logical volume of "mongodb_lv"
    ** I believe that all of these names matter come restore to another machine time. Thus you may want to deliberately name them in a way that makes them UNIQUE to the machine and its role, so you don't have any potential naming collisions should you need to recover it on another host later which already has an LVM 

And lastly, when using RAID on EC2, is it necessary to keep track of
which volumes are attached as which devices? It seems like you'd need
to know this in case the instance ever got terminated or whatever. Or
can mdadm automatically reassemble a RAID somehow device given the
correct list of EBS volumes? I just want to make sure I'm keeping
track of everything I might need to know in case an instance
disappears.


ABSOLUTELY.  Take a look at this swiped-from-wikipedia diagram of a typical RAID 10 setup:

Notably, the position of all 4 disks matters, as it controls which mirror set they are in and which side of the stripe.  I don't know all the internals of MDADM but I am doubtful it allows for example, the order of members A & B of mirror 1 to be interchangeable.  I am going to have to get back to you on the process for doing a full recovery --- we are going to write up a lot of what I've just dumped into this email for posterity and I'll do a separate batch of research and doc of the restore process. 

However, you definitely want to keep track.  I would suggest you safely save in a notebook or post-it or something else when you do the first installation step that looks something like:

MACHINE NAME | INSTANCE ID  | VOLUME ID | PHYSICAL PATH
--------------------------------------------------------
shard1-node1 |   i-abc123a  | vol-1234a |  /dev/sdf
shard1-node1 |   i-abc123a  | vol-1234b |  /dev/sdg
shard1-node1 |   i-abc123a  | vol-1234c |  /dev/sdh
shard1-node1 |   i-abc123a  | vol-1234|  /dev/sdi
shard2-node1 |   i-abc456a  | vol-5432a |  /dev/sdf
shard2-node1 |   i-abc456a  | vol-5432b |  /dev/sdg
shard2-node1 |   i-abc456a  | vol-5432c |  /dev/sdh
shard2-node1 |   i-abc456a  | vol-5432d |  /dev/sdi

 * you may want to store the md<n>, volume group name and logical volume names too just in case

This will be an important reference chart if you ever need to migrate the LVM setup to a different instance, and you should attach them in the same order, to the same volumes as originally.

As I said, I'll follow up later with any "How to actually move to a new machine" details and maybe even a screen shot walkthrough (Anyone else who wants to step in on that one can feel free).


FINALLY ... and this I'll leave as an exercise to you, the reader...  If you want to run backups on this new setup you can't use EBS snapshots as they represent a quarter (technically half but in reality a quarter) of a very particular dataset whose metadata is spread cross all four volumes.  You'll need to do LVM snapshots or a typical mongodump through your servers. 

If you want to do LVM snapshots you need overhead capacity on the logical volume group, separate from what you actually map as a logical volume.   Typically  this is equivalent to at least 2x the total size of your logical volume (There are ways to do partial snapshots on the diff of the original disk but I am not well versed in that aspect of LVM / Snapshotting).  

There exist good general, non-mongodb-or-amazon-specific directions online for what this looks like but the general idea is if we have a RAID 10 with our MongoDB data being stored on a 20 gig logical volume, we'll need at least another 20 gigs available on the same logical volume group.  So yes, even though we can do these "diff" sized lvm snapshots the disk we allocate for our MongoDB data probably needs to have a backup the size of our full mongoDB data.  I am uncertain if doing a "partial" snapshot of a MongoDB dataset would work correctly.


-b



 ‡ The answer, by the way, is "yes".  If you have journaling enabled, I was able to reliably and consistently take EBS snapshots (ONLY WITH A SINGLE EBS VOLUME - YOU CANT EBS SNAPSHOT AN LVM SETUP) or LVM Snapshots (Given a RAID) of a running MongoDB, without locking, and upon restoring the backup MongoDB used the journal reliably to recover.
ebs_lvm.sh
RAID 10.png

Brendan W. McAdams

unread,
May 19, 2011, 2:10:36 PM5/19/11
to mongod...@googlegroups.com
Er, Math fail.  4 40 gig volumes should give you 80 gigs of capacity.

I even simplified the formula in my script as <volume_size> * 2 since the script requires all volumes be the same size.

Michael Conigliaro

unread,
May 19, 2011, 2:25:04 PM5/19/11
to mongodb-user
This is great. Thanks! I have a couple questions though:

"We need to create physical partitions on the volumes that we mapped
from
EBS"

Is this really true? I've found that this command works fine *without*
partitioning the underlying RAID devices:

# mdadm --create /dev/md0 --force --metadata=1.1 --level 10 --chunk
256 --raid-devices 4 /dev/sdc /dev/sdd /dev/sde /dev/sdf

Is there some reason why everyone creates partitions first? I've
actually moved away from doing this when I know I'm going to use the
entire device and/or when I think I might expand it someday. For the
latter, this means I can always skip the partition expansion step. I
can just grow the filesystem and be done. But maybe I'm missing
something here?

Secondly, this guy reported that "Larger chunk sizes on the raid made
a (shockingly) HUGE difference in performance. The sweet spot seemed
to be at 256k."

http://orion.heroku.com/past/2009/7/29/io_performance_on_ebs/

This would mean your script should include the option "--chunk 256."

He also says "A larger read ahead buffer on the raid also made a HUGE
difference. I bumped it from 256 bytes to 64k." I believe the way to
set this is by running the following command:

# blockdev --setra 65536 /dev/md0

Any comments on what this guy has to say on his blog?

I also see a lot of people using the "--metadata=1.1" option. My
understanding is that this allows you to create much larger RAID
devices made up of many more phyical devices. Any reason why we
*shouldn't* use this? I noticed that your script doesn't, but this
leads me to another question. At some point, your data is going to be
so large that you'll want to start sharding. So I'm guessing it
wouldn't really make sense to have a >2T disk when a single MongoDB
won't perform well with that much data. So are there any
recommendations on how big our disks should be? Maybe a quick rule of
thumb formula or some sort? At the moment, I'm creating 500GB volumes
for each MongoDB server (which has ~35GB of memory).

- Mike

On May 19, 12:10 pm, "Brendan W. McAdams" <bren...@10gen.com> wrote:
> Er, Math fail.  4 40 gig volumes should give you 80 gigs of capacity.
>
> I even simplified the formula in my script as <volume_size> * 2 since the
> script requires all volumes be the same size.
>
> > Many of these benchmarks push the *max throughput* numbers they see from
> > various setups.  The one linked in particular is pointing out that a single
> > drive maxed out at 65 Mb/sec versus RAID 10 maxing out at 55 Mb/sec.
> >  Notably we are talking about *maxing out* as well as *single drive*.  The
> > reality of what we're most concerned with on EBS however is different: *inconsistent
> > performance* and *failure tolerance*.
>
> > That is to say --- the nature of RAID 10 is more likely to give us a
> > consistently high *average throughput* versus a single disk.  If the
> > single disk slows down at all everything slows down.  And of course, if that
> > single disk fails any apprehensiveness about RAID on EBS quickly turns to
> > regret for not having RAID ;)
>
> >> So overall, it seemed like RAID just wasn't worth the extra effort and
> >> complexity. But more recently, I've been reading more positive things
> >> about using RAID on EC2. I know that MongoDB recommends a RAID 10
> >> configuration on production clusters now, and since it looks like
> >> we're starting to hit an IO bottleneck here, I figured I should at
> >> least give it a try in our testing environment.
>
> > Tying this into my previous block, keep in mind that  Maximum Throughput is
> > not comorbid with the needs of a typical Database Workload.
>
> > What we've found, in general, is that for a Database Workload (and
> > specifically MongoDB) RAID 10 on EBS makes the absolute most sense.  The
> > most important bit of RAID 10 (at least as done by Linux's software RAID
> > tools) is that each disk access gets ultimately split into full speed disk
> > accesses to different drives.  You get a lot of the read/write performance
> > of RAID 0 but you don't rely on the stripe being on both drives.  This seems
> > to fit particularly well with the EBS model and the question of "What does
> > the physical layout of EBS' underlying disks look like in comparison to an
> > actual physical raw, bought it at best buy and plugged it in disk".  This
> > paired with the fact that underneath the stripe we have mirroring gives you
> > redundancy also, that parts of the stripe can fail or ... more important on
> > EBS: *get slower.*
> > backups *without locking MongoDB* and use the journal to safely use that
> > backup, thereby having a non blocking reliable backup of MongoDB using the
> > system level tools?‡".  As part of that I spent a lot of time putting
> > together scripts to bring up a full LVM RAID 10 quickly on top of 4 EBS
> > volumes.
>
> > I can't guarantee you they are the *most optimal *configuration but
> > they're put together from a variety of best practices writeups on building
> > RAID 10 on Linux that I dug through online.  These are using LVM, on top of
> > MDADM but your mileage may vary.  I have reused it recently to do testing of
> > larger sharded MongoDB clusters for some new features in our Hadoop driver
> > and been very pleased with the setup.  I was able to migrate a disk array to
> > a whole new instance at one point two when it was necessary because I fat
> > fingered "terminate" instead of "stop" ;)
>
> > I've attached my script, consider a disclaimer this:
>
> > *Please don't use this script blindly, without reading through it and
> > understanding it or assume it is the best way to do things.  It is provided
> > as a guide to something I've been using to test but not an "official 10gen
> > RAID 10 LVM+MDADM" installer of any kind.  It may become one later but for
> > now just a pointer in the right direction. If you run my script without
> > thinking about it or verifying the commands work on a test box, you'll make
> > me very disappointed.*
>
> > This was passed out to a few other people here to do a second round of
> > testing where they ran some of the same tests I ran while I tested other
> > things, so I rejiggered it (represented in its current form) to be slightly
> > interactive.
>
> > Here's the basic rundown.
>
> > ***** *Remember that the capacity of RAID 10 is (N/2) * S(min) where N is
> > the number of drives in the set and min is the smallest volume size.  For 4
> > 40 gigabyte volumes you'll get 20 gigs of capacity.  See my note below on
> > sizing your physical volume (The view of a 'real' disk that MDADM constructs
> > from the EBS volumes) versus your logical volume (The actual filesystem
> > bearing chunk of diskiness that LVM slaps on top of MDADM's physical).  *
> ...
>
> read more »

Brendan W. McAdams

unread,
May 19, 2011, 3:18:20 PM5/19/11
to mongod...@googlegroups.com
On Thu, May 19, 2011 at 2:25 PM, Michael Conigliaro <mike.co...@livingsocial.com> wrote:
This is great. Thanks! I have a couple questions though:

"We need to create physical partitions on the volumes that we mapped
from
EBS"

Is this really true? I've found that this command works fine *without*
partitioning the underlying RAID devices:

 # mdadm --create /dev/md0 --force --metadata=1.1 --level 10 --chunk
256 --raid-devices 4 /dev/sdc /dev/sdd /dev/sde /dev/sdf


I can't give you a definitive answer on that.  I can tell you that I tried a few setups using various online versions of the same command and had weird errors and such.  I can also tell you that anytime someone is giving me an online tutorial with --force in there my skin begins to itch.  That "Force" makes me wary.  I'd rather take the extra steps to do it right and safely than hope the tool figures out what I "meant" to do.  

 
Is there some reason why everyone creates partitions first? I've
actually moved away from doing this when I know I'm going to use the
entire device and/or when I think I might expand it someday. For the
latter, this means I can always skip the partition expansion step. I
can just grow the filesystem and be done. But maybe I'm missing
something here?


There's the other side of it --- you might create multiple partitions on each EBS volume and use parts of each volume for different RAIDs.  I create the partitions by hand because most guides suggest it, and it made me feel safer about understanding EVERYTHING that was going on under the covers.

When I put on my robe and devops hat,  I tend to prefer to not rely on the "magic" of any tools doing what the tools author thought was the best practice if there's a step I really should be doing myself.  Note this is also why I don't want you just using my script without questioning it, so I am far from telling you "stop questioning my last email!" ;)

Secondly, this guy reported that "Larger chunk sizes on the raid made
a (shockingly) HUGE difference in performance. The sweet spot seemed
to be at 256k."
This would mean your script should include the option "--chunk 256."


Remember: the plural of anecdote isn't data.  He has lots of benchmarkey goodness but again - just like maximum throughput wasn't important to the question of what disk is best for MongoDB... What is the type of testing and application he is using?  What is the affect on changing that chunk size on MongoDBs behavior?  

These chunk sizes are similar to extents in that they vary greatly on what changing them does... A bigger chunk may give you better performance but may be worse for certain applications like MongoDB.  We don't yet, to my knowledge, have guidelines on best chunk sizes either, but recommend you go with the default until you find a reason otherwise.  FTR, I'd (personally) consider any max throughput  or general benchmark without a specific application settings to not be a reason otherwise.

He also says "A larger read ahead buffer on the raid also made a HUGE
difference. I bumped it from 256 bytes to 64k." I believe the way to
set this is by running the following command:

 # blockdev --setra 65536 /dev/md0

Any comments on what this guy has to say on his blog?


Again, my earlier comments apply.  The read ahead buffer sounds like a much less specious statement than others in that well, we're reasonable people who can expect that yes, in fact it sounds like a good idea to have a bigger read ahead buffer.  It may be because it sounds so facepalmey obvious that it is in fact a worse idea to go with... I'm really actually not sure. Certainly with MongoDB, since even writes have a good bit of read involved it could be a good thing.  Since I'm obviously being conservative about all of this, then I would want to validate:

 a) Is this still true? This article was written in July 2009.  The stone ages for EC2 and EBS.  Has Amazon made changes in the past 2 years to obviate these settings? Does their hardware or software do things better or differently now? If so, would these recommendations make it worse? [I really don't know, but you should think about these things].  
b) Corollary to (a)... What kind of changes were made to LVM, MDADM and the Linux kernel itself that may affect this? The IO scheduler has likely gotten a whole hell of a lot better.  Given that MongoDB relies on a lot of kernel level code as well like mmap() the question of what changed in Linux is even more important.
c) What do you notice in this benchmark?  He mainly benchmarks EXT3.  He recommends XFS or JFS as the alternatives.  Notably, according to Wikipedia EXT4 wasn't even considered stable until October 2008... So he isn't deliberately ignoring EXT4.  And our recommendation for EXT4 comes with the caveat "As recent a version of EXT4 as possible", because EXT4 has gotten a lot of tweaks and improvements over the last few years that older distros don't ship by default.  This benchmark already is basing its findings and recommendations partially on the exact filesystem we recommend you don't use. ;)


I also see a lot of people using the "--metadata=1.1" option. My
understanding is that this allows you to create much larger RAID
devices made up of many more phyical devices. Any reason why we
*shouldn't* use this? I noticed that your script doesn't....


I just did a bit of research to confirm this... 

The funny thing is, there are a couple of metadata versions and as often happens in software each one seems better suited to certain tasks than the other, but less suited to others.  Some of that may be regressions, some may be deliberate changes in behavior which is why version is tunable.

Using https://raid.wiki.kernel.org/index.php/RAID_setup as a guideline though, since it is most likely to be accurate as the the recommendations of the Kernel and RAID Tools developers... they currently recommend 1.2 "except when creating a boot partition, in which case use version 1.0 metadata and RAID-1.[1]
No mention at all of 1.1 and I suspect that at the time many of these recommendations were written 1.1 was the shiny best option for then.  The footnote link recommending 1.2 is from May, 2010 so already a newer guideline than the Heroku writeup was working off of.  Specifically it seems like 1.2 has a whole new better system for storing block tables on disk.

It seems, at least at the time the Wiki was last updated, that all of the tools default to 0.9.  Given that I would say "yes, you should set --metadata" but I would recommend going with the most updated guideline of 1.1.

 
this leads me to another question. At some point, your data is going to be
so large that you'll want to start sharding. So I'm guessing it
wouldn't really make sense to have a >2T disk when a single MongoDB
won't perform well with that much data. So are there any
recommendations on how big our disks should be? Maybe a quick rule of
thumb formula or some sort? At the moment, I'm creating 500GB volumes
for each MongoDB server (which has ~35GB of memory).


I'm going to have to hedge/defer a bit on this one.  It all comes down to working set, and unfortunately the answer is "There is no exact formula or rule of thumb for calculating your working set".  Working set is in a sense a picture of really, during a given operational period, what is the majority of your TOTAL mongoDB data  that will be touched.  Touched in that read & write matter as both affect this number. And it includes indexes.  That working set should fit into your physical RAM.  Its especially when it isn't that sharding becomes a factor --- Sharding is the best way to chop your working set into smaller pieces.  Taking things to extremes, regardless of how much total data you have, if your working set was more than 64 gigs you'd be forced to shard as you can't get more RAM in one instance from Amazon at the moment.  

In many applications there is a significant difference from "total data" and "working set".  There are exceptions, but many people only are regularly accessing a percentage of their data; when that percentage exceeds the amount of RAM available you'll start to see slowdowns and should shard.

If you only have 100 gigs of data, know your working set is 16 gigs and are doing fine for now you're probably more than OK using smaller disks.  How much "ceiling" you need on top of that is going to depend how many <data size increment> of TOTAL DATA (including indexes) ON DISK you add for every <data size increment> you add to your working set.

It may be that for every 100 gigs of data you add, only 2 - 3 gigs of that adds to your working set.

When it comes time to shard you can split that all in half.

FINALLY - keep in mind that the current requirements of --repair (both for crash recovery and using it to shrink down allocated space) are that you have twice the amount of disk space available as your total data size.  I learned this the hard way in production when I accidentally (This was a long time ago, before I worked on MongoDB fulltime, and taught me an important lesson :P) filled up my disk because a cron job that cleaned up daily audit data hadn't run for ~6 months.  Mongo crashed, with partially mangled files.  But I couldn't repair them (which would both fix the corruption *AND* shrink the files on disk back to the size they were using rather than what was allocated) after I cleaned out the audit table because I had no overhead.  I had to get the hosting provider to give me another disk to clean up off of.

So keep that in mind with what you allocate as well in case you need it.  An upcoming version of MongoDB will have a better command for compaction which only needs (total data size + 2 gigs)*

Cost is obviously a factor when keeping a bunch of big EBS volumes around but know your caveats before trying to cut corners.

See some of the other threads on Working Set to better answer these (completely valid, sound and oft-asked questions). Anything on working sets in particular that Eliot has said can usually be taken as canon: http://groups.google.com/group/mongodb-user/browse_thread/thread/37f80ff39258e6f4 


* That was the overhead estimate last I heard, that may have changed and I'm not certain if its in the next major release or a further flung one this year

Michael Conigliaro

unread,
May 19, 2011, 4:49:20 PM5/19/11
to mongodb-user
I should mention that the only reason *I* had to use mdadm --force is
because I've been doing different experiments with the same 4 devices,
and mdadm --create complains whenever it finds traces of old
filesystems.

But in general, I agree that pre-optimization is almost always a bad
idea, and that I should just stick with the default settings for most
things until there's a really compelling reason to change them (which
it sounds like there isn't). Since I couldn't find any detailed docs
in the MongoDB manual, I just wanted to make sure I wasn't missing any
crucial information when it comes to setting this up. Thanks again for
the help!

- Mike

On May 19, 1:18 pm, "Brendan W. McAdams" <bren...@10gen.com> wrote:
> On Thu, May 19, 2011 at 2:25 PM, Michael Conigliaro <
>
> mike.conigli...@livingsocial.com> wrote:
> > This is great. Thanks! I have a couple questions though:
>
> > "We need to create physical partitions on the volumes that we mapped
> > from
> > EBS"
>
> > Is this really true? I've found that this command works fine *without*
> > partitioning the underlying RAID devices:
>
> >  # mdadm --create /dev/md0 --force --metadata=1.1 --level 10 --chunk
> > 256 --raid-devices 4 /dev/sdc /dev/sdd /dev/sde /dev/sdf
>
> I can't give you a definitive answer on that.  I can tell you that I tried a
> few setups using various online versions of the same command and had weird
> errors and such.  I can also tell you that anytime someone is giving me an
> online tutorial with *--force *in there my skin begins to itch.  That
> "Force" makes me wary.  I'd rather take the extra steps to do it right and
> safely than hope the tool figures out what I "meant" to do.
>
> > Is there some reason why everyone creates partitions first? I've
> > actually moved away from doing this when I know I'm going to use the
> > entire device and/or when I think I might expand it someday. For the
> > latter, this means I can always skip the partition expansion step. I
> > can just grow the filesystem and be done. But maybe I'm missing
> > something here?
>
> There's the other side of it --- you might create multiple partitions on
> each EBS volume and use parts of each volume for different RAIDs.  I create
> the partitions by hand because most guides suggest it, and it made me feel
> safer about understanding EVERYTHING that was going on under the covers.
>
> When I put on my robe and devops hat,  I tend to prefer to not rely on the
> "magic" of any tools doing what *the tools author *thought was the best
> practice if there's a step I really should be doing myself.  Note this is
> also why I don't want you just using my script without questioning it, so I
> am *far *from telling you "stop questioning my last email!" ;)
>
> Secondly, this guy reported that "Larger chunk sizes on the raid made> a (shockingly) HUGE difference in performance. The sweet spot seemed
> > to be at 256k."
>
> >http://orion.heroku.com/past/2009/7/29/io_performance_on_ebs/
>
> > This would mean your script should include the option "--chunk 256."
>
> Remember: the plural of anecdote isn't data.  He has lots of benchmarkey
> goodness but again - just like maximum throughput wasn't important to the
> question of what disk is best for MongoDB... What is the type of testing and
> application he is using?  What is the affect on changing that chunk size on
> MongoDBs behavior?
>
> These chunk sizes are similar to extents in that they vary greatly on what
> changing them does... A bigger chunk may give you better performance but may
> be worse for certain applications like MongoDB.  We don't yet, to my
> knowledge, have guidelines on best chunk sizes either, but recommend you go
> with the default until you find a reason otherwise.  FTR, I'd (personally)
> consider any *max throughput * or *general benchmark without a specific
> application *settings to not be a reason otherwise.
>
> He also says "A larger read ahead buffer on the raid also made a HUGE> difference. I bumped it from 256 bytes to 64k." I believe the way to
> > set this is by running the following command:
>
> >  # blockdev --setra 65536 /dev/md0
>
> > Any comments on what this guy has to say on his blog?
>
> Again, my earlier comments apply.  The read ahead buffer sounds like a much
> less specious statement than others in that well, we're reasonable people
> who can expect that yes, in fact it sounds like a good idea to have a bigger
> read ahead buffer.  It may be because it sounds so facepalmey obvious that
> it is in fact a worse idea to go with... I'm really actually not sure.
> Certainly with MongoDB, since even writes have a good bit of read involved
> it could be a good thing.  Since I'm obviously being conservative about all
> of this, then I would want to validate:
>
>  a) Is this still true? This article was written in *July 2009.  The stone
> ages for EC2 and EBS.  Has Amazon made changes in the past 2 years to
> obviate these settings? Does their hardware or software do things better or
> differently now? If so, would these recommendations make it worse? [I really
> don't know, but you should think about these things].  *
> *b) Corollary to (a*)... What kind of changes were made to LVM, MDADM and
> the Linux kernel itself that may affect this? The IO scheduler has likely
> gotten a whole hell of a lot better.  Given that MongoDB relies on a lot of
> kernel level code as well like *mmap()* the question of what changed in
> Linux is even more important.
> c) What do you notice in this benchmark?  He mainly benchmarks EXT3.  He
> recommends XFS or JFS as the alternatives.  Notably, according to Wikipedia
> EXT4 wasn't even considered *stable* until October 2008... So he isn't
> deliberately ignoring EXT4.  And our recommendation for EXT4 comes with the
> caveat "As recent a version of EXT4 as possible", because EXT4 has gotten a
> lot of tweaks and improvements over the last few years that older distros
> don't ship by default.  This benchmark already is basing its findings and
> recommendations partially on the exact filesystem we *recommend you don't
> use*. ;)
>
> I also see a lot of people using the "--metadata=1.1" option. My
>
> > understanding is that this allows you to create much larger RAID
> > devices made up of many more phyical devices. Any reason why we
> > *shouldn't* use this? I noticed that your script doesn't....
>
> I just did a bit of research to confirm this...
>
> The funny thing is, there are a couple of metadata versions and as often
> happens in software each one seems better suited to certain tasks than the
> other, but less suited to others.  Some of that may be regressions, some may
> be deliberate changes in behavior which is why version is tunable.
>
> Usinghttps://raid.wiki.kernel.org/index.php/RAID_setupas a guideline
> though, since it is most likely to be *accurate* as the the recommendations
> of the Kernel and RAID Tools developers... they currently recommend
> *1.2* "except
> when creating a boot partition, in which case use version 1.0 metadata and
> RAID-1.[1] <http://neil.brown.name/blog/20100519043730-002>"
> how much *total *data you have, if your working set was more than 64 gigs
> Cost is obviously a factor when keeping a bunch of big EBS volumes around...
>
> read more »

Brendan W. McAdams

unread,
May 19, 2011, 4:58:17 PM5/19/11
to mongod...@googlegroups.com
We haven't cemented a lot of this into formal recommendations and are beginning to, so any questions you have along these lines help (Despite the side effect of a long reply from me); they give us an idea of the kinds of things that ARE A concern and need to be codified as definitive answers.

In the meantime, send along any other findings you have and we'll be starting to create some Wiki documentation standardizing much of what I've outlined here as well as some other findings such as the metadata stuff.

-b

>
> read more »

--
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com.
To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.


Dominik Gehl

unread,
Sep 16, 2011, 4:06:50 PM9/16/11
to Brendan W. McAdams, mongod...@googlegroups.com
Hi again,

you mentioned that you tested both with and without LVM ... and I was
just curious to know if there were any performance benefits to using/
not using LVM.

Many thanks,
Dominik


On May 19, 1:46 pm, "Brendan W. McAdams" <bren...@10gen.com> wrote:
> Sorry for the delay, been making sure I have all the right info and check a
> few things.  See answers inline, below.
>
> On Tue, May 17, 2011 at 5:55 PM, Michael Conigliaro <
>

> mike.conigli...@livingsocial.com> wrote:
> > Hey guys,
>
> > So I've been resisting using RAID in EC2 for a long time now, mostly
> > because of things people have told me (e.g. one slow drive in your
> > array can slow everything down), and various articles I've read
> > online. For example:
>
> >http://www.nevdull.com/2008/08/24/why-raid-10-doesnt-help-on-ebs/
>
> Context matters, and these kinds of benchmarks are definitely all over the
> place ( I mean there's lots of them not that they vary in content per se
> even if they do ).
>
> Unfortunately, I haven't been able to pull up the original benchmark he
> links to (or find a copy in google cache) --- partly to get a feel for what
> his "F2" variant of RAID 10 is (I suspect the one I usually build out isn't
> "F2"; I know in general how F2 looks but not how to get it setup safely on
> EBS). [Addendum after I wrote this bit] It was pointed out in discussion
> here at our office that one of the classic problems faced with any disk
> management software such as RAIDs on EBS is that they incorrectly assume a
> standard physical disk layout and profile, which EBS is definitely not.  Our
> brief look through of F2 (and similar "far" configs) on this end looks like
> it is a software RAID variant designed very much to optimize for physical
> disk and spindle layouts, the concepts wouldn't hold up on EBS and are
> likely not a safe way to go.
>

> Many of these benchmarks push the *max throughput* numbers they see from


> various setups.  The one linked in particular is pointing out that a single
> drive maxed out at 65 Mb/sec versus RAID 10 maxing out at 55 Mb/sec.

>  Notably we are talking about *maxing out* as well as *single drive*.  The


> reality of what we're most concerned with on EBS however is different:

> *inconsistent
> performance* and *failure tolerance*.
>

> That is to say --- the nature of RAID 10 is more likely to give us a

> consistently high *average throughput* versus a single disk.  If the single


> disk slows down at all everything slows down.  And of course, if that single
> disk fails any apprehensiveness about RAID on EBS quickly turns to regret
> for not having RAID ;)
>
> > So overall, it seemed like RAID just wasn't worth the extra effort and
> > complexity. But more recently, I've been reading more positive things
> > about using RAID on EC2. I know that MongoDB recommends a RAID 10
> > configuration on production clusters now, and since it looks like
> > we're starting to hit an IO bottleneck here, I figured I should at
> > least give it a try in our testing environment.
>
> Tying this into my previous block, keep in mind that  Maximum Throughput is
> not comorbid with the needs of a typical Database Workload.
>
> What we've found, in general, is that for a Database Workload (and
> specifically MongoDB) RAID 10 on EBS makes the absolute most sense.  The
> most important bit of RAID 10 (at least as done by Linux's software RAID
> tools) is that each disk access gets ultimately split into full speed disk
> accesses to different drives.  You get a lot of the read/write performance
> of RAID 0 but you don't rely on the stripe being on both drives.  This seems
> to fit particularly well with the EBS model and the question of "What does
> the physical layout of EBS' underlying disks look like in comparison to an
> actual physical raw, bought it at best buy and plugged it in disk".  This
> paired with the fact that underneath the stripe we have mirroring gives you
> redundancy also, that parts of the stripe can fail or ... more important on

> EBS: *get slower.*

> backups *without locking MongoDB* and use the journal to safely use that


> backup, thereby having a non blocking reliable backup of MongoDB using the
> system level tools?‡".  As part of that I spent a lot of time putting
> together scripts to bring up a full LVM RAID 10 quickly on top of 4 EBS
> volumes.
>

> I can't guarantee you they are the *most optimal *configuration but they're


> put together from a variety of best practices writeups on building RAID 10
> on Linux that I dug through online.  These are using LVM, on top of MDADM
> but your mileage may vary.  I have reused it recently to do testing of
> larger sharded MongoDB clusters for some new features in our Hadoop driver
> and been very pleased with the setup.  I was able to migrate a disk array to
> a whole new instance at one point two when it was necessary because I fat
> fingered "terminate" instead of "stop" ;)
>
> I've attached my script, consider a disclaimer this:
>

> *Please don't use this script blindly, without reading through it and


> understanding it or assume it is the best way to do things.  It is provided
> as a guide to something I've been using to test but not an "official 10gen
> RAID 10 LVM+MDADM" installer of any kind.  It may become one later but for
> now just a pointer in the right direction. If you run my script without
> thinking about it or verifying the commands work on a test box, you'll make

> me very disappointed.*


>
> This was passed out to a few other people here to do a second round of
> testing where they ran some of the same tests I ran while I tested other
> things, so I rejiggered it (represented in its current form) to be slightly
> interactive.
>
> Here's the basic rundown.
>

> ***** *Remember that the capacity of RAID 10 is (N/2) * S(min) where N is


> the number of drives in the set and min is the smallest volume size.  For 4
> 40 gigabyte volumes you'll get 20 gigs of capacity.  See my note below on
> sizing your physical volume (The view of a 'real' disk that MDADM constructs
> from the EBS volumes) versus your logical volume (The actual filesystem

> bearing chunk of diskiness that LVM slaps on top of MDADM's physical).  *

> slightly funky name.  The...
>
> read more »
>
>  ebs_lvm.sh
> 2KViewDownload
>
>  RAID 10.png
> 73KViewDownload

Alexandre Fouché

unread,
Nov 4, 2011, 10:09:46 AM11/4/11
to mongodb-user
Hi,
I know this is an old discussion, but on Amazon EBS, why are you using
LVM on top of mdadm striping ? Is there a reason you still use raid0
or raid10 and put LVM onto it, instead of directly using LVM striping
(similar to raid0, but able to grow) without mdadm ?
> ...
>
> plus de détails »

Mike Conigliaro

unread,
Nov 4, 2011, 11:40:12 AM11/4/11
to mongod...@googlegroups.com
As far as I knew, LVM couldn't do RAID10 (which was the RAID level
specifically recommended to me by several 10gen engineers). As far as
using LVM striping alone, I'd be pretty hesitant to go without any
mirroring on a production database.

- Mike

>>> - This script assumes& requires SFDISK is installed. SFDISK is a cute

Mike Conigliaro

unread,
Nov 4, 2011, 11:40:28 AM11/4/11
to mongod...@googlegroups.com
As far as I knew, LVM couldn't do RAID10 (which was the RAID level
specifically recommended to me by several 10gen engineers). As far as
using LVM striping alone, I'd be pretty hesitant to go without any
mirroring on a production database.

- Mike

On 11/4/11 8:09 AM, Alexandre Fouché wrote:

>>> - This script assumes& requires SFDISK is installed. SFDISK is a cute

Brendan W. McAdams

unread,
Nov 4, 2011, 11:55:35 AM11/4/11
to mongod...@googlegroups.com
I believe Mike is correct, you cannot do RAID10 natively on LVM, and have to setup two layers.

Separately, with regards to Partitioning...

Friends don't let friends run RAID0 in production.

Unless those friends run a hard drive repair & recovery service, of course ;)



--
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com.
To unsubscribe from this group, send email to mongodb-user+unsubscribe@googlegroups.com.

Alexandre Fouché

unread,
Nov 5, 2011, 4:05:27 PM11/5/11
to mongodb-user
> Friends don't let friends run RAID0 in production.

:-)
Of course i would not do that on dedicated servers, but as far as the
cloud is concerned, i saw somewhere in the AWS documentation that EBS
has redundancy and resiliency builtin. Isn't that the case ? Having
mdadm+lvm to look after is quite a burden, and adds to the complexity


On 4 nov, 16:55, "Brendan W. McAdams" <bren...@10gen.com> wrote:
> I believe Mike is correct, you cannot do RAID10 natively on LVM, and have
> to setup two layers.
>
> Separately, with regards to Partitioning...
>
>
> >>>  mike.conigli...@livingsocial.**com <mike.conigli...@livingsocial.com>>
> >>>>  wrote:
>
> >>>>> Hey guys,
>
> >>>  So I've been resisting using RAID in EC2 for a long time now, mostly
> >>>>> because of things people have told me (e.g. one slow drive in your
> >>>>> array can slow everything down), and various articles I've read
> >>>>> online. For example:
>
> >>>  http://www.nevdull.com/2008/**08/24/why-raid-10-doesnt-help-**on-ebs/<http://www.nevdull.com/2008/08/24/why-raid-10-doesnt-help-on-ebs/>
> >>>>  data onto disk ... RAID 10 works.  (See:http://www.intel.com/**
> >>>> support/chipsets/imsm/sb/CS-**020655.htm<http://www.intel.com/support/chipsets/imsm/sb/CS-020655.htm>
> >>>> )
>
> >>>  So first of all, I was curious about how people were configuring RAID
>
> >>>  in their environments. Anyone care to share their experiences and/or
> >>>>> mdadm commands? Based on what I've read so far (e.g.
> >>>>>http://orion.heroku.com/past/**2009/7/29/io_performance_on_**ebs/<http://orion.heroku.com/past/2009/7/29/io_performance_on_ebs/>
> >>>>> ),
> ...
>
> plus de détails »

Eli Jones

unread,
Nov 6, 2011, 11:32:16 AM11/6/11
to mongod...@googlegroups.com
The main problem with using RAID0 on EC2 with EBS is that...
occasionally EBS drives will experience significant slowdowns..

So you have an 8 stripe set.. when one of those drives hits a
bottleneck.. the entire stripe will slow down and make your database
unresponsive.

So.. even if your data is 100% safe somewhere deep down in the guts of
that EBS drive.. it's locked away at the long end of this very narrow
pipe and you don't know when that drive will become responsive again.

> --
> You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com.

> To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.

Dominik Gehl

unread,
Nov 7, 2011, 8:29:43 AM11/7/11
to mongod...@googlegroups.com
Using RAID10 instead of a RAID0 really makes a huge performance impact …

Dominik

Reply all
Reply to author
Forward
0 new messages