As many of you are aware, Foursquare had a significant outage this
week. The outage was caused by capacity problems on one of the
machines hosting the MongoDB database used for check-ins. This is an
account of what happened, why it happened, how it can be prevented,
and how 10gen is working to improve MongoDB in light of this outage.
It’s important to note that throughout this week, 10gen and Foursquare
engineers have been working together very closely to resolve the
issue.
* Some history
Foursquare has been hosting check-ins on a MongoDB database for some
time now. The database was originally running on a single EC2
instance with 66GB of RAM. About 2 months ago, in response to
increased capacity requirements, Foursquare migrated that single
instance to a two-shard cluster. Now, each shard was running on its
own 66GB instance, and both shards were also replicating to a slave
for redundancy. This was an important migration because it allowed
Foursquare to keep all of their check-in data in RAM, which is
essential for maintaining acceptable performance.
The data had been split into 200 evenly distributed chunks based on
user id. The first half went to one server, and the other half to the
other. Each shard had about 33GB of data in RAM at this point, and
the whole system ran smoothly for two months.
* What we missed in the interim
Over these two months, check-ins were being written continually to
each shard. Unfortunately, these check-ins did not grow evenly across
chunks. It’s easy to imagine how this might happen: assuming certain
subsets of users are more active than others, it’s conceivable that
their updates might all go to the same shard. That’s what occurred in
this case, resulting in one shard growing to 66GB and the other only
to 50GB. [1]
* What went wrong
On Monday morning, the data on one shard (we’ll call it shard0)
finally grew to about 67GB, surpassing the 66GB of RAM on the hosting
machine. Whenever data size grows beyond physical RAM, it becomes
necessary to read and write to disk, which is orders of magnitude
slower than reading and writing RAM. Thus, certain queries started to
become very slow, and this caused a backlog that brought the site
down.
We first attempted to fix the problem by adding a third shard. We
brought the third shard up and started migrating chunks. Queries were
now being distributed to all three shards, but shard0 continued to hit
disk very heavily. When this failed to correct itself, we ultimately
discovered that the problem was due to data fragmentation on shard0.
In essence, although we had moved 5% of the data from shard0 to the
new third shard, the data files, in their fragmented state, still
needed the same amount of RAM. This can be explained by the fact that
Foursquare check-in documents are small (around 300 bytes each), so
many of them can fit on a 4KB page. Removing 5% of these just made
each page a little more sparse, rather than removing pages
altogether.[2]
After the first day's outage it had become clear that chunk migration,
sans compaction, was not going to solve the immediate problem. By the
time the second day's outage occurred, we had already move 5% of the
data off of shard0, so we decided to start an offline process to
compact the database using MongoDB’s repairDatabase() feature. This
process took about 4 hours (partly due to the data size, and partly
because of the slowness of EBS volumes at the time). At the end of
the 4 hours, the RAM requirements for shard0 had in fact been reduced
by 5%, allowing us to bring the system back online.
* Afterwards
Since repairing shard0 and adding a third shard, we’ve set up even
more shards, and now the check-in data is evenly distributed and there
is a good deal of extra capacity. Still, we had to address the
fragmentation problem. We ran a repairDatabase() on the slaves, and
promoted the slaves to masters, further reducing the RAM needed on
each shard to about 20GB.
* How is this issue triggered?
Several conditions need to be met to trigger the issue that brought
down Foursquare:
1. Systems are at or over capacity. How capacity is defined varies; in
the case of Foursquare, all data needed to fit into RAM for acceptable
performance. Other deployments may not have such strict RAM
requirements.
2. Document size is less than 4k. Such documents, when moved, may be
too small to free up pages and, thus, memory.
3. Shard key order and insertion order are different. This prevents
data from being moved in contiguous chunks.
Most sharded deployments will not meet these criteria. Anyone whose
documents are larger than 4KB will not suffer significant
fragmentation because the pages that aren’t being used won’t be
cached.
* Prevention
The main thing to remember here is that once you’re at max capacity,
it’s difficult to add more capacity without some downtime when objects
are small. However, if caught in advance, adding more shards on a
live system can be done with no downtime.
For example, if we had notifications in place to alert us 12 hours
earlier that we needed more capacity, we could have added a third
shard, migrated data, and then compacted the slaves.
Another salient point: when you’re operating at or near capacity,
realize that if things get slow at your hosting provider, you may find
yourself all of a sudden effectively over capacity.
* Final Thoughts
The 10gen tech team is working hard to correct the issues exposed by
this outage. We will continue to work as hard as possible to ensure
that everyone using MongoDB has the best possible experience. We are
thankful for the support that we have received from Foursquare and our
community during this unfortunate episode. As always, please let us
know if you have any questions or concerns.
[1] Chunks get split when they are 200MB into 2 100MB halves. This
means that even if the number of chunks on each shard was the same,
data size is not always so. This is something we are going to be
addressing in MongoDB. We'll be making splitting balancing look for
this imbalance so it can act upon it.
[2] The 10gen team is working on doing online incremental compaction
of both data files and indexes. We know this has been a concern in
non-sharded systems as well. More details about this will be coming
in the next few weeks.
Currently no. We're going to be working on doing compaction in the
background though so you won't have to do this large offline
compaction at all.
> - It seems that when data grows out of memory, the performance seems
> to degrade significantly. If only part of the data pages are touched
> (because only some users are active), why the paging/caching mechanism
> is not effective and cause the disk read/write rates to be high and
> bog down the request processing?
Foursquare needed read access to more blocks per second than even a 4
drive raid could keep up with.
This is not normal for most sites, as many MongoDB installations have
storage sizes 2x or even up to 100x ram.
MongoDB should be better about handling situations like this and
degrade much more gracefully.
We'll be working on these enhancements soon as well.
> - For shard key selection, it looks like if insertion order determines
> shard key, it will cause an uneven load on a machine for high insert.
> If not, it may cause some problem when data is moved. What should be
> the best practices? Does it depend on type of load?
Its very hard to give a general answer. As you said one based on
insertion order means 1 shard takes all the writes.
In some cases that's fine. If total data size is massive, but insert
load is modest, then that might be a great solution.
If you need to distribue writes, then you might fall into this other
issue. We're working on compaction so it won't actually be an issue,
but in the meantime you can follow the instructions above to add
shards.
> - Is there some way to identify and alert that additional shards/
> machines are needed by looking at the statistics information? It would
> be very helpful to prevent this kind of problem. We have a similar
> situation as well. The performance is not degraded gradually. When
> something hits, it hits really hard and we have to take action
> immediately but additional machines may not be available immediately.
First you have to understand what resources your system cares most about.
It can be ram, disk or cpu. If its ram or cpu, you can use the mongo
db.stats() command to see total requirements.
> --
> You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
>
>
I'm not actually sure - but does not really impact anything else.
> 2. were replicas used for distributing reads?
No - all reads needed to be consistent.
> 3. could you bring up read-only replicas to deal with the increased
> read access time?
The read load is so heavy - that even double the number of disks
wouldn't have helped.
> 4. why the new shard could take only 5% of the data?
It could take more, and in fact it now has 50%, we were trying to move
as little as possible to get the site back up as fast as possible.
> 5. is there a real solution to the chunk migration/page size issue?
Yes - we are working on online compaction would mitigate this.
> 6. how could one plan for growth when sharding is based on a key that
> doesn't follow insertion order?
In the post we describe a method for adding shards without any
downtime in that case.
Briefly, its add shard, let balancing move data, compact slaves, swap
master and slave.
>
> On Oct 8, 2:32 am, harryh <har...@gmail.com> wrote:
>> Just as a quick follow-up to Eliot's detailed message on our outage I
>> want to make it clear to the community how helpful Eliot and the rest
>> of the gang at 10gen has been to us in the last couple days. We've
>> been in near constant communication with them as we worked to repair
>> things and end the outage, and then to identify and get to work on
>> fixing things so that this sort of thing won't happen again.
>>
>> They've been a huge help to us, and things would have been much worse
>> without them.
>>
>> Further, I am very confident that some of the issues that we've
>> identified in Mongo will be dealt with so others don't encounter
>> similar problems. Overall we still remain huge fans of MongoDB @
>> foursquare, and expect to be using it for a long time to come.
>>
>> I'm more than happy to help answer any concerns that others might have
>> about MongoDB as it relates to this incident.
>>
>> -harryh, eng lead @ foursquare
>
> I think the most obvious question is: How do we avoid maxing out our
> mongodb nodes and know when to provision new nodes?
> With the assumptions that: you're monitoring everything.
>
> What do we need to look at? How do we tell? What happens if you're a
> company with lots of changing scaling needs and features...it's tough
> to capacity plan when you're a startup.
This is very application dependent. In some cases you need all your
indexes in ram, in other cases it can just be a small working set.
One good way to determine this is figure out how much data you
determine in a 10 minute period. Indexes and documents. Make sure
you can either store that in ram or read that from disk in the same
amount of time.
> One solution would've been to memory limit how much mongo uses but I
> don't think this is possible given Mongo uses mmpa'd files right now.
> Merely watching IO and Swap I don't think is a solution since it may
> be too late at that point.
This wouldn't really help. Generally the more ram you have the
better. The memory MongoDB uses is to cache data, so the more you
give it the better.
You need to determine working set. See recent comment or 2 above this one.
> Is the MongoDB formula significantly different from the traditional
> RDBMS, where performance is best when RAM buffers are large enough to
> contain most commonly accessed / MRU blocks?
Its really the same in MongoDB as a traditional RDBMs.
> How much of Foursquare's problem is related to the lower IO rates of
> EC2/EBS compared to fast, local, dedicated disks?
I don't think even fast local disks would have prevented this. There
is a chance SSDs might have, but we have not done all the math to know
for sure.
Faster disks would probably have made it easier to manipulate things
to reduce downtime howerver.
On 10/07/2010 04:47 PM, Nat wrote:
> - It seems that when data grows out of memory, the performance seems
> to degrade significantly.
It is when the working set grows out of memory that you have a problem not
"data". Note that this is a problem with any kind of system and not
specific to MongoDB.
What happens when running out of memory (ie more operations proceed at disk
speeds) is that things will get a lot worse because of concurrency. If a
query now takes a second instead of 100ms then it is far more likely another
query will come in during that extra time. Both queries will now be
competing for the disk. Then a third will come along making things even
worse again. The disk seek times really hurt.
What MongoDB needs to do is decrease concurrency under load so that existing
queries can complete before allowing new ones to add fuel to the fire.
There is a ticket for this:
http://jira.mongodb.org/browse/SERVER-574
Roger
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAkyuo5IACgkQmOOfHg372QRgAwCeILzjvefjjWOqz38ur0ENy0Z6
l6AAnjM/E392E5YtGd7c6t9FozITIveS
=EWrw
-----END PGP SIGNATURE-----
MattK> How does one determine what the RAM requirements are? For
MattK> databases that extend into the TB range, the same quantity of
MattK> RAM is not an option.
As already mentioned, every use case is different; what is important is
that you have enough RAM to house your working set and indexes
http://www.markus-gattol.name/ws/mongodb.html#how_much_ram_does_mongodb_need
MattK> Is the MongoDB formula significantly different from the
MattK> traditional RDBMS, where performance is best when RAM buffers
MattK> are large enough to contain most commonly accessed / MRU blocks?
The formula is not different but the impact in both directions (enough
vs not enough RAM to hold the working set and indices) carries more
weight i.e. you will usually (with default configurations) see MongoDB
being faster if there is enough RAM but worse (compared to RDBMs) if
MongoDB needs to hit the disk.
MattK> How much of Foursquare's problem is related to the lower IO
MattK> rates of EC2/EBS compared to fast, local, dedicated disks?
If you need to go to disks for I/O except the default fsync etc. then
you already have a problem. The best solution to mitigate the pain then
however is something that does random I/O i.e. SSD (Solid State Drive)
rather than usual HDDs.
http://www.markus-gattol.name/ws/mongodb.html#speed_impact_of_not_having_enough_ram
Dinesh> I mean you dont expect the data to grow beyond your RAM is a
Dinesh> very risky constraint.
I would disagree with that. The fact is that the optimal situation is
that all you data set including indices fit into RAM. However, this will
only be true for small sites using MongoDB as data tier in some sort.
The key concern is whether or not the working set fits into RAM at all
times. This pattern is different per use case and situation (slashdot
effect). With sites as foursquare it will always be so that your data
set is bigger than your actual working set and certainly also available
RAM. Monitoring whether or not your data set is growing bigger than RAM
therefore does not make sense. You as a technician need to know about
your use case and plan/develop accordingly as MongoDB can not free you
from those duties.
Dinesh> So there should've been *some* sort of monitoring in place to
Dinesh> warn you that you're about to run out of resources.
There are statistics available you can use
http://www.markus-gattol.name/ws/mongodb.html#server_status
See also the links for the munin and nagios plugins from this link.
Bottom line is, the information is manifold and available, it is up the
person who runs a setup to interpret it and act/plan accordingly
(rationale see above).
> Placing a limit on the memory used by MongoDB artificially reduces the
> overall system capacity. But, it creates a more reliable early warning
> with a fast path to a temporary fix. The temporary fix will usually
> last long enough to get new shards online.
Theoretically you could lower that aforementioned dirty percentage
using the vm tuning tools. In linux, /proc/sys/vm/... If the working
set is diryting pages beyond the thresholds, then the os will start
flushing to disk and mongo will get slow, you could simply raise the
vm thresholds and get this temporary runway.
>
> Are you sure the root problem was write throughput? That doesn't fit
> entirely with the explanation of why adding a new shard didn't fix the
> problem. Moving blocks to a new machine should reduce the write rate
> on the original shard immediately, even if there is fragmentation that
> prevents the working set size from shrinking.
>
> Isn't it more likely that the working set size for read exceeded
> memory, causing reads to hit disk sometimes? The new read load was big
> enough that moving a percentage of the write load onto a new shard did
> not fix the IO contention on the original machine. Increasing the
> available RAM would have removed the read IO entirely, fixing the
> problem and creating a runway.
>
very possibly yes. if i understand the VM, linux at least, reading
all over an mmap'd space does count toward the dirty page ratio, but
perhaps are bounded by all of available RAM (swap too?). if one reads
beyond what's available, then other pages are evicted. i can't seem
to find any reference to describe which pages are chosen for eviction,
perhaps LRU -that's just a guess.
to provide the runway you proposed, using the dirty page settings
would not help in the case of heavily randomized reads i guess.
another way to do the mlock safe guard is to set the the
min_free_kbyes setting to include this runway and simply remove that
runway amount when things get hairy enough.
i know ssd drives were already brought up, but if the problem was in
fact random reads and not necessarily writes, ssd's could have helped.
my guess is that it was probably both.
http://www.fusionio.com/products/
> 1) It sounds like all the database is in the working set. I would have
> guessed that only a small fraction of all historical check-ins are
> likely to be read (<1%), and that old check-ins are likely to be
> clustered on pages that aren't read often, so that is surprising to
> me. Can you say what fraction of the objects are likely to be read in
> a given hour and what the turnover of working set objects is?
I can't give a lot of details here, but a very high percentage of
documents were touched very often.
Much higher than you would expect.
> 2) Later in the thread Eliot noted:
> MongoDB should be better about handling situations like this and
> degrade much more gracefully.
> We'll be working on these enhancements soon as well.
>
> What kind of enhancements are you planning? Better VM management? Re-
> custering objects to push inactive objects into pages that are on
> disk? A paging scheme like Redis uses (as described in
> http://antirez.com/post/what-is-wrong-with-2006-programming.html)?
The big issue really is concurrency. The VM side works well, the
problem is a read-write lock is too coarse.
1 thread that casues a fault can a bigger slowdown than it should be able to.
We'll be addressing this in a few ways: making yielding more
intelligent, real intra-collection concurrency.
> 3) From the description, it seems like you should be able to monitor
> the resident memory used by the mongo db process to get an alert as
> shard memory ran low. Is that viable? If not, what can be done to
> identify cases where the working set is approaching available? It
> seems like monitoring memory/working sets for mongo db instance would
> be a generally useful facility - are there plans to add this
> capability? What's the best practice today?
The key metric to look at is disk operations per second. If this
starts trending up, that's a good warning side.
Memory is a bit hard because some things don't have to be in ram
(transaction log) but will be if nothing else is using it.
> 4) With respect to using SSD's - what is the write pattern for pages
> that get evicted? If they are written randomly using OS paging, I
> wouldn't expect there to be a benefit from SSD's (you wouldn't be able
> to evict pages fast enough), although if mongodb were able to evict
> larger chunks of pages from RAM so you had fewer bigger random writes,
> with lots of smaller random reads, that could be a big win.
Not sure I follow. The random writes/reads are why we think SSDs are
so good (and in our testing).
-Eliot
> Thanks,
> Ron
>
> On Oct 7, 7:32 pm, harryh <har...@gmail.com> wrote:
>> Just as a quick follow-up to Eliot's detailed message on our outage I
>> want to make it clear to the community how helpful Eliot and the rest
>> of the gang at 10gen has been to us in the last couple days. We've
>> been in near constant communication with them as we worked to repair
>> things and end the outage, and then to identify and get to work on
>> fixing things so that this sort of thing won't happen again.
>>
>> They've been a huge help to us, and things would have been much worse
>> without them.
>>
>> Further, I am very confident that some of the issues that we've
>> identified in Mongo will be dealt with so others don't encounter
>> similar problems. Overall we still remain huge fans of MongoDB @
>> foursquare, and expect to be using it for a long time to come.
>>
>> I'm more than happy to help answer any concerns that others might have
>> about MongoDB as it relates to this incident.
>>
>> -harryh, eng lead @ foursquare
>