(Note: this is being posted with Foursquare’s permission.)
As many of you are aware, Foursquare had a significant outage this week. The outage was caused by capacity problems on one of the machines hosting the MongoDB database used for check-ins. This is an account of what happened, why it happened, how it can be prevented, and how 10gen is working to improve MongoDB in light of this outage.
It’s important to note that throughout this week, 10gen and Foursquare engineers have been working together very closely to resolve the issue.
* Some history Foursquare has been hosting check-ins on a MongoDB database for some time now. The database was originally running on a single EC2 instance with 66GB of RAM. About 2 months ago, in response to increased capacity requirements, Foursquare migrated that single instance to a two-shard cluster. Now, each shard was running on its own 66GB instance, and both shards were also replicating to a slave for redundancy. This was an important migration because it allowed Foursquare to keep all of their check-in data in RAM, which is essential for maintaining acceptable performance.
The data had been split into 200 evenly distributed chunks based on user id. The first half went to one server, and the other half to the other. Each shard had about 33GB of data in RAM at this point, and the whole system ran smoothly for two months.
* What we missed in the interim Over these two months, check-ins were being written continually to each shard. Unfortunately, these check-ins did not grow evenly across chunks. It’s easy to imagine how this might happen: assuming certain subsets of users are more active than others, it’s conceivable that their updates might all go to the same shard. That’s what occurred in this case, resulting in one shard growing to 66GB and the other only to 50GB. [1]
* What went wrong On Monday morning, the data on one shard (we’ll call it shard0) finally grew to about 67GB, surpassing the 66GB of RAM on the hosting machine. Whenever data size grows beyond physical RAM, it becomes necessary to read and write to disk, which is orders of magnitude slower than reading and writing RAM. Thus, certain queries started to become very slow, and this caused a backlog that brought the site down.
We first attempted to fix the problem by adding a third shard. We brought the third shard up and started migrating chunks. Queries were now being distributed to all three shards, but shard0 continued to hit disk very heavily. When this failed to correct itself, we ultimately discovered that the problem was due to data fragmentation on shard0. In essence, although we had moved 5% of the data from shard0 to the new third shard, the data files, in their fragmented state, still needed the same amount of RAM. This can be explained by the fact that Foursquare check-in documents are small (around 300 bytes each), so many of them can fit on a 4KB page. Removing 5% of these just made each page a little more sparse, rather than removing pages altogether.[2]
After the first day's outage it had become clear that chunk migration, sans compaction, was not going to solve the immediate problem. By the time the second day's outage occurred, we had already move 5% of the data off of shard0, so we decided to start an offline process to compact the database using MongoDB’s repairDatabase() feature. This process took about 4 hours (partly due to the data size, and partly because of the slowness of EBS volumes at the time). At the end of the 4 hours, the RAM requirements for shard0 had in fact been reduced by 5%, allowing us to bring the system back online.
* Afterwards Since repairing shard0 and adding a third shard, we’ve set up even more shards, and now the check-in data is evenly distributed and there is a good deal of extra capacity. Still, we had to address the fragmentation problem. We ran a repairDatabase() on the slaves, and promoted the slaves to masters, further reducing the RAM needed on each shard to about 20GB.
* How is this issue triggered? Several conditions need to be met to trigger the issue that brought down Foursquare: 1. Systems are at or over capacity. How capacity is defined varies; in the case of Foursquare, all data needed to fit into RAM for acceptable performance. Other deployments may not have such strict RAM requirements. 2. Document size is less than 4k. Such documents, when moved, may be too small to free up pages and, thus, memory. 3. Shard key order and insertion order are different. This prevents data from being moved in contiguous chunks.
Most sharded deployments will not meet these criteria. Anyone whose documents are larger than 4KB will not suffer significant fragmentation because the pages that aren’t being used won’t be cached.
* Prevention The main thing to remember here is that once you’re at max capacity, it’s difficult to add more capacity without some downtime when objects are small. However, if caught in advance, adding more shards on a live system can be done with no downtime.
For example, if we had notifications in place to alert us 12 hours earlier that we needed more capacity, we could have added a third shard, migrated data, and then compacted the slaves.
Another salient point: when you’re operating at or near capacity, realize that if things get slow at your hosting provider, you may find yourself all of a sudden effectively over capacity.
* Final Thoughts The 10gen tech team is working hard to correct the issues exposed by this outage. We will continue to work as hard as possible to ensure that everyone using MongoDB has the best possible experience. We are thankful for the support that we have received from Foursquare and our community during this unfortunate episode. As always, please let us know if you have any questions or concerns.
[1] Chunks get split when they are 200MB into 2 100MB halves. This means that even if the number of chunks on each shard was the same, data size is not always so. This is something we are going to be addressing in MongoDB. We'll be making splitting balancing look for this imbalance so it can act upon it.
[2] The 10gen team is working on doing online incremental compaction of both data files and indexes. We know this has been a concern in non-sharded systems as well. More details about this will be coming in the next few weeks.
Just as a quick follow-up to Eliot's detailed message on our outage I
want to make it clear to the community how helpful Eliot and the rest
of the gang at 10gen has been to us in the last couple days. We've
been in near constant communication with them as we worked to repair
things and end the outage, and then to identify and get to work on
fixing things so that this sort of thing won't happen again.
They've been a huge help to us, and things would have been much worse
without them.
Further, I am very confident that some of the issues that we've
identified in Mongo will be dealt with so others don't encounter
similar problems. Overall we still remain huge fans of MongoDB @
foursquare, and expect to be using it for a long time to come.
I'm more than happy to help answer any concerns that others might have
about MongoDB as it relates to this incident.
Thanks for very insightful post-mortem. I have a couple more questions
to ask:
- Can repairDatabase() leverage multiple cores? Given that data is
broken down into chunks, can we process them in parallel? 4 hours
downtime seems to be eternal in internet space.
- It seems that when data grows out of memory, the performance seems
to degrade significantly. If only part of the data pages are touched
(because only some users are active), why the paging/caching mechanism
is not effective and cause the disk read/write rates to be high and
bog down the request processing?
- For shard key selection, it looks like if insertion order determines
shard key, it will cause an uneven load on a machine for high insert.
If not, it may cause some problem when data is moved. What should be
the best practices? Does it depend on type of load?
- Is there some way to identify and alert that additional shards/
machines are needed by looking at the statistics information? It would
be very helpful to prevent this kind of problem. We have a similar
situation as well. The performance is not degraded gradually. When
something hits, it hits really hard and we have to take action
immediately but additional machines may not be available immediately.
Thanks,
On Oct 8, 6:05 am, Eliot Horowitz <eliothorow...@gmail.com> wrote:
> (Note: this is being posted with Foursquare’s permission.)
> As many of you are aware, Foursquare had a significant outage this
> week. The outage was caused by capacity problems on one of the
> machines hosting the MongoDB database used for check-ins. This is an
> account of what happened, why it happened, how it can be prevented,
> and how 10gen is working to improve MongoDB in light of this outage.
> It’s important to note that throughout this week, 10gen and Foursquare
> engineers have been working together very closely to resolve the
> issue.
> * Some history
> Foursquare has been hosting check-ins on a MongoDB database for some
> time now. The database was originally running on a single EC2
> instance with 66GB of RAM. About 2 months ago, in response to
> increased capacity requirements, Foursquare migrated that single
> instance to a two-shard cluster. Now, each shard was running on its
> own 66GB instance, and both shards were also replicating to a slave
> for redundancy. This was an important migration because it allowed
> Foursquare to keep all of their check-in data in RAM, which is
> essential for maintaining acceptable performance.
> The data had been split into 200 evenly distributed chunks based on
> user id. The first half went to one server, and the other half to the
> other. Each shard had about 33GB of data in RAM at this point, and
> the whole system ran smoothly for two months.
> * What we missed in the interim
> Over these two months, check-ins were being written continually to
> each shard. Unfortunately, these check-ins did not grow evenly across
> chunks. It’s easy to imagine how this might happen: assuming certain
> subsets of users are more active than others, it’s conceivable that
> their updates might all go to the same shard. That’s what occurred in
> this case, resulting in one shard growing to 66GB and the other only
> to 50GB. [1]
> * What went wrong
> On Monday morning, the data on one shard (we’ll call it shard0)
> finally grew to about 67GB, surpassing the 66GB of RAM on the hosting
> machine. Whenever data size grows beyond physical RAM, it becomes
> necessary to read and write to disk, which is orders of magnitude
> slower than reading and writing RAM. Thus, certain queries started to
> become very slow, and this caused a backlog that brought the site
> down.
> We first attempted to fix the problem by adding a third shard. We
> brought the third shard up and started migrating chunks. Queries were
> now being distributed to all three shards, but shard0 continued to hit
> disk very heavily. When this failed to correct itself, we ultimately
> discovered that the problem was due to data fragmentation on shard0.
> In essence, although we had moved 5% of the data from shard0 to the
> new third shard, the data files, in their fragmented state, still
> needed the same amount of RAM. This can be explained by the fact that
> Foursquare check-in documents are small (around 300 bytes each), so
> many of them can fit on a 4KB page. Removing 5% of these just made
> each page a little more sparse, rather than removing pages
> altogether.[2]
> After the first day's outage it had become clear that chunk migration,
> sans compaction, was not going to solve the immediate problem. By the
> time the second day's outage occurred, we had already move 5% of the
> data off of shard0, so we decided to start an offline process to
> compact the database using MongoDB’s repairDatabase() feature. This
> process took about 4 hours (partly due to the data size, and partly
> because of the slowness of EBS volumes at the time). At the end of
> the 4 hours, the RAM requirements for shard0 had in fact been reduced
> by 5%, allowing us to bring the system back online.
> * Afterwards
> Since repairing shard0 and adding a third shard, we’ve set up even
> more shards, and now the check-in data is evenly distributed and there
> is a good deal of extra capacity. Still, we had to address the
> fragmentation problem. We ran a repairDatabase() on the slaves, and
> promoted the slaves to masters, further reducing the RAM needed on
> each shard to about 20GB.
> * How is this issue triggered?
> Several conditions need to be met to trigger the issue that brought
> down Foursquare:
> 1. Systems are at or over capacity. How capacity is defined varies; in
> the case of Foursquare, all data needed to fit into RAM for acceptable
> performance. Other deployments may not have such strict RAM
> requirements.
> 2. Document size is less than 4k. Such documents, when moved, may be
> too small to free up pages and, thus, memory.
> 3. Shard key order and insertion order are different. This prevents
> data from being moved in contiguous chunks.
> Most sharded deployments will not meet these criteria. Anyone whose
> documents are larger than 4KB will not suffer significant
> fragmentation because the pages that aren’t being used won’t be
> cached.
> * Prevention
> The main thing to remember here is that once you’re at max capacity,
> it’s difficult to add more capacity without some downtime when objects
> are small. However, if caught in advance, adding more shards on a
> live system can be done with no downtime.
> For example, if we had notifications in place to alert us 12 hours
> earlier that we needed more capacity, we could have added a third
> shard, migrated data, and then compacted the slaves.
> Another salient point: when you’re operating at or near capacity,
> realize that if things get slow at your hosting provider, you may find
> yourself all of a sudden effectively over capacity.
> * Final Thoughts
> The 10gen tech team is working hard to correct the issues exposed by
> this outage. We will continue to work as hard as possible to ensure
> that everyone using MongoDB has the best possible experience. We are
> thankful for the support that we have received from Foursquare and our
> community during this unfortunate episode. As always, please let us
> know if you have any questions or concerns.
> [1] Chunks get split when they are 200MB into 2 100MB halves. This
> means that even if the number of chunks on each shard was the same,
> data size is not always so. This is something we are going to be
> addressing in MongoDB. We'll be making splitting balancing look for
> this imbalance so it can act upon it.
> [2] The 10gen team is working on doing online incremental compaction
> of both data files and indexes. We know this has been a concern in
> non-sharded systems as well. More details about this will be coming
> in the next few weeks.
> Just as a quick follow-up to Eliot's detailed message on our outage I
> want to make it clear to the community how helpful Eliot and the rest
> of the gang at 10gen has been to us in the last couple days. We've
> been in near constant communication with them as we worked to repair
> things and end the outage, and then to identify and get to work on
> fixing things so that this sort of thing won't happen again.
> They've been a huge help to us, and things would have been much worse
> without them.
> Further, I am very confident that some of the issues that we've
> identified in Mongo will be dealt with so others don't encounter
> similar problems. Overall we still remain huge fans of MongoDB @
> foursquare, and expect to be using it for a long time to come.
> I'm more than happy to help answer any concerns that others might have
> about MongoDB as it relates to this incident.
I'm curious if foursquare had other options for shard key. Can anyone
speak to whether, if they were starting from scratch, they have better
choices now for choosing shard keys?
On Oct 7, 8:01 pm, Alex Popescu <the.mindstorm.mailingl...@gmail.com>
wrote:
> 1. were writes configured to work with MongoDB's fire and forget
> behavior?
> 2. were replicas used for distributing reads?
> 3. could you bring up read-only replicas to deal with the increased
> read access time?
> 4. why the new shard could take only 5% of the data?
> 5. is there a real solution to the chunk migration/page size issue?
> 6. how could one plan for growth when sharding is based on a key that
> doesn't follow insertion order?
> thanks again,
> :- alex
> On Oct 8, 2:32 am, harryh <har...@gmail.com> wrote:
> > Just as a quick follow-up to Eliot's detailed message on our outage I
> > want to make it clear to the community how helpful Eliot and the rest
> > of the gang at 10gen has been to us in the last couple days. We've
> > been in near constant communication with them as we worked to repair
> > things and end the outage, and then to identify and get to work on
> > fixing things so that this sort of thing won't happen again.
> > They've been a huge help to us, and things would have been much worse
> > without them.
> > Further, I am very confident that some of the issues that we've
> > identified in Mongo will be dealt with so others don't encounter
> > similar problems. Overall we still remain huge fans of MongoDB @
> > foursquare, and expect to be using it for a long time to come.
> > I'm more than happy to help answer any concerns that others might have
> > about MongoDB as it relates to this incident.
> - Can repairDatabase() leverage multiple cores? Given that data is > broken down into chunks, can we process them in parallel? 4 hours > downtime seems to be eternal in internet space.
Currently no. We're going to be working on doing compaction in the background though so you won't have to do this large offline compaction at all.
> - It seems that when data grows out of memory, the performance seems > to degrade significantly. If only part of the data pages are touched > (because only some users are active), why the paging/caching mechanism > is not effective and cause the disk read/write rates to be high and > bog down the request processing?
Foursquare needed read access to more blocks per second than even a 4 drive raid could keep up with. This is not normal for most sites, as many MongoDB installations have storage sizes 2x or even up to 100x ram. MongoDB should be better about handling situations like this and degrade much more gracefully. We'll be working on these enhancements soon as well.
> - For shard key selection, it looks like if insertion order determines > shard key, it will cause an uneven load on a machine for high insert. > If not, it may cause some problem when data is moved. What should be > the best practices? Does it depend on type of load?
Its very hard to give a general answer. As you said one based on insertion order means 1 shard takes all the writes. In some cases that's fine. If total data size is massive, but insert load is modest, then that might be a great solution. If you need to distribue writes, then you might fall into this other issue. We're working on compaction so it won't actually be an issue, but in the meantime you can follow the instructions above to add shards.
> - Is there some way to identify and alert that additional shards/ > machines are needed by looking at the statistics information? It would > be very helpful to prevent this kind of problem. We have a similar > situation as well. The performance is not degraded gradually. When > something hits, it hits really hard and we have to take action > immediately but additional machines may not be available immediately.
First you have to understand what resources your system cares most about. It can be ram, disk or cpu. If its ram or cpu, you can use the mongo db.stats() command to see total requirements.
> On Oct 8, 6:05 am, Eliot Horowitz <eliothorow...@gmail.com> wrote: >> (Note: this is being posted with Foursquare’s permission.)
>> As many of you are aware, Foursquare had a significant outage this >> week. The outage was caused by capacity problems on one of the >> machines hosting the MongoDB database used for check-ins. This is an >> account of what happened, why it happened, how it can be prevented, >> and how 10gen is working to improve MongoDB in light of this outage.
>> It’s important to note that throughout this week, 10gen and Foursquare >> engineers have been working together very closely to resolve the >> issue.
>> * Some history >> Foursquare has been hosting check-ins on a MongoDB database for some >> time now. The database was originally running on a single EC2 >> instance with 66GB of RAM. About 2 months ago, in response to >> increased capacity requirements, Foursquare migrated that single >> instance to a two-shard cluster. Now, each shard was running on its >> own 66GB instance, and both shards were also replicating to a slave >> for redundancy. This was an important migration because it allowed >> Foursquare to keep all of their check-in data in RAM, which is >> essential for maintaining acceptable performance.
>> The data had been split into 200 evenly distributed chunks based on >> user id. The first half went to one server, and the other half to the >> other. Each shard had about 33GB of data in RAM at this point, and >> the whole system ran smoothly for two months.
>> * What we missed in the interim >> Over these two months, check-ins were being written continually to >> each shard. Unfortunately, these check-ins did not grow evenly across >> chunks. It’s easy to imagine how this might happen: assuming certain >> subsets of users are more active than others, it’s conceivable that >> their updates might all go to the same shard. That’s what occurred in >> this case, resulting in one shard growing to 66GB and the other only >> to 50GB. [1]
>> * What went wrong >> On Monday morning, the data on one shard (we’ll call it shard0) >> finally grew to about 67GB, surpassing the 66GB of RAM on the hosting >> machine. Whenever data size grows beyond physical RAM, it becomes >> necessary to read and write to disk, which is orders of magnitude >> slower than reading and writing RAM. Thus, certain queries started to >> become very slow, and this caused a backlog that brought the site >> down.
>> We first attempted to fix the problem by adding a third shard. We >> brought the third shard up and started migrating chunks. Queries were >> now being distributed to all three shards, but shard0 continued to hit >> disk very heavily. When this failed to correct itself, we ultimately >> discovered that the problem was due to data fragmentation on shard0. >> In essence, although we had moved 5% of the data from shard0 to the >> new third shard, the data files, in their fragmented state, still >> needed the same amount of RAM. This can be explained by the fact that >> Foursquare check-in documents are small (around 300 bytes each), so >> many of them can fit on a 4KB page. Removing 5% of these just made >> each page a little more sparse, rather than removing pages >> altogether.[2]
>> After the first day's outage it had become clear that chunk migration, >> sans compaction, was not going to solve the immediate problem. By the >> time the second day's outage occurred, we had already move 5% of the >> data off of shard0, so we decided to start an offline process to >> compact the database using MongoDB’s repairDatabase() feature. This >> process took about 4 hours (partly due to the data size, and partly >> because of the slowness of EBS volumes at the time). At the end of >> the 4 hours, the RAM requirements for shard0 had in fact been reduced >> by 5%, allowing us to bring the system back online.
>> * Afterwards >> Since repairing shard0 and adding a third shard, we’ve set up even >> more shards, and now the check-in data is evenly distributed and there >> is a good deal of extra capacity. Still, we had to address the >> fragmentation problem. We ran a repairDatabase() on the slaves, and >> promoted the slaves to masters, further reducing the RAM needed on >> each shard to about 20GB.
>> * How is this issue triggered? >> Several conditions need to be met to trigger the issue that brought >> down Foursquare: >> 1. Systems are at or over capacity. How capacity is defined varies; in >> the case of Foursquare, all data needed to fit into RAM for acceptable >> performance. Other deployments may not have such strict RAM >> requirements. >> 2. Document size is less than 4k. Such documents, when moved, may be >> too small to free up pages and, thus, memory. >> 3. Shard key order and insertion order are different. This prevents >> data from being moved in contiguous chunks.
>> Most sharded deployments will not meet these criteria. Anyone whose >> documents are larger than 4KB will not suffer significant >> fragmentation because the pages that aren’t being used won’t be >> cached.
>> * Prevention >> The main thing to remember here is that once you’re at max capacity, >> it’s difficult to add more capacity without some downtime when objects >> are small. However, if caught in advance, adding more shards on a >> live system can be done with no downtime.
>> For example, if we had notifications in place to alert us 12 hours >> earlier that we needed more capacity, we could have added a third >> shard, migrated data, and then compacted the slaves.
>> Another salient point: when you’re operating at or near capacity, >> realize that if things get slow at your hosting provider, you may find >> yourself all of a sudden effectively over capacity.
>> * Final Thoughts >> The 10gen tech team is working hard to correct the issues exposed by >> this outage. We will continue to work as hard as possible to ensure >> that everyone using MongoDB has the best possible experience. We are >> thankful for the support that we have received from Foursquare and our >> community during this unfortunate episode. As always, please let us >> know if you have any questions or concerns.
>> [1] Chunks get split when they are 200MB into 2 100MB halves. This >> means that even if the number of chunks on each shard was the same, >> data size is not always so. This is something we are going to be >> addressing in MongoDB. We'll be making splitting balancing look for >> this imbalance so it can act upon it.
>> [2] The 10gen team is working on doing online incremental compaction >> of both data files and indexes. We know this has been a concern in >> non-sharded systems as well. More details about this will be coming >> in the next few weeks.
> -- > You received this message because you are subscribed to the Google Groups "mongodb-user" group. > To post to this group, send email to mongodb-user@googlegroups.com. > To unsubscribe from this group, send email to mongodb-user+unsubscribe@googlegroups.com. > For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
> 1. were writes configured to work with MongoDB's fire and forget > behavior?
I'm not actually sure - but does not really impact anything else.
> 2. were replicas used for distributing reads?
No - all reads needed to be consistent.
> 3. could you bring up read-only replicas to deal with the increased > read access time?
The read load is so heavy - that even double the number of disks wouldn't have helped.
> 4. why the new shard could take only 5% of the data?
It could take more, and in fact it now has 50%, we were trying to move as little as possible to get the site back up as fast as possible.
> 5. is there a real solution to the chunk migration/page size issue?
Yes - we are working on online compaction would mitigate this.
> 6. how could one plan for growth when sharding is based on a key that > doesn't follow insertion order?
In the post we describe a method for adding shards without any downtime in that case. Briefly, its add shard, let balancing move data, compact slaves, swap master and slave.
> On Oct 8, 2:32 am, harryh <har...@gmail.com> wrote: >> Just as a quick follow-up to Eliot's detailed message on our outage I >> want to make it clear to the community how helpful Eliot and the rest >> of the gang at 10gen has been to us in the last couple days. We've >> been in near constant communication with them as we worked to repair >> things and end the outage, and then to identify and get to work on >> fixing things so that this sort of thing won't happen again.
>> They've been a huge help to us, and things would have been much worse >> without them.
>> Further, I am very confident that some of the issues that we've >> identified in Mongo will be dealt with so others don't encounter >> similar problems. Overall we still remain huge fans of MongoDB @ >> foursquare, and expect to be using it for a long time to come.
>> I'm more than happy to help answer any concerns that others might have >> about MongoDB as it relates to this incident.
>> -harryh, eng lead @ foursquare
> -- > You received this message because you are subscribed to the Google Groups "mongodb-user" group. > To post to this group, send email to mongodb-user@googlegroups.com. > To unsubscribe from this group, send email to mongodb-user+unsubscribe@googlegroups.com. > For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
The only shard key that might have helped was one based on insertion order. The problem with that is then loading data for a single user would hit multiple machines. While workable, it's a lot cleaner to shard based on uid. The choice of shard key is not wrong, MongoDB just needs to do online compaction so migrates reduce the number of active pages.
On Thu, Oct 7, 2010 at 8:14 PM, Marc Esher <marc.es...@gmail.com> wrote: > I'm curious if foursquare had other options for shard key. Can anyone > speak to whether, if they were starting from scratch, they have better > choices now for choosing shard keys?
> On Oct 7, 8:01 pm, Alex Popescu <the.mindstorm.mailingl...@gmail.com> > wrote: >> Firstly, thanks a lot to both Foursquare and 10gen engineers/people >> for sharing these details.
>> 1. were writes configured to work with MongoDB's fire and forget >> behavior?
>> 2. were replicas used for distributing reads?
>> 3. could you bring up read-only replicas to deal with the increased >> read access time?
>> 4. why the new shard could take only 5% of the data?
>> 5. is there a real solution to the chunk migration/page size issue?
>> 6. how could one plan for growth when sharding is based on a key that >> doesn't follow insertion order?
>> thanks again,
>> :- alex
>> On Oct 8, 2:32 am, harryh <har...@gmail.com> wrote:
>> > Just as a quick follow-up to Eliot's detailed message on our outage I >> > want to make it clear to the community how helpful Eliot and the rest >> > of the gang at 10gen has been to us in the last couple days. We've >> > been in near constant communication with them as we worked to repair >> > things and end the outage, and then to identify and get to work on >> > fixing things so that this sort of thing won't happen again.
>> > They've been a huge help to us, and things would have been much worse >> > without them.
>> > Further, I am very confident that some of the issues that we've >> > identified in Mongo will be dealt with so others don't encounter >> > similar problems. Overall we still remain huge fans of MongoDB @ >> > foursquare, and expect to be using it for a long time to come.
>> > I'm more than happy to help answer any concerns that others might have >> > about MongoDB as it relates to this incident.
>> > -harryh, eng lead @ foursquare
> -- > You received this message because you are subscribed to the Google Groups "mongodb-user" group. > To post to this group, send email to mongodb-user@googlegroups.com. > To unsubscribe from this group, send email to mongodb-user+unsubscribe@googlegroups.com. > For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
I think the most obvious question is: How do we avoid maxing out our
mongodb nodes and know when to provision new nodes?
With the assumptions that: you're monitoring everything.
What do we need to look at? How do we tell? What happens if you're a
company with lots of changing scaling needs and features...it's tough
to capacity plan when you're a startup.
One solution would've been to memory limit how much mongo uses but I
don't think this is possible given Mongo uses mmpa'd files right now.
Merely watching IO and Swap I don't think is a solution since it may
be too late at that point.
Suhail
On Oct 7, 3:05 pm, Eliot Horowitz <eliothorow...@gmail.com> wrote:
> (Note: this is being posted with Foursquare’s permission.)
> As many of you are aware, Foursquare had a significant outage this
> week. The outage was caused by capacity problems on one of the
> machines hosting the MongoDB database used for check-ins. This is an
> account of what happened, why it happened, how it can be prevented,
> and how 10gen is working to improve MongoDB in light of this outage.
> It’s important to note that throughout this week, 10gen and Foursquare
> engineers have been working together very closely to resolve the
> issue.
> * Some history
> Foursquare has been hosting check-ins on a MongoDB database for some
> time now. The database was originally running on a single EC2
> instance with 66GB of RAM. About 2 months ago, in response to
> increased capacity requirements, Foursquare migrated that single
> instance to a two-shard cluster. Now, each shard was running on its
> own 66GB instance, and both shards were also replicating to a slave
> for redundancy. This was an important migration because it allowed
> Foursquare to keep all of their check-in data in RAM, which is
> essential for maintaining acceptable performance.
> The data had been split into 200 evenly distributed chunks based on
> user id. The first half went to one server, and the other half to the
> other. Each shard had about 33GB of data in RAM at this point, and
> the whole system ran smoothly for two months.
> * What we missed in the interim
> Over these two months, check-ins were being written continually to
> each shard. Unfortunately, these check-ins did not grow evenly across
> chunks. It’s easy to imagine how this might happen: assuming certain
> subsets of users are more active than others, it’s conceivable that
> their updates might all go to the same shard. That’s what occurred in
> this case, resulting in one shard growing to 66GB and the other only
> to 50GB. [1]
> * What went wrong
> On Monday morning, the data on one shard (we’ll call it shard0)
> finally grew to about 67GB, surpassing the 66GB of RAM on the hosting
> machine. Whenever data size grows beyond physical RAM, it becomes
> necessary to read and write to disk, which is orders of magnitude
> slower than reading and writing RAM. Thus, certain queries started to
> become very slow, and this caused a backlog that brought the site
> down.
> We first attempted to fix the problem by adding a third shard. We
> brought the third shard up and started migrating chunks. Queries were
> now being distributed to all three shards, but shard0 continued to hit
> disk very heavily. When this failed to correct itself, we ultimately
> discovered that the problem was due to data fragmentation on shard0.
> In essence, although we had moved 5% of the data from shard0 to the
> new third shard, the data files, in their fragmented state, still
> needed the same amount of RAM. This can be explained by the fact that
> Foursquare check-in documents are small (around 300 bytes each), so
> many of them can fit on a 4KB page. Removing 5% of these just made
> each page a little more sparse, rather than removing pages
> altogether.[2]
> After the first day's outage it had become clear that chunk migration,
> sans compaction, was not going to solve the immediate problem. By the
> time the second day's outage occurred, we had already move 5% of the
> data off of shard0, so we decided to start an offline process to
> compact the database using MongoDB’s repairDatabase() feature. This
> process took about 4 hours (partly due to the data size, and partly
> because of the slowness of EBS volumes at the time). At the end of
> the 4 hours, the RAM requirements for shard0 had in fact been reduced
> by 5%, allowing us to bring the system back online.
> * Afterwards
> Since repairing shard0 and adding a third shard, we’ve set up even
> more shards, and now the check-in data is evenly distributed and there
> is a good deal of extra capacity. Still, we had to address the
> fragmentation problem. We ran a repairDatabase() on the slaves, and
> promoted the slaves to masters, further reducing the RAM needed on
> each shard to about 20GB.
> * How is this issue triggered?
> Several conditions need to be met to trigger the issue that brought
> down Foursquare:
> 1. Systems are at or over capacity. How capacity is defined varies; in
> the case of Foursquare, all data needed to fit into RAM for acceptable
> performance. Other deployments may not have such strict RAM
> requirements.
> 2. Document size is less than 4k. Such documents, when moved, may be
> too small to free up pages and, thus, memory.
> 3. Shard key order and insertion order are different. This prevents
> data from being moved in contiguous chunks.
> Most sharded deployments will not meet these criteria. Anyone whose
> documents are larger than 4KB will not suffer significant
> fragmentation because the pages that aren’t being used won’t be
> cached.
> * Prevention
> The main thing to remember here is that once you’re at max capacity,
> it’s difficult to add more capacity without some downtime when objects
> are small. However, if caught in advance, adding more shards on a
> live system can be done with no downtime.
> For example, if we had notifications in place to alert us 12 hours
> earlier that we needed more capacity, we could have added a third
> shard, migrated data, and then compacted the slaves.
> Another salient point: when you’re operating at or near capacity,
> realize that if things get slow at your hosting provider, you may find
> yourself all of a sudden effectively over capacity.
> * Final Thoughts
> The 10gen tech team is working hard to correct the issues exposed by
> this outage. We will continue to work as hard as possible to ensure
> that everyone using MongoDB has the best possible experience. We are
> thankful for the support that we have received from Foursquare and our
> community during this unfortunate episode. As always, please let us
> know if you have any questions or concerns.
> [1] Chunks get split when they are 200MB into 2 100MB halves. This
> means that even if the number of chunks on each shard was the same,
> data size is not always so. This is something we are going to be
> addressing in MongoDB. We'll be making splitting balancing look for
> this imbalance so it can act upon it.
> [2] The 10gen team is working on doing online incremental compaction
> of both data files and indexes. We know this has been a concern in
> non-sharded systems as well. More details about this will be coming
> in the next few weeks.
I'm curious if foursquare had other options for shard key. Can anyone
speak to whether, if they were starting from scratch, they have better
choices now for choosing shard keys?
On Oct 7, 8:01 pm, Alex Popescu <the.mindstorm.mailingl...@gmail.com>
wrote:
> 1. were writes configured to work with MongoDB's fire and forget
> behavior?
> 2. were replicas used for distributing reads?
> 3. could you bring up read-only replicas to deal with the increased
> read access time?
> 4. why the new shard could take only 5% of the data?
> 5. is there a real solution to the chunk migration/page size issue?
> 6. how could one plan for growth when sharding is based on a key that
> doesn't follow insertion order?
> thanks again,
> :- alex
> On Oct 8, 2:32 am, harryh <har...@gmail.com> wrote:
> > Just as a quick follow-up to Eliot's detailed message on our outage I
> > want to make it clear to the community how helpful Eliot and the rest
> > of the gang at 10gen has been to us in the last couple days. We've
> > been in near constant communication with them as we worked to repair
> > things and end the outage, and then to identify and get to work on
> > fixing things so that this sort of thing won't happen again.
> > They've been a huge help to us, and things would have been much worse
> > without them.
> > Further, I am very confident that some of the issues that we've
> > identified in Mongo will be dealt with so others don't encounter
> > similar problems. Overall we still remain huge fans of MongoDB @
> > foursquare, and expect to be using it for a long time to come.
> > I'm more than happy to help answer any concerns that others might have
> > about MongoDB as it relates to this incident.
On Oct 7, 6:05 pm, Eliot Horowitz <eliothorow...@gmail.com> wrote:
> 1. Systems are at or over capacity. How capacity is defined varies; in
> the case of Foursquare, all data needed to fit into RAM for acceptable
> performance. Other deployments may not have such strict RAM
> requirements.
How does one determine what the RAM requirements are? For databases
that extend into the TB range, the same quantity of RAM is not an
option.
Is the MongoDB formula significantly different from the traditional
RDBMS, where performance is best when RAM buffers are large enough to
contain most commonly accessed / MRU blocks?
How much of Foursquare's problem is related to the lower IO rates of
EC2/EBS compared to fast, local, dedicated disks?
> I think the most obvious question is: How do we avoid maxing out our > mongodb nodes and know when to provision new nodes? > With the assumptions that: you're monitoring everything.
> What do we need to look at? How do we tell? What happens if you're a > company with lots of changing scaling needs and features...it's tough > to capacity plan when you're a startup.
This is very application dependent. In some cases you need all your indexes in ram, in other cases it can just be a small working set. One good way to determine this is figure out how much data you determine in a 10 minute period. Indexes and documents. Make sure you can either store that in ram or read that from disk in the same amount of time.
> One solution would've been to memory limit how much mongo uses but I > don't think this is possible given Mongo uses mmpa'd files right now. > Merely watching IO and Swap I don't think is a solution since it may > be too late at that point.
This wouldn't really help. Generally the more ram you have the better. The memory MongoDB uses is to cache data, so the more you give it the better.
> On Oct 7, 3:05 pm, Eliot Horowitz <eliothorow...@gmail.com> wrote: >> (Note: this is being posted with Foursquare’s permission.)
>> As many of you are aware, Foursquare had a significant outage this >> week. The outage was caused by capacity problems on one of the >> machines hosting the MongoDB database used for check-ins. This is an >> account of what happened, why it happened, how it can be prevented, >> and how 10gen is working to improve MongoDB in light of this outage.
>> It’s important to note that throughout this week, 10gen and Foursquare >> engineers have been working together very closely to resolve the >> issue.
>> * Some history >> Foursquare has been hosting check-ins on a MongoDB database for some >> time now. The database was originally running on a single EC2 >> instance with 66GB of RAM. About 2 months ago, in response to >> increased capacity requirements, Foursquare migrated that single >> instance to a two-shard cluster. Now, each shard was running on its >> own 66GB instance, and both shards were also replicating to a slave >> for redundancy. This was an important migration because it allowed >> Foursquare to keep all of their check-in data in RAM, which is >> essential for maintaining acceptable performance.
>> The data had been split into 200 evenly distributed chunks based on >> user id. The first half went to one server, and the other half to the >> other. Each shard had about 33GB of data in RAM at this point, and >> the whole system ran smoothly for two months.
>> * What we missed in the interim >> Over these two months, check-ins were being written continually to >> each shard. Unfortunately, these check-ins did not grow evenly across >> chunks. It’s easy to imagine how this might happen: assuming certain >> subsets of users are more active than others, it’s conceivable that >> their updates might all go to the same shard. That’s what occurred in >> this case, resulting in one shard growing to 66GB and the other only >> to 50GB. [1]
>> * What went wrong >> On Monday morning, the data on one shard (we’ll call it shard0) >> finally grew to about 67GB, surpassing the 66GB of RAM on the hosting >> machine. Whenever data size grows beyond physical RAM, it becomes >> necessary to read and write to disk, which is orders of magnitude >> slower than reading and writing RAM. Thus, certain queries started to >> become very slow, and this caused a backlog that brought the site >> down.
>> We first attempted to fix the problem by adding a third shard. We >> brought the third shard up and started migrating chunks. Queries were >> now being distributed to all three shards, but shard0 continued to hit >> disk very heavily. When this failed to correct itself, we ultimately >> discovered that the problem was due to data fragmentation on shard0. >> In essence, although we had moved 5% of the data from shard0 to the >> new third shard, the data files, in their fragmented state, still >> needed the same amount of RAM. This can be explained by the fact that >> Foursquare check-in documents are small (around 300 bytes each), so >> many of them can fit on a 4KB page. Removing 5% of these just made >> each page a little more sparse, rather than removing pages >> altogether.[2]
>> After the first day's outage it had become clear that chunk migration, >> sans compaction, was not going to solve the immediate problem. By the >> time the second day's outage occurred, we had already move 5% of the >> data off of shard0, so we decided to start an offline process to >> compact the database using MongoDB’s repairDatabase() feature. This >> process took about 4 hours (partly due to the data size, and partly >> because of the slowness of EBS volumes at the time). At the end of >> the 4 hours, the RAM requirements for shard0 had in fact been reduced >> by 5%, allowing us to bring the system back online.
>> * Afterwards >> Since repairing shard0 and adding a third shard, we’ve set up even >> more shards, and now the check-in data is evenly distributed and there >> is a good deal of extra capacity. Still, we had to address the >> fragmentation problem. We ran a repairDatabase() on the slaves, and >> promoted the slaves to masters, further reducing the RAM needed on >> each shard to about 20GB.
>> * How is this issue triggered? >> Several conditions need to be met to trigger the issue that brought >> down Foursquare: >> 1. Systems are at or over capacity. How capacity is defined varies; in >> the case of Foursquare, all data needed to fit into RAM for acceptable >> performance. Other deployments may not have such strict RAM >> requirements. >> 2. Document size is less than 4k. Such documents, when moved, may be >> too small to free up pages and, thus, memory. >> 3. Shard key order and insertion order are different. This prevents >> data from being moved in contiguous chunks.
>> Most sharded deployments will not meet these criteria. Anyone whose >> documents are larger than 4KB will not suffer significant >> fragmentation because the pages that aren’t being used won’t be >> cached.
>> * Prevention >> The main thing to remember here is that once you’re at max capacity, >> it’s difficult to add more capacity without some downtime when objects >> are small. However, if caught in advance, adding more shards on a >> live system can be done with no downtime.
>> For example, if we had notifications in place to alert us 12 hours >> earlier that we needed more capacity, we could have added a third >> shard, migrated data, and then compacted the slaves.
>> Another salient point: when you’re operating at or near capacity, >> realize that if things get slow at your hosting provider, you may find >> yourself all of a sudden effectively over capacity.
>> * Final Thoughts >> The 10gen tech team is working hard to correct the issues exposed by >> this outage. We will continue to work as hard as possible to ensure >> that everyone using MongoDB has the best possible experience. We are >> thankful for the support that we have received from Foursquare and our >> community during this unfortunate episode. As always, please let us >> know if you have any questions or concerns.
>> [1] Chunks get split when they are 200MB into 2 100MB halves. This >> means that even if the number of chunks on each shard was the same, >> data size is not always so. This is something we are going to be >> addressing in MongoDB. We'll be making splitting balancing look for >> this imbalance so it can act upon it.
>> [2] The 10gen team is working on doing online incremental compaction >> of both data files and indexes. We know this has been a concern in >> non-sharded systems as well. More details about this will be coming >> in the next few weeks.
> -- > You received this message because you are subscribed to the Google Groups "mongodb-user" group. > To post to this group, send email to mongodb-user@googlegroups.com. > To unsubscribe from this group, send email to mongodb-user+unsubscribe@googlegroups.com. > For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
> How does one determine what the RAM requirements are? For databases > that extend into the TB range, the same quantity of RAM is not an > option.
You need to determine working set. See recent comment or 2 above this one.
> Is the MongoDB formula significantly different from the traditional > RDBMS, where performance is best when RAM buffers are large enough to > contain most commonly accessed / MRU blocks?
Its really the same in MongoDB as a traditional RDBMs.
> How much of Foursquare's problem is related to the lower IO rates of > EC2/EBS compared to fast, local, dedicated disks?
I don't think even fast local disks would have prevented this. There is a chance SSDs might have, but we have not done all the math to know for sure. Faster disks would probably have made it easier to manipulate things to reduce downtime howerver.
> -- > You received this message because you are subscribed to the Google Groups "mongodb-user" group. > To post to this group, send email to mongodb-user@googlegroups.com. > To unsubscribe from this group, send email to mongodb-user+unsubscribe@googlegroups.com. > For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
One metric to keep an eye on is the database's and/or collection's
storageSize + the totalIndexSize. Also increased read rates can tip
you off that there might be a problem. Last suggestion is to get
comfortable with the mongostats tool - it's invaluable!
On Oct 7, 8:16 pm, Suhail Doshi <digitalwarf...@gmail.com> wrote:
> I think the most obvious question is: How do we avoid maxing out our
> mongodb nodes and know when to provision new nodes?
> With the assumptions that: you're monitoring everything.
> What do we need to look at? How do we tell? What happens if you're a
> company with lots of changing scaling needs and features...it's tough
> to capacity plan when you're a startup.
> One solution would've been to memory limit how much mongo uses but I
> don't think this is possible given Mongo uses mmpa'd files right now.
> Merely watching IO and Swap I don't think is a solution since it may
> be too late at that point.
> Suhail
> On Oct 7, 3:05 pm, Eliot Horowitz <eliothorow...@gmail.com> wrote:
> > (Note: this is being posted with Foursquare’s permission.)
> > As many of you are aware, Foursquare had a significant outage this
> > week. The outage was caused by capacity problems on one of the
> > machines hosting the MongoDB database used for check-ins. This is an
> > account of what happened, why it happened, how it can be prevented,
> > and how 10gen is working to improve MongoDB in light of this outage.
> > It’s important to note that throughout this week, 10gen and Foursquare
> > engineers have been working together very closely to resolve the
> > issue.
> > * Some history
> > Foursquare has been hosting check-ins on a MongoDB database for some
> > time now. The database was originally running on a single EC2
> > instance with 66GB of RAM. About 2 months ago, in response to
> > increased capacity requirements, Foursquare migrated that single
> > instance to a two-shard cluster. Now, each shard was running on its
> > own 66GB instance, and both shards were also replicating to a slave
> > for redundancy. This was an important migration because it allowed
> > Foursquare to keep all of their check-in data in RAM, which is
> > essential for maintaining acceptable performance.
> > The data had been split into 200 evenly distributed chunks based on
> > user id. The first half went to one server, and the other half to the
> > other. Each shard had about 33GB of data in RAM at this point, and
> > the whole system ran smoothly for two months.
> > * What we missed in the interim
> > Over these two months, check-ins were being written continually to
> > each shard. Unfortunately, these check-ins did not grow evenly across
> > chunks. It’s easy to imagine how this might happen: assuming certain
> > subsets of users are more active than others, it’s conceivable that
> > their updates might all go to the same shard. That’s what occurred in
> > this case, resulting in one shard growing to 66GB and the other only
> > to 50GB. [1]
> > * What went wrong
> > On Monday morning, the data on one shard (we’ll call it shard0)
> > finally grew to about 67GB, surpassing the 66GB of RAM on the hosting
> > machine. Whenever data size grows beyond physical RAM, it becomes
> > necessary to read and write to disk, which is orders of magnitude
> > slower than reading and writing RAM. Thus, certain queries started to
> > become very slow, and this caused a backlog that brought the site
> > down.
> > We first attempted to fix the problem by adding a third shard. We
> > brought the third shard up and started migrating chunks. Queries were
> > now being distributed to all three shards, but shard0 continued to hit
> > disk very heavily. When this failed to correct itself, we ultimately
> > discovered that the problem was due to data fragmentation on shard0.
> > In essence, although we had moved 5% of the data from shard0 to the
> > new third shard, the data files, in their fragmented state, still
> > needed the same amount of RAM. This can be explained by the fact that
> > Foursquare check-in documents are small (around 300 bytes each), so
> > many of them can fit on a 4KB page. Removing 5% of these just made
> > each page a little more sparse, rather than removing pages
> > altogether.[2]
> > After the first day's outage it had become clear that chunk migration,
> > sans compaction, was not going to solve the immediate problem. By the
> > time the second day's outage occurred, we had already move 5% of the
> > data off of shard0, so we decided to start an offline process to
> > compact the database using MongoDB’s repairDatabase() feature. This
> > process took about 4 hours (partly due to the data size, and partly
> > because of the slowness of EBS volumes at the time). At the end of
> > the 4 hours, the RAM requirements for shard0 had in fact been reduced
> > by 5%, allowing us to bring the system back online.
> > * Afterwards
> > Since repairing shard0 and adding a third shard, we’ve set up even
> > more shards, and now the check-in data is evenly distributed and there
> > is a good deal of extra capacity. Still, we had to address the
> > fragmentation problem. We ran a repairDatabase() on the slaves, and
> > promoted the slaves to masters, further reducing the RAM needed on
> > each shard to about 20GB.
> > * How is this issue triggered?
> > Several conditions need to be met to trigger the issue that brought
> > down Foursquare:
> > 1. Systems are at or over capacity. How capacity is defined varies; in
> > the case of Foursquare, all data needed to fit into RAM for acceptable
> > performance. Other deployments may not have such strict RAM
> > requirements.
> > 2. Document size is less than 4k. Such documents, when moved, may be
> > too small to free up pages and, thus, memory.
> > 3. Shard key order and insertion order are different. This prevents
> > data from being moved in contiguous chunks.
> > Most sharded deployments will not meet these criteria. Anyone whose
> > documents are larger than 4KB will not suffer significant
> > fragmentation because the pages that aren’t being used won’t be
> > cached.
> > * Prevention
> > The main thing to remember here is that once you’re at max capacity,
> > it’s difficult to add more capacity without some downtime when objects
> > are small. However, if caught in advance, adding more shards on a
> > live system can be done with no downtime.
> > For example, if we had notifications in place to alert us 12 hours
> > earlier that we needed more capacity, we could have added a third
> > shard, migrated data, and then compacted the slaves.
> > Another salient point: when you’re operating at or near capacity,
> > realize that if things get slow at your hosting provider, you may find
> > yourself all of a sudden effectively over capacity.
> > * Final Thoughts
> > The 10gen tech team is working hard to correct the issues exposed by
> > this outage. We will continue to work as hard as possible to ensure
> > that everyone using MongoDB has the best possible experience. We are
> > thankful for the support that we have received from Foursquare and our
> > community during this unfortunate episode. As always, please let us
> > know if you have any questions or concerns.
> > [1] Chunks get split when they are 200MB into 2 100MB halves. This
> > means that even if the number of chunks on each shard was the same,
> > data size is not always so. This is something we are going to be
> > addressing in MongoDB. We'll be making splitting balancing look for
> > this imbalance so it can act upon it.
> > [2] The 10gen team is working on doing online incremental compaction
> > of both data files and indexes. We know this has been a concern in
> > non-sharded systems as well. More details about this will be coming
> > in the next few weeks.
In our case we found that the EBS read rates were not sufficient if
the data + indexes didn't fit into RAM. I'd caution that this is
specific to how we're accessing this collection, and to our read/write
rates. We're running with 4 EBS volumes (RAID-0) with a small bit of
tuning, which definitely helps.
On Oct 7, 8:54 pm, MattK <bsg...@gmail.com> wrote:
> On Oct 7, 6:05 pm, Eliot Horowitz <eliothorow...@gmail.com> wrote:
> > 1. Systems are at or over capacity. How capacity is defined varies; in
> > the case of Foursquare, all data needed to fit into RAM for acceptable
> > performance. Other deployments may not have such strict RAM
> > requirements.
> How does one determine what the RAM requirements are? For databases
> that extend into the TB range, the same quantity of RAM is not an
> option.
> Is the MongoDB formula significantly different from the traditional
> RDBMS, where performance is best when RAM buffers are large enough to
> contain most commonly accessed / MRU blocks?
> How much of Foursquare's problem is related to the lower IO rates of
> EC2/EBS compared to fast, local, dedicated disks?
> - It seems that when data grows out of memory, the performance seems > to degrade significantly.
It is when the working set grows out of memory that you have a problem not "data". Note that this is a problem with any kind of system and not specific to MongoDB.
What happens when running out of memory (ie more operations proceed at disk speeds) is that things will get a lot worse because of concurrency. If a query now takes a second instead of 100ms then it is far more likely another query will come in during that extra time. Both queries will now be competing for the disk. Then a third will come along making things even worse again. The disk seek times really hurt.
What MongoDB needs to do is decrease concurrency under load so that existing queries can complete before allowing new ones to add fuel to the fire.
Would be possible to get a tool that could gather info / stats for a
few hours / days and do some suggestions.
Maybe it could suggest shard keys, suggest memory size, new indexes,
etc...
I'm sure that a few of us do wrong querys, and it could even warn of
using group instead of map_reduce if we have less than 10000 keys,
etc... ( Good practices )
Claim: The main problem was the fact that fragmentation was happening
under very specific unexpected conditions. And even though the system
was running for months, it was not foreseen or expected.
Counter-action: Try to simulate the production system to get an idea
about different kind of problems in advance!
Therefore, ask your biggest/most important customers about their
typical ways of using MongoDB.
Typical document sizes, number of queries per time unit, read/write
ratios, kind of queries, clustering and sharding setup, etc. May be
even for examples of real data.
Then try to simulate this setup at your own premises using the same/
similar characteristics as your customers do. Do some sort of
randomized/non-deterministic testing. Try to see what would happen to
the system, if major factors are changed, e.g. size of documents,
number of requests, network speed, etc. This way, many kinds of
problems can be detected in this simulated setup before they happen
for real in production.
Does it make sense for you?
- Leo
P.S. Of course, this approach does not apply only to MongoDB. It can
be used almost for any system, where simulation is possible.
MattK> How does one determine what the RAM requirements are? For MattK> databases that extend into the TB range, the same quantity of MattK> RAM is not an option.
As already mentioned, every use case is different; what is important is that you have enough RAM to house your working set and indexes
MattK> Is the MongoDB formula significantly different from the MattK> traditional RDBMS, where performance is best when RAM buffers MattK> are large enough to contain most commonly accessed / MRU blocks?
The formula is not different but the impact in both directions (enough vs not enough RAM to hold the working set and indices) carries more weight i.e. you will usually (with default configurations) see MongoDB being faster if there is enough RAM but worse (compared to RDBMs) if MongoDB needs to hit the disk.
MattK> How much of Foursquare's problem is related to the lower IO MattK> rates of EC2/EBS compared to fast, local, dedicated disks?
If you need to go to disks for I/O except the default fsync etc. then you already have a problem. The best solution to mitigate the pain then however is something that does random I/O i.e. SSD (Solid State Drive) rather than usual HDDs.
Your post was very informative. One question that remains (and I cant
believe that nobody has raised it yet) is - Why was there no
monitoring system in place to fore-warn your team?
Isn't this a basic requirement? I mean you dont expect the data to
grow beyond your RAM is a very risky constraint. So there should've
been *some* sort of monitoring in place to warn you that you're about
to run out of resources. Its a very basic requirement according to me?
Something like "hey, i'm doing malloc(). Let me check if theres
ACTUALLY memory available for me.". If such a monitoring system was in
place and hooked up to your foursquare app, users would've gotten
error messages while checking in, BUT their existing check-ins
would've been available and wouldn't have led to the whole site/
service going down.
Dinesh> I mean you dont expect the data to grow beyond your RAM is a Dinesh> very risky constraint.
I would disagree with that. The fact is that the optimal situation is that all you data set including indices fit into RAM. However, this will only be true for small sites using MongoDB as data tier in some sort.
The key concern is whether or not the working set fits into RAM at all times. This pattern is different per use case and situation (slashdot effect). With sites as foursquare it will always be so that your data set is bigger than your actual working set and certainly also available RAM. Monitoring whether or not your data set is growing bigger than RAM therefore does not make sense. You as a technician need to know about your use case and plan/develop accordingly as MongoDB can not free you from those duties.
Dinesh> So there should've been *some* sort of monitoring in place to Dinesh> warn you that you're about to run out of resources.
See also the links for the munin and nagios plugins from this link. Bottom line is, the information is manifold and available, it is up the person who runs a setup to interpret it and act/plan accordingly (rationale see above).
On Fri, Oct 8, 2010 at 1:25 AM, Dinesh <dinesh.a.jo...@gmail.com> wrote: > Hi Eliot,
> Your post was very informative. One question that remains (and I cant > believe that nobody has raised it yet) is - Why was there no > monitoring system in place to fore-warn your team?
> Isn't this a basic requirement? I mean you dont expect the data to > grow beyond your RAM is a very risky constraint. So there should've > been *some* sort of monitoring in place to warn you that you're about > to run out of resources. Its a very basic requirement according to me? > Something like "hey, i'm doing malloc(). Let me check if theres > ACTUALLY memory available for me.". If such a monitoring system was in > place and hooked up to your foursquare app, users would've gotten > error messages while checking in, BUT their existing check-ins > would've been available and wouldn't have led to the whole site/ > service going down.
> Dinesh
> -- > You received this message because you are subscribed to the Google Groups "mongodb-user" group. > To post to this group, send email to mongodb-user@googlegroups.com. > To unsubscribe from this group, send email to mongodb-user+unsubscribe@googlegroups.com. > For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
That is why setting a limit on the RAM used by MongoDB would be
useful. If it were possible to set a limit, they could have limited
the machine to 64GB resident. Then, when the system started to fall
over, they could stop MongoDB, bump the limit to 66GB, then turn
MongoDB back on.
Placing a limit on the memory used by MongoDB artificially reduces the
overall system capacity. But, it creates a more reliable early warning
with a fast path to a temporary fix. The temporary fix will usually
last long enough to get new shards online.
Brainstorming.... one could allocate memory in another process, mlock
it so that it can't be swapped out, and then just let it sit in an
infinite sleep. If the server starts to fall over due to lack of
memory, kill the process with the mlocked memory, and boom. You have
runway.
-Jonathan
On Oct 8, 7:01 am, Eliot Horowitz <eliothorow...@gmail.com> wrote:
> Knowing what the actual limit can be tricky.
> There was monitoring in place, but the limit for when to add capacity
> wasn't quite right.
> On Fri, Oct 8, 2010 at 1:25 AM, Dinesh <dinesh.a.jo...@gmail.com> wrote:
> > Hi Eliot,
> > Your post was very informative. One question that remains (and I cant
> > believe that nobody has raised it yet) is - Why was there no
> > monitoring system in place to fore-warn your team?
> > Isn't this a basic requirement? I mean you dont expect the data to
> > grow beyond your RAM is a very risky constraint. So there should've
> > been *some* sort of monitoring in place to warn you that you're about
> > to run out of resources. Its a very basic requirement according to me?
> > Something like "hey, i'm doing malloc(). Let me check if theres
> > ACTUALLY memory available for me.". If such a monitoring system was in
> > place and hooked up to your foursquare app, users would've gotten
> > error messages while checking in, BUT their existing check-ins
> > would've been available and wouldn't have led to the whole site/
> > service going down.
> > Dinesh
> > --
> > You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> > To post to this group, send email to mongodb-user@googlegroups.com.
> > To unsubscribe from this group, send email to mongodb-user+unsubscribe@googlegroups.com.
> > For more options, visit this group athttp://groups.google.com/group/mongodb-user?hl=en.
On Fri, Oct 8, 2010 at 9:15 AM, Jonathan Ultis <jonathan.ul...@gmail.com> wrote: > That is why setting a limit on the RAM used by MongoDB would be > useful. If it were possible to set a limit, they could have limited > the machine to 64GB resident. Then, when the system started to fall > over, they could stop MongoDB, bump the limit to 66GB, then turn > MongoDB back on.
AFAIK, Mongo isn't malloc'ing memory. you can mmap 100x physical RAM on the box. it isn't resident memory; it's all virtual. If mongo dirties pages beyond the configured amount that the system allows to be dirty, then the vm manager will flush pages to the backnig mmap file. Usually that configured amount is derived via percentages of the system RAM. In foursquare's case, they were dirtying pages above that threshold and the debt owed to IOwait piled up. I'm not negating your later idea though.
> Placing a limit on the memory used by MongoDB artificially reduces the > overall system capacity. But, it creates a more reliable early warning > with a fast path to a temporary fix. The temporary fix will usually > last long enough to get new shards online.
Theoretically you could lower that aforementioned dirty percentage using the vm tuning tools. In linux, /proc/sys/vm/... If the working set is diryting pages beyond the thresholds, then the os will start flushing to disk and mongo will get slow, you could simply raise the vm thresholds and get this temporary runway.
> Brainstorming.... one could allocate memory in another process, mlock > it so that it can't be swapped out, and then just let it sit in an > infinite sleep. If the server starts to fall over due to lack of > memory, kill the process with the mlocked memory, and boom. You have > runway.
> -Jonathan
> On Oct 8, 7:01 am, Eliot Horowitz <eliothorow...@gmail.com> wrote: >> Knowing what the actual limit can be tricky. >> There was monitoring in place, but the limit for when to add capacity >> wasn't quite right.
>> On Fri, Oct 8, 2010 at 1:25 AM, Dinesh <dinesh.a.jo...@gmail.com> wrote: >> > Hi Eliot,
>> > Your post was very informative. One question that remains (and I cant >> > believe that nobody has raised it yet) is - Why was there no >> > monitoring system in place to fore-warn your team?
>> > Isn't this a basic requirement? I mean you dont expect the data to >> > grow beyond your RAM is a very risky constraint. So there should've >> > been *some* sort of monitoring in place to warn you that you're about >> > to run out of resources. Its a very basic requirement according to me? >> > Something like "hey, i'm doing malloc(). Let me check if theres >> > ACTUALLY memory available for me.". If such a monitoring system was in >> > place and hooked up to your foursquare app, users would've gotten >> > error messages while checking in, BUT their existing check-ins >> > would've been available and wouldn't have led to the whole site/ >> > service going down.
>> > Dinesh
>> > -- >> > You received this message because you are subscribed to the Google Groups "mongodb-user" group. >> > To post to this group, send email to mongodb-user@googlegroups.com. >> > To unsubscribe from this group, send email to mongodb-user+unsubscribe@googlegroups.com. >> > For more options, visit this group athttp://groups.google.com/group/mongodb-user?hl=en.
> -- > You received this message because you are subscribed to the Google Groups "mongodb-user" group. > To post to this group, send email to mongodb-user@googlegroups.com. > To unsubscribe from this group, send email to mongodb-user+unsubscribe@googlegroups.com. > For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
MongoDB uses range based partitioning of its shards. So even if some
power users were involved in doing loads of check-ins, what I don't
understand is why didn't MongoDB sharding split the user ranges
further and swap out the chunks and migrate to the other shard. Any
thoughts?
On Oct 7, 6:05 pm, Eliot Horowitz <eliothorow...@gmail.com> wrote:
> (Note: this is being posted with Foursquare’s permission.)
> As many of you are aware, Foursquare had a significant outage this
> week. The outage was caused by capacity problems on one of the
> machines hosting the MongoDB database used for check-ins. This is an
> account of what happened, why it happened, how it can be prevented,
> and how 10gen is working to improve MongoDB in light of this outage.
> It’s important to note that throughout this week, 10gen and Foursquare
> engineers have been working together very closely to resolve the
> issue.
> * Some history
> Foursquare has been hosting check-ins on a MongoDB database for some
> time now. The database was originally running on a single EC2
> instance with 66GB of RAM. About 2 months ago, in response to
> increased capacity requirements, Foursquare migrated that single
> instance to a two-shard cluster. Now, each shard was running on its
> own 66GB instance, and both shards were also replicating to a slave
> for redundancy. This was an important migration because it allowed
> Foursquare to keep all of their check-in data in RAM, which is
> essential for maintaining acceptable performance.
> The data had been split into 200 evenly distributed chunks based on
> user id. The first half went to one server, and the other half to the
> other. Each shard had about 33GB of data in RAM at this point, and
> the whole system ran smoothly for two months.
> * What we missed in the interim
> Over these two months, check-ins were being written continually to
> each shard. Unfortunately, these check-ins did not grow evenly across
> chunks. It’s easy to imagine how this might happen: assuming certain
> subsets of users are more active than others, it’s conceivable that
> their updates might all go to the same shard. That’s what occurred in
> this case, resulting in one shard growing to 66GB and the other only
> to 50GB. [1]
> * What went wrong
> On Monday morning, the data on one shard (we’ll call it shard0)
> finally grew to about 67GB, surpassing the 66GB of RAM on the hosting
> machine. Whenever data size grows beyond physical RAM, it becomes
> necessary to read and write to disk, which is orders of magnitude
> slower than reading and writing RAM. Thus, certain queries started to
> become very slow, and this caused a backlog that brought the site
> down.
> We first attempted to fix the problem by adding a third shard. We
> brought the third shard up and started migrating chunks. Queries were
> now being distributed to all three shards, but shard0 continued to hit
> disk very heavily. When this failed to correct itself, we ultimately
> discovered that the problem was due to data fragmentation on shard0.
> In essence, although we had moved 5% of the data from shard0 to the
> new third shard, the data files, in their fragmented state, still
> needed the same amount of RAM. This can be explained by the fact that
> Foursquare check-in documents are small (around 300 bytes each), so
> many of them can fit on a 4KB page. Removing 5% of these just made
> each page a little more sparse, rather than removing pages
> altogether.[2]
> After the first day's outage it had become clear that chunk migration,
> sans compaction, was not going to solve the immediate problem. By the
> time the second day's outage occurred, we had already move 5% of the
> data off of shard0, so we decided to start an offline process to
> compact the database using MongoDB’s repairDatabase() feature. This
> process took about 4 hours (partly due to the data size, and partly
> because of the slowness of EBS volumes at the time). At the end of
> the 4 hours, the RAM requirements for shard0 had in fact been reduced
> by 5%, allowing us to bring the system back online.
> * Afterwards
> Since repairing shard0 and adding a third shard, we’ve set up even
> more shards, and now the check-in data is evenly distributed and there
> is a good deal of extra capacity. Still, we had to address the
> fragmentation problem. We ran a repairDatabase() on the slaves, and
> promoted the slaves to masters, further reducing the RAM needed on
> each shard to about 20GB.
> * How is this issue triggered?
> Several conditions need to be met to trigger the issue that brought
> down Foursquare:
> 1. Systems are at or over capacity. How capacity is defined varies; in
> the case of Foursquare, all data needed to fit into RAM for acceptable
> performance. Other deployments may not have such strict RAM
> requirements.
> 2. Document size is less than 4k. Such documents, when moved, may be
> too small to free up pages and, thus, memory.
> 3. Shard key order and insertion order are different. This prevents
> data from being moved in contiguous chunks.
> Most sharded deployments will not meet these criteria. Anyone whose
> documents are larger than 4KB will not suffer significant
> fragmentation because the pages that aren’t being used won’t be
> cached.
> * Prevention
> The main thing to remember here is that once you’re at max capacity,
> it’s difficult to add more capacity without some downtime when objects
> are small. However, if caught in advance, adding more shards on a
> live system can be done with no downtime.
> For example, if we had notifications in place to alert us 12 hours
> earlier that we needed more capacity, we could have added a third
> shard, migrated data, and then compacted the slaves.
> Another salient point: when you’re operating at or near capacity,
> realize that if things get slow at your hosting provider, you may find
> yourself all of a sudden effectively over capacity.
> * Final Thoughts
> The 10gen tech team is working hard to correct the issues exposed by
> this outage. We will continue to work as hard as possible to ensure
> that everyone using MongoDB has the best possible experience. We are
> thankful for the support that we have received from Foursquare and our
> community during this unfortunate episode. As always, please let us
> know if you have any questions or concerns.
> [1] Chunks get split when they are 200MB into 2 100MB halves. This
> means that even if the number of chunks on each shard was the same,
> data size is not always so. This is something we are going to be
> addressing in MongoDB. We'll be making splitting balancing look for
> this imbalance so it can act upon it.
> [2] The 10gen team is working on doing online incremental compaction
> of both data files and indexes. We know this has been a concern in
> non-sharded systems as well. More details about this will be coming
> in the next few weeks.