On Fri, Oct 8, 2010 at 1:16 PM, diptamay <dipta...@gmail.com> wrote: > MongoDB uses range based partitioning of its shards. So even if some > power users were involved in doing loads of check-ins, what I don't > understand is why didn't MongoDB sharding split the user ranges > further and swap out the chunks and migrate to the other shard. Any > thoughts?
> On Oct 7, 6:05 pm, Eliot Horowitz <eliothorow...@gmail.com> wrote: >> (Note: this is being posted with Foursquare’s permission.)
>> As many of you are aware, Foursquare had a significant outage this >> week. The outage was caused by capacity problems on one of the >> machines hosting the MongoDB database used for check-ins. This is an >> account of what happened, why it happened, how it can be prevented, >> and how 10gen is working to improve MongoDB in light of this outage.
>> It’s important to note that throughout this week, 10gen and Foursquare >> engineers have been working together very closely to resolve the >> issue.
>> * Some history >> Foursquare has been hosting check-ins on a MongoDB database for some >> time now. The database was originally running on a single EC2 >> instance with 66GB of RAM. About 2 months ago, in response to >> increased capacity requirements, Foursquare migrated that single >> instance to a two-shard cluster. Now, each shard was running on its >> own 66GB instance, and both shards were also replicating to a slave >> for redundancy. This was an important migration because it allowed >> Foursquare to keep all of their check-in data in RAM, which is >> essential for maintaining acceptable performance.
>> The data had been split into 200 evenly distributed chunks based on >> user id. The first half went to one server, and the other half to the >> other. Each shard had about 33GB of data in RAM at this point, and >> the whole system ran smoothly for two months.
>> * What we missed in the interim >> Over these two months, check-ins were being written continually to >> each shard. Unfortunately, these check-ins did not grow evenly across >> chunks. It’s easy to imagine how this might happen: assuming certain >> subsets of users are more active than others, it’s conceivable that >> their updates might all go to the same shard. That’s what occurred in >> this case, resulting in one shard growing to 66GB and the other only >> to 50GB. [1]
>> * What went wrong >> On Monday morning, the data on one shard (we’ll call it shard0) >> finally grew to about 67GB, surpassing the 66GB of RAM on the hosting >> machine. Whenever data size grows beyond physical RAM, it becomes >> necessary to read and write to disk, which is orders of magnitude >> slower than reading and writing RAM. Thus, certain queries started to >> become very slow, and this caused a backlog that brought the site >> down.
>> We first attempted to fix the problem by adding a third shard. We >> brought the third shard up and started migrating chunks. Queries were >> now being distributed to all three shards, but shard0 continued to hit >> disk very heavily. When this failed to correct itself, we ultimately >> discovered that the problem was due to data fragmentation on shard0. >> In essence, although we had moved 5% of the data from shard0 to the >> new third shard, the data files, in their fragmented state, still >> needed the same amount of RAM. This can be explained by the fact that >> Foursquare check-in documents are small (around 300 bytes each), so >> many of them can fit on a 4KB page. Removing 5% of these just made >> each page a little more sparse, rather than removing pages >> altogether.[2]
>> After the first day's outage it had become clear that chunk migration, >> sans compaction, was not going to solve the immediate problem. By the >> time the second day's outage occurred, we had already move 5% of the >> data off of shard0, so we decided to start an offline process to >> compact the database using MongoDB’s repairDatabase() feature. This >> process took about 4 hours (partly due to the data size, and partly >> because of the slowness of EBS volumes at the time). At the end of >> the 4 hours, the RAM requirements for shard0 had in fact been reduced >> by 5%, allowing us to bring the system back online.
>> * Afterwards >> Since repairing shard0 and adding a third shard, we’ve set up even >> more shards, and now the check-in data is evenly distributed and there >> is a good deal of extra capacity. Still, we had to address the >> fragmentation problem. We ran a repairDatabase() on the slaves, and >> promoted the slaves to masters, further reducing the RAM needed on >> each shard to about 20GB.
>> * How is this issue triggered? >> Several conditions need to be met to trigger the issue that brought >> down Foursquare: >> 1. Systems are at or over capacity. How capacity is defined varies; in >> the case of Foursquare, all data needed to fit into RAM for acceptable >> performance. Other deployments may not have such strict RAM >> requirements. >> 2. Document size is less than 4k. Such documents, when moved, may be >> too small to free up pages and, thus, memory. >> 3. Shard key order and insertion order are different. This prevents >> data from being moved in contiguous chunks.
>> Most sharded deployments will not meet these criteria. Anyone whose >> documents are larger than 4KB will not suffer significant >> fragmentation because the pages that aren’t being used won’t be >> cached.
>> * Prevention >> The main thing to remember here is that once you’re at max capacity, >> it’s difficult to add more capacity without some downtime when objects >> are small. However, if caught in advance, adding more shards on a >> live system can be done with no downtime.
>> For example, if we had notifications in place to alert us 12 hours >> earlier that we needed more capacity, we could have added a third >> shard, migrated data, and then compacted the slaves.
>> Another salient point: when you’re operating at or near capacity, >> realize that if things get slow at your hosting provider, you may find >> yourself all of a sudden effectively over capacity.
>> * Final Thoughts >> The 10gen tech team is working hard to correct the issues exposed by >> this outage. We will continue to work as hard as possible to ensure >> that everyone using MongoDB has the best possible experience. We are >> thankful for the support that we have received from Foursquare and our >> community during this unfortunate episode. As always, please let us >> know if you have any questions or concerns.
>> [1] Chunks get split when they are 200MB into 2 100MB halves. This >> means that even if the number of chunks on each shard was the same, >> data size is not always so. This is something we are going to be >> addressing in MongoDB. We'll be making splitting balancing look for >> this imbalance so it can act upon it.
>> [2] The 10gen team is working on doing online incremental compaction >> of both data files and indexes. We know this has been a concern in >> non-sharded systems as well. More details about this will be coming >> in the next few weeks.
> -- > You received this message because you are subscribed to the Google Groups "mongodb-user" group. > To post to this group, send email to mongodb-user@googlegroups.com. > To unsubscribe from this group, send email to mongodb-user+unsubscribe@googlegroups.com. > For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
On Oct 8, 6:16 pm, diptamay <dipta...@gmail.com> wrote:
> MongoDB uses range based partitioning of its shards. So even if some
> power users were involved in doing loads of check-ins, what I don't
> understand is why didn't MongoDB sharding split the user ranges
> further and swap out the chunks and migrate to the other shard. Any
> thoughts?
Read the "What we missed in the interim" again and also [1] which is
referenced from it. In a nutshell: even if you have the same number of
chunks on each shard, that does not mean that the data set size per
shard is the same on each shard. This lead to a situation where one
shard didn't have enough RAM left which then started a vicious
cycle ...
I like that thought. Yeah, mlocking memory will only provide a runway
for read throughput problems due to the working set size exceeding
memory. If the problem was write throughput...
Let's see, increasing the dirty_ratio would let the OS hold more dirty
pages in memory. That won't change the total write throughput by
itself. But, it might allow more edits to land on already dirty pages.
If the edits end up merging, then that would reduce the amount of
write throughput needed. Hopefully. Is that the idea?
Are you sure the root problem was write throughput? That doesn't fit
entirely with the explanation of why adding a new shard didn't fix the
problem. Moving blocks to a new machine should reduce the write rate
on the original shard immediately, even if there is fragmentation that
prevents the working set size from shrinking.
Isn't it more likely that the working set size for read exceeded
memory, causing reads to hit disk sometimes? The new read load was big
enough that moving a percentage of the write load onto a new shard did
not fix the IO contention on the original machine. Increasing the
available RAM would have removed the read IO entirely, fixing the
problem and creating a runway.
On Oct 8, 10:14 am, David Birdsong <david.birds...@gmail.com> wrote:
> On Fri, Oct 8, 2010 at 9:15 AM, Jonathan Ultis <jonathan.ul...@gmail.com> wrote:
> > That is why setting a limit on the RAM used by MongoDB would be
> > useful. If it were possible to set a limit, they could have limited
> > the machine to 64GB resident. Then, when the system started to fall
> > over, they could stop MongoDB, bump the limit to 66GB, then turn
> > MongoDB back on.
> AFAIK, Mongo isn't malloc'ing memory. you can mmap 100x physical RAM
> on the box. it isn't resident memory; it's all virtual. If mongo
> dirties pages beyond the configured amount that the system allows to
> be dirty, then the vm manager will flush pages to the backnig mmap
> file. Usually that configured amount is derived via percentages of
> the system RAM. In foursquare's case, they were dirtying pages above
> that threshold and the debt owed to IOwait piled up. I'm not negating
> your later idea though.
> > Placing a limit on the memory used by MongoDB artificially reduces the
> > overall system capacity. But, it creates a more reliable early warning
> > with a fast path to a temporary fix. The temporary fix will usually
> > last long enough to get new shards online.
> Theoretically you could lower that aforementioned dirty percentage
> using the vm tuning tools. In linux, /proc/sys/vm/... If the working
> set is diryting pages beyond the thresholds, then the os will start
> flushing to disk and mongo will get slow, you could simply raise the
> vm thresholds and get this temporary runway.
> > Brainstorming.... one could allocate memory in another process, mlock
> > it so that it can't be swapped out, and then just let it sit in an
> > infinite sleep. If the server starts to fall over due to lack of
> > memory, kill the process with the mlocked memory, and boom. You have
> > runway.
> > -Jonathan
> > On Oct 8, 7:01 am, Eliot Horowitz <eliothorow...@gmail.com> wrote:
> >> Knowing what the actual limit can be tricky.
> >> There was monitoring in place, but the limit for when to add capacity
> >> wasn't quite right.
> >> On Fri, Oct 8, 2010 at 1:25 AM, Dinesh <dinesh.a.jo...@gmail.com> wrote:
> >> > Hi Eliot,
> >> > Your post was very informative. One question that remains (and I cant
> >> > believe that nobody has raised it yet) is - Why was there no
> >> > monitoring system in place to fore-warn your team?
> >> > Isn't this a basic requirement? I mean you dont expect the data to
> >> > grow beyond your RAM is a very risky constraint. So there should've
> >> > been *some* sort of monitoring in place to warn you that you're about
> >> > to run out of resources. Its a very basic requirement according to me?
> >> > Something like "hey, i'm doing malloc(). Let me check if theres
> >> > ACTUALLY memory available for me.". If such a monitoring system was in
> >> > place and hooked up to your foursquare app, users would've gotten
> >> > error messages while checking in, BUT their existing check-ins
> >> > would've been available and wouldn't have led to the whole site/
> >> > service going down.
> >> > Dinesh
> >> > --
> >> > You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> >> > To post to this group, send email to mongodb-user@googlegroups.com.
> >> > To unsubscribe from this group, send email to mongodb-user+unsubscribe@googlegroups.com.
> >> > For more options, visit this group athttp://groups.google.com/group/mongodb-user?hl=en.
> > --
> > You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> > To post to this group, send email to mongodb-user@googlegroups.com.
> > To unsubscribe from this group, send email to mongodb-user+unsubscribe@googlegroups.com.
> > For more options, visit this group athttp://groups.google.com/group/mongodb-user?hl=en.
<jonathan.ul...@gmail.com> wrote: > I like that thought. Yeah, mlocking memory will only provide a runway > for read throughput problems due to the working set size exceeding > memory. If the problem was write throughput...
> Let's see, increasing the dirty_ratio would let the OS hold more dirty > pages in memory. That won't change the total write throughput by > itself. But, it might allow more edits to land on already dirty pages. > If the edits end up merging, then that would reduce the amount of > write throughput needed. Hopefully. Is that the idea?
sounds good to me.
> Are you sure the root problem was write throughput? That doesn't fit > entirely with the explanation of why adding a new shard didn't fix the > problem. Moving blocks to a new machine should reduce the write rate > on the original shard immediately, even if there is fragmentation that > prevents the working set size from shrinking.
> Isn't it more likely that the working set size for read exceeded > memory, causing reads to hit disk sometimes? The new read load was big > enough that moving a percentage of the write load onto a new shard did > not fix the IO contention on the original machine. Increasing the > available RAM would have removed the read IO entirely, fixing the > problem and creating a runway.
very possibly yes. if i understand the VM, linux at least, reading all over an mmap'd space does count toward the dirty page ratio, but perhaps are bounded by all of available RAM (swap too?). if one reads beyond what's available, then other pages are evicted. i can't seem to find any reference to describe which pages are chosen for eviction, perhaps LRU -that's just a guess.
to provide the runway you proposed, using the dirty page settings would not help in the case of heavily randomized reads i guess. another way to do the mlock safe guard is to set the the min_free_kbyes setting to include this runway and simply remove that runway amount when things get hairy enough.
i know ssd drives were already brought up, but if the problem was in fact random reads and not necessarily writes, ssd's could have helped. my guess is that it was probably both.
> On Oct 8, 10:14 am, David Birdsong <david.birds...@gmail.com> wrote: >> On Fri, Oct 8, 2010 at 9:15 AM, Jonathan Ultis <jonathan.ul...@gmail.com> wrote: >> > That is why setting a limit on the RAM used by MongoDB would be >> > useful. If it were possible to set a limit, they could have limited >> > the machine to 64GB resident. Then, when the system started to fall >> > over, they could stop MongoDB, bump the limit to 66GB, then turn >> > MongoDB back on.
>> AFAIK, Mongo isn't malloc'ing memory. you can mmap 100x physical RAM >> on the box. it isn't resident memory; it's all virtual. If mongo >> dirties pages beyond the configured amount that the system allows to >> be dirty, then the vm manager will flush pages to the backnig mmap >> file. Usually that configured amount is derived via percentages of >> the system RAM. In foursquare's case, they were dirtying pages above >> that threshold and the debt owed to IOwait piled up. I'm not negating >> your later idea though.
>> > Placing a limit on the memory used by MongoDB artificially reduces the >> > overall system capacity. But, it creates a more reliable early warning >> > with a fast path to a temporary fix. The temporary fix will usually >> > last long enough to get new shards online.
>> Theoretically you could lower that aforementioned dirty percentage >> using the vm tuning tools. In linux, /proc/sys/vm/... If the working >> set is diryting pages beyond the thresholds, then the os will start >> flushing to disk and mongo will get slow, you could simply raise the >> vm thresholds and get this temporary runway.
>> > Brainstorming.... one could allocate memory in another process, mlock >> > it so that it can't be swapped out, and then just let it sit in an >> > infinite sleep. If the server starts to fall over due to lack of >> > memory, kill the process with the mlocked memory, and boom. You have >> > runway.
>> > -Jonathan
>> > On Oct 8, 7:01 am, Eliot Horowitz <eliothorow...@gmail.com> wrote: >> >> Knowing what the actual limit can be tricky. >> >> There was monitoring in place, but the limit for when to add capacity >> >> wasn't quite right.
>> >> On Fri, Oct 8, 2010 at 1:25 AM, Dinesh <dinesh.a.jo...@gmail.com> wrote: >> >> > Hi Eliot,
>> >> > Your post was very informative. One question that remains (and I cant >> >> > believe that nobody has raised it yet) is - Why was there no >> >> > monitoring system in place to fore-warn your team?
>> >> > Isn't this a basic requirement? I mean you dont expect the data to >> >> > grow beyond your RAM is a very risky constraint. So there should've >> >> > been *some* sort of monitoring in place to warn you that you're about >> >> > to run out of resources. Its a very basic requirement according to me? >> >> > Something like "hey, i'm doing malloc(). Let me check if theres >> >> > ACTUALLY memory available for me.". If such a monitoring system was in >> >> > place and hooked up to your foursquare app, users would've gotten >> >> > error messages while checking in, BUT their existing check-ins >> >> > would've been available and wouldn't have led to the whole site/ >> >> > service going down.
>> >> > Dinesh
>> >> > -- >> >> > You received this message because you are subscribed to the Google Groups "mongodb-user" group. >> >> > To post to this group, send email to mongodb-user@googlegroups.com. >> >> > To unsubscribe from this group, send email to mongodb-user+unsubscribe@googlegroups.com. >> >> > For more options, visit this group athttp://groups.google.com/group/mongodb-user?hl=en.
>> > -- >> > You received this message because you are subscribed to the Google Groups "mongodb-user" group. >> > To post to this group, send email to mongodb-user@googlegroups.com. >> > To unsubscribe from this group, send email to mongodb-user+unsubscribe@googlegroups.com. >> > For more options, visit this group athttp://groups.google.com/group/mongodb-user?hl=en.
> -- > You received this message because you are subscribed to the Google Groups "mongodb-user" group. > To post to this group, send email to mongodb-user@googlegroups.com. > To unsubscribe from this group, send email to mongodb-user+unsubscribe@googlegroups.com. > For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
On Fri, Oct 8, 2010 at 3:23 PM, David Birdsong <david.birds...@gmail.com> wrote: > On Fri, Oct 8, 2010 at 11:37 AM, Jonathan Ultis > <jonathan.ul...@gmail.com> wrote: >> I like that thought. Yeah, mlocking memory will only provide a runway >> for read throughput problems due to the working set size exceeding >> memory. If the problem was write throughput...
>> Let's see, increasing the dirty_ratio would let the OS hold more dirty >> pages in memory. That won't change the total write throughput by >> itself. But, it might allow more edits to land on already dirty pages. >> If the edits end up merging, then that would reduce the amount of >> write throughput needed. Hopefully. Is that the idea? > sounds good to me.
>> Are you sure the root problem was write throughput? That doesn't fit >> entirely with the explanation of why adding a new shard didn't fix the >> problem. Moving blocks to a new machine should reduce the write rate >> on the original shard immediately, even if there is fragmentation that >> prevents the working set size from shrinking.
>> Isn't it more likely that the working set size for read exceeded >> memory, causing reads to hit disk sometimes? The new read load was big >> enough that moving a percentage of the write load onto a new shard did >> not fix the IO contention on the original machine. Increasing the >> available RAM would have removed the read IO entirely, fixing the >> problem and creating a runway.
> very possibly yes. if i understand the VM, linux at least, reading > all over an mmap'd space does count toward the dirty page ratio, but > perhaps are bounded by all of available RAM (swap too?). if one reads > beyond what's available, then other pages are evicted. i can't seem > to find any reference to describe which pages are chosen for eviction, > perhaps LRU -that's just a guess.
> to provide the runway you proposed, using the dirty page settings > would not help in the case of heavily randomized reads i guess. > another way to do the mlock safe guard is to set the the > min_free_kbyes setting to include this runway and simply remove that > runway amount when things get hairy enough.
> i know ssd drives were already brought up, but if the problem was in > fact random reads and not necessarily writes, ssd's could have helped. > my guess is that it was probably both.
>> On Oct 8, 10:14 am, David Birdsong <david.birds...@gmail.com> wrote: >>> On Fri, Oct 8, 2010 at 9:15 AM, Jonathan Ultis <jonathan.ul...@gmail.com> wrote: >>> > That is why setting a limit on the RAM used by MongoDB would be >>> > useful. If it were possible to set a limit, they could have limited >>> > the machine to 64GB resident. Then, when the system started to fall >>> > over, they could stop MongoDB, bump the limit to 66GB, then turn >>> > MongoDB back on.
>>> AFAIK, Mongo isn't malloc'ing memory. you can mmap 100x physical RAM >>> on the box. it isn't resident memory; it's all virtual. If mongo >>> dirties pages beyond the configured amount that the system allows to >>> be dirty, then the vm manager will flush pages to the backnig mmap >>> file. Usually that configured amount is derived via percentages of >>> the system RAM. In foursquare's case, they were dirtying pages above >>> that threshold and the debt owed to IOwait piled up. I'm not negating >>> your later idea though.
>>> > Placing a limit on the memory used by MongoDB artificially reduces the >>> > overall system capacity. But, it creates a more reliable early warning >>> > with a fast path to a temporary fix. The temporary fix will usually >>> > last long enough to get new shards online.
>>> Theoretically you could lower that aforementioned dirty percentage >>> using the vm tuning tools. In linux, /proc/sys/vm/... If the working >>> set is diryting pages beyond the thresholds, then the os will start >>> flushing to disk and mongo will get slow, you could simply raise the >>> vm thresholds and get this temporary runway.
>>> > Brainstorming.... one could allocate memory in another process, mlock >>> > it so that it can't be swapped out, and then just let it sit in an >>> > infinite sleep. If the server starts to fall over due to lack of >>> > memory, kill the process with the mlocked memory, and boom. You have >>> > runway.
>>> > -Jonathan
>>> > On Oct 8, 7:01 am, Eliot Horowitz <eliothorow...@gmail.com> wrote: >>> >> Knowing what the actual limit can be tricky. >>> >> There was monitoring in place, but the limit for when to add capacity >>> >> wasn't quite right.
>>> >> On Fri, Oct 8, 2010 at 1:25 AM, Dinesh <dinesh.a.jo...@gmail.com> wrote: >>> >> > Hi Eliot,
>>> >> > Your post was very informative. One question that remains (and I cant >>> >> > believe that nobody has raised it yet) is - Why was there no >>> >> > monitoring system in place to fore-warn your team?
>>> >> > Isn't this a basic requirement? I mean you dont expect the data to >>> >> > grow beyond your RAM is a very risky constraint. So there should've >>> >> > been *some* sort of monitoring in place to warn you that you're about >>> >> > to run out of resources. Its a very basic requirement according to me? >>> >> > Something like "hey, i'm doing malloc(). Let me check if theres >>> >> > ACTUALLY memory available for me.". If such a monitoring system was in >>> >> > place and hooked up to your foursquare app, users would've gotten >>> >> > error messages while checking in, BUT their existing check-ins >>> >> > would've been available and wouldn't have led to the whole site/ >>> >> > service going down.
>>> >> > Dinesh
>>> >> > -- >>> >> > You received this message because you are subscribed to the Google Groups "mongodb-user" group. >>> >> > To post to this group, send email to mongodb-user@googlegroups.com. >>> >> > To unsubscribe from this group, send email to mongodb-user+unsubscribe@googlegroups.com. >>> >> > For more options, visit this group athttp://groups.google.com/group/mongodb-user?hl=en.
>>> > -- >>> > You received this message because you are subscribed to the Google Groups "mongodb-user" group. >>> > To post to this group, send email to mongodb-user@googlegroups.com. >>> > To unsubscribe from this group, send email to mongodb-user+unsubscribe@googlegroups.com. >>> > For more options, visit this group athttp://groups.google.com/group/mongodb-user?hl=en.
>> -- >> You received this message because you are subscribed to the Google Groups "mongodb-user" group. >> To post to this group, send email to mongodb-user@googlegroups.com. >> To unsubscribe from this group, send email to mongodb-user+unsubscribe@googlegroups.com. >> For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
> -- > You received this message because you are subscribed to the Google Groups "mongodb-user" group. > To post to this group, send email to mongodb-user@googlegroups.com. > To unsubscribe from this group, send email to mongodb-user+unsubscribe@googlegroups.com. > For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
I've enjoyed reading the thread. Like Alex I'm not a MongoDB expert,
but am trying to understand what happened better and what can be done
differently, I'd love to know:
1) It sounds like all the database is in the working set. I would have
guessed that only a small fraction of all historical check-ins are
likely to be read (<1%), and that old check-ins are likely to be
clustered on pages that aren't read often, so that is surprising to
me. Can you say what fraction of the objects are likely to be read in
a given hour and what the turnover of working set objects is?
2) Later in the thread Eliot noted:
MongoDB should be better about handling situations like this and
degrade much more gracefully.
We'll be working on these enhancements soon as well.
3) From the description, it seems like you should be able to monitor
the resident memory used by the mongo db process to get an alert as
shard memory ran low. Is that viable? If not, what can be done to
identify cases where the working set is approaching available? It
seems like monitoring memory/working sets for mongo db instance would
be a generally useful facility - are there plans to add this
capability? What's the best practice today?
4) With respect to using SSD's - what is the write pattern for pages
that get evicted? If they are written randomly using OS paging, I
wouldn't expect there to be a benefit from SSD's (you wouldn't be able
to evict pages fast enough), although if mongodb were able to evict
larger chunks of pages from RAM so you had fewer bigger random writes,
with lots of smaller random reads, that could be a big win.
Thanks,
Ron
On Oct 7, 7:32 pm, harryh <har...@gmail.com> wrote:
> Just as a quick follow-up to Eliot's detailed message on our outage I
> want to make it clear to the community how helpful Eliot and the rest
> of the gang at 10gen has been to us in the last couple days. We've
> been in near constant communication with them as we worked to repair
> things and end the outage, and then to identify and get to work on
> fixing things so that this sort of thing won't happen again.
> They've been a huge help to us, and things would have been much worse
> without them.
> Further, I am very confident that some of the issues that we've
> identified in Mongo will be dealt with so others don't encounter
> similar problems. Overall we still remain huge fans of MongoDB @
> foursquare, and expect to be using it for a long time to come.
> I'm more than happy to help answer any concerns that others might have
> about MongoDB as it relates to this incident.
> 1) It sounds like all the database is in the working set. I would have > guessed that only a small fraction of all historical check-ins are > likely to be read (<1%), and that old check-ins are likely to be > clustered on pages that aren't read often, so that is surprising to > me. Can you say what fraction of the objects are likely to be read in > a given hour and what the turnover of working set objects is?
I can't give a lot of details here, but a very high percentage of documents were touched very often. Much higher than you would expect.
> 2) Later in the thread Eliot noted: > MongoDB should be better about handling situations like this and > degrade much more gracefully. > We'll be working on these enhancements soon as well.
> What kind of enhancements are you planning? Better VM management? Re- > custering objects to push inactive objects into pages that are on > disk? A paging scheme like Redis uses (as described in > http://antirez.com/post/what-is-wrong-with-2006-programming.html)?
The big issue really is concurrency. The VM side works well, the problem is a read-write lock is too coarse. 1 thread that casues a fault can a bigger slowdown than it should be able to. We'll be addressing this in a few ways: making yielding more intelligent, real intra-collection concurrency.
> 3) From the description, it seems like you should be able to monitor > the resident memory used by the mongo db process to get an alert as > shard memory ran low. Is that viable? If not, what can be done to > identify cases where the working set is approaching available? It > seems like monitoring memory/working sets for mongo db instance would > be a generally useful facility - are there plans to add this > capability? What's the best practice today?
The key metric to look at is disk operations per second. If this starts trending up, that's a good warning side. Memory is a bit hard because some things don't have to be in ram (transaction log) but will be if nothing else is using it.
> 4) With respect to using SSD's - what is the write pattern for pages > that get evicted? If they are written randomly using OS paging, I > wouldn't expect there to be a benefit from SSD's (you wouldn't be able > to evict pages fast enough), although if mongodb were able to evict > larger chunks of pages from RAM so you had fewer bigger random writes, > with lots of smaller random reads, that could be a big win.
Not sure I follow. The random writes/reads are why we think SSDs are so good (and in our testing).
> On Oct 7, 7:32 pm, harryh <har...@gmail.com> wrote: >> Just as a quick follow-up to Eliot's detailed message on our outage I >> want to make it clear to the community how helpful Eliot and the rest >> of the gang at 10gen has been to us in the last couple days. We've >> been in near constant communication with them as we worked to repair >> things and end the outage, and then to identify and get to work on >> fixing things so that this sort of thing won't happen again.
>> They've been a huge help to us, and things would have been much worse >> without them.
>> Further, I am very confident that some of the issues that we've >> identified in Mongo will be dealt with so others don't encounter >> similar problems. Overall we still remain huge fans of MongoDB @ >> foursquare, and expect to be using it for a long time to come.
>> I'm more than happy to help answer any concerns that others might have >> about MongoDB as it relates to this incident.
>> -harryh, eng lead @ foursquare
> -- > You received this message because you are subscribed to the Google Groups "mongodb-user" group. > To post to this group, send email to mongodb-user@googlegroups.com. > To unsubscribe from this group, send email to mongodb-user+unsubscribe@googlegroups.com. > For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
On Oct 11, 6:05 pm, Eliot Horowitz <eliothorow...@gmail.com> wrote:
> > 4) With respect to using SSD's - what is the write pattern for pages
> > that get evicted? If they are written randomly using OS paging, I
> > wouldn't expect there to be a benefit from SSD's (you wouldn't be able
> > to evict pages fast enough), although if mongodb were able to evict
> > larger chunks of pages from RAM so you had fewer bigger random writes,
> > with lots of smaller random reads, that could be a big win.
> Not sure I follow. The random writes/reads are why we think SSDs are
> so good (and in our testing).
SSD's generally have equally poor random writes compared to spinning
disk - it's only for random reads where they have a latency advantage.
If VM paging is resulting in lots of random writes then I wouldn't
expect SSD's to help. But if there a lot more random reads, it could
help a lot (e.g., if the system can just discard pages that are stale
but have no modifications, rather than writing them to disk through
paging).
On Thu, Oct 14, 2010 at 11:10 PM, Ron Bodkin <rbod...@gmail.com> wrote: > I wanted to follow up on one point here:
> On Oct 11, 6:05 pm, Eliot Horowitz <eliothorow...@gmail.com> wrote: > > > 4) With respect to using SSD's - what is the write pattern for pages > > > that get evicted? If they are written randomly using OS paging, I > > > wouldn't expect there to be a benefit from SSD's (you wouldn't be able > > > to evict pages fast enough), although if mongodb were able to evict > > > larger chunks of pages from RAM so you had fewer bigger random writes, > > > with lots of smaller random reads, that could be a big win.
> > Not sure I follow. The random writes/reads are why we think SSDs are > > so good (and in our testing).
> SSD's generally have equally poor random writes compared to spinning > disk - it's only for random reads where they have a latency advantage. > If VM paging is resulting in lots of random writes then I wouldn't > expect SSD's to help. But if there a lot more random reads, it could > help a lot (e.g., if the system can just discard pages that are stale > but have no modifications, rather than writing them to disk through > paging).
> Ron
> -- > You received this message because you are subscribed to the Google Groups > "mongodb-user" group. > To post to this group, send email to mongodb-user@googlegroups.com. > To unsubscribe from this group, send email to > mongodb-user+unsubscribe@googlegroups.com<mongodb-user%2Bunsubscribe@google groups.com> > . > For more options, visit this group at > http://groups.google.com/group/mongodb-user?hl=en.