Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Foursquare outage post mortem
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  Messages 26 - 34 of 34 - Collapse all  -  Translate all to Translated (View all originals) < Older 
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Eliot Horowitz  
View profile  
 More options Oct 8 2010, 1:19 pm
From: Eliot Horowitz <eliothorow...@gmail.com>
Date: Fri, 8 Oct 2010 13:19:53 -0400
Local: Fri, Oct 8 2010 1:19 pm
Subject: Re: [mongodb-user] Re: Foursquare outage post mortem
diptamay: see footnote 1.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Markus Gattol  
View profile  
 More options Oct 8 2010, 2:03 pm
From: Markus Gattol <markus.gat...@gmail.com>
Date: Fri, 8 Oct 2010 11:03:13 -0700 (PDT)
Local: Fri, Oct 8 2010 2:03 pm
Subject: Re: Foursquare outage post mortem
On Oct 8, 6:16 pm, diptamay <dipta...@gmail.com> wrote:

> MongoDB uses range based partitioning of its shards. So even if some
> power users were involved in doing loads of check-ins, what I don't
> understand is why didn't MongoDB sharding split the user ranges
> further and swap out the chunks and migrate to the other shard. Any
> thoughts?

Read the "What we missed in the interim" again and also [1] which is
referenced from it. In a nutshell: even if you have the same number of
chunks on each shard, that does not mean that the data set size per
shard is the same on each shard. This lead to a situation where one
shard didn't have enough RAM left which then started a vicious
cycle ...

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jonathan Ultis  
View profile  
 More options Oct 8 2010, 2:37 pm
From: Jonathan Ultis <jonathan.ul...@gmail.com>
Date: Fri, 8 Oct 2010 11:37:26 -0700 (PDT)
Local: Fri, Oct 8 2010 2:37 pm
Subject: Re: Foursquare outage post mortem
I like that thought. Yeah, mlocking memory will only provide a runway
for read throughput problems due to the working set size exceeding
memory. If the problem was write throughput...

Let's see, increasing the dirty_ratio would let the OS hold more dirty
pages in memory. That won't change the total write throughput by
itself. But, it might allow more edits to land on already dirty pages.
If the edits end up merging, then that would reduce the amount of
write throughput needed. Hopefully. Is that the idea?

Are you sure the root problem was write throughput? That doesn't fit
entirely with the explanation of why adding a new shard didn't fix the
problem. Moving blocks to a new machine should reduce the write rate
on the original shard immediately, even if there is fragmentation that
prevents the working set size from shrinking.

Isn't it more likely that the working set size for read exceeded
memory, causing reads to hit disk sometimes? The new read load was big
enough that moving a percentage of the write load onto a new shard did
not fix the IO contention on the original machine. Increasing the
available RAM would have removed the read IO entirely, fixing the
problem and creating a runway.

On Oct 8, 10:14 am, David Birdsong <david.birds...@gmail.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
David Birdsong  
View profile  
 More options Oct 8 2010, 6:23 pm
From: David Birdsong <david.birds...@gmail.com>
Date: Fri, 8 Oct 2010 15:23:38 -0700
Local: Fri, Oct 8 2010 6:23 pm
Subject: Re: [mongodb-user] Re: Foursquare outage post mortem
On Fri, Oct 8, 2010 at 11:37 AM, Jonathan Ultis
<jonathan.ul...@gmail.com> wrote:
> I like that thought. Yeah, mlocking memory will only provide a runway
> for read throughput problems due to the working set size exceeding
> memory. If the problem was write throughput...

> Let's see, increasing the dirty_ratio would let the OS hold more dirty
> pages in memory. That won't change the total write throughput by
> itself. But, it might allow more edits to land on already dirty pages.
> If the edits end up merging, then that would reduce the amount of
> write throughput needed. Hopefully. Is that the idea?

sounds good to me.

> Are you sure the root problem was write throughput? That doesn't fit
> entirely with the explanation of why adding a new shard didn't fix the
> problem. Moving blocks to a new machine should reduce the write rate
> on the original shard immediately, even if there is fragmentation that
> prevents the working set size from shrinking.

> Isn't it more likely that the working set size for read exceeded
> memory, causing reads to hit disk sometimes? The new read load was big
> enough that moving a percentage of the write load onto a new shard did
> not fix the IO contention on the original machine. Increasing the
> available RAM would have removed the read IO entirely, fixing the
> problem and creating a runway.

very possibly yes.  if i understand the VM, linux at least, reading
all over an mmap'd space does count toward the dirty page ratio, but
perhaps are bounded by all of available RAM (swap too?).  if one reads
beyond what's available, then other pages are evicted.  i can't seem
to find any reference to describe which pages are chosen for eviction,
perhaps LRU -that's just a guess.

to provide the runway you proposed, using the dirty page settings
would not help in the case of heavily randomized reads i guess.
another way to do the mlock safe guard is to set the the
min_free_kbyes setting to include this runway and simply remove that
runway amount when things get hairy enough.

i know ssd drives were already brought up, but if the problem was in
fact random reads and not necessarily writes, ssd's could have helped.
 my guess is that it was probably both.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ethan Whitt  
View profile  
 More options Oct 8 2010, 8:50 pm
From: Ethan Whitt <ethan.l.wh...@gmail.com>
Date: Fri, 8 Oct 2010 17:50:30 -0700
Local: Fri, Oct 8 2010 8:50 pm
Subject: Re: [mongodb-user] Re: Foursquare outage post mortem
Fusion-IO is another option

http://www.fusionio.com/products/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ron Bodkin  
View profile  
 More options Oct 11 2010, 12:05 pm
From: Ron Bodkin <rbod...@gmail.com>
Date: Mon, 11 Oct 2010 09:05:45 -0700 (PDT)
Local: Mon, Oct 11 2010 12:05 pm
Subject: Re: Foursquare outage post mortem
I've enjoyed reading the thread. Like Alex I'm not a MongoDB expert,
but am trying to understand what happened better and what can be done
differently, I'd love to know:

1) It sounds like all the database is in the working set. I would have
guessed that only a small fraction of all historical check-ins are
likely to be read (<1%), and that old check-ins are likely to be
clustered on pages that aren't read often, so that is surprising to
me. Can you say what fraction of the objects are likely to be read in
a given hour and what the turnover of working set objects is?

2) Later in the thread Eliot noted:
MongoDB should be better about handling situations like this and
degrade much more gracefully.
We'll be working on these enhancements soon as well.

What kind of enhancements are you planning? Better VM management? Re-
custering objects to push inactive objects into pages that are on
disk? A paging scheme like Redis uses (as described in
http://antirez.com/post/what-is-wrong-with-2006-programming.html)?

3) From the description, it seems like you should be able to monitor
the resident memory used by the mongo db process to get an alert as
shard memory ran low. Is that viable? If not, what can be done to
identify cases where the working set is approaching available? It
seems like monitoring memory/working sets for mongo db instance would
be a generally useful facility - are there plans to add this
capability? What's the best practice today?

4) With respect to using SSD's - what is the write pattern for pages
that get evicted? If they are written randomly using OS paging, I
wouldn't expect there to be a benefit from SSD's (you wouldn't be able
to evict pages fast enough), although if mongodb were able to evict
larger chunks of pages from RAM so you had fewer bigger random writes,
with lots of smaller random reads, that could be a big win.

Thanks,
Ron

On Oct 7, 7:32 pm, harryh <har...@gmail.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Eliot Horowitz  
View profile  
 More options Oct 11 2010, 6:05 pm
From: Eliot Horowitz <eliothorow...@gmail.com>
Date: Mon, 11 Oct 2010 18:05:59 -0400
Local: Mon, Oct 11 2010 6:05 pm
Subject: Re: [mongodb-user] Re: Foursquare outage post mortem
Answers below

> 1) It sounds like all the database is in the working set. I would have
> guessed that only a small fraction of all historical check-ins are
> likely to be read (<1%), and that old check-ins are likely to be
> clustered on pages that aren't read often, so that is surprising to
> me. Can you say what fraction of the objects are likely to be read in
> a given hour and what the turnover of working set objects is?

I can't give a lot of details here, but a very high percentage of
documents were touched very often.
Much higher than you would expect.

> 2) Later in the thread Eliot noted:
> MongoDB should be better about handling situations like this and
> degrade much more gracefully.
> We'll be working on these enhancements soon as well.

> What kind of enhancements are you planning? Better VM management? Re-
> custering objects to push inactive objects into pages that are on
> disk? A paging scheme like Redis uses (as described in
> http://antirez.com/post/what-is-wrong-with-2006-programming.html)?

The big issue really is concurrency.  The VM side works well, the
problem is a read-write lock is too coarse.
1 thread that casues a fault can a bigger slowdown than it should be able to.
We'll be addressing this in a few ways: making yielding more
intelligent, real intra-collection concurrency.

> 3) From the description, it seems like you should be able to monitor
> the resident memory used by the mongo db process to get an alert as
> shard memory ran low. Is that viable? If not, what can be done to
> identify cases where the working set is approaching available? It
> seems like monitoring memory/working sets for mongo db instance would
> be a generally useful facility - are there plans to add this
> capability? What's the best practice today?

The key metric to look at is disk operations per second.  If this
starts trending up, that's a good warning side.
Memory is a bit hard because some things don't have to be in ram
(transaction log) but will be if nothing else is using it.

> 4) With respect to using SSD's - what is the write pattern for pages
> that get evicted? If they are written randomly using OS paging, I
> wouldn't expect there to be a benefit from SSD's (you wouldn't be able
> to evict pages fast enough), although if mongodb were able to evict
> larger chunks of pages from RAM so you had fewer bigger random writes,
> with lots of smaller random reads, that could be a big win.

Not sure I follow.  The random writes/reads are why we think SSDs are
so good (and in our testing).

-Eliot


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ron Bodkin  
View profile  
 More options Oct 14 2010, 11:10 pm
From: Ron Bodkin <rbod...@gmail.com>
Date: Thu, 14 Oct 2010 20:10:06 -0700 (PDT)
Local: Thurs, Oct 14 2010 11:10 pm
Subject: Re: Foursquare outage post mortem
I wanted to follow up on one point here:

On Oct 11, 6:05 pm, Eliot Horowitz <eliothorow...@gmail.com> wrote:

> > 4) With respect to using SSD's - what is the write pattern for pages
> > that get evicted? If they are written randomly using OS paging, I
> > wouldn't expect there to be a benefit from SSD's (you wouldn't be able
> > to evict pages fast enough), although if mongodb were able to evict
> > larger chunks of pages from RAM so you had fewer bigger random writes,
> > with lots of smaller random reads, that could be a big win.

> Not sure I follow.  The random writes/reads are why we think SSDs are
> so good (and in our testing).

SSD's generally have equally poor random writes compared to spinning
disk - it's only for random reads where they have a latency advantage.
If VM paging is resulting in lots of random writes then I wouldn't
expect SSD's to help. But if there a lot more random reads, it could
help a lot (e.g., if the system can just discard pages that are stale
but have no modifications, rather than writing them to disk through
paging).

Ron


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Dwight Merriman  
View profile  
 More options Oct 17 2010, 5:30 pm
From: Dwight Merriman <dwi...@10gen.com>
Date: Sun, 17 Oct 2010 17:30:32 -0400
Local: Sun, Oct 17 2010 5:30 pm
Subject: Re: [mongodb-user] Re: Foursquare outage post mortem

perhaps use a log structured file system like jffs2?


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages < Older 
« Back to Discussions « Newer topic     Older topic »