To expand on John Whitlock's summary ... Here are some recent examples from
"Everything's fine" to "Hmm ... something might be off" to "OMG it's on
fire".
Yesterday's 60m window of the push of some MDN 10th anniversary tweaks
[1][2] was about as "everything's fine" as it currently gets for MDN.
[image: Inline image 1]
There's an expected spike in request queuing at the time of deployment:
while the 3 web-heads restart apache, requests are queued for < 1 minute.
The response time before and after the push looks identical - around
100-120ms. The error rate, throughput, and apdex graphs are all essentially
unaffected by the push.
Likewise, Wednesday's push of the original MDN 10th anniversary assets
[3][4] was essentially the same. This is typical of deployments that push
only front-end assets.
[image: Inline image 2]
On Monday, we deployed a couple days of changes: code sample widget,
helpfulness widget, and main/head navigation changes. [5][6][7] Hmm ...
something looks a little off ...
[image: Inline image 3]
Note the 2 subsequent spikes after the initial one. These correspond to the
waffle flag changes, because*:
1. waffle uses cookies to put users into feature flag "buckets" to
determine if they see the waffled feature or not
2. We send Vary: Cookie in our HTTP response headers, so...
3. A whole new batch of clients send requests thru the load-balancer to the
web-heads to get their next response, causing another spike of request
queuing.
Couple things worth pointing out from this:
* We need to monitor "soft deploys" via waffle to make sure the request
spikes don't cause knock-on issues
* Those kinds of things look much scarier as they're happening, because the
up-ticks look like they might go on forever:
[image: Inline image 4]
So, give the request queuing spikes at least 1-2 minutes to come back down
(maybe even 3-5 minutes) before pressing the "OMG" button ...
A couple weeks ago, on July 9, we activated some code to restore the social
sharing A/B test. OMG it's on fire ...
[image: Inline image 2]
In this case, the spike doesn't come back down after 3-5m, so something is
very wrong. The throughput is also dropping (spoiler alert: it's heading to
0). Time to start digging in ...
If you look carefully, you can see a corresponding spike in "Web external"
in the graph. You can even click the "Request Queuing" label to toggle it
off to make the other components more pronounced ...
[image: Inline image 3]
Pretty obvious where the problem is. At this point, if we know we just
activated the social share widget we may recognize this as external time
spent on bitly. But if we don't realize that, New Relic has a great
"Transactions" report where we can see more specifically where time is
being spent.
Clicking on "Transactions" on the left, we can see the up-tick is in
/kuma.wiki.views:document. (This is almost always the case, as the document
view is the majority of our traffic, so changes there can cause big
issues.) ...
[image: Inline image 4]
Clicking on "/kuma.wiki.views:document" gives us another detailed
break-down of where time is spent inside the transaction ...
[image: Inline image 5]
And here it's obvious that there's nothing in urllib2[
api.bit.ly], but then
a sudden spike. So, we know the problem and we can start fixing it. In this
particular case, we were able to disable the feature and we got back to
regular levels in 13 minutes. (Our target monthly down-time is 22 minutes
or less [8])
Couple things worth pointing out from this:
* Again, we need to monitor "soft deploys" via waffle; even mostly
front-end changes may be using back-end code that can cause problems at
scale
* We need to start load-testing our code on the stage server [9]
Hope that helps? I'm also happy to host a New Relic brown-bag or just have
a screen-sharing 1:1 if anyone would like a personalized tour of it.
-L
[1]
http://developeradm.private.scl3.mozilla.com/chief/developer.prod/logs/43e923e4d687e716673afa651ba38a964a42ced2.1437664747
[2]
https://rpm.newrelic.com/accounts/263620/applications/3172075/deployments/2291403
[3]
http://developeradm.private.scl3.mozilla.com/chief/developer.prod/logs/43e923e4d687e716673afa651ba38a964a42ced2.1437664747
[4]
https://rpm.newrelic.com/accounts/263620/applications/3172075/deployments/2287688
[5]
http://developeradm.private.scl3.mozilla.com/chief/developer.prod/logs/df4079242fbf9b5f03f08697c15de10d2fe4ef31.1437424147
[6]
https://rpm.newrelic.com/accounts/263620/applications/3172075/deployments/2278635
[7]
https://github.com/mozilla/kuma/compare/73cb0388bd0864075b098d7f4601a9697c26aadd...df4079242fbf9b5f03f08697c15de10d2fe4ef31
* At least, this is my current theory
[8]
https://docs.google.com/spreadsheets/d/1EAMlBjfuutHJo2ihFAn5dgZ0VrZ_Q3DUvi02BTe6ibI/edit#gid=1516878892
[9]
https://bugzilla.mozilla.org/show_bug.cgi?id=1186081