Re: MDN Reliability Plan

27 views
Skip to first unread message

Luke Crouch

unread,
Jul 21, 2015, 6:20:34 PM7/21/15
to dev-mdn, dev-mdc, m...@lists.mozilla.org
UPDATE 1 ...

We have been running with our new deployment guidelines since our last
down-time incident on 2015-07-09. Since then we have seen 0 down-time
incidents, and 1 non-critical slow-down incident. [1]

We have a master bug that's tracking a number of sub-projects and tasks for
MDN devs, WebQA, and WebOps to improve MDN site reliability. [2] Many dev
tasks are already in our dev board. [3]

We are solidifying our Maintenance & Reliability KPI targets [4], and are
using those as our metrics for success of our reliability work & projects.

Thanks,
-L

[1] http://screencast.com/t/1qet7GHGAR
[2] https://bugzilla.mozilla.org/showdependencytree.cgi?id=1186081
[3] https://trello.com/b/p56Gwq46/mdndev-rocks
[4]
https://docs.google.com/spreadsheets/d/1EAMlBjfuutHJo2ihFAn5dgZ0VrZ_Q3DUvi02BTe6ibI/edit#gid=1516878892


On Thu, Jul 9, 2015 at 5:09 PM, Luke Crouch <lcr...@mozilla.com> wrote:

> MDN devs have updated MDN dependencies and have been working on
> performance improvements to the MDN codebase. In the process, we have
> thrown a sharp contrast on a number of reliability issues with MDN.
> Primarily:
>
>
> - Down-times caused by code subjected to production load [1]
> - Historically high error rates
> - Periodic performance dips
>
>
> In essence - we rocked the MDN boat to make it go faster, and we
> dis-lodged some plugs and exposed leaks.
>
> So, we're going to take immediate actions to mitigate risks, and are
> starting some projects to address the issues.
>
> Immediate actions:
>
> - All pushes - including "soft pushes" via waffle - must be approved &
> attended by 2 devs
> - After every push, devs must monitor New Relic for 1 hour
> - File bugs for all outages to track cause & work to fix
> - Start a staging environment & process plan with WebQA & WebOps [2]
> - Start a load-testing plan with WebQA, WebOps, & Services Ops [3]
> - Invite WebQA to MDN bug triage to prioritize ISE bugs. [4]
>
>
> Medium term (the next couple of weeks)
>
> - Implement the new stage environment for MDN
> - Implement automated load-testing for MDN as a standard part of the
> acceptance process. Performance regressions should not be pushed to
> production.
> - Add monitoring based on change rates (e.g. rate of errors in Apache
> error_log)
> - Continue profiling and performance improvements.
>
>
> Long term (after that)
>
> - Move to AWS
> - Add better analytics so we know when things are going south
> - Run old stacks and new stacks in parallel so rollbacks are easy
> - Use autoscaling to avoid MDN being brought down by load
>
>
> You may follow the bugs to stay in the loop, and chime in on the etherpads
> with ideas, comments, and feedback on the plans.
>
> Thanks,
> -L
>
>
> [1]
> https://github.com/mozilla/kuma/pull/3324
> https://github.com/mozilla/kuma/pull/3124
>
> [2]
> https://bugzilla.mozilla.org/show_bug.cgi?id=1182182
> https://etherpad.mozilla.org/MDN-stage-environment
>
> [3]
> https://bugzilla.mozilla.org/show_bug.cgi?id=1182198
> https://etherpad.mozilla.org/MDN-load-testing
>
> [4]
> https://bugzilla.mozilla.org/show_bug.cgi?id=1174209
>
>

Stephanie Hobson

unread,
Jul 23, 2015, 12:14:24 PM7/23/15
to Luke Crouch, dev-mdc, dev-mdn, m...@lists.mozilla.org
Could someone give us a short summary of what to look for in New Relic? Not
sure what are normal spikes and what at cause to panic. (example: 0.06%
error rate is a big spike on this graph but that's still a really little
number).

Thanks,
Stephanie.
> _______________________________________________
> mdn mailing list
> m...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/mdn
>

John Whitlock

unread,
Jul 23, 2015, 10:47:59 PM7/23/15
to Stephanie Hobson, dev-mdc, dev-mdn, Luke Crouch, m...@lists.mozilla.org
There's a ton of stuff in New Relic, and it's very useful for finding the
one bit of code that is taking 50% of the time, or the service that
occasionally causes your site to start raising errors. It's worth spending
a little time clicking around and seeing what is there.

To monitor site reliability around pushes, I load the developer.mozilla.org
application, make sure I'm looking at the web transactions response time
(the default), switch the time picker to the last 60 minutes, and look for
the deployment event. It gets rendered as a vertical line, and usually
causes a spike in request queuing as web servers are restarted. After a
few minutes, everything should be happy and back to normal. There does
appear to be another spike in request queuing 10-20 minutes after a push,
but that may be human pattern matching. Once the deployment event is 60
minutes in the past, I can close New Relic.

These graphs take a lot of laptop space, so I like to use the iPad app
instead.

If you want to see what a deployment looks like, switch to the last 24
hours and find a vertical deployment line. Drag over the graph to zoom in
on that event.

To see if the web response time is similar to the past, I use the "compare
with yesterday and last week" checkbox. If the site is twice as slow as it
was last week, it will be visible here. A 10% difference is probably just
traffic patterns, unless you can see the change is correlated with the
production push.

An error rate of 0.06% isn't enough to worry about - that's 1 or 2 an
hour. You can load the errors page and see what the errors are. Celery
chord errors are part of the celery multi-task system, and are expected,
especially around restarts.

Also, keep an eye on IRC. Any real issue gets reported there pretty
quickly.

John

On Thu, Jul 23, 2015 at 11:13 AM, Stephanie Hobson <sho...@mozilla.com>
wrote:

Luke Crouch

unread,
Jul 24, 2015, 8:49:03 AM7/24/15
to Stephanie Hobson, dev-mdc, dev-mdn, m...@lists.mozilla.org
To expand on John Whitlock's summary ... Here are some recent examples from
"Everything's fine" to "Hmm ... something might be off" to "OMG it's on
fire".

Yesterday's 60m window of the push of some MDN 10th anniversary tweaks
[1][2] was about as "everything's fine" as it currently gets for MDN.

[image: Inline image 1]

There's an expected spike in request queuing at the time of deployment:
while the 3 web-heads restart apache, requests are queued for < 1 minute.
The response time before and after the push looks identical - around
100-120ms. The error rate, throughput, and apdex graphs are all essentially
unaffected by the push.

Likewise, Wednesday's push of the original MDN 10th anniversary assets
[3][4] was essentially the same. This is typical of deployments that push
only front-end assets.

[image: Inline image 2]

On Monday, we deployed a couple days of changes: code sample widget,
helpfulness widget, and main/head navigation changes. [5][6][7] Hmm ...
something looks a little off ...

[image: Inline image 3]

Note the 2 subsequent spikes after the initial one. These correspond to the
waffle flag changes, because*:

1. waffle uses cookies to put users into feature flag "buckets" to
determine if they see the waffled feature or not
2. We send Vary: Cookie in our HTTP response headers, so...
3. A whole new batch of clients send requests thru the load-balancer to the
web-heads to get their next response, causing another spike of request
queuing.

Couple things worth pointing out from this:

* We need to monitor "soft deploys" via waffle to make sure the request
spikes don't cause knock-on issues
* Those kinds of things look much scarier as they're happening, because the
up-ticks look like they might go on forever:

[image: Inline image 4]

So, give the request queuing spikes at least 1-2 minutes to come back down
(maybe even 3-5 minutes) before pressing the "OMG" button ...

A couple weeks ago, on July 9, we activated some code to restore the social
sharing A/B test. OMG it's on fire ...

[image: Inline image 2]

In this case, the spike doesn't come back down after 3-5m, so something is
very wrong. The throughput is also dropping (spoiler alert: it's heading to
0). Time to start digging in ...

If you look carefully, you can see a corresponding spike in "Web external"
in the graph. You can even click the "Request Queuing" label to toggle it
off to make the other components more pronounced ...

[image: Inline image 3]

Pretty obvious where the problem is. At this point, if we know we just
activated the social share widget we may recognize this as external time
spent on bitly. But if we don't realize that, New Relic has a great
"Transactions" report where we can see more specifically where time is
being spent.

Clicking on "Transactions" on the left, we can see the up-tick is in
/kuma.wiki.views:document. (This is almost always the case, as the document
view is the majority of our traffic, so changes there can cause big
issues.) ...

[image: Inline image 4]

Clicking on "/kuma.wiki.views:document" gives us another detailed
break-down of where time is spent inside the transaction ...

[image: Inline image 5]

And here it's obvious that there's nothing in urllib2[api.bit.ly], but then
a sudden spike. So, we know the problem and we can start fixing it. In this
particular case, we were able to disable the feature and we got back to
regular levels in 13 minutes. (Our target monthly down-time is 22 minutes
or less [8])

Couple things worth pointing out from this:

* Again, we need to monitor "soft deploys" via waffle; even mostly
front-end changes may be using back-end code that can cause problems at
scale
* We need to start load-testing our code on the stage server [9]

Hope that helps? I'm also happy to host a New Relic brown-bag or just have
a screen-sharing 1:1 if anyone would like a personalized tour of it.

-L

[1]
http://developeradm.private.scl3.mozilla.com/chief/developer.prod/logs/43e923e4d687e716673afa651ba38a964a42ced2.1437664747
[2]
https://rpm.newrelic.com/accounts/263620/applications/3172075/deployments/2291403

[3]
http://developeradm.private.scl3.mozilla.com/chief/developer.prod/logs/43e923e4d687e716673afa651ba38a964a42ced2.1437664747
[4]
https://rpm.newrelic.com/accounts/263620/applications/3172075/deployments/2287688

[5]
http://developeradm.private.scl3.mozilla.com/chief/developer.prod/logs/df4079242fbf9b5f03f08697c15de10d2fe4ef31.1437424147
[6]
https://rpm.newrelic.com/accounts/263620/applications/3172075/deployments/2278635
[7]
https://github.com/mozilla/kuma/compare/73cb0388bd0864075b098d7f4601a9697c26aadd...df4079242fbf9b5f03f08697c15de10d2fe4ef31

* At least, this is my current theory

[8]
https://docs.google.com/spreadsheets/d/1EAMlBjfuutHJo2ihFAn5dgZ0VrZ_Q3DUvi02BTe6ibI/edit#gid=1516878892
[9] https://bugzilla.mozilla.org/show_bug.cgi?id=1186081

Renoir Boulanger

unread,
Jul 24, 2015, 10:27:02 AM7/24/15
to
[[ apologies for double posting. I didn't setup ]]
[[ properly my mozilla mailing list subscriptions yet ]]
[[ PS: First time posting on a Mozilla mailing list ]]

If I may,

What is the load balancer based on?

Varnish?

If that's the case, Vary on cookies isn't a good idea because Varnish already does this.

Otherwise, apologies for jumping in. The rest might not be useful to the problems.

But if you are using Varnish, read on.

Varnish is full of subtleties. For instance, don't send *ALL* cookies to origin. Only ones that Kuma (backend) cares about. You can do that by using RegEx in a VCL.

As for Vary, anything you put there can create a very big set of variants because Varnish is very prompt to create them.

For example, Vary: Accept-Encoding

If browser sends requests like this, they will all be variants. Even though we know it shouldn't.

Accept-Encoding: gzip, deflate
Accept-Encoding: gzip,deflate
Accept-Encoding: deflate, gzip
Accept-Encoding: deflate,gzip

There also might be the double of variants if browser also sends lower-case "accept-encoding: ...".

All of this is to say that if MDN uses Varnish;

- Maybe we should dig into the VCL to see where Vary happens
- Filter/Rewrite what's going on between Varnish and origins
- Remove erratic headers (e.g. images don't need cookies, ever.)

Objective is to ensure some headers are rewritten to have less possible combinations and therefore reduce the number of possible variations.

I can share more about Varnish but its pointless if Mozilla don't use it. :). If you want to get some other notes of mine, you can take a look at my [discoveries notes][1] and [some VCL i wrote][2].

Hope it helps.

Renoir Boulanger

[1] <https://docs.webplatform.org/wiki/WPD:Infrastructure/architecture/Things_to_consider_when_we_expose_service_via_Fastly_and_Varnish>
[2] <https://github.com/webplatform/varnish-configs/blob/master/docs.vcl>

Luke Crouch

unread,
Jul 29, 2015, 8:34:07 PM7/29/15
to Renoir Boulanger, Stephanie Hobson, dev-mdc, m...@lists.mozilla.org, dev-mdn
Thanks for the input Renoir. Our load balancer is Stingray, formerly Zeus.
I sent your note along to our WebOps team though in case they can check for
a similar issue.

-L

On Fri, Jul 24, 2015 at 8:36 AM, Renoir Boulanger <he...@renoirboulanger.com
> wrote:

> If I may,
>
> What is the load balancer based on?
>
> Varnish?
>
> If that's the case, Vary on cookies isn't a good idea because Varnish
> already does this.
>
> otherwise, apologies for jumping in. The rest might not be useful to the
> problems.
>
> But if you are using Varnish, read on.
>
> Varnish is full of subtleties. For instance, don't send *ALL* cookies to
> origin. Only ones that Kuma (backend) cares about. You can do that by using
> RegEx in a VCL.
>
> As for Vary, anything you put there can create a very big set of variants
> because Varnish is very prompt to create them.
>
> For example, Vary: Accept-Encoding
>
> If browser sends requests like this, they will all be variants. Even
> though we know it shouldn't.
>
> Accept-Encoding: gzip, deflate
> Accept-Encoding: gzip,deflate
> Accept-Encoding: deflate, gzip
> Accept-Encoding: deflate,gzip
>
> The variants based on casing might also be taken into account.
>
> All of this is to say that if MDN uses Varnish;
>
> - maybe we should dig into the VCL to see where vary happens
> - filter/rewrite what's going on between Varnish and origins to improve
> HIT ratio
>
> I can share more about Varnish but its pointless if Mozilla don't use it.
> :). If you want to get some other notes of mine, you can take a look at my
> [discoveries notes][1] and [some VCL i wrote][2].
>
> Hoe it helps.
>> _______________________________________________
>> dev-mdn mailing list
>> dev...@lists.mozilla.org
>> https://lists.mozilla.org/listinfo/dev-mdn
>>
>

Renoir Boulanger

unread,
Aug 12, 2015, 10:15:44 PM8/12/15
to
Hi all,

I have a suggestion to make to help scaling MDN.

Problem with web apps is that we make backend servers generate HTML as if it was unique when, in truth, most of it could be cached.

If we recall about HTTP caching. Regardless of what software does it: Squid, Varnish, NGINX, Zeus, caching is done the same way.

In the end, the HTTP caching layer basically keeps in RAM generated HTTP Response body and keeps in memory based on the headers it had when it passed it through to the original request. Only GET Responses, *without cookies*, are cacheable. Other response body coming from a [PUT, DELETE, POST] request aren’t.

On a documentation Website page, what’s unique to the current user compared to what an anonymous visitor gets? [1].

The content itself, the "chrome" (what’s always there), Links to account settings, edit, or visualize details for the current page, account settings, the username, link to logout.

I imagine that most of those, except maybe the username, could be the same for any context.

This makes me wonder if we could improve site resiliency by leveraging HTTP caching, strip off *any cookies*, and factor out what’s unique on a page so that we get the same output for any context.

As for the contextual parts of the site 'chrome', how about we expose a context-root which would take care of serving dynamically generated HTML to use as partials.

One way of doing it would be to make that context-root generate simple HTML strings that we can pass to a JavaScript manager that’ll convert it into DOM and inject it in the 'chrome'.

Since we can make cookies to be isolated to specific context-roots, we can keep the statefulness of the session on MDN and have a clear separation of what’s dynamic and what’s not.

I thought that the MDN dev team would be interested to hear about my idea.

[1]: <https://renoirboulanger.com/wp-content/uploads/2015/08/2015-08-12-What-makes-a-page-unique-1024x921.png>

Renoir Boulanger

https://renoirboulanger.com/ ✪ @renoirb
~
Reply all
Reply to author
Forward
0 new messages