Scaling persona from ZeroToSixty, a plan

Lloyd Hilaiel

unread,

Oct 22, 2012, 1:27:47 PM10/22/12

to dev-id...@lists.mozilla.org

I spent friday trying to re-aquaint myself with the performance properties of Persona in attempt to try to figure an action plan for the group of folks focusing on scaling the service in the short term and long term. I have some conclusions and proposed steps forward.

High level hypothesis:
1. On current hardware (pending application optimizations) we should be able to support between 40-60M "active daily users".
2. Upon the launch of "BigTent" for 60% of users, we should theoretically about double that capacity without any additional hardware.
3. If we offload just computationally expensive operations to new hardware, we should be able to approximately double capacity again.

Methodology and Current state:

To get an understanding of how our application looks under load, jrgm and I ran a load test up to 3M simulated ADU - pushing our current staging environment till it broke. At maximal load here is roughly how we looked:

Note: See http://lloyd.io/persona-architectural-changes for some fairly current diagrams which represent how hardware is laid out in production and processes are mapped to hardware.

"Webheads" - 4.41 load (4.41 processors out of 24 utilized)
browserid process: 100% CPU utilization
router process: 75% CPU utilization
verifier: 37% CPU utilization
remaining load is distributed between processes performing computation, mostly bcrypt compute processes

"Secure Webheads" - 1.38 load (24 cores available)
dbwriter: 70% CPU utilization
remaining load is again, distributed among processes performing computation

"Keysigner": 0.17 load (16 cores available)

Key learning 0: we're compute bound.

Key learning 1: request processing is not free. NodeJS processes are performing a significant amount of work - and in order to leverage all processors we'll need to distribute request processing across compute cost across multiple instances. (Side note: 30% of computation used by browserid, router, and dbwriter is logging.)

Key learning 2: a disproportionate amount of work is being performed on "webhead" machines. If we really want to leverage all available compute resources and maintain the same deployment configuration, we're going to need to offload work from webheads to keysigners and dbwriters. We're going to need to do so in a way that preserves the security benefits of the current configuration.

Key learning 3: How we break: When one of the request processing processes (dbwriter, keysigner, browserid, or router) hits 100%, we see lag in handling work on the event queue. When this lag approaches 3 seconds or so, we stop responding to health checks sufficiently fast, and load balancers take processes out of rotation. shifting the brunt of the load onto remaining nodes, which brings everything to a grinding halt - tl;dr; 100% usage in these processes is bad.

Action plan:

Phase 1 - "Other Peoples Efforts"

tl;dr; upgrade to node 0.8.x

One can meaningfully simulate the "breaking point" by throwing the whole thing onto an EC2 8 core instance, and turning down the bcrypt work factor. By reducing the cost of bcrypt compute work with fewer cores, you can simulate the bottleneck in our present deployment.

Doing this works well and displays load characteristics similar to what we see in actual deployment. The first thing that bites us, however, is refused connections from router to browserid, or from browserid to keysigner, or from router to dbwriter. These manifest as ECONNREFUSED responses. In doing some digging, my current theory is that what is happening is the TCP backlog of 128 connections (hardcoded in node 0.6.x) is getting hit by the recipient of requests. This is just a theory, but it fits the data well.

Now, simulated load is significantly different than actual load, especially because of latency characteristics, so I cannot really say how prominent this issue would be under real world load… but...

Because this is configurable in node 0.8.x, because 0.8 has settled (now at .12), because we're already running our tests on 0.8.x (for months now), I see no reason not to optimistically upgrade. I suggest as soon as this current train (to be cut friday) if there are no objections.

Phase 2 - "Transparency"

The current method of getting a snapshot of the staging environment under load is to open up three terminals on the three most pertinent types of nodes in our deployment, and watch top. We should instead use statsd. We should distill down a handful of metrics that are important and give us a meaningful view of what is going on (perhaps api requests per second, perhaps request processor CPU usage, perhaps others).

In this phase we look at the system holistically, and move from ad hoc CPU usage percentages to other metrics that are more meaningful, and more software architecture independent (so as we refine the architecture the metrics give us an idea of relative performance).

Nice side benefit, we can actually compare these numbers to production, and both vet our synthetic load generation tools in terms of request mix, as well as more meaningfully understand what percentage of capacity we're using in production.

Phase 3 - "Webheads are lazy"

As you saw above, the application breaks at a piddling 4.4 load on machines of 24 cores. The first bit of actual re-architecting is to make it so the application breaks at a load of 24 on these machines.

This will involve using more CPU for request processing, and multiple approaches exist - but it's basically run more processes using either node's cluster stuff OR just starting more on different ports and balancing with our load balancers.

Phase 4 - "First grace period"

Now that we have made a 5x improvement in capacity, we should revisit application grace. The application should not fail by ripping processes out of rotation, but instead we should have the software understand when load is at unacceptable levels and strategically return "server too busy" errors. This probably happens in the router process by giving her an understanding of realtime server utilization.

Lots of ways to go here.

Phase 5 - "Use all the processors"

At this point, we've got web heads fully utilized, but we're still leaving some cores on the table in dbwriter and (especially) keysigner. What if we were to run bcrypt operations on these two nodes as well? lots of ways to get there, we want something simple and robust that preserves the grace characteristics that bcrypt distribution has today - more thoughts here on this specific issue: https://id.etherpad.mozilla.org/scaling-use-all-the-cpus

Phase 6 - "Second grace period"

Again, significant architectural update. Now we must verify that we still fail beautifully when we fail.

wooo! monster email! If this analysis seems reasonable and the phases seem ok, then the next step would be to map the work onto trains and get 'er done.

lloyd

Zachary Carter

unread,

Oct 22, 2012, 6:02:02 PM10/22/12

to Lloyd Hilaiel, dev-id...@lists.mozilla.org

We've discussed this in the past, but a CDN for static resources should be helpful for alleviating load on the webheads. Is that still in consideration?

-Zach

-----
From: "Lloyd Hilaiel" <ll...@mozilla.com>
To: dev-id...@lists.mozilla.org
Sent: Monday, October 22, 2012 10:27:47 AM
Subject: Scaling persona from ZeroToSixty, a plan

_______________________________________________
dev-identity mailing list
dev-id...@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-identity

Lloyd Hilaiel

unread,

Oct 22, 2012, 7:03:33 PM10/22/12

to Zachary Carter, dev-id...@lists.mozilla.org

On Oct 22, 2012, at 4:02 PM, Zachary Carter <zca...@mozilla.com> wrote:

> We've discussed this in the past, but a CDN for static resources should be helpful for alleviating load on the webheads. Is that still in consideration?

Fantastic point. All of the performance testing I mention is performed with the assumption that static resources are not handled by our daemons.

A cheap way to make that so would be to enable proxy caching in nginx - but this is something that deserves its own thought. Anyone on the crew wanna pick up the problem of static resources and run with it? Approaches, phases, etc?

lloyd

Leen Besselink

unread,

Oct 22, 2012, 7:16:12 PM10/22/12

to dev-id...@lists.mozilla.org

On Mon, Oct 22, 2012 at 05:03:33PM -0600, Lloyd Hilaiel wrote:
> On Oct 22, 2012, at 4:02 PM, Zachary Carter <zca...@mozilla.com> wrote:
>
> > We've discussed this in the past, but a CDN for static resources should be helpful for alleviating load on the webheads. Is that still in consideration?
>
> Fantastic point. All of the performance testing I mention is performed with the assumption that static resources are not handled by our daemons.
>
> A cheap way to make that so would be to enable proxy caching in nginx - but this is something that deserves its own thought. Anyone on the crew wanna pick up the problem of static resources and run with it? Approaches, phases, etc?
>
> lloyd

If you need nginx examples contact me directly.

Shane Tomlinson

unread,

Oct 23, 2012, 5:26:43 AM10/23/12

to dev-id...@lists.mozilla.org

On 23/10/2012 00:03, Lloyd Hilaiel wrote:
> On Oct 22, 2012, at 4:02 PM, Zachary Carter <zca...@mozilla.com> wrote:
>
>> We've discussed this in the past, but a CDN for static resources should be helpful for alleviating load on the webheads. Is that still in consideration?
> Fantastic point. All of the performance testing I mention is performed with the assumption that static resources are not handled by our daemons.
>
> A cheap way to make that so would be to enable proxy caching in nginx - but this is something that deserves its own thought. Anyone on the crew wanna pick up the problem of static resources and run with it? Approaches, phases, etc?
>
> lloyd
>

Mozilla has the fantastic new CDN's that are globally distributed, do we
have contact with the teams that manage those?

Shane

Lloyd Hilaiel

unread,

Oct 24, 2012, 12:07:33 PM10/24/12

to Shane Tomlinson, dev-id...@lists.mozilla.org

Well shucks, that sounds really interesting. Links to more information?

lloyd

> Shane

Jared Hirsch

unread,

Oct 31, 2012, 1:08:55 PM10/31/12

to Lloyd Hilaiel, Gene Wood, Bob Micheletto, Shane Tomlinson, dev-id...@lists.mozilla.org

+Gene, Bob

I pinged in #it about this, and solarce suggested we follow up with service
ops. Service ops brethren, do you have any info on shifting some persona
assets onto mozilla's CDN?

Shane Tomlinson

unread,

Oct 31, 2012, 1:33:39 PM10/31/12

to Jared Hirsch, dev-id...@lists.mozilla.org, Lloyd Hilaiel, Gene Wood, Bob Micheletto

On 31/10/2012 17:08, Jared Hirsch wrote:
> +Gene, Bob
>
> I pinged in #it about this, and solarce suggested we follow up with
> service ops. Service ops brethren, do you have any info on shifting
> some persona assets onto mozilla's CDN?

We have also ping'd Wil Clouser from the marketplace about their use.
Wil's response:

We use CDNs with great success on AMO. Sometimes multiple CDNs
simultaneously although I think that is IT chasing features or cost.

The CDNs we use are just reverse proxies so they should "just work" when
they are pointed to the right place. I'd suggest you serve static
assets off of nginx anyway though, as it'll be cleaner and faster, and
then throw a CDN in front of that.

I'm CCing oremj who handles all our CDN stuff and can answer any
technical implementation questions.

Wil

Gene Wood

unread,

Nov 1, 2012, 12:18:07 PM11/1/12

to dev-id...@lists.mozilla.org, Lloyd Hilaiel, Jared Hirsch, Bob Micheletto

Zack or Lloyd, go ahead and open a bugzilla ticket requesting this CDN
stuff be setup and we'll take this conversation out of email.
-Gene

Lloyd Hilaiel

unread,

Nov 1, 2012, 12:44:17 PM11/1/12

to Gene Wood, dev-id...@lists.mozilla.org, Jared Hirsch, Bob Micheletto

https://bugzilla.mozilla.org/show_bug.cgi?id=807705

Lloyd Hilaiel

unread,

Nov 2, 2012, 11:21:07 AM11/2/12

to Gene Wood, Bob Micheletto, Jared Hirsch, dev-id...@lists.mozilla.org

On Nov 1, 2012, at 10:44 AM, Lloyd Hilaiel <ll...@mozilla.com> wrote:

> https://bugzilla.mozilla.org/show_bug.cgi?id=807705

Final note on using mozilla's CDN in persona: we're going to get it set up, then we're going to analyze the performance benefit and security implications carefully, and only after that will we make a decision to flip it on in configuration.

The nice part of this approach is because it's a reverse proxy and our urls and cache headers were designed to be CDN friendly [1] with a simple configuration change we can turn it on/off in dev/stage/prod - good for disaster scenarios as well as testing.

lloyd

[1]: https://github.com/mozilla/connect-cachify

Gene Wood

unread,

Nov 2, 2012, 12:39:45 PM11/2/12

to Lloyd Hilaiel, Bob Micheletto, Jared Hirsch, dev-id...@lists.mozilla.org

I've assigned the ticket to Jeremy and asked for him to help us.
-Gene