I spent friday trying to re-aquaint myself with the performance properties of Persona in attempt to try to figure an action plan for the group of folks focusing on scaling the service in the short term and long term. I have some conclusions and proposed steps forward.
High level hypothesis:
1. On current hardware (pending application optimizations) we should be able to support between 40-60M "active daily users".
2. Upon the launch of "BigTent" for 60% of users, we should theoretically about double that capacity without any additional hardware.
3. If we offload just computationally expensive operations to new hardware, we should be able to approximately double capacity again.
Methodology and Current state:
To get an understanding of how our application looks under load, jrgm and I ran a load test up to 3M simulated ADU - pushing our current staging environment till it broke. At maximal load here is roughly how we looked:
Note: See
http://lloyd.io/persona-architectural-changes for some fairly current diagrams which represent how hardware is laid out in production and processes are mapped to hardware.
"Webheads" - 4.41 load (4.41 processors out of 24 utilized)
browserid process: 100% CPU utilization
router process: 75% CPU utilization
verifier: 37% CPU utilization
remaining load is distributed between processes performing computation, mostly bcrypt compute processes
"Secure Webheads" - 1.38 load (24 cores available)
dbwriter: 70% CPU utilization
remaining load is again, distributed among processes performing computation
"Keysigner": 0.17 load (16 cores available)
Key learning 0: we're compute bound.
Key learning 1: request processing is not free. NodeJS processes are performing a significant amount of work - and in order to leverage all processors we'll need to distribute request processing across compute cost across multiple instances. (Side note: 30% of computation used by browserid, router, and dbwriter is logging.)
Key learning 2: a disproportionate amount of work is being performed on "webhead" machines. If we really want to leverage all available compute resources and maintain the same deployment configuration, we're going to need to offload work from webheads to keysigners and dbwriters. We're going to need to do so in a way that preserves the security benefits of the current configuration.
Key learning 3: How we break: When one of the request processing processes (dbwriter, keysigner, browserid, or router) hits 100%, we see lag in handling work on the event queue. When this lag approaches 3 seconds or so, we stop responding to health checks sufficiently fast, and load balancers take processes out of rotation. shifting the brunt of the load onto remaining nodes, which brings everything to a grinding halt - tl;dr; 100% usage in these processes is bad.
Action plan:
Phase 1 - "Other Peoples Efforts"
tl;dr; upgrade to node 0.8.x
One can meaningfully simulate the "breaking point" by throwing the whole thing onto an EC2 8 core instance, and turning down the bcrypt work factor. By reducing the cost of bcrypt compute work with fewer cores, you can simulate the bottleneck in our present deployment.
Doing this works well and displays load characteristics similar to what we see in actual deployment. The first thing that bites us, however, is refused connections from router to browserid, or from browserid to keysigner, or from router to dbwriter. These manifest as ECONNREFUSED responses. In doing some digging, my current theory is that what is happening is the TCP backlog of 128 connections (hardcoded in node 0.6.x) is getting hit by the recipient of requests. This is just a theory, but it fits the data well.
Now, simulated load is significantly different than actual load, especially because of latency characteristics, so I cannot really say how prominent this issue would be under real world load… but...
Because this is configurable in node 0.8.x, because 0.8 has settled (now at .12), because we're already running our tests on 0.8.x (for months now), I see no reason not to optimistically upgrade. I suggest as soon as this current train (to be cut friday) if there are no objections.
Phase 2 - "Transparency"
The current method of getting a snapshot of the staging environment under load is to open up three terminals on the three most pertinent types of nodes in our deployment, and watch top. We should instead use statsd. We should distill down a handful of metrics that are important and give us a meaningful view of what is going on (perhaps api requests per second, perhaps request processor CPU usage, perhaps others).
In this phase we look at the system holistically, and move from ad hoc CPU usage percentages to other metrics that are more meaningful, and more software architecture independent (so as we refine the architecture the metrics give us an idea of relative performance).
Nice side benefit, we can actually compare these numbers to production, and both vet our synthetic load generation tools in terms of request mix, as well as more meaningfully understand what percentage of capacity we're using in production.
Phase 3 - "Webheads are lazy"
As you saw above, the application breaks at a piddling 4.4 load on machines of 24 cores. The first bit of actual re-architecting is to make it so the application breaks at a load of 24 on these machines.
This will involve using more CPU for request processing, and multiple approaches exist - but it's basically run more processes using either node's cluster stuff OR just starting more on different ports and balancing with our load balancers.
Phase 4 - "First grace period"
Now that we have made a 5x improvement in capacity, we should revisit application grace. The application should not fail by ripping processes out of rotation, but instead we should have the software understand when load is at unacceptable levels and strategically return "server too busy" errors. This probably happens in the router process by giving her an understanding of realtime server utilization.
Lots of ways to go here.
Phase 5 - "Use all the processors"
At this point, we've got web heads fully utilized, but we're still leaving some cores on the table in dbwriter and (especially) keysigner. What if we were to run bcrypt operations on these two nodes as well? lots of ways to get there, we want something simple and robust that preserves the grace characteristics that bcrypt distribution has today - more thoughts here on this specific issue:
https://id.etherpad.mozilla.org/scaling-use-all-the-cpus
Phase 6 - "Second grace period"
Again, significant architectural update. Now we must verify that we still fail beautifully when we fail.
wooo! monster email! If this analysis seems reasonable and the phases seem ok, then the next step would be to map the work onto trains and get 'er done.
lloyd