Scaling BrowserID

Lloyd Hilaiel

unread,

Aug 15, 2011, 9:12:23 AM8/15/11

to dev-id...@lists.mozilla.org

We want to scale browserid up and have been playing with the initial goal of supporting "a million active users" by the end of the quarter. Attached inline below is the approach I hope to apply to be able to assess capacity and hit this goal.

Thoughts welcome,
lloyd

--------------------------------------------------------------------

## A million BrowserID Users

By the end of Q3 2011 we want browserid to support 1M active
users. This is a quick and dirty execution plan of how we'll
do that.

At a very high level, this is a proposal of several steps that will
take a few man days of engineering effort and will result in a custom
load generation tool that can be pointed at a browserid server to
assess its capacity. The tool will be able to apply a continuous
load so that the system can be inspected while under strain.

1. Defining an Active User

In order to understand what kind of load an "active user" imparts on
browserid servers, we must know precisely what an average active user
is.

For the sake of this discussion, an active user uses 4 sites that use
BrowserID and visits them 10 times each day. These activities are
split across 2 different devices. Further, the average user has two
different email addresses that they use equally, and forgets their
password about every four weeks.

The final bit of assumption is growth rate, what percentage of active
users in a unit of time are using browserid for the first time. This
is interesting as different types of requests (with different costs)
are made during initial user signup. We start by assuming a 20/80 split
of new to returning users.

The next bit of guesswork required is to explain the behaviors of these
sites (RPs) that a user visits. The average RP will set authentication
cookies with 6 hour duration, such that a user must re-authenticate using
browserid at least every six hours.

2. Defining high-level user activities

Given these parameters we can now derive concretely the number of
high level user activities we must support per second to support 1M
active. This will manifest as activities per second for each of the
following distinct activities:

* *new user signup* - someone who has never used browserid goes through the in-dialog "sign up" flow.
* *password recovery* - a user of browserid goes through the "i forgot my password" flow
* *email addition* - a user of browserid adds a new email address to their existing account
* *re-authentication* - a user of browserid re-authenticates to browserid (they have an expired session, or are using a new device for the first time)
* *authenticated user sign-in* - a user of browserid with authentication material already on their device, and an active session to browserid, logs into a site

3. From activities to HTTP requests

Having defined the activities above, each activity corresponds to
some number of network requests. In step three, we'll break each
activity down into its constituent network transactions in a holistic
manner, including all resources loaded from browserid servers (NOTE: we
could account for browser caching here with another factor).

The result of this is a description of each activity in terms of network
transactions that can be expressed in code.

4. The stressful tool

A tool will be created that simultaneously runs an appropriate
proportion of the activities described above in a continuous manner.
The general functioning of the tool will be to continuously start new
"activities", where the type of activity started will match the
probability distribution determined by the rough estimation described
above. The tool will consistently maintain some number of
outstanding, unserviced requests.

Overall the tool will strive to immediately saturate the server and keep
it saturated for the duration of the run.

The tool will output and regularly update a synthesized number which is
the "number of active users" that the load being applied represents.

5. Target steady state load

Depending on how conservative we want to be, we should ensure that our
deployment can handle bursts. I propose we handle this by multiplying
the number of active users by a factor of 10 for the purposes of
capacity planning. This buffer will cause us to buy too much hardware,
but as we can gain experience we can improve our guesses above and
minimize waste.

6. Challenges

6a. email

browserid sends a lot of email, and actually sending emails during
load testing adds complexity with minimal benefit. The current theory
is that we can add a little hook to the server that will only be used
in load testing that short-circuits the sending of email but realistically
exercises all potentially interesting database queries.

6b. user delays

A specific place where there have been performance concerns is in
polling from the dialog while waiting for a user's click-through of the
verification email. To properly include this in load estimations we
should guess at the average number of seconds the user will take to click
the verification email and include the appropriate number of polling
transactions in this activity.

6c. maintenance

This tool will be built based on the behavior of the dialog which is
prone to change (especially in the near future). Keeping the load
generation tool up to date is an ongoing maintenance cost. Other than
structuring the implementation of the tool reasonably, there's really
no other way that I can see to mitigate this cost other than considerably
dumbing down the load generation approach (like, curl and a shell script
or two)

6d. database scalability

In order to test the effectiveness of horizontal database scalability
we would ideally run the tool against a load balancer in front of
varied numbers of nodes. More generally, we can use this approach to
determine the appropriate number and time of nodes for optimal
performance/cost.

6e. limits and testing tool efficiency

It's bloody unlikely given the types of database transactions that are
occurring that we'll hit a throughput wall in the testing environment given
we've got gigabit, but it's worth typing here perhaps.

Similarly, it's possible to hit a wall if the testing tool exhausts local
compute or memory resources. So we should make sure that doesn't happen.

7. Benefits and Potential future ideas

Ultimately this approach gets us a pretty good guess that's probably
order-of-magnitude accurate about the number of active users we can
support in practice. Having it be something that's trivial to point at
any browserid can also yield the following benefits:

* meaningful before/after testing of the impact of expected performance improvements
* meaningful comparison of performance of browserid servers with different hardware
* periodic execution to chart code performance over time (on reference hardware)

8. Next steps

0. air out the approach
1. if tomatoes aren't through as a result of #0, write the tool
2. figure out where a good place for a run of the tool is (testing can happen offline, but to get a baseline number, is MPT ok? should we bother setting up a reference environment at this early date?)
3. publish results and analyze
4. acquire hardware
5. shout "we're ready for a million, baby!"

Mike Hanson

unread,

Aug 15, 2011, 12:22:49 PM8/15/11

to Lloyd Hilaiel, dev-id...@lists.mozilla.org

On Aug 15, 2011, at 6:12 AM, Lloyd Hilaiel wrote:

> 6. Challenges
> 6a. email
>
> browserid sends a lot of email, and actually sending emails during
> load testing adds complexity with minimal benefit. The current theory
> is that we can add a little hook to the server that will only be used
> in load testing that short-circuits the sending of email but realistically
> exercises all potentially interesting database queries.

Correctly sending as much email as we imagine is a scalability challenge in itself. I would pull this piece out and figure out whether we should buy-or-build on this point. There are plenty of services that will help us get this part right (I'm thinking about getting rDNS, DKIM, etc. right - all the stuff you need to do if you want to be a good SMTP citizen).

m

Ben Adida

unread,

Aug 15, 2011, 12:26:28 PM8/15/11

to dev-id...@lists.mozilla.org

On 8/15/11 9:22 AM, Mike Hanson wrote:
>
> Correctly sending as much email as we imagine is a scalability
> challenge in itself.

Agreed. We're already seeing some pain points with some email servers on
the receiving end delaying delivery, I'm guessing because of
insufficient signals that this isn't spam.

-Ben

Paul Osman

unread,

Aug 15, 2011, 12:48:33 PM8/15/11

to Ben Adida, dev-id...@lists.mozilla.org

FYI, a number of webdev projects use socketlabs [1] for this, so you
could ask folks from groups like AMO -- who I imagine send out quite a
lot of email -- about their experiences.

[1] http://socketlabs.com/

-P

Mike Hanson

unread,

Aug 15, 2011, 12:50:53 PM8/15/11

to Paul Osman, Ben Adida, dev-id...@lists.mozilla.org

Perfect - we should use the relationships we already have if we're happy with the vendors.

m