Proposed new crew: The DB Crew

Lloyd Hilaiel

unread,

Sep 19, 2012, 12:55:40 PM9/19/12

to dev-id...@lists.mozilla.org, Austin King, Richard Soderberg, Gene Wood, Sheeri Cabral

While participating in the deployment of train-2012.08.17 yesterday, it became clear to everyone working on this that our mysql deployment in production is not properly optimized. The following concrete actions can drastically improve our capacity and reduce time to deployment in my opinion:

1. Audit configuration - We are running with 128gb of ram, yet the mysql configuration doesn't leverage this
2. Audit mysql version - We can likely safely upgrade mysql which will address many bugs and bring new features to our disposal.

I suggest a new crew, the database crew. I suggest the following process:

Generate a sample data set (in mysqldump form), and 10 or so representative queries against that set. These queries wholly represent the interaction of our web app against mysql and also cover things like altering tables.
This sample data set will have 10 million users.

Success criteria:

1. mysql updated to the ideal version in production
2. configuration tuned to perform predictably and optimally against the sample data set
3. Deep understanding of table modification time requirements to inform dev and prod for future changes to schema

This work will also require that we hone our master failover process, as we would need to exercise this process to achieve the criteria above.

I suggest Sheeri audit and advise this crew cause she's awesome and knows mysql.

Thoughts?
lloyd

Ben Adida

unread,

Sep 19, 2012, 1:33:41 PM9/19/12

to Lloyd Hilaiel, Austin King, Richard Soderberg, Sheeri Cabral, Gene Wood, dev-id...@lists.mozilla.org

Love it. The only issue is: someone has to lead it. Are you proposing to do that?

I'm happy to help how I can, I used to do quite a bit of DB stuff, back in the day (oh, I'm getting old, aren't I?)

--
Ben Adida

> _______________________________________________
> dev-identity mailing list
> dev-id...@lists.mozilla.org (mailto:dev-id...@lists.mozilla.org)
> https://lists.mozilla.org/listinfo/dev-identity
>
>

Jared Hirsch

unread,

Sep 19, 2012, 1:46:27 PM9/19/12

to Lloyd Hilaiel, Austin King, Richard Soderberg, Sheeri Cabral, Gene Wood, dev-id...@lists.mozilla.org

Sounds great! Couple of questions/comments:

- I've heard recent talk of evaluating alternative persistence tools. Does
the DB crew's existence imply that we're sticking with MySQL in the longer
term?

- This falls a bit into implementation detail, but when you start building
a test environment, it would be awesome to see the DB crew leverage and
extend ozten's prodlike VM cluster<https://github.com/ozten/browserid-devops>.

Jared

> https://lists.mozilla.org/listinfo/dev-identity
>

Lloyd Hilaiel

unread,

Sep 19, 2012, 1:49:07 PM9/19/12

to Ben Adida, Austin King, Richard Soderberg, Sheeri Cabral, Gene Wood, dev-id...@lists.mozilla.org

On Sep 19, 2012, at 11:33 AM, Ben Adida <b...@adida.net> wrote:
> Love it. The only issue is: someone has to lead it. Are you proposing to do that?

I would like to participate and would be happy to lead (however that would break the rule of leading two things).

My biggest interest is simply in a phased approach that is low impact and tactical. I don't want this to turn into the "shall we switch data stores" conversation - that will be had but is much less urgent. I think there is a very small amount of work we can do right now that will drastically improve our situation.

If we had performed this work before yesterday, deployment would have taken 4 hours instead of 12 imo.

lloyd

Lloyd Hilaiel

unread,

Sep 19, 2012, 1:53:48 PM9/19/12

to Jared Hirsch, Austin King, Richard Soderberg, Sheeri Cabral, Gene Wood, dev-id...@lists.mozilla.org

On Sep 19, 2012, at 11:46 AM, Jared Hirsch <jhi...@mozilla.com> wrote:

> Sounds great! Couple of questions/comments:
>
> - I've heard recent talk of evaluating alternative persistence tools. Does the DB crew's existence imply that we're sticking with MySQL in the longer term?

No. they are parallel tracks. We can drastically improve capacity right now with minimal investment of work, let's do it. Once we un-hobble our mysql deployment, *then* let's analyze other data stores and compare them on meaningful grounds.

> - This falls a bit into implementation detail, but when you start building a test environment, it would be awesome to see the DB crew leverage and extend ozten's prodlike VM cluster.

Agree, that's an implementation detail. To dive in as well - I think the crew should do this if it saves time.

If it were me, I would probably build a test environment with the 10 million row database on an EC2 cluster then let anyone in the crew instantly deploy a vm with the database and actively fiddle with the different queries we're running and configuration to see if we can improve things.

This means you're about 2 minutes to live experimentation and empirical data - and I like those things! :)

lloyd

Ben Adida

unread,

Sep 19, 2012, 1:55:19 PM9/19/12

to Lloyd Hilaiel, Austin King, Richard Soderberg, Sheeri Cabral, Gene Wood, dev-id...@lists.mozilla.org

On Wednesday, September 19, 2012 at 10:49 AM, Lloyd Hilaiel wrote:

> On Sep 19, 2012, at 11:33 AM, Ben Adida <b...@adida.net (mailto:b...@adida.net)> wrote:
> > Love it. The only issue is: someone has to lead it. Are you proposing to do that?
>
> I would like to participate and would be happy to lead (however that would break the rule of leading two things).
>
>

You mean because of stabilization? Since that is winding down, I think it's okay for you to lead DB.

> My biggest interest is simply in a phased approach that is low impact and tactical. I don't want this to turn into the "shall we switch data stores" conversation - that will be had but is much less urgent. I think there is a very small amount of work we can do right now that will drastically improve our situation.

I like this approach very much. This is not a "switch datastore" discussion, it's a "how good can we get our current datastore" discussion.

-Ben

Sheeri Cabral

unread,

Sep 19, 2012, 2:14:28 PM9/19/12

to Lloyd Hilaiel, Austin King, Richard Soderberg, Gene Wood, dev-id...@lists.mozilla.org

I think it's a great idea! my nitpicks inline....

----- Original Message -----
From: "Lloyd Hilaiel" <ll...@mozilla.com>
To: dev-id...@lists.mozilla.org
Cc: "Richard Soderberg" <at...@mozilla.com>, "Sheeri Cabral" <sca...@mozilla.com>, "Gene Wood" <ge...@mozilla.com>, "Austin King" <oz...@mozilla.com>
Sent: Wednesday, September 19, 2012 12:55:40 PM
Subject: Proposed new crew: The DB Crew

While participating in the deployment of train-2012.08.17 yesterday, it became clear to everyone working on this that our mysql deployment in production is not properly optimized. The following concrete actions can drastically improve our capacity and reduce time to deployment in my opinion:

1. Audit configuration - We are running with 128gb of ram, yet the mysql configuration doesn't leverage this
2. Audit mysql version - We can likely safely upgrade mysql which will address many bugs and bring new features to our disposal.

I suggest a new crew, the database crew. I suggest the following process:

Generate a sample data set (in mysqldump form), and 10 or so representative queries against that set. These queries wholly represent the interaction of our web app against mysql and also cover things like altering tables.
This sample data set will have 10 million users.

-------------

Why sample and not real data? the db how it is now isn't that large, we could do the whole thing. My guess is there are security issues around that?

Also, a static sample is not like real production...for one, it wouldn't have the fragmentation we had yesterday.

On another note, there are tools that can digest the slow query logs, and we can set the slow query logs to log every query (say, for 24 hours and then analyze). We can build up a meta database of types of queries that are actually run (and frequencies too, I believe), so that we can routinely test regular actual load, as opposed to what we think load is.

---------------

Success criteria:

1. mysql updated to the ideal version in production
2. configuration tuned to perform predictably and optimally against the sample data set
3. Deep understanding of table modification time requirements to inform dev and prod for future changes to schema

This work will also require that we hone our master failover process, as we would need to exercise this process to achieve the criteria above.

I suggest Sheeri audit and advise this crew cause she's awesome and knows mysql.

-------------
I could audit/advise, and even lead, but I'd want to know more about the "crew" structure. That's probably a 5-10 minute call...

-Sheeri

Lloyd Hilaiel

unread,

Sep 19, 2012, 2:40:00 PM9/19/12

to Sheeri Cabral, Austin King, Richard Soderberg, Gene Wood, dev-id...@lists.mozilla.org

On Sep 19, 2012, at 12:14 PM, Sheeri Cabral <sca...@mozilla.com> wrote:

> Why sample and not real data? the db how it is now isn't that large, we could do the whole thing. My guess is there are security issues around that?

Yeah, real data has peoples email addresses and hashed passwords. We cannot move that data out from under lock and key.

Also, real data is too small, we want to be able to create and prepare for a much larger data set that what we're currently supporting.

> Also, a static sample is not like real production...for one, it wouldn't have the fragmentation we had yesterday.

No, but we were able to reproduce the same behavior we saw in production with a data set of size n fragmented about 400% in a different environment just by having a data set that was 4n large.

So In the case yesterday, we were able to account for fragmentation in production simply by increasing the data set size. This not perfect, but I cannot think of a better approach. you?

> On another note, there are tools that can digest the slow query logs, and we can set the slow query logs to log every query (say, for 24 hours and then analyze). We can build up a meta database of types of queries that are actually run (and frequencies too, I believe), so that we can routinely test regular actual load, as opposed to what we think load is.

I love this approach, however it might be difficult to execute this in production for privacy and user data safety reasons.

The spirit of this approach though, using empirical data to prioritize optimization, can be done using the current mix of api requests in production combined with some math.

> -------------
> I could audit/advise, and even lead, but I'd want to know more about the "crew" structure. That's probably a 5-10 minute call…

Ping me on IRC if you'd like to talk. Otherwise, the crew structure is simple. It's a group of people working together for some amount of time to attain a specific set of goals.

Crew goals are simply to :
1. get a stable yet modern version of mysql in production
2. tune the database consciously to maximally leverage available resources
3. Stretch: develop understanding and guidelines for how to change the database structure (approach, cost, etc)

lloyd

Sheeri Cabral

unread,

Sep 19, 2012, 3:45:29 PM9/19/12

to Lloyd Hilaiel, Austin King, Richard Soderberg, Gene Wood, dev-id...@lists.mozilla.org

----- Original Message -----
From: "Lloyd Hilaiel" <ll...@mozilla.com>

To: "Sheeri Cabral" <sca...@mozilla.com>
Cc: "Richard Soderberg" <at...@mozilla.com>, "Gene Wood" <ge...@mozilla.com>, "Austin King" <oz...@mozilla.com>, dev-id...@lists.mozilla.org
Sent: Wednesday, September 19, 2012 2:40:00 PM
Subject: Re: Proposed new crew: The DB Crew

Yeah, real data has peoples email addresses and hashed passwords. We cannot move that data out from under lock and key.

----
figured as much :D

-----

Also, real data is too small, we want to be able to create and prepare for a much larger data set that what we're currently supporting.

----

true!

----

> Also, a static sample is not like real production...for one, it wouldn't have the fragmentation we had yesterday.

No, but we were able to reproduce the same behavior we saw in production with a data set of size n fragmented about 400% in a different environment just by having a data set that was 4n large.

So In the case yesterday, we were able to account for fragmentation in production simply by increasing the data set size. This not perfect, but I cannot think of a better approach. you?

-----

Interesting! It would also increase the end-size, too, but that's OK to get a worst case.

------

> On another note, there are tools that can digest the slow query logs, and we can set the slow query logs to log every query (say, for 24 hours and then analyze). We can build up a meta database of types of queries that are actually run (and frequencies too, I believe), so that we can routinely test regular actual load, as opposed to what we think load is.

I love this approach, however it might be difficult to execute this in production for privacy and user data safety reasons.

The spirit of this approach though, using empirical data to prioritize optimization, can be done using the current mix of api requests in production combined with some math.

------

Cool! I like math!

--------

Ping me on IRC if you'd like to talk. Otherwise, the crew structure is simple. It's a group of people working together for some amount of time to attain a specific set of goals.

Crew goals are simply to :
1. get a stable yet modern version of mysql in production
2. tune the database consciously to maximally leverage available resources
3. Stretch: develop understanding and guidelines for how to change the database structure (approach, cost, etc)

-----

works for me!

-Sheeri