I've put up the notes on our current thoughts on scaling our services
up at https://wiki.mozilla.org/Labs/Weave/Service/Scaling
Please feel free to poke around and add stuff as appropriate.
I will be working next on creating a set of action items based on our
discussions from last week. That said, I think our immediate
priorities with people currently working on it are as follows:
Engineering:
* Better backoff (mconnor)
* Eliminate known problems with race conditions (dan)
* Brainstorm other ideas for graceful degradation of service (all)
Operational:
* Update load profile/numbers to more accurately reflect our current
load (toby/zandr)
* Figure out how much extra capacity we might have and can get in
place for 0.7 (zandr)
* Operations runbook (toby/zandr)
* Figure out backup for toby (ragavan)
If you think of other things that need to happen right away, post them here.
Thanks,
Ragavan
> I will be working next on creating a set of action items based on our
> discussions from last week. That said, I think our immediate
> priorities with people currently working on it are as follows:
I have started this page to track stuff we need to do:
https://wiki.mozilla.org/Labs/Weave/Service/Scaling/Projects/Aragog
Add stuff here as appropriate. Next step will be to add owners and dates.
Ragavan
I updated the wiki with IT/Operation specific notes.
To clarify,
In IT land we treat Weave as a production service. Weave's operations
is managed by Mozilla IT and there shouldn't be any tasks that rely
solely on an individual IT member. I've updated the wiki to reflect
that and Mozilla IT's downtime/content push schedules.
On Tue, Sep 8, 2009 at 11:48 AM, matthew zeier<m...@mozilla.com> wrote:
>
> I updated the wiki with IT/Operation specific notes.
>
> To clarify,
>
> In IT land we treat Weave as a production service. Weave's operations
> is managed by Mozilla IT and there shouldn't be any tasks that rely
> solely on an individual IT member. I've updated the wiki to reflect
> that and Mozilla IT's downtime/content push schedules.
Thanks for the updates.
A couple of clarifications:
While it is true that Weave is treated as a production service, we
don't have an operations runbook for it (at least based on chats I've
had with zandr). There've been at least a few occasions in the recent
past where we had to rely on a person or guesswork to figure out stuff
(ldap configs, an /etc/hosts file hack come to mind).
Totally agreed on trying to figure out webdev as a backup for Toby.
I'll talk to morgamic to see how to get things going there.
And thanks also for pointing out our regular downtime windows - we
should totally follow those for Weave as well.
Regards,
Ragavan