Re: Update - [CrisisCommons] CrisisCommons wiki dead

dsha...@comcast.net

unread,

Mar 14, 2011, 5:05:40 PM3/14/11

to crisis...@googlegroups.com, crisiscom...@googlegroups.com

Thanks everyone!!

Realize we are in the process of planning a move to the Open Source Labs (OSL) soon for our 'core' CrisisCommons Infrastructure services. As our hosting-provider, we are defining a SOW with OSL around topics of support, recovery, and availability, that will align with their capabilities as a 'provider' in their 'data center'. We are 'testing' some processes out now to understand the gaps in both support and technology capability, and intentionally pushing-the-envelop, to understand if we will need some compensating processes developed around these gaps (why the move has been as much focused on 'doing it' as in 'observing and documenting it'). For instance, they host their DB's on separate hardware than the app-servers, so our risk is reduced a bit against a single 'server failure'.

Agree, that the only way to achieve true high-availability is to go with an active-active redundant solution which comes with a traditionally very steep cost, so is important to manage expectations here. We will look to get creative around things like active-passive scenarios, or other scenarios, with some tradeoffs between recovery time objectives (RTO, how long it takes to come back up) and recovery point objectives (RPO, how often you take your recovery (backup) points, and tradeoffs, say, between 30 minutes vs. 1 day). Creative being an optimal word, and there are lots of creative approaches and strategies to this (and NFR's that may be higher in a disaster than in a steady-state), that can be evaluated once we figure out the OSL gaps and their capabilities as system administrators (and we may reach back out to ya'll).

The main difference in philosophy I see in this, say with just your average web-site, however, is that as a type-of disaster-response organization ourselves, we should at least attempt to follow some traditional IT disaster-recovery practices, as a measure of our own disaster preparedness, which would mean some active-passive or other sort of recovery scenarios with geographical redundancy (we speak a lot about disaster preparedness in general, and so it's worth thinking through). So we are spending a lot of time considering recovery, and evaluation of things like technology AND human single-points-of-failure (ie, eliminating the one-guy has the keys environment), that works with a heterogeneous mix of technologies.

Like this sort of stuff? Join our infrastructure working group (smile):

http://groups.google.com/group/crisiscommons_cciwg/

Thanks!!

Deborah Shaddon

CrisisCommons Infrastructure Working Group Lead

----- Original Message -----
From: "Tim Schwartz" <tim.c.s...@gmail.com>
To: crisis...@googlegroups.com
Sent: Monday, March 14, 2011 3:18:50 PM
Subject: Re: Update - [CrisisCommons] CrisisCommons wiki dead

+1

This stuff happens all the time, the best you can do is have a backup
plan. If I were managing the tech hosting of this, I would have a
secondary VPS (small workspace server) hosted somewhere completely
different than dreamhost and have the DB + files rsynced (transfered
and synced) ever 30 minutes or so. That means you have at most a 30
minute lagging backup that is always ready to rock and roll. This
would cost about $50 a month or so.

As well, make sure that domain hosting is not at the primary hosting
site (dreamhost). That way the main domain can always be pointed to
the backup should anything happen to the primary hosting company.

This can all be accomplished in a days worth of work.

2cents,
-tim

On Mon, Mar 14, 2011 at 1:10 PM, Andrew Turner
<ajtu...@highearthorbit.com> wrote:
> Yep - sometimes servers & networks just have problems. Silly Internet Tubes.
>
> I agree that "hotswap" to alternate hosting would be ideal, but as
> mentioned that is a good deal of effort to have database and file
> replication setup and maintained and usually only exists for
> "Enterprise" (for some level of E in enterprise) or with very large
> IT/scaling budgets.
>
> OSL is going to be a great resource - but we also have to manage
> expectations that even they may have problems (power outages, network
> downtime, server failures, etc).
>
>
> Andrew
>
> On Mon, Mar 14, 2011 at 3:04 PM, Heather Blanchard
> <hea...@crisiscommons.org> wrote:
>> In fact, we agree and do have a solution - Open Source Labs. We are currently planning the migration of our platforms to OSL. In fact this week.This incident just came at a time that we were inbetween infrastructures. And its not us, there are thousands of folks who are in the same position. Its not ideal. We have a plan for mirroring data around the world as well.
>>
>> Andrew - would be great to have you and anyone else interested in infrastructure to join our infrastructure team. They would love to have your help!
>>
>> Heather
>>
>> On Mar 14, 2011, at 2:54 PM, Andrew Lih wrote:
>>
>>> Dreamhost? Get it off there as quickly as possible. I'd suggest a more
>>> dedicated solution, ala Slicehost or Softlayer, depending on how much
>>> sysadmin resources you have.
>>>
>>>
>>> On Mon, Mar 14, 2011 at 9:57 AM, CrisisCommons <crisis...@gmail.com> wrote:
>>>> Not so dead - but but disconnected for a few folks. There are some in Europe on the east coast that can connect but some out west that can't. It definitely a Dreamhost problem.
>>>>
>>>> Plan B is working - folks are continuing on via the Skype chat. Temporary google docs have been set up for those who can't access the wiki until Dreamhost core gets back to full production.
>>>>
>>>> Thanks Spike, Deborah and Andrew for being on top of it. We will be reporting back here and on the Skype Infrastructure chat with updates
>>>>
>>>> Best,
>>>> Heather
>>>>
>>>> On Mar 14, 2011, at 12:22 PM, Chris Foote (Spike) wrote:
>>>>
>>>>> Dear all!
>>>>>
>>>>> Some of you may have noticed that the CrisisCommons wiki (and other sites) are currently having problems.
>>>>>
>>>>> This is being caused by the problems being experienced by our hosting company Dreamhost.
>>>>>
>>>>> We hope this will be resolved soon.
>>>>>
>>>>> Regards
>>>>> Spike
>>>>>
>>>>> BTW - it's not just us - http://bit.ly/gRIRpM
>>>>>
>>>>> CrisisCommons Infrastructure Working Group
>>>>>
>>>>
>>>>
>>
>>
>
>
>
> --
> Andrew Turner
> mobile: 248.982.3609
> and...@fortiusone.com
> http://highearthorbit.com
>
> http://geocommons.com Helping build the Geospatial Web
> Introduction to Neogeography - http://oreilly.com/catalog/neogeography
>

Andrew Turner

unread,

Mar 14, 2011, 6:00:50 PM3/14/11

to crisiscom...@googlegroups.com, dsha...@comcast.net

On Mon, Mar 14, 2011 at 5:05 PM, <dsha...@comcast.net> wrote:

Thanks everyone!!

Realize we are in the process of planning a move to the Open Source Labs (OSL) soon for our 'core' CrisisCommons Infrastructure services. As our hosting-provider, we are defining a SOW with OSL around topics of support, recovery, and availability, that will align with their capabilities as a 'provider' in their 'data center'. We are 'testing' some processes out now to understand the gaps in both support and technology capability, and intentionally pushing-the-envelop, to understand if we will need some compensating processes developed around these gaps (why the move has been as much focused on 'doing it' as in 'observing and documenting it'). For instance, they host their DB's on separate hardware than the app-servers, so our risk is reduced a bit against a single 'server failure'.

Thanks for working with OSL and moving us to a more dedicated infrastructure Deborah. Definitely going to be a big help and boost.

Today was a good indicator of some things to work on as well re: replication etc. I think Tim Schwartz has some great and straight-forward ideas that he's going to share shortly with the list.

What can we do to move forward on the transfer? I sent OSL a DB and Site dump a few weeks ago. It would be good to see if they've stood that up to ensure everything is in place.

Andrew

Jeff Sheltren

unread,

Mar 14, 2011, 6:03:26 PM3/14/11

to crisiscom...@googlegroups.com

On Mon, Mar 14, 2011 at 3:00 PM, Andrew Turner <and...@crisiscommons.org> wrote:
> Thanks for working with OSL and moving us to a more dedicated infrastructure
> Deborah. Definitely going to be a big help and boost.
> Today was a good indicator of some things to work on as well re: replication
> etc. I think Tim Schwartz has some great and straight-forward ideas that
> he's going to share shortly with the list.
> What can we do to move forward on the transfer? I sent OSL a DB and Site
> dump a few weeks ago. It would be good to see if they've stood that up to
> ensure everything is in place.
> Andrew

Yep, it's up and I know at least Spike has been testing the instance here...

-Jeff

dsha...@comcast.net

unread,

Mar 14, 2011, 6:55:09 PM3/14/11

to Andrew Turner, crisiscommons cciwg

----- Original Message -----
From: "Andrew Turner" <and...@crisiscommons.org>
To: "crisiscommons cciwg" <crisiscom...@googlegroups.com>
Cc: dsha...@comcast.net
Sent: Monday, March 14, 2011 5:00:50 PM
Subject: Re: CCIWG: 71 Re: Update - [CrisisCommons] CrisisCommons wiki dead

On Mon, Mar 14, 2011 at 5:05 PM, <dsha...@comcast.net> wrote:

Thanks everyone!!

Realize we are in the process of planning a move to the Open Source Labs (OSL) soon for our 'core' CrisisCommons Infrastructure services. As our hosting-provider, we are defining a SOW with OSL around topics of support, recovery, and availability, that will align with their capabilities as a 'provider' in their 'data center'. We are 'testing' some processes out now to understand the gaps in both support and technology capability, and intentionally pushing-the-envelop, to understand if we will need some compensating processes developed around these gaps (why the move has been as much focused on 'doing it' as in 'observing and documenting it'). For instance, they host their DB's on separate hardware than the app-servers, so our risk is reduced a bit against a single 'server failure'.

Thanks for working with OSL and moving us to a more dedicated infrastructure Deborah. Definitely going to be a big help and boost.

Today was a good indicator of some things to work on as well re: replication etc. I think Tim Schwartz has some great and straight-forward ideas that he's going to share shortly with the list.

What can we do to move forward on the transfer? I sent OSL a DB and Site dump a few weeks ago. It would be good to see if they've stood that up to ensure everything is in place.

Andrew

Andrew:

Regarding WIKI: It is ready to test with broader community, will send another email to this group.

Regarding CC.ORG: OSL is awaiting your response to ticket #18409.

Thanks,

Deb

Ted Han

unread,

Mar 15, 2011, 10:08:03 PM3/15/11

to crisiscom...@googlegroups.com, Andrew Turner

Hey Gang,

The Crisis Commons wiki has dropped again. I'm able to ping the box, but it's no longer serving HTTP requests (i've been firing curl's at it to no avail).

As i don't have access to the box, someone else will have to take a look at it and find out what's up.

Cheers,

-Ted

P.S. i'll find out whether Dreamhost is having trouble again.

CrisisCommons

unread,

Mar 15, 2011, 10:24:27 PM3/15/11

to crisiscom...@googlegroups.com, Andrew Turner

Andrew and Deborah have been pinged

Ted Han

unread,

Mar 15, 2011, 10:26:36 PM3/15/11

to crisiscom...@googlegroups.com, Andrew Turner

Yep, it's DreamHost again.

Yay.

https://twitter.com/dreamhost/status/47341995672342529

I would recommend moving off of DH as quickly as possible :)

Cheers guys,

-Ted

Vivek Lakshmanan

unread,

Mar 15, 2011, 10:34:24 PM3/15/11

to crisiscom...@googlegroups.com, CrisisCommons, Andrew Turner

Hi All,

On Tue, Mar 15, 2011 at 10:24 PM, CrisisCommons <crisis...@gmail.com> wrote:
> Andrew and Deborah have been pinged
>
> On Mar 15, 2011, at 10:08 PM, Ted Han wrote:
>>
>> The Crisis Commons wiki has dropped again. I'm able to ping the box, but it's no longer serving HTTP requests (i've been firing curl's at it to no avail).
>>
>> As i don't have access to the box, someone else will have to take a look at it and find out what's up.

FWIW dreamhost itself has not reported a problem at
http://dreamhoststatus.com and all of the sites I host on dreamhost
are up. The wiki indeed seems to be down though.

I was watching the discussion on moving to dedicated hosting with OSI
and replicating to secondary hosting providers for redundancy. That
all sounds perfectly reasonable. Another obvious request/suggestion to
throw on the list would be to register key properties like the wiki
etc. with a monitoring service like pingdom. It should help detect
downtimes and promptly notify admins. Might even help detect partial
unavailability, for instance, the wiki not being accessible from
certain parts of the world as was the case due to the dreamhost outage
recently.
- Vivek

Deborah Shaddon

unread,

Mar 15, 2011, 10:44:48 PM3/15/11

to crisiscom...@googlegroups.com, crisiscom...@googlegroups.com, CrisisCommons, Andrew Turner

Thanks everyone. We do have monitoring today, I sent a copy of that as FYI in another post.

In future, OSL, as a fully operational data center, does have monitoring and I know Lance from OSL was setting that up for both avail and performance monitoring, as well as usage metrics, on our test servers, and we will have a few key resources on our side notified, but more importantly, it will be integrate directly with their help-support desk. They host lots of opensource projects, like Apache and Linux kernel, and other CC partners like Sahana.