Devops Workout: 1 and 2 and 3 oohrah: Quantitative vs. Qualitative

Scott McCarty

unread,

Jul 6, 2010, 10:50:57 AM7/6/10

to devops-t...@googlegroups.com

So, this is a work out for all of you. This is a very complex build that we just started deploying at my place of employment (www.eyemg.com). Everything said and done, with documentation, I probably 60-ish hours in it's development and it is not automated. I would estimate that each build of one of these clusters would take about 4-6 hours and it would probably take 120 hours of senior sysadmin/devops time to automate this deployments. I believe it would then take an extra 30-40 per year as Red Hat (Read any distro) always makes small changes, even within a major release (I have had the syntax of wget, rsync, etc change even between minor revisions RHEL4.0 and 4.6 for example). If my estimations are correct, and they are probably grossly underestimated, I would have to build between 27 and 40 of these clusters the first year (160/4-6 hours) and 8-10 per year after the first to make automating this deployment cost effective.

Now, let's say that I am grossly underestimating, this has almost unanimously been my experience (and almost every software vendor on the planet's experience) when coding something for which I have no experience, that I am easily wrong by a factor or 2 to 4. In this case, I woudl have to build 130-160 or these clusters in the first year to make it worth automating the build time, and that doesn't even take into account the cost difference between a senior sysadmin/coder (devops) person to work on the scripting of this install. It might even doublt again.

Clearly, my point is, the cost benifit of automation still has to stay in our field of vision especially when we are building the underlying cloud infrastructure. I really want hear opinions on this, especially from the developer side?

Tutorial: how to build a a shared nothing HA cluster for KVM/DRBD/GFS2. Could this build be automated? http://ow.ly/27xPN

Leigh Heyman

unread,

Jul 6, 2010, 11:39:08 AM7/6/10

to devops-t...@googlegroups.com

There's one aspect here I'm not quite sure how to quantify. Maybe it's some kind of amortization I suppose, but one question you need to ask is after the initial build-out, how often do you anticipate building more. I.e. is the 8-10 per year after the first year high or low? If that's a high estimate, and in fact you expect only to build one of these every few months, there is a risk-benefit equation that's not getting captured by your straight cost-benefit number. Namely risk increases inversely to the frequency with which you perform a repeated task.

I went through a similar quantification excercise a few years ago and discovered the same (flawed) result-- that it seemed more cost effective to stick with a manual process because we set these systems up so rarely. Problem was, because we did it so infrequently, the docs on the manual process weren't kept up to date as the systems in production evolved. So every time we set up a new one, we'd miss something we wouldn't find until we brought the system online often causing at least partial downtime and burning up people's hours trying to find the problem and fix it (and update the docs and so on). In that context, the cost of building out an automated system suddenly made a lot more sense.

Also, how often do you expect to modify the configurations of the systems once they're in production? How does it change the equation when you're maintaining these systems with the same tools you've created to build them out?

-L

Scott McCarty

unread,

Jul 6, 2010, 11:57:26 AM7/6/10

to devops-t...@googlegroups.com

I agree. So here is the magic though, and I am partially playing devil's advocate here, I had a very similar experience with a Red Hat Cluster installation tool that I built. First, I had it in a human script form, as I like to call it, in a wiki. This worked great and we built a couple of them a year. I agree, it was a pain, we did them so rarely that we almost always had minor glitches, sometimes major, but the wiki worked way better than building them from memory, no joke, that's what they did before. Well, eventually, I got cocky and said, I can automate this extremely complex system install. Well, I succeeded after about 140 hours of coding, testing, etc. Here is the rub, now every time stuff broke because either Red Hat changed something, something changed in our infrastructure, or we had to add a feature, it multiplied the cost because I had to update the automation, and the automation was just as prone to destroying things or having undetectable failures.

Definitely, there is some art here at some point. I stuck this out here, because I would love to see what other people have used to quantify this problem. In security the risk analysis is mushy, but still gives a wag. I think we need something like that. The other thing is, why do modern operating systems not have clean api's for everyting, I mean everything from aligning partitions on boundaries to generating configuration files. Why is monitoring and performance acquisition not standard in all pieces of software. These problems will continue to plague us until we have some standards for what a "complete" piece of software is. I think cloud is helping drive this.

I hate to admit this, but it has taken ten years of me doing this kind of work to finally come to these conclusions and feel confident enough about my knowledge of this stuff to finally have a strong opinion. For new comers to this field it is a long arduous path. I think anyone that calls them selves a dev-op after three years of experience is kind of stretching it. Maybe, an infrastructure programmer, or a sysadmin, but the whole she-bang takes a lot of business acumen too.

Scott M

Scott Smith

unread,

Jul 6, 2010, 12:05:49 PM7/6/10

to devops-t...@googlegroups.com

Aren't you using Puppet or Chef?

Jonathan Hitchcock

unread,

Jul 6, 2010, 12:10:53 PM7/6/10

to devops-t...@googlegroups.com

On Tue, Jul 6, 2010 at 08:57, Scott McCarty <scott....@gmail.com> wrote:

Here is the rub, now every time stuff broke because either Red Hat changed something, something changed in our infrastructure, or we had to add a feature, it multiplied the cost because I had to update the automation, and the automation was just as prone to destroying things or having undetectable failures.

A hand-crafted script will probably break like this, yes. However, using puppet or chef means that you are often using plugins which the community has written, which will stay more up to date than the script (or at least have better error detection and recovery).

I think anyone that calls them selves a dev-op after three years of experience is kind of stretching it. Maybe, an infrastructure programmer, or a sysadmin, but the whole she-bang takes a lot of business acumen too.

We don't call ourselves dev-ops, we call ourselves sysadmins. Devops is the way we do things while we're being sysadmins :)

Leigh Heyman

unread,

Jul 6, 2010, 12:20:07 PM7/6/10

to devops-t...@googlegroups.com

On Tue, Jul 6, 2010 at 11:57 AM, Scott McCarty <scott....@gmail.com> wrote:

I agree. So here is the magic though, and I am partially playing devil's advocate here, I had a very similar experience with a Red Hat Cluster installation tool that I built. First, I had it in a human script form, as I like to call it, in a wiki. This worked great and we built a couple of them a year. I agree, it was a pain, we did them so rarely that we almost always had minor glitches, sometimes major, but the wiki worked way better than building them from memory, no joke, that's what they did before. Well, eventually, I got cocky and said, I can automate this extremely complex system install. Well, I succeeded after about 140 hours of coding, testing, etc. Here is the rub, now every time stuff broke because either Red Hat changed something, something changed in our infrastructure, or we had to add a feature, it multiplied the cost because I had to update the automation, and the automation was just as prone to destroying things or having undetectable failures.

Sounds like we've traveled a lot of the same ground... brain ---> wiki ---> automated build. The next step though (perhaps you're there already), and here's were I think the long-term payoff is the highest, is to build and use the SAME tools for both building and maintaining the systems. This more or less solves the issue of having to keep the build system up to date with feature and infrastructural changes, because the very act of pushing the new change will also update your build. But then, I"m probably simplifying things in the name of generalization.

Definitely, there is some art here at some point. I stuck this out here, because I would love to see what other people have used to quantify this problem. In security the risk analysis is mushy, but still gives a wag. I think we need something like that.

Agreed. Actually I don't know if you were there, but John Allspaw gave a great talk at Velocity that covered this very topic. By tracking (and graphing) incident frequency and duration, and cross-referencing e.g. time-to-resolve numbers against the "size" of releases etc. or other waypoints, you can actually gain some degree of quantifying the risk of certain actions-- way beyond just the anecdotal or "it feels."

Slides here:
http://www.slideshare.net/jallspaw/ops-metametrics-the-currency-you-pay-for-change

-L

Scott McCarty

unread,

Jul 6, 2010, 12:21:48 PM7/6/10

to devops-t...@googlegroups.com

Agreed, but puppet and chef can't align partition boundaries and honestly, neither can kickstart yet. There are always problems outside the boundaries of our tools. What about base lining the performance of every system before the get deployed to the production pool in a server farm, etc, etc

I guess what I am saying is I see a lot of people finding problems that fit the tool, which is great, and also common when you are emotionally attached, but when I have problems outside the bounds of a tool (everyday), I watch people turn into piles of mush because they don't know what to do. I still haven't quite been able to put my finger on it, but there is a difference between reproducibility and automation are not the same thing. The business requires one, not the other.

Scott M

Scott McCarty

unread,

Jul 6, 2010, 12:27:43 PM7/6/10

to devops-t...@googlegroups.com

On Tue, Jul 6, 2010 at 12:20 PM, Leigh Heyman <leigh....@gmail.com> wrote:

On Tue, Jul 6, 2010 at 11:57 AM, Scott McCarty <scott....@gmail.com> wrote:

I agree. So here is the magic though, and I am partially playing devil's advocate here, I had a very similar experience with a Red Hat Cluster installation tool that I built. First, I had it in a human script form, as I like to call it, in a wiki. This worked great and we built a couple of them a year. I agree, it was a pain, we did them so rarely that we almost always had minor glitches, sometimes major, but the wiki worked way better than building them from memory, no joke, that's what they did before. Well, eventually, I got cocky and said, I can automate this extremely complex system install. Well, I succeeded after about 140 hours of coding, testing, etc. Here is the rub, now every time stuff broke because either Red Hat changed something, something changed in our infrastructure, or we had to add a feature, it multiplied the cost because I had to update the automation, and the automation was just as prone to destroying things or having undetectable failures.

Sounds like we've traveled a lot of the same ground... brain ---> wiki ---> automated build. The next step though (perhaps you're there already), and here's were I think the long-term payoff is the highest, is to build and use the SAME tools for both building and maintaining the systems. This more or less solves the issue of having to keep the build system up to date with feature and infrastructural changes, because the very act of pushing the new change will also update your build. But then, I"m probably simplifying things in the name of generalization.

Not sure yet, I am not convinced of that just yet. I still think base lining is more like extending the operating system, or infrastructure, and deployment cycles are something different, I won't even dig into it all, but yeah, I think I am there, I just haven't had enough time to implement my brain child in code yet called Ranchero. I really think the magic is in the qualitative/quantitative hand off, which in my opinion is the place where you let human beings make decision based input to the system. Currently, I don't think any tool has the hand off right, and I think they are missing it because there are not APIs for every service...yet. But, oh well, this is like politics, lol.

Definitely, there is some art here at some point. I stuck this out here, because I would love to see what other people have used to quantify this problem. In security the risk analysis is mushy, but still gives a wag. I think we need something like that.

Agreed. Actually I don't know if you were there, but John Allspaw gave a great talk at Velocity that covered this very topic. By tracking (and graphing) incident frequency and duration, and cross-referencing e.g. time-to-resolve numbers against the "size" of releases etc. or other waypoints, you can actually gain some degree of quantifying the risk of certain actions-- way beyond just the anecdotal or "it feels."

Slides here:
http://www.slideshare.net/jallspaw/ops-metametrics-the-currency-you-pay-for-change

Now this is good stuff

-L

Gerardo López-Fernández

unread,

Jul 6, 2010, 12:32:41 PM7/6/10

to devops-t...@googlegroups.com

On Jul 6, 2010, at 5:57 PM, Scott McCarty wrote:

> I hate to admit this, but it has taken ten years of me doing this kind of work to finally come to these conclusions and feel confident enough about my knowledge of this stuff to finally have a strong opinion. For new comers to this field it is a long arduous path. I think anyone that calls them selves a dev-op after three years of experience is kind of stretching it. Maybe, an infrastructure programmer, or a sysadmin, but the whole she-bang takes a lot of business acumen too.

+1

This should be one the quotes heading a chapter in the devops-toolchain book. There are countless blog posts out there about joys of DevOps Nirvana, a lot of which claim that "we picked up this devops stuff and it's great; you should try it too". Very few, if any, address the _migration_ to the DevOps model from whatever current (and possibly) broken model you are running under, which is probably the hardest problem to overcome when you ponder embracing [any new] model. That does not even take into account whatever cultural and political issues you run into whenver you utter the word "change" in any organization. Change takes time, costs money, and requires shuffling organizations in ways people may or may not be ready to accept.

-Gerir

Kelly joyner

unread,

Jul 6, 2010, 5:27:11 PM7/6/10

to devops-t...@googlegroups.com

I agree that some processes aren't worth automating and that you should do the math before embarking on an expensive automation project. However,
one thing I see missing from this discussion is marginal reliability improvement gained from automating deployment. These gains are small when you do smaller/fewer deployments, but become huge when you do many/large deployments.

The trick is to take the ultimate scaling of the system into account when you design the solution: if you end up having to go back and automate the deployment steps after everybody and everything is hooked into the manual process, then you'll incur much higher expenses over the lifetime of the system than if you'd automated it to begin with.

Most outages are caused by change, and no small part of those failed changes are caused by deployment errors. One point of the automation is to reduce the number of chances that engineers have to get things wrong. If humans make errors ~10% of the time, then by reducing the number of things they have to do, you get 10% fewer errors for every step you automate.

This change is quantifiable, too, although not necessarily for solutions you haven't created yet.

For example: For an existing solution, scrub your pre-automation change-related outages, and determine how many were related to operator error. Divide that by the number of changes. Then scrub your post-automation outages, and determine how many changes were related to operator error. Also divide that by the number of changes. The difference between the two figures is moderately close to the marginal gain you got from automation.

Oh, there are other factors: familiarity with a deployment system tends to decrease the error rate as time goes on, and people often don't make the same mistake twice, which also tends to decrease the error rate as time goes on. If you've got a small difference between those two numbers, or a small sample size (e.g., only did three or four deployments), then any signal here may get lost in the noise. However, if there's a big difference between the numbers, and especially if the change centers around the time the deployment was automated, then that is very informative.

Scott raises an issue with automated deployment systems that I disagree with, another one that depends on the particular project, and one that I think is partially relevant:

a) Engineers who've been using an automated deployment system aren't as familiar with the process and don't know how to react when things go awry.

This is akin to owning a fencing company and saying "I'm going to have my
workers dig post-holes manually, because otherwise they won't know how to dig a post-hole when the digging machine breaks." That may be true, but the company digging their holes automatically is going to eat your lunch whenever their machine is working.

If you have engineers that don't understand the system enough to troubleshoot it, that, to me, is a separate issue that should be addressed in a testing or staging environment.

b) Changes in the underlying components can break automated deployment as easily--or more easily--than an automated deployment.

True, but with an automated deployment system you can more easily test the deployment system with new components, and even put it under CI, so that a complete end-to-end test deployment is done automatically whenever any of the components changes.

If you do a small number of deployments into an environment where things rarely change, then the cost of automation plus the costs of the test may not be worth it; as you move toward a larger of smaller deployments--toward the continuous deployment end of the spectrum, in other words--then that cost becomes much easier to amortize.

c) Some components of the deployment process are not correctly automatable.

True. However, I think this is getting to be a smaller and smaller set of corner cases. If the scale is there, I think it's generally best to automate what you can, because you still get the benefit of fewer errors for the parts you can automate.

As an aside, in my experience, if a human does a part of a process, and it's possible to automatically verify that the human did it right, then that can be a significant value right there.

Thanks,
-k. -_-

--- On Tue, 7/6/10, Gerardo López-Fernández <gerir...@gmail.com> wrote:

James Bailey

unread,

Jul 7, 2010, 1:53:20 PM7/7/10

to devops-t...@googlegroups.com

I have to agree with this I am a system administrator not a dev-op. I
called my self a junior systems administrator for over 5 years only
now after 10 years do I call my self a senior system administrator,
something I am very proud of. Devops is a useful skill set but it
doesn't encompass everything system administration encompass. It does
encompass network topology, it doesn't encompass sane design of the
underlying platforms. It doesn't encompass troubleshooting
techniques. It doesn't encompass data centre design cooling and power
strategies. System administration does.

Why is there this embarrassment to be a system administrator Paul
Nasrat was blogging about this 2 years ago saying that system
administrator as a term should die.[0]

One of the reasons I became interested in Devops, continuous build and
deployment processes and tools like Cobbler, Puppet and Controltier,
was that they helped me deal with the constant stream of poorly coded
releases from contractors and in house developers and the 30+ tweaks
to the system that were needed to release this particular version of
the software.

Being a system administrator is not about releasing code, sorry if
that comes as as a shock to the dev-ops people in the room. Releasing
code and data is something we do as part of our job our job is
administering the system.

[0] http://nasrat.livejournal.com/52943.html

Jim :)

Scott Smith

unread,

Jul 7, 2010, 2:06:11 PM7/7/10

to devops-t...@googlegroups.com

I hate labels.

Scott McCarty

unread,

Jul 7, 2010, 2:15:30 PM7/7/10

to devops-t...@googlegroups.com

On Wed, Jul 7, 2010 at 1:53 PM, James Bailey <parado...@googlemail.com> wrote:

On 6 July 2010 17:10, Jonathan Hitchcock <jonathan....@gmail.com> wrote:

> On Tue, Jul 6, 2010 at 08:57, Scott McCarty <scott....@gmail.com> wrote:
>>
>> I think anyone that calls them selves a dev-op after three years of
>> experience is kind of stretching it. Maybe, an infrastructure programmer, or
>> a sysadmin, but the whole she-bang takes a lot of business acumen too.
>

> We don't call ourselves dev-ops, we call ourselves sysadmins. Devops is the
> way we do things while we're being sysadmins :)

I have to agree with this I am a system administrator not a dev-op. I
called my self a junior systems administrator for over 5 years only
now after 10 years do I call my self a senior system administrator,
something I am very proud of. Devops is a useful skill set but it
doesn't encompass everything system administration encompass. It does
encompass network topology, it doesn't encompass sane design of the
underlying platforms. It doesn't encompass troubleshooting
techniques. It doesn't encompass data centre design cooling and power
strategies. System administration does.

Agree with all

Why is there this embarrassment to be a system administrator Paul
Nasrat was blogging about this 2 years ago saying that system
administrator as a term should die.[0]

I think there is terrible misunderstanding going on here, I meant embarrassment that it takes ten years to really be a senior systems administrator. I want to help junior admins with my code/documentation/blogging.

One of the reasons I became interested in Devops, continuous build and
deployment processes and tools like Cobbler, Puppet and Controltier,
was that they helped me deal with the constant stream of poorly coded
releases from contractors and in house developers and the 30+ tweaks
to the system that were needed to release this particular version of
the software.

Agreed, but they never seem to solve the 22K apache redirects that I have to test and implement when Business Intelligence determines that they need them to increase traffic/sales, etc. Kickstart/Cobbler and build/compile/deploy tools still hurt to use during the development phase. There needs to be more granularity, but I agree that they are the best that we have right now.

Being a system administrator is not about releasing code, sorry if
that comes as as a shock to the dev-ops people in the room. Releasing
code and data is something we do as part of our job our job is
administering the system.

Agreed for systems administrator only, but that is part of the magic of the dev-ops concept, until I started releasing some open source software (all systems administration related), I didn't realize some of the pieces I was missing about the deployment phase. I built a crap load of scripts to auto deploy RPMs to my local systems, while at the same time building tar balls, and debs to be deployed on the public website, while interacting with the version control repository, etc, etc. I had to write tools to do all of this stuff like a developer. The cool part is, I didn't have to ask anybody for hardware/software resources (OS install/Servers) because I am a sys admin, I just deployed all of this crap, lol.

There is definitely something I missed the first five to seven years of being a sys admin. Now I can configure a firewall, manage HVAC/Electrical Contractors, design a network, develop a piece of software for public consumption from scratch and write everything to deploy it sanely and not annoy other sys admins. I have a bit more of a "can do" attitude with regard to code and deployment and change because I understand the requirements a little tiny bit better now that I have had to release code.

I think the concept of devops is sort of romanticized by some of us (myself included), and that has been mentioned as bad by many systems administrators before me, but you really do have to "know" and "understand" so much "stuff" to classify your work as devops. I think there is something good to aspire to there.

Kind Regards
Scott M

Scott Smith

unread,

Jul 7, 2010, 2:25:26 PM7/7/10

to devops-t...@googlegroups.com

On Jul 7, 2010, at 11:15 AM, Scott McCarty wrote:
> I think there is terrible misunderstanding going on here, I meant embarrassment that it takes ten years to really be a senior systems administrator. I want to help junior admins with my code/documentation/blogging.
>

Until legit four-year degrees in Systems Administration become commonplace, I think ten years will remain a standard baseline, in terms of straight experience in the field. This profession is *way* more complex than software development. Unless one is working a gig that continually focuses on one specialty, there's no way they can learn enough to be truly senior in less than six years.

With software development, someone might spend ~4 years (counting time spent programming and learning on your own) in university and maybe after three years into their actual career could they be considered senior. And again, that's a field much which is less complex.

I think the SAGE job descriptions are a good reference for sysadmins trying to set goals, but they aren't up to date with today's skill requirements.

> I think the concept of devops is sort of romanticized by some of us (myself included), and that has been mentioned as bad by many systems administrators before me, but you really do have to "know" and "understand" so much "stuff" to classify your work as devops. I think there is something good to aspire to there.

I hate labels.

-scott

Ryan Dooley

unread,

Jul 7, 2010, 3:15:29 PM7/7/10

to devops-t...@googlegroups.com

On 7/6/2010 9:21 AM, Scott McCarty wrote:
> Agreed, but puppet and chef can't align partition boundaries and
> honestly, neither can kickstart yet. There are always problems outside
> the boundaries of our tools. What about base lining the performance of
> every system before the get deployed to the production pool in a
> server farm, etc, etc

Puppet, Chef, CfEngine, etc. all can't for sure but a simple (ruby,
python, bash) script wrapping parted can :)

FWIW, I do this for Powerset building software raids on installation.
My installer calculates boundaries for Linux MD devices based on the
smallest disk in the partition.

You could easily do a puppet class that requires an
Exec["partition-and-format-raid"] to be valid before installing support
software and dumping configs and services on it easily.

Cheers,
Ryan

Scott McCarty

unread,

Jul 7, 2010, 3:19:12 PM7/7/10

to devops-t...@googlegroups.com

This is good news. Show us some code brutha, I don't want to have to write that boundary checking stuff. Can parted do this for the root drive too?

Regards
Scott M

Ryan Dooley

unread,

Jul 7, 2010, 3:28:07 PM7/7/10

to devops-t...@googlegroups.com

On 7/7/2010 12:19 PM, Scott McCarty wrote:
>
> This is good news. Show us some code brutha, I don't want to have to
> write that boundary checking stuff. Can parted do this for the root
> drive too?
>

Yup. At Powerset on our older machines (when we didn't have many) we
used software raid-1's for boot, root and swap so if we lost a disk it
wouldn't stop the services on that host. Not the most optimal setup
but, hey, you do what you can when money is tight :)

I could put together a small proof-of-concept util for this sort of
thing though I have no cciss devices on which to experiment. I guess
that's the catch, right, depends on the hardware. All of our disk
controllers exported disks as /dev/sdX.

Cheers,
Ryan

Scott McCarty

unread,

Jul 7, 2010, 3:38:10 PM7/7/10

to devops-t...@googlegroups.com

If you post it, I can test it on a cciss device. It doesn't have to be the whole shebang, just enough to run during a kick start, I suppose.

Scott M

Joshua Timberman

unread,

Jul 8, 2010, 2:19:58 AM7/8/10

to devops-t...@googlegroups.com

Hello!

On Jul 6, 2010, at 8:50 AM, Scott McCarty wrote:

> Tutorial: how to build a a shared nothing HA cluster for KVM/DRBD/GFS2. Could this build be automated?http://ow.ly/27xPN

Yes, it can be automated.

Anything you can do with Unix commands you can do with a tool like Chef or Puppet. The variables here are the disk sizing calculations, and that can be done with a script (shell, perl, ruby, pick your flavor), or in the case of Chef, in the recipe itself[0]. The Unix commands used are going to set up / configure system resources in some way, and either Chef or Puppet have the capability to manage a wide variety of resources.

The amount of time taken to automate this is going to vary on the individual sysadmin or specific team's familiarity with the tools and the procedures involved. This particular document is very detailed, and there is a lot of domain knowledge here. There's a sunk cost (time) fallacy here, in that the approach used was that the procedure is so complex, and the steps were written out in a way that is designed to be run by hand. This breaks down at scale (~30some deployments for this alone), and is prone to human error.

I've seen many a "copy and paste" wiki document for procedurally setting up some service or kit manually, often because automating the tasks was deemed too difficult, or would be ineffective time/cost-wise. But invariably at some point, someone will miss a step, even the 15 year veteran expert that has done it dozens of times and/or wrote the procedure.

This doesn't even take into account troubleshooting and fixing the system if it broke somewhere in the middle of installation, especially under a deadline or during an outage. I'm sure we all have our HA software outage war stories.

Perhaps I have a strange perspective. After 3 years of building fully automated infrastructures with a variety of tools, I start with automating in mind. My friends thought it silly that I wrote a Chef cookbook to install teamspeak, since its an "install once and forget it" kind of thing on my home server, but it has proven a time saver since the latest version is under active development[1].

I'd love to take up the challenge, but I have my own time poverty right now.

[0]: Since Chef recipes are a Ruby DSL, and you can write Ruby code to do math.
[1]: Teamspeak 3, which is in beta, has sporadic new releases, and its a snap for me to upgrade :).

--
Joshua Timberman
http://twitter.com/jtimberman

Scott McCarty

unread,

Jul 8, 2010, 8:27:06 AM7/8/10

to devops-t...@googlegroups.com

In my small environment (~100 servers), there ARE things that cannot be automated efficiently (RT installation, is a singlton), period. This is not religion and we are not developers arguing about a new favorite tool of the month like screen or zsh.

I think you may have missed an underlying point. I pressed/asked for the partition alignment example/code, partially to spark conversation, but also because it should be used for guest virtual machine installations too, which really do need to be automated in my environment. I wanted to demonstrate how easy it is to get down the path of religion and loose business acumen.

Beyond that, remember that some of these questions depend on staff training level too. We spent our time automating DNS entry building and Apache virtual server building with ver robust tools because we do that all the time (10 times a week or more). However we installed request tracker manually, probably 8 hours of farting around, but it has literally saved us 100s of hours over the 3.5 years it has been installed. I don't plan on touching it for another 1.5 years. I will have gotten five years of usage out of 8 hours of work, saving hundreds. Now you might argue there are better tools in 2010, I don't care. It's not good business sense to even look at theme seriously until it's life cycle ends. Wether software is free or not, does not affect the business underpinings.
.
I don't automate because of philosophy, I do it when it makes sense. On the flip side, I am not scared of it either and at times have had to help others in my team when there was something a bit difficult to automate.

Tools like pupper/chef help in automation and may help slide the scale of when it makes to automate a little, but there will always be times that you cannot automate, always. This is especially true in small diverse environments like wild west hosting, which still has demand in 2010.

http://crunchtools.com/the-qq-workflow/

Kind Regards
Scott M

Scott Smith

unread,

Jul 8, 2010, 11:52:20 AM7/8/10

to devops-t...@googlegroups.com

On Jul 8, 2010, at 5:27 AM, Scott McCarty wrote:

> In my small environment (~100 servers), there ARE things that cannot be automated efficiently (RT installation, is a singlton), period. This is not religion and we are not developers arguing about a new favorite tool of the month like screen or zsh.
>
> I think you may have missed an underlying point. I pressed/asked for the partition alignment example/code, partially to spark conversation, but also because it should be used for guest virtual machine installations too, which really do need to be automated in my environment. I wanted to demonstrate how easy it is to get down the path of religion and loose business acumen.
>

I'm confused. You seemed fairly explicit in the requirement that such work be automated in your environment, but maybe I misread. At any rate, I think most everyone would agree that time spent at work on a project should reflect the needs and values of the business.

At least, I haven't seen anyone arguing otherwise on here.

> Now you might argue there are better tools in 2010, I don't care. It's not good business sense to even look at theme seriously until it's life cycle ends. Wether software is free or not, does not affect the business underpinings.

I think that depends on the quality (stability, performance, and more importantly: USABILITY, REPEATABILITY) of your current automation / config management infrastructure vs. what's coming out these days.

Sometimes it isn't, sometimes it is. Sometimes it's better to pay newbie admins to do it.

> I don't automate because of philosophy, I do it when it makes sense. On the flip side, I am not scared of it either and at times have had to help others in my team when there was something a bit difficult to automate.
>
> Tools like pupper/chef help in automation and may help slide the scale of when it makes to automate a little, but there will always be times that you cannot automate, always.

I think the word "automate" is probably not the best to use. It doesn't reflect business value. I much prefer saying "time saving mechanism(s)."

> This is especially true in small diverse environments like wild west hosting, which still has demand in 2010.

Sure. That's the beauty of our jobs. You just use/do what works best for your environment.

-scott

Scott McCarty

unread,

Jul 8, 2010, 12:32:07 PM7/8/10

to devops-t...@googlegroups.com

On Thu, Jul 8, 2010 at 11:52 AM, Scott Smith <sc...@ohlol.net> wrote:

On Jul 8, 2010, at 5:27 AM, Scott McCarty wrote:

> In my small environment (~100 servers), there ARE things that cannot be automated efficiently (RT installation, is a singlton), period. This is not religion and we are not developers arguing about a new favorite tool of the month like screen or zsh.
>
> I think you may have missed an underlying point. I pressed/asked for the partition alignment example/code, partially to spark conversation, but also because it should be used for guest virtual machine installations too, which really do need to be automated in my environment. I wanted to demonstrate how easy it is to get down the path of religion and loose business acumen.
>

I'm confused. You seemed fairly explicit in the requirement that such work be automated in your environment, but maybe I misread. At any rate, I think most everyone would agree that time spent at work on a project should reflect the needs and values of the business.

At least, I haven't seen anyone arguing otherwise on here.

Alright, that is probably my fault because I have been a bit coy. I think in my environment it makes sense to automate the virtual guests, especially for new deployments, but not the underlying infrastructure which is probably in a ration of 20/1.

Here's why, yy environment is highly diverse, as follows.

* 100 servers, with 88 different kinds. Biggest cluster/pool: 6
* mostly RHEL all the way back to 2.1 up to 5.4 and everything in between
* Finally removed last solaris, but could get more
* Some windows 5 servers, don't like doing it but pays bills
* Some of the apps are 10 years old and still used
* Customers won't pay to upgrade apps (50K) price tag
* Customers will pay to host (<1K/mo)

This environment sounds nothing like the peoples who I hear talking about cloud. Now mind you before I worked here, I worked for American Greetings running the largest e-card site in the world and also one of the highest traffic sites in the world on Valentines day, so I understand scale, I understand running commands across web pools of 50, 60, 150, and 1500. Amazon's EC2 basically exists because they had to meet a similar business problem. I understand deployment of highly homogeneous systems. This is like preaching to the choir.

What I have now is way, way harder to deal with because it is all one off servers. This does not mean that web ops, devops, automation principles aren't useful to us, but it means that we have to pick and choose what makes sense. Currently, we laid off our other systems administrator so I am by my self. Surely, now have to be wise with my time.

I was posing the the question in my very first post, when does it make sense to automate and I do think people arguing to think about "automating first". I feel a frustration similar to Bob Plankers over at the Lone Sys Admin http://lonesysadmin.net because most of the focus is on the easy problem, large homogeneous environments like yahoo, google, twitter, facebook, American Greetings, etc. At some point in my career, I will probably work for a company like that again, but right now I am trying to solve a different problem and I really think the businesses which are on this list are missing an opportunity to help the small and medium sized organizations with lots of legacy stuff.

> Now you might argue there are better tools in 2010, I don't care. It's not good business sense to even look at theme seriously until it's life cycle ends. Wether software is free or not, does not affect the business underpinings.

I think that depends on the quality (stability, performance, and more importantly: USABILITY, REPEATABILITY) of your current automation / config management infrastructure vs. what's coming out these days.

I am talking about individual apps/deployments not the infrastructure.

Sometimes it isn't, sometimes it is. Sometimes it's better to pay newbie admins to do it.

> I don't automate because of philosophy, I do it when it makes sense. On the flip side, I am not scared of it either and at times have had to help others in my team when there was something a bit difficult to automate.
>
> Tools like pupper/chef help in automation and may help slide the scale of when it makes to automate a little, but there will always be times that you cannot automate, always.

I think the word "automate" is probably not the best to use. It doesn't reflect business value. I much prefer saying "time saving mechanism(s)."

> This is especially true in small diverse environments like wild west hosting, which still has demand in 2010.

Sure. That's the beauty of our jobs. You just use/do what works best for your environment.

But the small, not sexy, problems get ignored, while the large, High Scalability style problems get all of the press and focus. I think there are a lot of small sysadmins that are drowning out there because it is a huge leap to use something like puppet/chef/cfengine, but they are missing getting a baseline from kickstart or something similar. Some automation is better than no automation for repetitive tasks, but they aren't going to be able to automate their entire environment. If I were starting a new project today, I would focus on reproducibility, but if I am supporting legacy applications that might not even be possible to rebuild, I am going to pick and choose what makes sense.

Essentially, this is a devops toolchain mailing list, and I would like to focus, identify and create toolchains that can be useful to small guys and large guys, medium guys, and of course the large guys will write their own

-scott

Luke Kanies

unread,

Jul 8, 2010, 12:46:34 PM7/8/10

to devops-t...@googlegroups.com

On Jul 8, 2010, at 9:32 AM, Scott McCarty wrote:

[...]

Alright, that is probably my fault because I have been a bit coy. I think in my environment it makes sense to automate the virtual guests, especially for new deployments, but not the underlying infrastructure which is probably in a ration of 20/1.

Here's why, yy environment is highly diverse, as follows.

* 100 servers, with 88 different kinds. Biggest cluster/pool: 6
* mostly RHEL all the way back to 2.1 up to 5.4 and everything in between
* Finally removed last solaris, but could get more
* Some windows 5 servers, don't like doing it but pays bills
* Some of the apps are 10 years old and still used
* Customers won't pay to upgrade apps (50K) price tag
* Customers will pay to host (<1K/mo)

This environment sounds nothing like the peoples who I hear talking about cloud. Now mind you before I worked here, I worked for American Greetings running the largest e-card site in the world and also one of the highest traffic sites in the world on Valentines day, so I understand scale, I understand running commands across web pools of 50, 60, 150, and 1500. Amazon's EC2 basically exists because they had to meet a similar business problem. I understand deployment of highly homogeneous systems. This is like preaching to the choir.

What I have now is way, way harder to deal with because it is all one off servers. This does not mean that web ops, devops, automation principles aren't useful to us, but it means that we have to pick and choose what makes sense. Currently, we laid off our other systems administrator so I am by my self. Surely, now have to be wise with my time.

I was posing the the question in my very first post, when does it make sense to automate and I do think people arguing to think about "automating first". I feel a frustration similar to Bob Plankers over at the Lone Sys Admin http://lonesysadmin.net because most of the focus is on the easy problem, large homogeneous environments like yahoo, google, twitter, facebook, American Greetings, etc. At some point in my career, I will probably work for a company like that again, but right now I am trying to solve a different problem and I really think the businesses which are on this list are missing an opportunity to help the small and medium sized organizations with lots of legacy stuff.

This is almost exactly like the environments that I did most of my automation in before I started working on Puppet, and my recommendation in this situation is, at least at first, to just start with the most painful bits. Automate away the things that suck to do, the things that get you paged in the middle of the night, the things that have the highest effort to reward ratio. These can usually be done with not much more effort than the work itself takes, and once automated they're just gravy from then on.

Only when the really painful bits are done should you look at automating "everything", and even then, "everything" is just "the work you actually do".

Green field is great, but it's also wicked rare.

[...]

--

You can't build a reputation on what you are going to do.
-- Henry Ford

---------------------------------------------------------------------

Luke Kanies | http://puppetlabs.com | http://madstop.com

Scott McCarty

unread,

Jul 8, 2010, 12:59:23 PM7/8/10

to devops-t...@googlegroups.com

Done, I did that 2 years ago, along with redesigning our network, firewalling, and BGP infrastructure.

But, the automation is home grown because cobbler was too much to set up, I wrote two dead easy to use build/baselining tools which are next on my list to release open source.

Because puppet/chef were too much, again two years ago, I just started using svn for critical configs, not dead simple, but as close as possible with some baby bash scripts wrapped around it works pretty damn well.

Honestly, I am impressed with all of the network services that can be handled with puppet now and I do envision a day of complete virtualization where everything can be config driven including the network, but I fear that is 10 to 15 years off, maybe longer for legacy stuff. I am 34 and hope to retire before then ;-), so until then I will focus on all kinds of pieces of the tool chain (logging, log analisis for now).

Thanks for the advise
Scott M

On Jul 8, 2010 12:46 PM, "Luke Kanies" <lu...@madstop.com> wrote:

On Jul 8, 2010, at 9:32 AM, Scott McCarty wrote:

[...]

> Alright, that is probably my fault because I have been a bit coy. I think in my environment it ma...

This is almost exactly like the environments that I did most of my automation in before I started working on Puppet, and my recommendation in this situation is, at least at first, to just start with the most painful bits. Automate away the things that suck to do, the things that get you paged in the middle of the night, the things that have the highest effort to reward ratio. These can usually be done with not much more effort than the work itself takes, and once automated they're just gravy from then on.

Only when the really painful bits are done should you look at automating "everything", and even then, "everything" is just "the work you actually do".

Green field is great, but it's also wicked rare.

Scott Smith

unread,

Jul 8, 2010, 1:11:08 PM7/8/10

to devops-t...@googlegroups.com

On Jul 8, 2010, at 9:46 AM, Luke Kanies wrote:
>> I was posing the the question in my very first post, when does it make sense to automate and I do think people arguing to think about "automating first". I feel a frustration similar to Bob Plankers over at the Lone Sys Admin http://lonesysadmin.net because most of the focus is on the easy problem, large homogeneous environments like yahoo, google, twitter, facebook, American Greetings, etc. At some point in my career, I will probably work for a company like that again, but right now I am trying to solve a different problem and I really think the businesses which are on this list are missing an opportunity to help the small and medium sized organizations with lots of legacy stuff.
>
> This is almost exactly like the environments that I did most of my automation in before I started working on Puppet, and my recommendation in this situation is, at least at first, to just start with the most painful bits. Automate away the things that suck to do, the things that get you paged in the middle of the night, the things that have the highest effort to reward ratio. These can usually be done with not much more effort than the work itself takes, and once automated they're just gravy from then on.
>
> Only when the really painful bits are done should you look at automating "everything", and even then, "everything" is just "the work you actually do".
>
> Green field is great, but it's also wicked rare.

+1 what Luke said.

Also... Scott, I'm not sure what makes you think large networks are easier ;) In my experience, they're just as hard due to complexity of services. Do you have experience otherwise?

-scott

Scott McCarty

unread,

Jul 8, 2010, 1:14:13 PM7/8/10

to devops-t...@googlegroups.com

When I was at american greetings we had 6-7 systems administrators with a similar number of kinds of servers, let's say 80ish

Scott McCarty

unread,

Jul 8, 2010, 1:19:03 PM7/8/10

to devops-t...@googlegroups.com

Oops, hit send on my phone before I was finished. Similar diversity with ease of managing the bigger pools made automation a no brainer cost benefit analysis, but no my job was not any easier ;-)

Currently, I have similar diversity but way smaller numbers which honestly DOES cloud (no pun intended) the cost benifit analysis and if I mess up, it hurts a lot more at a small company. I just don't have time to make bad decisions, they can have huge impacts on the business, like go out of business.

Thanks everyone for the feedback.

Scott M

Scott Smith

unread,

Jul 8, 2010, 1:30:50 PM7/8/10

to devops-t...@googlegroups.com

Ah, cool :)

Damon Edwards

unread,

Jul 9, 2010, 4:57:21 PM7/9/10

to devops-t...@googlegroups.com

I just want to point out that there is another angle to the "cost of automating" that isn't getting much attention on this thread.

Most of the discussion has been about looking at the cost issue from a single experienced administrator's point of view. (i.e. How long would it take ME to automate this vs. how long would it take ME to continue to do it manually/semi-manually?)

You have to consider all of the costs and risks associated with not automating from the perspective of the business. For example...

-Disaster recovery? All of those "one off" servers are necessary for the running some part of your business (if not, then you have a different problem and need to start shutting things off). Automation, backed-up offsite, is the shortest path to recovery (even if you aren't using a fully abstracted framework and have to tweak your scripts for a new datacenter).

-Resource utilization? Should you even be the manual cog in the deployment chain or should you be the automation designer and let a lower cost or less critical person run it and maintain it?

-What if anyone gets hit by a bus? Automation is documentation in code form. It either works or it doesn't... no "oops I didn't write that part down" surprises for the new guy.

-Throughput? Is there business value in being able to turn the crank faster and having a shorter Dev -> QA -> Prod lifecycle?

-How do you scale up or down? No organization is static. How do you bring on new people to allow your more experienced people to focus on value adding tasks rather than deployment? If times get tough how do you reduce headcount or reassign people if you aren't fully leveraging automation? Scale up or scale down points are usually make or break points for businesses.

-Standardization? What's the current cost of not standardizing on one way to build machines and deploy applications across all environments (dev, qa, prod)? Can you even manage a separate QA environment without it being automated (or is QA "performed" on developers boxes and on live production systems)? What's the long term cost of having multiple ways of doing things?

-Compliance? If you're business has compliance issues, how do you provide audibility without pushing all work on production systems through an abstraction like an automation framework?

-Performance measurement? If everything is a one off how do you know what is consuming too much time (apples to oranges to bananas makes metrics difficult) ? What actions do/don't add value to the business?

-Switch providers? If it makes business sense to move to the cloud or move to a different datacenter... would it even be possible to do so with so many one off or manual efforts? (for example we have a client that is going to save millions of dollars by moving several dozen machines to a different country for tax purposes)

And specific to using an open source framework to do your automation...

-Hiring? Do I have a better chance of finding someone and bringing that new hire up to speed if things are automated using an off the shelf framework vs. fully custom scripting or not automated at all?

-Can I leverage the work of others? Can I pull in knowledge and updates (in the form of reusable automation code) from other organizations or do I have to compile and maintain it all myself?

These are all concerns that are usually above the normal day-to-day concerns of a systems administrator. However, to me DevOps is all about elevating our individual game so we see how our actions/choices impact others and the success of the business.

So next time you catch yourself thinking "It's just an XXXXX. It's a one off and not worth automating", try looking at it from the macro perspective.

Damon Edwards

[ DTO Solutions, Inc. | 1840 Gateway Drive - Suite #200, San Mateo CA 94404 | o: 1.650.292.9660 x705 | m: 1.415.830.5856 | f: 1.415.358.8435 ]

Scott McCarty

unread,

Jul 9, 2010, 5:32:52 PM7/9/10

to devops-t...@googlegroups.com

No seriously, I don't want to come off sounding mean here, but disagree strongly with much of this logic. Automation does not alleviate many of the risks that you bring up. In fact, it often makes the risk greater.

- Scripts still break and when they do they often break horribly in ways that make the problem way harder to find, I don't care what abstraction your using, it is an abstraction and often hides the root cause of problems. -

- Scripts/Automation do NOT, NOT, NOT solve the problem of if I get hit by a bus, it is often much harder to dig into complex code especially when you don't understand it. This is just a different set of risks

- It can be much harder to hire someone who will be able to hit the ground running in an automated environment. for 55K you can hire somebody to solve most problems in my metro area. A person that can do "devops" can easily demand 80K-110K

- When you need to scale quickly automation will always help you do it quicker. I have brought this example up before, but I have a Redhat Cluster Build tool that I created which is very complex and was my first attempt at such a complex build tool. We had a deadline where I needed to build a cluster and I should have had it done in 20 minutes, but some infrastructure had changed since the last time I had used it causing about a day and a half delay. Luckily, I started about three days before the project needed completed, but the automation didn't help in building the system in time, that is for sure. I would also argue that it wasn't measurably more consistent either. We have clusters that were built by hand and by automation and they are still flaky finicky muckery.

I would argue that pieces of automation will help, kickstart for example, but not complete automation. Complete automation is a pipe dream, there are always human inputs. There is definitely times when it is inefficient to automate. When Toyota has a recall, they don't tool up a new factory and fix the accelerator pedals on Lexus', running each one through the line, that would be ridiculous. Instead they manually fix each one at a dealership because it is cheaper. Guess what happens, some of them are messed up still, so they bring back a third or fourth time to a dealership and let a mechanic fix them, eventually getting it right. No car company or manufacturer does it the other way. I rest my case.

On Fri, Jul 9, 2010 at 4:57 PM, Damon Edwards <da...@dtosolutions.com> wrote:

I just want to point out that there is another angle to the "cost of automating" that isn't getting much attention on this thread.

Most of the discussion has been about looking at the cost issue from a single experienced administrator's point of view. (i.e. How long would it take ME to automate this vs. how long would it take ME to continue to do it manually/semi-manually?)

You have to consider all of the costs and risks associated with not automating from the perspective of the business. For example...
-Disaster recovery? All of those "one off" servers are necessary for the running some part of your business (if not, then you have a different problem and need to start shutting things off). Automation, backed-up offsite, is the shortest path to recovery (even if you aren't using a fully abstracted framework and have to tweak your scripts for a new datacenter).

First, let's define business as anything that makes me money. If I make five dollars a month off of a server, it is still going to be there and I am not going to shut it off. You know why, because the guy that owns it or collocates it with me might have a 50K project coming up in 4 weeks and he might up his hosting these are internal sales opportunities, this is not google or yahoo. On the flip side, the guy that owns it doesn't expect DR for something that is so low cost.

Second, windows servers which are all built by hand and backed up with backup exec and bare metal restore are very quick to restore, probably just as quick as anything out there, so I don't understand the argument. DR and automation are two separate things in this context.

-Resource utilization? Should you even be the manual cog in the deployment chain or should you be the automation designer and let a lower cost or less critical person run it and maintain it?

Agreed, I brought this up.

-What if anyone gets hit by a bus? Automation is documentation in code form. It either works or it doesn't... no "oops I didn't write that part down" surprises for the new guy.

Addressed in opening paragraph

-Throughput? Is there business value in being able to turn the crank faster and having a shorter Dev -> QA -> Prod lifecycle?

Honestly, the operations side is rarely the bottleneck in any organization that I have ever worked for. Usually, dev runs way over budget and over time, while ops has the new servers in place months before go live.

-How do you scale up or down? No organization is static. How do you bring on new people to allow your more experienced people to focus on value adding tasks rather than deployment? If times get tough how do you reduce headcount or reassign people if you aren't fully leveraging automation? Scale up or scale down points are usually make or break points for businesses.

As the question before.

-Standardization? What's the current cost of not standardizing on one way to build machines and deploy applications across all environments (dev, qa, prod)? Can you even manage a separate QA environment without it being automated (or is QA "performed" on developers boxes and on live production systems)? What's the long term cost of having multiple ways of doing things?

Very complex question, standardization is from circa 1910 Ford plants and on, everyone attempts this. Having something documented in the wiki and tracked with a ticket system is imo standardization. I do agree with base lining (standardizing) all new builds.

-Compliance? If you're business has compliance issues, how do you provide audibility without pushing all work on production systems through an abstraction like an automation framework?

As above, a wiki/ticket system solves this.

-Performance measurement? If everything is a one off how do you know what is consuming too much time (apples to oranges to bananas makes metrics difficult) ? What actions do/don't add value to the business?

We are not talking about google/yahoo again, we are talking about small $100/$300 dollar a month sites, they do not expect google level performance tuning, though, for point of this argument, I have some of the most complex performance monitoring in place, often much more so than bigger environments which usually just sample members of the web pools because it is too much data to data acquisition all nodes.

-Switch providers? If it makes business sense to move to the cloud or move to a different datacenter... would it even be possible to do so with so many one off or manual efforts? (for example we have a client that is going to save millions of dollars by moving several dozen machines to a different country for tax purposes)

No it would not, but we are still in business, so clearly we fill a niche as a local service provider. I am paid above average for Akron/OH and I am sure, I am sure we are not unique.

And specific to using an open source framework to do your automation...

-Hiring? Do I have a better chance of finding someone and bringing that new hire up to speed if things are automated using an off the shelf framework vs. fully custom scripting or not automated at all?

Off the shelf is way better. No automation is even better, hate to say but most people don't understand it and can't do it very well. It is accustomed to writing your password down if it is too complex. I have seen people do all kinds of strange things to avoid using the automated system especially when they don't understand it. Also, in silicon valley everyone knows this stuff, in smaller markets, you may have to hire a consultant at a fortune if your one automation specialists leaves. In fact, I have been privy to conversations where managers have planned what to do if the "automation" guy leaves. It usually involves ripping it out and starting over.

-Can I leverage the work of others? Can I pull in knowledge and updates (in the form of reusable automation code) from other organizations or do I have to compile and maintain it all myself?

Of course, I would rather use puppet with all of those modules. This is not an argument about open source vs. home grown, in my opinion, it is almost always better to start with something open source.

These are all concerns that are usually above the normal day-to-day concerns of a systems administrator. However, to me DevOps is all about elevating our individual game so we see how our actions/choices impact others and the success of the business.

Yeah, I have thought about these problems a million time. I AM the operations at our small 6 person company. I am IT Manager, CTO, Systems Administrator, Systems Analyst, and Engineer all in one, and that is not uncommon in small environments, actually it is part of the magic of small places.

Luke Kanies

unread,

Jul 9, 2010, 5:47:40 PM7/9/10

to devops-t...@googlegroups.com

On Jul 9, 2010, at 2:32 PM, Scott McCarty wrote:

No seriously, I don't want to come off sounding mean here, but disagree strongly with much of this logic. Automation does not alleviate many of the risks that you bring up. In fact, it often makes the risk greater.

- Scripts still break and when they do they often break horribly in ways that make the problem way harder to find, I don't care what abstraction your using, it is an abstraction and often hides the root cause of problems. -

Meh. I'm pretty happy not knowing that my hard drive is actually multiple platters with data stored concentrically in cylinders on each side. That abstraction hasn't yet hurt me. I have no idea how many registers my processors have, nor how many chips my RAM is broken into - it just looks like 2gb of linear storage to me.

Obviously abstractions can be bad, but believe it or not there are just tons of abstractions that you don't even notice every day, and at some point in time, someone cried that there was no way they could survive if that critical detail was abstracted. In the switch to high level languages from assembly, programmers were convinced it wasn't possible to write usable software unless you managed every register manually, along with the heads of the hard drive, and just about everything else.

So yeah, we could have bad abstractions, we could provide poor debugging interfaces, we could provide poor logging, we could catch your computers on fire whenever there's an exception. But good software will fail well and provide transparency through the abstraction when you have problems or questions. Puppet will tell you every single command it ever runs, if you run it in debug mode.

In fact, the mark of a good abstraction vs. a crap one is that it provides transparency - it hides detail when you don't need it, but provides access to it when you need it. The goal isn't to forbid you from dealing with detail, but rather to just not require that you deal with it.

- Scripts/Automation do NOT, NOT, NOT solve the problem of if I get hit by a bus, it is often much harder to dig into complex code especially when you don't understand it. This is just a different set of risks

Having had a coworker -- the sole owner of storage and month-end processing -- die in his sleep, I can't agree less. Trust me, it's far harder to extract information from a dead person's brain than it is from code, no matter how poorly written.

- It can be much harder to hire someone who will be able to hit the ground running in an automated environment. for 55K you can hire somebody to solve most problems in my metro area. A person that can do "devops" can easily demand 80K-110K

Sure, you just have to hire five of them and a team lead. People who get more done in less time always get paid more, and it's not because they have words like devops on their resumes, it's because they get more done in less time. It's up to your company to decide whether they'll pay that premium.

- When you need to scale quickly automation will always help you do it quicker. I have brought this example up before, but I have a Redhat Cluster Build tool that I created which is very complex and was my first attempt at such a complex build tool. We had a deadline where I needed to build a cluster and I should have had it done in 20 minutes, but some infrastructure had changed since the last time I had used it causing about a day and a half delay. Luckily, I started about three days before the project needed completed, but the automation didn't help in building the system in time, that is for sure. I would also argue that it wasn't measurably more consistent either. We have clusters that were built by hand and by automation and they are still flaky finicky muckery.

That seems like a success story to me. It sounds like it was easier to fix your broken automation than it would have been to build the cluster manually, right?

And really, the problem with your cluster automation is that it wasn't declarative - if you'd built it in Puppet, you would have been running it constantly, and you would have been changing it as your infrastructure changed. Then your old clusters and new clusters would be built the same.

I would argue that pieces of automation will help, kickstart for example, but not complete automation. Complete automation is a pipe dream, there are always human inputs. There is definitely times when it is inefficient to automate. When Toyota has a recall, they don't tool up a new factory and fix the accelerator pedals on Lexus', running each one through the line, that would be ridiculous. Instead they manually fix each one at a dealership because it is cheaper. Guess what happens, some of them are messed up still, so they bring back a third or fourth time to a dealership and let a mechanic fix them, eventually getting it right. No car company or manufacturer does it the other way. I rest my case.

No argument that not everything can be automated, but is that really a reason to just give up on any automation ever?

--

Brand's Asymmetry:
The past can only be known, not changed. The future can only be
changed, not known.

Vladimir Vuksan

unread,

Jul 9, 2010, 5:54:32 PM7/9/10

to devops-t...@googlegroups.com

I agree with most of the points in principle however what it leaves off is ongoing cost of automating and relative value of data/services. One of the consequences of "infrastructure as code" is that as with regular code infrastructure code needs to be maintained, updated, tested etc. This is something where I often have had disagreement with product management/sales in relation to implementing "simple/one off" features. They may be simple however they are a serious long-term drain.

Secondly some data/services are not as valuable as others. Question is really "can my business survive an extended outage of X". For most of the one-of services the answer is yes. In the end it comes down to priorities, resources and communication and what tradeoffs people are willing to make. That said it is good to strive :-).

Vladimir

Scott McCarty

unread,

Jul 9, 2010, 5:55:03 PM7/9/10

to devops-t...@googlegroups.com

After this comment I am done, this is getting ridiculous.

On Fri, Jul 9, 2010 at 5:47 PM, Luke Kanies <lu...@madstop.com> wrote:

On Jul 9, 2010, at 2:32 PM, Scott McCarty wrote:

No seriously, I don't want to come off sounding mean here, but disagree strongly with much of this logic. Automation does not alleviate many of the risks that you bring up. In fact, it often makes the risk greater.

- Scripts still break and when they do they often break horribly in ways that make the problem way harder to find, I don't care what abstraction your using, it is an abstraction and often hides the root cause of problems. -

Meh. I'm pretty happy not knowing that my hard drive is actually multiple platters with data stored concentrically in cylinders on each side. That abstraction hasn't yet hurt me. I have no idea how many registers my processors have, nor how many chips my RAM is broken into - it just looks like 2gb of linear storage to me.

Correct, our customers pay us to have this kind of abstraction for their application, just like you pay your hardware vendor to have this kind of abstraction with the hardware. The guys at Intel, Seagate, etc, still care about the pieces parts. This is a very bad analogy.

Obviously abstractions can be bad, but believe it or not there are just tons of abstractions that you don't even notice every day, and at some point in time, someone cried that there was no way they could survive if that critical detail was abstracted. In the switch to high level languages from assembly, programmers were convinced it wasn't possible to write usable software unless you managed every register manually, along with the heads of the hard drive, and just about everything else.

So yeah, we could have bad abstractions, we could provide poor debugging interfaces, we could provide poor logging, we could catch your computers on fire whenever there's an exception. But good software will fail well and provide transparency through the abstraction when you have problems or questions. Puppet will tell you every single command it ever runs, if you run it in debug mode.

In fact, the mark of a good abstraction vs. a crap one is that it provides transparency - it hides detail when you don't need it, but provides access to it when you need it. The goal isn't to forbid you from dealing with detail, but rather to just not require that you deal with it.

- Scripts/Automation do NOT, NOT, NOT solve the problem of if I get hit by a bus, it is often much harder to dig into complex code especially when you don't understand it. This is just a different set of risks

Having had a coworker -- the sole owner of storage and month-end processing -- die in his sleep, I can't agree less. Trust me, it's far harder to extract information from a dead person's brain than it is from code, no matter how poorly written.

I didn't say no documentation, I said wiki docs vs. code. Bad docs is bad documentation and people that don't document aren't going to code any better. Maybe they will just go be trash men, I don't know.

- It can be much harder to hire someone who will be able to hit the ground running in an automated environment. for 55K you can hire somebody to solve most problems in my metro area. A person that can do "devops" can easily demand 80K-110K

Sure, you just have to hire five of them and a team lead. People who get more done in less time always get paid more, and it's not because they have words like devops on their resumes, it's because they get more done in less time. It's up to your company to decide whether they'll pay that premium.

They are also, just simply not available in every market.

- When you need to scale quickly automation will always help you do it quicker. I have brought this example up before, but I have a Redhat Cluster Build tool that I created which is very complex and was my first attempt at such a complex build tool. We had a deadline where I needed to build a cluster and I should have had it done in 20 minutes, but some infrastructure had changed since the last time I had used it causing about a day and a half delay. Luckily, I started about three days before the project needed completed, but the automation didn't help in building the system in time, that is for sure. I would also argue that it wasn't measurably more consistent either. We have clusters that were built by hand and by automation and they are still flaky finicky muckery.

That seems like a success story to me. It sounds like it was easier to fix your broken automation than it would have been to build the cluster manually, right?

And really, the problem with your cluster automation is that it wasn't declarative - if you'd built it in Puppet, you would have been running it constantly, and you would have been changing it as your infrastructure changed. Then your old clusters and new clusters would be built the same.

#fail incorrect, we build these things like two times a year. I am not talking about OS level stuff all of that ran beautiful, I am talking about the clustering pieces parts.

I would argue that pieces of automation will help, kickstart for example, but not complete automation. Complete automation is a pipe dream, there are always human inputs. There is definitely times when it is inefficient to automate. When Toyota has a recall, they don't tool up a new factory and fix the accelerator pedals on Lexus', running each one through the line, that would be ridiculous. Instead they manually fix each one at a dealership because it is cheaper. Guess what happens, some of them are messed up still, so they bring back a third or fourth time to a dealership and let a mechanic fix them, eventually getting it right. No car company or manufacturer does it the other way. I rest my case.

No argument that not everything can be automated, but is that really a reason to just give up on any automation ever?

Did I ever say that? Did anyone here say that? This is ridiculous. Dude, this is not a religion. Again, there are things that are better not to automate, there are things that are better to automate. I am using cost and risk to justify "better/worse"

Damon Edwards

unread,

Jul 9, 2010, 6:07:01 PM7/9/10

to devops-t...@googlegroups.com

Good point about sales / pm not understanding the full cost of a feature. I've found that it's critical to track all of those "non-functional requirements" in whatever requirements tracking/ticketing system/project management combo that you use. Until those non-functional requirements show up on a project estimate/plan as equal citizens to the functional requirements and everyone understands the full cost of adding that feature, the non-technical (and sometimes even development) parts of the organization will pretend or assume that those non-functional requirements and "surprise" costs don't exist.

It's one of the more common struggles in bridging an agile development group with it's corresponding operations group (yes, the siloization in the first place is a whole different topic). The dev and biz folks roll happily along committing to how much work they can get done in a sprint... and forget to factor in what operations can commit to within that sprint. End of the sprint comes and Ops gets the blame for being the bottleneck. Fingerpointyness ensues.

Damon Edwards

unread,

Jul 9, 2010, 6:23:43 PM7/9/10

to devops-t...@googlegroups.com

Same team! Same team! :)

We are all on the same team here! Let's not lose that DevOps groovy vibe. It's just my read on the thread but I don't think anyone is preaching religion. Just different experiences pointing out different points of view.

Scott, I'd ask you to think about one last thing before you go. Here is the ultimate business evaluation...

Assume another company popped up next door to yours tomorrow. Maybe they are a startup or maybe they are a bigger player taking notice of what you do. They have maximized their automation to the fullest extent possible and all other parts of your businesses (sales, marketing, product design, etc.) are equal. Could you compete? Would you be at a disadvantage? If so, then the choices you are making are calculated risks that you and the rest of your company have to decide on. If not, then carry on and do the smallest amount of work that gets you to the pub by 5. :)

Luke Kanies

unread,

Jul 9, 2010, 7:06:36 PM7/9/10

to devops-t...@googlegroups.com

On Jul 9, 2010, at 3:23 PM, Damon Edwards wrote:

Same team! Same team! :)

We are all on the same team here! Let's not lose that DevOps groovy vibe. It's just my read on the thread but I don't think anyone is preaching religion. Just different experiences pointing out different points of view.

Yeah, I didn't mean to come off as preaching - just trying to provide my side of argument. Sorry if I came off that way.

Scott, I'd ask you to think about one last thing before you go. Here is the ultimate business evaluation...

Assume another company popped up next door to yours tomorrow. Maybe they are a startup or maybe they are a bigger player taking notice of what you do. They have maximized their automation to the fullest extent possible and all other parts of your businesses (sales, marketing, product design, etc.) are equal. Could you compete? Would you be at a disadvantage? If so, then the choices you are making are calculated risks that you and the rest of your company have to decide on. If not, then carry on and do the smallest amount of work that gets you to the pub by 5. :)

Indeed and exactly. (That's a Jorge Castro quote, btw - the 'pub by 5' thing.)

--

The chief lesson I have learned in a long life is that the only way to
make a man trustworthy is to trust him; and the surest way to make him
untrustworthy is to distrust him and show your distrust.
-- Henry L. Stimson

Damon Edwards

unread,

Jul 9, 2010, 7:14:54 PM7/9/10

to devops-t...@googlegroups.com

Funny, I've been telling people it was a Luke Kanies quote... now I know!

Jacob Rosenberg

unread,

Jul 9, 2010, 7:21:17 PM7/9/10

to devops-t...@googlegroups.com

> Assume another company popped up next door to yours tomorrow. Maybe they are
> a startup or maybe they are a bigger player taking notice of what you do.
> They have maximized their automation to the fullest extent possible and all
> other parts of your businesses (sales, marketing, product design, etc.) are
> equal. Could you compete? Would you be at a disadvantage? If so, then the
> choices you are making are calculated risks that you and the rest of your
> company have to decide on.

The extreme opposite argument to much of the dev-ops comes from those
who would scoff at a $500k a year team of dev-ops people, and instead
hire a 25-to-30 head team somewhere offshore to work 3 shifts of doing
grunt work. While this isn't infinitely scalable, reconfiguring three
dozen employees to follow a new policy can be much quicker than
re-writing automation tools. This isn't necessarily a practical choice
for many options, but the "meat cloud" is a surprisingly competitive
alternative in everything but the most extreme case of scale.

Personally, I find the DevOps vibe a lot more interesting and fun, but
the world is full of diverse solutions to problems... and it's good to
know what people are going to be benchmarking you against.

Luke Kanies

unread,

Jul 9, 2010, 7:28:08 PM7/9/10

to devops-t...@googlegroups.com

On Jul 9, 2010, at 4:21 PM, Jacob Rosenberg wrote:

>> Assume another company popped up next door to yours tomorrow. Maybe
>> they are
>> a startup or maybe they are a bigger player taking notice of what
>> you do.
>> They have maximized their automation to the fullest extent possible
>> and all
>> other parts of your businesses (sales, marketing, product design,
>> etc.) are
>> equal. Could you compete? Would you be at a disadvantage? If so,
>> then the
>> choices you are making are calculated risks that you and the rest
>> of your
>> company have to decide on.
>
> The extreme opposite argument to much of the dev-ops comes from those
> who would scoff at a $500k a year team of dev-ops people, and instead
> hire a 25-to-30 head team somewhere offshore to work 3 shifts of doing
> grunt work. While this isn't infinitely scalable, reconfiguring three
> dozen employees to follow a new policy can be much quicker than
> re-writing automation tools. This isn't necessarily a practical choice
> for many options, but the "meat cloud" is a surprisingly competitive
> alternative in everything but the most extreme case of scale.

This story plays out in the dev world, though, too, right? And
Mythical Man Month hit this right on - you just can't scale people up,
because the communication costs pretty quickly overwhelm the benefits
of additional people.

I work with a lot of those companies that just hire people instead of
tooling up, and I wouldn't describe what they do as successful - in
fact, that's a lot of why they call us in for help.

> Personally, I find the DevOps vibe a lot more interesting and fun, but
> the world is full of diverse solutions to problems... and it's good to
> know what people are going to be benchmarking you against.

Indeed. Darwin ftw!

--
I worry that the person who thought up Muzak may be thinking up
something else. -- Lily Tomlin

Mark J. Reed

unread,

Jul 9, 2010, 8:50:02 PM7/9/10

to devops-t...@googlegroups.com

I wanted to reply to one of Scott's points, even though he dropped out
of the thread..

On Fri, Jul 9, 2010 at 5:55 PM, Scott McCarty <scott....@gmail.com> wrote:
> I didn't say no documentation, I said wiki docs vs. code. Bad docs is bad
> documentation and people that don't document aren't going to code any
> better. Maybe they will just go be trash men, I don't know.

Code is doc, but it's active doc - and compared with passive doc, it's
more likely to be tested, which is key. If you have an automated
configuration management system, you have no choice but to run the
code. The computer can't cheat by adding steps that aren't there, and
if it doesn't work, you'll notice. So even if it's "bad" code, it's
*working* code. Which is more than you may be able to say about your
passive documentation; it can lie dormant until it's needed because
the guy who wrote it is suddenly unavailable, and then you're in
trouble.

You can try to mitigate this problem by exercising the doc, as if it
were code - have someone, ideally a new person, run the procedure
every time, following the checklist exactly with nobody "helping".
But with humans instead of computers, you really have to nail down the
discipline. It's too easy for little extra "trivial" steps to get run
in between the formal steps of the procedure (and not even added
later).

If you have good discipline around creating and exercising
documentation, that's great - but it also means you'll seriously kick
ass at automation. :)

--
Mark J. Reed <mark...@gmail.com>

Scott McCarty

unread,

Jul 9, 2010, 9:24:43 PM7/9/10

to devops-t...@googlegroups.com

Agreed, same team. I was getting fired up earlier, but it's because not one person on here has come out and said that there are times when it is less efficient to automate.

Now here is the funny part. In general (computers and manufacturing), automation helps you be really good at one thing but really bad at doing a bunch of unknown things, as in my recall example. If you automate too much you loose fundamental business agility. I am not talking about features, I am talking about new markets.

For example, we don't go after Windows/MS work. It just isn't efficient for my organization to undertake. All of our automation/scripting/knowledge is in Linux, so we turn that work away. Specialization, including automation saps your agility, but allows you to do one thing really good.

On your point about the second company popping up, I really like this point and it is honestly one of the only genuinely interesting/original points brought up.

To answer your question, I believe at my company, we have currently automated everything that is possible to automate. My company is 15 years old and always had two systems administrators. Currently our web operations is bigger than it has ever been and I am now running it, by myself, and not breaking a sweat. I mean, hell, I have enough time to contribute to this list ;-)

From another angle though, I don't think guys that automate first would ever go after a business niche like we go after, they are just not wired that way, so really I don't think they coulod compete. I bring this up only because it is interesting. Second, our owner/sales weasel is one of the best I have ever seen. He is really good at selling to people that need this kind of niche. Finally, it is I think a fully automated environment is probably more attractive work which produces an intangable value for the employees that love this kind of work.

If we automate anymore than we are, given the current tools that exist, we would start to loose money and narrow the focus of what work we can undertake. Trust me, it is already a struggle in every organization with sales guys. They always want to sell on the edges of what the organization can do. It's a balance.

Oh and btw, yes, they pay me more than any sysadmin that has ever worked for them, but I bring a lot more than automation capability to our org. I do mrc/nrc calculations, product engineering/advise, etc, etc, etc

Time for the pub

Scott M

On Jul 9, 2010 6:23 PM, "Damon Edwards" <da...@dtosolutions.com> wrote:

Same team! Same team! :)

We are all on the same team here! Let's not lose that DevOps groovy vibe. It's just my read on the thread but I don't think anyone is preaching religion. Just different experiences pointing out different points of view.

Scott, I'd ask you to think about one last thing before you go. Here is the ultimate business evaluation...

Assume another company popped up next door to yours tomorrow. Maybe they are a startup or maybe they are a bigger player taking notice of what you do. They have maximized their automation to the fullest extent possible and all other parts of your businesses (sales, marketing, product design, etc.) are equal. Could you compete? Would you be at a disadvantage? If so, then the choices you are making are calculated risks that you and the rest of your company have to decide on. If not, then carry on and do the smallest amount of work that gets you to the pub by 5. :)

On Jul 9, 2010, at 2:55 PM, Scott McCarty wrote:

> After this comment I am done, this is getting ...
Damon Edwards
[ DTO Solutions, Inc. | 1840 Gateway Drive - Suite #200, San Mateo CA 94404 | o: 1.65...

Scott McCarty

unread,

Jul 9, 2010, 9:35:08 PM7/9/10

to devops-t...@googlegroups.com

I want to address why this is faulty logic. Our wiki docs are NOT passive. The only stuff in my org that isn't automated is either ran so rarely and so complex, it doesn't make sense or very very old and legacy.

We aren't building legacy stuff. The stuff that we do so rarely/complex that I don't automate, I can't remember how to do.(as a side note I have to do so much stuff I can barely remember how to do anything). In these scenarios, the wiki docs are by no means passive, they are the only way to get it done without a lot of suffering to figure it out again. This creates natural pain based sanctions thereby making the wiki authoratative.

Tengentally, I think this is when you know you have enough automation. If your docs are passive you better keep automating, when the stop being and your spending more time fixing automation, then building new infrastructure, you might have to much. It really is a bit of an art.

So, in short, I think your arguement has been being made since 1980 and is at this point a straw man. Everyone knows too much process sucks, and I probably error on the side of a hair too much automation myself, 1. Because it is more fun 2. Because I think it is a business advantage (for my own career) to be on the cutting edge of what is possible with technology and engineering.

Alright, you guys turned me into a liar about dropping out, oh well, I love this stuff ;-)

Scott M

Barry Allard

unread,

Jul 9, 2010, 10:04:05 PM7/9/10

to devops-t...@googlegroups.com

Luke nailed the mark precisely.

Paraphrasing a proto agile title 'Rapid Development,' an advancement of another great text, 'Code Complete,' "Adding people late in a project is like adding gasoline to a fire."

Say you're the technical lead of a project when the dreaded fail whale beaches itself at your door. And on the Friday en route your honeymoon in bahamas. Being the only one who can fix the email provisioning issue, it soon becomes clear recalling the former network provisioning lead will also be necessary to share the joy.

Food for Thought:
1a) Is there enough trust with management to rapidly tap the right people?
1b) Are there sufficient boring-but-important exercises to keep response and collaboration skills up?
2a) For projects, are managers generally able to partition responsibility rather than everyone-does-everything Tragedy of the Commons?
2b) Also, can they promote just the best ideas and spread how to hit <delete> on most hindenberg skycars before going IPO?

To all a good weekend.

Barry Allard

Scott Smith

unread,

Jul 9, 2010, 10:13:34 PM7/9/10

to devops-t...@googlegroups.com

On Jul 9, 2010, at 6:35 PM, Scott McCarty wrote:

> I want to address why this is faulty logic. Our wiki docs are NOT passive. The only stuff in my org that isn't automated is either ran so rarely and so complex, it doesn't make sense or very very old and legacy.
>
> We aren't building legacy stuff. The stuff that we do so rarely/complex that I don't automate, I can't remember how to do.(as a side note I have to do so much stuff I can barely remember how to do anything). In these scenarios, the wiki docs are by no means passive, they are the only way to get it done without a lot of suffering to figure it out again. This creates natural pain based sanctions thereby making the wiki authoratative.
>
> Tengentally, I think this is when you know you have enough automation. If your docs are passive you better keep automating, when the stop being and your spending more time fixing automation, then building new infrastructure, you might have to much. It really is a bit of an art.
>
> So, in short, I think your arguement has been being made since 1980 and is at this point a straw man. Everyone knows too much process sucks, and I probably error on the side of a hair too much automation myself, 1. Because it is more fun 2. Because I think it is a business advantage (for my own career) to be on the cutting edge of what is possible with technology and engineering.

Unfortunately, as great as wikis are, they easily become "passive" and very stale in large organizations. It takes whole teams of people to curate wikis and prevent stagnation.

Once you hit about ten people maintaining docs lasting years, the size of your department's wiki easily grows to thousands of pages. At that point duplicated work can be a serious problem. Not to mention old documentation that is outdated.

This requires a lot of work in large orgs. When faced with spending money on personnel to curate and maintain or self-document through code and automated tasks, which are taken care of inline with the Sysadmin's work itself, the choice is pretty clear.

-scott

Scott McCarty

unread,

Jul 9, 2010, 10:24:45 PM7/9/10

to devops-t...@googlegroups.com

The exact same thing is true of operations code and automation. Exact same problem, unused code is duplicated, etc, etc. Ever look at a piece of code you didn't like and get that urge to write it from scratch, or maybe you don't understand it all the way. Either way code gets duplicated/out dated (broken because of infrastructure changes) just like docs and maybe more so!

I still hold that there will be stuff that should be in the wiki, not automated, and it will be so complex/rarely used that it will by it's very nature athoritative, this is healthy and efficient!

This is another strawman, sorry

Scott M

Scott Smith

unread,

Jul 9, 2010, 10:29:24 PM7/9/10

to devops-t...@googlegroups.com

On Jul 9, 2010, at 7:24 PM, Scott McCarty wrote:

> The exact same thing is true of operations code and automation. Exact same problem, unused code is duplicated, etc, etc. Ever look at a piece of code you didn't like and get that urge to write it from scratch, or maybe you don't understand it all the way. Either way code gets duplicated/out dated (broken because of infrastructure changes) just like docs and maybe more so!

My real life experience says otherwise: If it's unused then it's not doing anything to your system. If it's used and unmodified it's working as intended.

If it's NOT working as intended, you've got other fish to fry.

-scott

Mark J. Reed

unread,

Jul 10, 2010, 9:14:13 AM7/10/10

to devops-t...@googlegroups.com

Wiki docs are intrinsically "passive"; calling them so wasn't a value
judgement. Code is "active" documentation because it can be run; it
does things on its own once triggered. Anything that requires a human
to read and interpret and take action on is passive documentation.
Whether it's well-maintained, etc. or not.

I'm also not arguing that everything can or should be automated; you
definitely have to look at the cost/benefit (Tom Limoncelli had some
good guidelines at LISA last year). I'm saying that automation has
inherent advantages over plain documentation: other things being
equal, automation wins. The fact that other things often *aren't*
equal absolutely has to go into your decision, but it doesn't affect
my point.

In particular, "good enough" automation can be far more useful
day-to-day than great documentation. Of course, you also have to be
careful - bad automation can wreak havoc across your infrastructure
much faster than people manually following terrible documentation.

Scott McCarty

unread,

Jul 10, 2010, 2:30:44 PM7/10/10

to devops-t...@googlegroups.com

I can live with everything you just said, I knew we could find common ground! Seriously, sorry for getting so huffy about this stuff.

I think Limoncelli may have some of this in his book Practical Network/Systems Administration too, but I don't remember it very well. Sadly, it is sitting on my desk, but I haven't cracked it open in a couple years.

Scott M

Greg Retkowski

unread,

Jul 10, 2010, 6:50:25 PM7/10/10

to devops-t...@googlegroups.com

I've got my own rule of thumb about when to automate - and it is served
me pretty well...

The first time you do something, figure it out and just do it..
The second time you do something, make it into a checklist.
The third time you do something, automate it.

This has helped me not waste a bunch of work into a configuration I
never use again, but ensures that if I find myself doing it with any
frequency that I gain the advantages of documenting and automating it.

-- Greg

Scott McCarty wrote:
> Agreed, same team. I was getting fired up earlier, but it's because not
> one person on here has come out and said that there are times when it is
> less efficient to automate.

--
Greg Retkowski / I.T. Infrastructure Consultant | RAGE
gr...@rage.net http://www.rage.net/~greg/ C:408-455-3913 | .NET

Packets routed through the Bay Area, CA, USA

Ernest Mueller

unread,

Jul 12, 2010, 5:23:40 PM7/12/10

to devops-t...@googlegroups.com

I was on vacation and missed some great discussion... Scott and Damon
especially make great points.

Here's what we've done and has worked/not worked for us.

In my previous environment, which was traditional hardware Web systems, we
picked and chose what to automate. We wrote code to do frequently repeated
work, and the top ones for us were:
1. Pushing new static content to the Web servers (take stuff out of SCM
based on control documents and rsync it out to the multiple targets and
invalidate it in the various caching levels) - eventually just turned it
into an interface for the content people to use directly.
2. Pushing new Java applications to the app servers (again, pulling out of
SCM based on control docs); devs use it to push to dev and ops uses it to
push to test/production. The dev basically provides a template full of
Perl variables of sources and targets and we "do the right thing" with it.
3. Apache redirect management, so that the business people eager for
dozens of redirects a week can do that themselves safely using a Web
interface (why there's not a real app that does this yet I'll never know)
4. Automatic actions (often restarts) on servers based on monitor results,
a typical SSH framework kind of thing. We have hundreds of apps and due to
priority conflict, when one had a major production problem there wasn't
always a dev doing a same day fix, so we'd need to NOT page a hapless
sysadmin 200 times a week about it. If all the sysadmin was going to do is
stare balefully at it and take stack traces and then restart - we have
computers for that.
5. Other more commodity/packaged stuff, like splunk for log management,
security scans, monitoring. We tried to expose as much of this directly to
the developers as we could.

There were a number of things we couldn't/didn't automate that made us sad.
1. Some kinds of code deploys. Annoyingly, some products we used
(Vignette v7, FAST Search) did not have full automation, you actually had
to click around in their GUIs to do some deploys and changes. We do
monthly "Web releases" where we take the whole site down late Friday night
and do 60+ app deploys. The Java stuff once automated was never on the
critical path, so this stuff usually was (although our DBAs often were,
running random long jobs and whatnot). This means that to this day we have
~8 hours of downtime once a month to push code. We also had the problem of
very few servers being "the same" - at most, small clusters were.
2. We managed system configs and builds using "the wiki method." But as
many have noted, we got thousands of docs, doc duplication, and when you'd
pawn off a system build on a new guy - you'd think they should be able to
do steps 1-30 as written down, but for some reason they always messed it
up.
Our justification to not automate it more using Chef or Puppet or whatever
was that we didn't do system builds that often, and it would take work and
maintenance, and there was always loads of other work to do. Plus, we
didn't have complete control over our boxes - we were part of a big
fragmented IT Infrastructure department where OS changes had to go through
the "UNIX team" or whatnot, we could only control/touch our app layer
stuff. (The UNIX team used cfengine but had zero interest in using it to
automate our stuff. Ah, OpsOps.)
As it took weeks to get a PO and buy a server and have the network team
rack and jack it and the UNIX team build it - it also didn't really matter
that our builds took a while to do "off the wiki" as well. What's a couple
days on top of 6 weeks?
3. Tools require continuous upkeep. Our DBAs learned that; they bought an
expensive Quest DB APM tool and didn't put enough time into keeping it up;
eventually it fell apart and they just uninstalled it and didn't re-up
maintenance. But we needed to spend more and more time maintaining the
tools (third party stuff like splunk and monitoring, but also our own
code).
4. Every time you did something like add a new system there would be gaps.
Maybe it didn't get added to monitoring. Maybe it didn't get added to the
server tracking db. But eventually someone would find the problem and fix
it manually, like the "Toyota accelerator solution" mentioned earlier.

So we got to a steady state. Things worked. The biggest time sinks were
filled. But agility was poor. It wasn't all lack of tooling - it was part
culture, part organizational, part un-automation, part using physical
hardware. But I saw more and more the business or devs not even trying
things because they had a tight timeframe and "knew the sysadmins couldn't
deliver that on a tight timeframe." Sure, there were DR concerns and
consistency/quality issues and all - but those didn't ever "cost justify"
more automation or more fundamental change. As team manager I enforced
documentation and crosstraining and other "hit by a bus" mitigations. We
were lightly automated waterfall with some dev collaboration (limited by a
high dev to op ratio).

But eventually, this didn't keep up with our needs. We stopped just
running our e-comm Web site and decided to offer actual SaaS products.
Everyone - business, devs, ops - realized "hey we have to be able to do
things faster. More code releases with less risk. Faster provisioning of
environments." It was clear that doing it our existing way was a
non-starter. We need entire new dev and/or production environments to
appear largely on demand, not in weeks. We can't queue up all our app
changes for one big release a month (that incurs downtime).

The solution was a mix of all this newfangled stuff. Yes, just taking the
existing people, process, and systems and putting more automation on them
doesn't get you all that much. But it's part of it. We greenfielded a
team of joint devs and ops for collaboration, started using cloud computing
to avoid hardware procurement cycles, collapsed ops responsibility for all
remaining components into that team to avoid the OpsOps fragmentation, and
automated provisioning, control, and monitoring. We are architecting the
systems differently, no longer cramming various services into every box to
maximize CPU usage and make them more commoditized. We're just finishing
version 1 of our automation framework and it has certainly taken a good
amount of time and effort.

Is it cost justified? Well, "hard ROI" is largely a myth. But when we
laid out this vision to the dev managers and business decisionmakers, they
sure as hell think it is. Frankly we wouldn't be able to turn back from a
more fully automated approach now, they wouldn't stand for it. Once they
realized what continuous deployment, pushbutton environment provisioning,
etc. get them in terms of flexibility and agility, they are more eager for
it than we are.

If I went back to our Web systems team, with a more than 10:1 dev to op
ratio, and multiple stovepipe infra teams, and physical hardware, and all
that, would full automation be at the top of my list? No, probably not.
There were bigger fish to fry than automating a once-per-month new server
build. But if that team needed more agility, it would need a series of
changes, where fuller automation is really one of the later stages.
People>process>tools, after all...

Ernest
______________________
UN-altered REPRODUCTION and DISSEMINATION of
this IMPORTANT information is ENCOURAGED.

From: Damon Edwards <da...@dtosolutions.com>

To: devops-t...@googlegroups.com

Date: 07/09/2010 03:57 PM

Subject: Re: Devops Workout: 1 and 2 and 3 oohrah: Quantitative vs. Qualitative

Sent by: devops-t...@googlegroups.com

Adrian Cole

unread,

Jul 12, 2010, 7:40:22 PM7/12/10

to devops-t...@googlegroups.com

Wow, what a great story!

Something related I think is worth underscoring is morale. I don't know about you, but in scenarios where one department "gets to" automate, collaborate, be more agile, etc, I've found morale relatively higher for those doing "operations" work. As everyone who can legally participate participates, less jobs suck. The former pattern of "random ops project given to Bob so he wont quit" is a non issue.

Good stuff guys!
-Adrian

...

Date: 07/09/2010 03:57 PM

...
Sent by: devops-t...@googlegroups.com

...

Reply all

Reply to author

Forward