Here is the rub, now every time stuff broke because either Red Hat changed something, something changed in our infrastructure, or we had to add a feature, it multiplied the cost because I had to update the automation, and the automation was just as prone to destroying things or having undetectable failures.
I think anyone that calls them selves a dev-op after three years of experience is kind of stretching it. Maybe, an infrastructure programmer, or a sysadmin, but the whole she-bang takes a lot of business acumen too.
I agree. So here is the magic though, and I am partially playing devil's advocate here, I had a very similar experience with a Red Hat Cluster installation tool that I built. First, I had it in a human script form, as I like to call it, in a wiki. This worked great and we built a couple of them a year. I agree, it was a pain, we did them so rarely that we almost always had minor glitches, sometimes major, but the wiki worked way better than building them from memory, no joke, that's what they did before. Well, eventually, I got cocky and said, I can automate this extremely complex system install. Well, I succeeded after about 140 hours of coding, testing, etc. Here is the rub, now every time stuff broke because either Red Hat changed something, something changed in our infrastructure, or we had to add a feature, it multiplied the cost because I had to update the automation, and the automation was just as prone to destroying things or having undetectable failures.
Definitely, there is some art here at some point. I stuck this out here, because I would love to see what other people have used to quantify this problem. In security the risk analysis is mushy, but still gives a wag. I think we need something like that.
On Tue, Jul 6, 2010 at 11:57 AM, Scott McCarty <scott....@gmail.com> wrote:
I agree. So here is the magic though, and I am partially playing devil's advocate here, I had a very similar experience with a Red Hat Cluster installation tool that I built. First, I had it in a human script form, as I like to call it, in a wiki. This worked great and we built a couple of them a year. I agree, it was a pain, we did them so rarely that we almost always had minor glitches, sometimes major, but the wiki worked way better than building them from memory, no joke, that's what they did before. Well, eventually, I got cocky and said, I can automate this extremely complex system install. Well, I succeeded after about 140 hours of coding, testing, etc. Here is the rub, now every time stuff broke because either Red Hat changed something, something changed in our infrastructure, or we had to add a feature, it multiplied the cost because I had to update the automation, and the automation was just as prone to destroying things or having undetectable failures.
Sounds like we've traveled a lot of the same ground... brain ---> wiki ---> automated build. The next step though (perhaps you're there already), and here's were I think the long-term payoff is the highest, is to build and use the SAME tools for both building and maintaining the systems. This more or less solves the issue of having to keep the build system up to date with feature and infrastructural changes, because the very act of pushing the new change will also update your build. But then, I"m probably simplifying things in the name of generalization.
Definitely, there is some art here at some point. I stuck this out here, because I would love to see what other people have used to quantify this problem. In security the risk analysis is mushy, but still gives a wag. I think we need something like that.
Agreed. Actually I don't know if you were there, but John Allspaw gave a great talk at Velocity that covered this very topic. By tracking (and graphing) incident frequency and duration, and cross-referencing e.g. time-to-resolve numbers against the "size" of releases etc. or other waypoints, you can actually gain some degree of quantifying the risk of certain actions-- way beyond just the anecdotal or "it feels."
Slides here:
http://www.slideshare.net/jallspaw/ops-metametrics-the-currency-you-pay-for-change
-L
> I hate to admit this, but it has taken ten years of me doing this kind of work to finally come to these conclusions and feel confident enough about my knowledge of this stuff to finally have a strong opinion. For new comers to this field it is a long arduous path. I think anyone that calls them selves a dev-op after three years of experience is kind of stretching it. Maybe, an infrastructure programmer, or a sysadmin, but the whole she-bang takes a lot of business acumen too.
+1
This should be one the quotes heading a chapter in the devops-toolchain book. There are countless blog posts out there about joys of DevOps Nirvana, a lot of which claim that "we picked up this devops stuff and it's great; you should try it too". Very few, if any, address the _migration_ to the DevOps model from whatever current (and possibly) broken model you are running under, which is probably the hardest problem to overcome when you ponder embracing [any new] model. That does not even take into account whatever cultural and political issues you run into whenver you utter the word "change" in any organization. Change takes time, costs money, and requires shuffling organizations in ways people may or may not be ready to accept.
-Gerir
The trick is to take the ultimate scaling of the system into account when you design the solution: if you end up having to go back and automate the deployment steps after everybody and everything is hooked into the manual process, then you'll incur much higher expenses over the lifetime of the system than if you'd automated it to begin with.
Most outages are caused by change, and no small part of those failed changes are caused by deployment errors. One point of the automation is to reduce the number of chances that engineers have to get things wrong. If humans make errors ~10% of the time, then by reducing the number of things they have to do, you get 10% fewer errors for every step you automate.
This change is quantifiable, too, although not necessarily for solutions you haven't created yet.
For example: For an existing solution, scrub your pre-automation change-related outages, and determine how many were related to operator error. Divide that by the number of changes. Then scrub your post-automation outages, and determine how many changes were related to operator error. Also divide that by the number of changes. The difference between the two figures is moderately close to the marginal gain you got from automation.
Oh, there are other factors: familiarity with a deployment system tends to decrease the error rate as time goes on, and people often don't make the same mistake twice, which also tends to decrease the error rate as time goes on. If you've got a small difference between those two numbers, or a small sample size (e.g., only did three or four deployments), then any signal here may get lost in the noise. However, if there's a big difference between the numbers, and especially if the change centers around the time the deployment was automated, then that is very informative.
Scott raises an issue with automated deployment systems that I disagree with, another one that depends on the particular project, and one that I think is partially relevant:
a) Engineers who've been using an automated deployment system aren't as familiar with the process and don't know how to react when things go awry.
This is akin to owning a fencing company and saying "I'm going to have my
workers dig post-holes manually, because otherwise they won't know how to dig a post-hole when the digging machine breaks." That may be true, but the company digging their holes automatically is going to eat your lunch whenever their machine is working.
If you have engineers that don't understand the system enough to troubleshoot it, that, to me, is a separate issue that should be addressed in a testing or staging environment.
b) Changes in the underlying components can break automated deployment as easily--or more easily--than an automated deployment.
True, but with an automated deployment system you can more easily test the deployment system with new components, and even put it under CI, so that a complete end-to-end test deployment is done automatically whenever any of the components changes.
If you do a small number of deployments into an environment where things rarely change, then the cost of automation plus the costs of the test may not be worth it; as you move toward a larger of smaller deployments--toward the continuous deployment end of the spectrum, in other words--then that cost becomes much easier to amortize.
c) Some components of the deployment process are not correctly automatable.
True. However, I think this is getting to be a smaller and smaller set of corner cases. If the scale is there, I think it's generally best to automate what you can, because you still get the benefit of fewer errors for the parts you can automate.
As an aside, in my experience, if a human does a part of a process, and it's possible to automatically verify that the human did it right, then that can be a significant value right there.
Thanks,
-k. -_-
--- On Tue, 7/6/10, Gerardo López-Fernández <gerir...@gmail.com> wrote:
I have to agree with this I am a system administrator not a dev-op. I
called my self a junior systems administrator for over 5 years only
now after 10 years do I call my self a senior system administrator,
something I am very proud of. Devops is a useful skill set but it
doesn't encompass everything system administration encompass. It does
encompass network topology, it doesn't encompass sane design of the
underlying platforms. It doesn't encompass troubleshooting
techniques. It doesn't encompass data centre design cooling and power
strategies. System administration does.
Why is there this embarrassment to be a system administrator Paul
Nasrat was blogging about this 2 years ago saying that system
administrator as a term should die.[0]
One of the reasons I became interested in Devops, continuous build and
deployment processes and tools like Cobbler, Puppet and Controltier,
was that they helped me deal with the constant stream of poorly coded
releases from contractors and in house developers and the 30+ tweaks
to the system that were needed to release this particular version of
the software.
Being a system administrator is not about releasing code, sorry if
that comes as as a shock to the dev-ops people in the room. Releasing
code and data is something we do as part of our job our job is
administering the system.
[0] http://nasrat.livejournal.com/52943.html
Jim :)
On 6 July 2010 17:10, Jonathan Hitchcock <jonathan....@gmail.com> wrote:
> On Tue, Jul 6, 2010 at 08:57, Scott McCarty <scott....@gmail.com> wrote:> We don't call ourselves dev-ops, we call ourselves sysadmins. Devops is the
>>
>> I think anyone that calls them selves a dev-op after three years of
>> experience is kind of stretching it. Maybe, an infrastructure programmer, or
>> a sysadmin, but the whole she-bang takes a lot of business acumen too.
>
> way we do things while we're being sysadmins :)
I have to agree with this I am a system administrator not a dev-op. I
called my self a junior systems administrator for over 5 years only
now after 10 years do I call my self a senior system administrator,
something I am very proud of. Devops is a useful skill set but it
doesn't encompass everything system administration encompass. It does
encompass network topology, it doesn't encompass sane design of the
underlying platforms. It doesn't encompass troubleshooting
techniques. It doesn't encompass data centre design cooling and power
strategies. System administration does.
Why is there this embarrassment to be a system administrator Paul
Nasrat was blogging about this 2 years ago saying that system
administrator as a term should die.[0]
One of the reasons I became interested in Devops, continuous build and
deployment processes and tools like Cobbler, Puppet and Controltier,
was that they helped me deal with the constant stream of poorly coded
releases from contractors and in house developers and the 30+ tweaks
to the system that were needed to release this particular version of
the software.
Being a system administrator is not about releasing code, sorry if
that comes as as a shock to the dev-ops people in the room. Releasing
code and data is something we do as part of our job our job is
administering the system.
Until legit four-year degrees in Systems Administration become commonplace, I think ten years will remain a standard baseline, in terms of straight experience in the field. This profession is *way* more complex than software development. Unless one is working a gig that continually focuses on one specialty, there's no way they can learn enough to be truly senior in less than six years.
With software development, someone might spend ~4 years (counting time spent programming and learning on your own) in university and maybe after three years into their actual career could they be considered senior. And again, that's a field much which is less complex.
I think the SAGE job descriptions are a good reference for sysadmins trying to set goals, but they aren't up to date with today's skill requirements.
> I think the concept of devops is sort of romanticized by some of us (myself included), and that has been mentioned as bad by many systems administrators before me, but you really do have to "know" and "understand" so much "stuff" to classify your work as devops. I think there is something good to aspire to there.
I hate labels.
-scott
Puppet, Chef, CfEngine, etc. all can't for sure but a simple (ruby,
python, bash) script wrapping parted can :)
FWIW, I do this for Powerset building software raids on installation.
My installer calculates boundaries for Linux MD devices based on the
smallest disk in the partition.
You could easily do a puppet class that requires an
Exec["partition-and-format-raid"] to be valid before installing support
software and dumping configs and services on it easily.
Cheers,
Ryan
This is good news. Show us some code brutha, I don't want to have to write that boundary checking stuff. Can parted do this for the root drive too?
Regards
Scott M
Yup. At Powerset on our older machines (when we didn't have many) we
used software raid-1's for boot, root and swap so if we lost a disk it
wouldn't stop the services on that host. Not the most optimal setup
but, hey, you do what you can when money is tight :)
I could put together a small proof-of-concept util for this sort of
thing though I have no cciss devices on which to experiment. I guess
that's the catch, right, depends on the hardware. All of our disk
controllers exported disks as /dev/sdX.
Cheers,
Ryan
On Jul 6, 2010, at 8:50 AM, Scott McCarty wrote:
> Tutorial: how to build a a shared nothing HA cluster for KVM/DRBD/GFS2. Could this build be automated?http://ow.ly/27xPN
Yes, it can be automated.
Anything you can do with Unix commands you can do with a tool like Chef or Puppet. The variables here are the disk sizing calculations, and that can be done with a script (shell, perl, ruby, pick your flavor), or in the case of Chef, in the recipe itself[0]. The Unix commands used are going to set up / configure system resources in some way, and either Chef or Puppet have the capability to manage a wide variety of resources.
The amount of time taken to automate this is going to vary on the individual sysadmin or specific team's familiarity with the tools and the procedures involved. This particular document is very detailed, and there is a lot of domain knowledge here. There's a sunk cost (time) fallacy here, in that the approach used was that the procedure is so complex, and the steps were written out in a way that is designed to be run by hand. This breaks down at scale (~30some deployments for this alone), and is prone to human error.
I've seen many a "copy and paste" wiki document for procedurally setting up some service or kit manually, often because automating the tasks was deemed too difficult, or would be ineffective time/cost-wise. But invariably at some point, someone will miss a step, even the 15 year veteran expert that has done it dozens of times and/or wrote the procedure.
This doesn't even take into account troubleshooting and fixing the system if it broke somewhere in the middle of installation, especially under a deadline or during an outage. I'm sure we all have our HA software outage war stories.
Perhaps I have a strange perspective. After 3 years of building fully automated infrastructures with a variety of tools, I start with automating in mind. My friends thought it silly that I wrote a Chef cookbook to install teamspeak, since its an "install once and forget it" kind of thing on my home server, but it has proven a time saver since the latest version is under active development[1].
I'd love to take up the challenge, but I have my own time poverty right now.
[0]: Since Chef recipes are a Ruby DSL, and you can write Ruby code to do math.
[1]: Teamspeak 3, which is in beta, has sporadic new releases, and its a snap for me to upgrade :).
--
Joshua Timberman
http://twitter.com/jtimberman
In my small environment (~100 servers), there ARE things that cannot be automated efficiently (RT installation, is a singlton), period. This is not religion and we are not developers arguing about a new favorite tool of the month like screen or zsh.
I think you may have missed an underlying point. I pressed/asked for the partition alignment example/code, partially to spark conversation, but also because it should be used for guest virtual machine installations too, which really do need to be automated in my environment. I wanted to demonstrate how easy it is to get down the path of religion and loose business acumen.
Beyond that, remember that some of these questions depend on staff training level too. We spent our time automating DNS entry building and Apache virtual server building with ver robust tools because we do that all the time (10 times a week or more). However we installed request tracker manually, probably 8 hours of farting around, but it has literally saved us 100s of hours over the 3.5 years it has been installed. I don't plan on touching it for another 1.5 years. I will have gotten five years of usage out of 8 hours of work, saving hundreds. Now you might argue there are better tools in 2010, I don't care. It's not good business sense to even look at theme seriously until it's life cycle ends. Wether software is free or not, does not affect the business underpinings.
.
I don't automate because of philosophy, I do it when it makes sense. On the flip side, I am not scared of it either and at times have had to help others in my team when there was something a bit difficult to automate.
Tools like pupper/chef help in automation and may help slide the scale of when it makes to automate a little, but there will always be times that you cannot automate, always. This is especially true in small diverse environments like wild west hosting, which still has demand in 2010.
http://crunchtools.com/the-qq-workflow/
Kind Regards
Scott M
> In my small environment (~100 servers), there ARE things that cannot be automated efficiently (RT installation, is a singlton), period. This is not religion and we are not developers arguing about a new favorite tool of the month like screen or zsh.
>
> I think you may have missed an underlying point. I pressed/asked for the partition alignment example/code, partially to spark conversation, but also because it should be used for guest virtual machine installations too, which really do need to be automated in my environment. I wanted to demonstrate how easy it is to get down the path of religion and loose business acumen.
>
I'm confused. You seemed fairly explicit in the requirement that such work be automated in your environment, but maybe I misread. At any rate, I think most everyone would agree that time spent at work on a project should reflect the needs and values of the business.
At least, I haven't seen anyone arguing otherwise on here.
> Now you might argue there are better tools in 2010, I don't care. It's not good business sense to even look at theme seriously until it's life cycle ends. Wether software is free or not, does not affect the business underpinings.
I think that depends on the quality (stability, performance, and more importantly: USABILITY, REPEATABILITY) of your current automation / config management infrastructure vs. what's coming out these days.
Sometimes it isn't, sometimes it is. Sometimes it's better to pay newbie admins to do it.
> I don't automate because of philosophy, I do it when it makes sense. On the flip side, I am not scared of it either and at times have had to help others in my team when there was something a bit difficult to automate.
>
> Tools like pupper/chef help in automation and may help slide the scale of when it makes to automate a little, but there will always be times that you cannot automate, always.
I think the word "automate" is probably not the best to use. It doesn't reflect business value. I much prefer saying "time saving mechanism(s)."
> This is especially true in small diverse environments like wild west hosting, which still has demand in 2010.
Sure. That's the beauty of our jobs. You just use/do what works best for your environment.
-scott
On Jul 8, 2010, at 5:27 AM, Scott McCarty wrote:I'm confused. You seemed fairly explicit in the requirement that such work be automated in your environment, but maybe I misread. At any rate, I think most everyone would agree that time spent at work on a project should reflect the needs and values of the business.
> In my small environment (~100 servers), there ARE things that cannot be automated efficiently (RT installation, is a singlton), period. This is not religion and we are not developers arguing about a new favorite tool of the month like screen or zsh.
>
> I think you may have missed an underlying point. I pressed/asked for the partition alignment example/code, partially to spark conversation, but also because it should be used for guest virtual machine installations too, which really do need to be automated in my environment. I wanted to demonstrate how easy it is to get down the path of religion and loose business acumen.
>
At least, I haven't seen anyone arguing otherwise on here.
I think that depends on the quality (stability, performance, and more importantly: USABILITY, REPEATABILITY) of your current automation / config management infrastructure vs. what's coming out these days.
> Now you might argue there are better tools in 2010, I don't care. It's not good business sense to even look at theme seriously until it's life cycle ends. Wether software is free or not, does not affect the business underpinings.
Sometimes it isn't, sometimes it is. Sometimes it's better to pay newbie admins to do it.
I think the word "automate" is probably not the best to use. It doesn't reflect business value. I much prefer saying "time saving mechanism(s)."
> I don't automate because of philosophy, I do it when it makes sense. On the flip side, I am not scared of it either and at times have had to help others in my team when there was something a bit difficult to automate.
>
> Tools like pupper/chef help in automation and may help slide the scale of when it makes to automate a little, but there will always be times that you cannot automate, always.
Sure. That's the beauty of our jobs. You just use/do what works best for your environment.
> This is especially true in small diverse environments like wild west hosting, which still has demand in 2010.
-scott
[...]
Alright, that is probably my fault because I have been a bit coy. I think in my environment it makes sense to automate the virtual guests, especially for new deployments, but not the underlying infrastructure which is probably in a ration of 20/1.
Here's why, yy environment is highly diverse, as follows.
* 100 servers, with 88 different kinds. Biggest cluster/pool: 6
* mostly RHEL all the way back to 2.1 up to 5.4 and everything in between
* Finally removed last solaris, but could get more
* Some windows 5 servers, don't like doing it but pays bills
* Some of the apps are 10 years old and still used
* Customers won't pay to upgrade apps (50K) price tag
* Customers will pay to host (<1K/mo)
This environment sounds nothing like the peoples who I hear talking about cloud. Now mind you before I worked here, I worked for American Greetings running the largest e-card site in the world and also one of the highest traffic sites in the world on Valentines day, so I understand scale, I understand running commands across web pools of 50, 60, 150, and 1500. Amazon's EC2 basically exists because they had to meet a similar business problem. I understand deployment of highly homogeneous systems. This is like preaching to the choir.
What I have now is way, way harder to deal with because it is all one off servers. This does not mean that web ops, devops, automation principles aren't useful to us, but it means that we have to pick and choose what makes sense. Currently, we laid off our other systems administrator so I am by my self. Surely, now have to be wise with my time.
I was posing the the question in my very first post, when does it make sense to automate and I do think people arguing to think about "automating first". I feel a frustration similar to Bob Plankers over at the Lone Sys Admin http://lonesysadmin.net because most of the focus is on the easy problem, large homogeneous environments like yahoo, google, twitter, facebook, American Greetings, etc. At some point in my career, I will probably work for a company like that again, but right now I am trying to solve a different problem and I really think the businesses which are on this list are missing an opportunity to help the small and medium sized organizations with lots of legacy stuff.
[...]
--
Luke Kanies | http://puppetlabs.com | http://madstop.com
Done, I did that 2 years ago, along with redesigning our network, firewalling, and BGP infrastructure.
But, the automation is home grown because cobbler was too much to set up, I wrote two dead easy to use build/baselining tools which are next on my list to release open source.
Because puppet/chef were too much, again two years ago, I just started using svn for critical configs, not dead simple, but as close as possible with some baby bash scripts wrapped around it works pretty damn well.
Honestly, I am impressed with all of the network services that can be handled with puppet now and I do envision a day of complete virtualization where everything can be config driven including the network, but I fear that is 10 to 15 years off, maybe longer for legacy stuff. I am 34 and hope to retire before then ;-), so until then I will focus on all kinds of pieces of the tool chain (logging, log analisis for now).
Thanks for the advise
Scott M
On Jul 8, 2010 12:46 PM, "Luke Kanies" <lu...@madstop.com> wrote:
[...]
> Alright, that is probably my fault because I have been a bit coy. I think in my environment it ma...
+1 what Luke said.
Also... Scott, I'm not sure what makes you think large networks are easier ;) In my experience, they're just as hard due to complexity of services. Do you have experience otherwise?
-scott
When I was at american greetings we had 6-7 systems administrators with a similar number of kinds of servers, let's say 80ish
Oops, hit send on my phone before I was finished. Similar diversity with ease of managing the bigger pools made automation a no brainer cost benefit analysis, but no my job was not any easier ;-)
Currently, I have similar diversity but way smaller numbers which honestly DOES cloud (no pun intended) the cost benifit analysis and if I mess up, it hurts a lot more at a small company. I just don't have time to make bad decisions, they can have huge impacts on the business, like go out of business.
Thanks everyone for the feedback.
Scott M
I just want to point out that there is another angle to the "cost of automating" that isn't getting much attention on this thread.Most of the discussion has been about looking at the cost issue from a single experienced administrator's point of view. (i.e. How long would it take ME to automate this vs. how long would it take ME to continue to do it manually/semi-manually?)You have to consider all of the costs and risks associated with not automating from the perspective of the business. For example...-Disaster recovery? All of those "one off" servers are necessary for the running some part of your business (if not, then you have a different problem and need to start shutting things off). Automation, backed-up offsite, is the shortest path to recovery (even if you aren't using a fully abstracted framework and have to tweak your scripts for a new datacenter).
-Resource utilization? Should you even be the manual cog in the deployment chain or should you be the automation designer and let a lower cost or less critical person run it and maintain it?
-What if anyone gets hit by a bus? Automation is documentation in code form. It either works or it doesn't... no "oops I didn't write that part down" surprises for the new guy.
-Throughput? Is there business value in being able to turn the crank faster and having a shorter Dev -> QA -> Prod lifecycle?
-How do you scale up or down? No organization is static. How do you bring on new people to allow your more experienced people to focus on value adding tasks rather than deployment? If times get tough how do you reduce headcount or reassign people if you aren't fully leveraging automation? Scale up or scale down points are usually make or break points for businesses.
-Standardization? What's the current cost of not standardizing on one way to build machines and deploy applications across all environments (dev, qa, prod)? Can you even manage a separate QA environment without it being automated (or is QA "performed" on developers boxes and on live production systems)? What's the long term cost of having multiple ways of doing things?
-Compliance? If you're business has compliance issues, how do you provide audibility without pushing all work on production systems through an abstraction like an automation framework?
-Performance measurement? If everything is a one off how do you know what is consuming too much time (apples to oranges to bananas makes metrics difficult) ? What actions do/don't add value to the business?
-Switch providers? If it makes business sense to move to the cloud or move to a different datacenter... would it even be possible to do so with so many one off or manual efforts? (for example we have a client that is going to save millions of dollars by moving several dozen machines to a different country for tax purposes)
And specific to using an open source framework to do your automation...-Hiring? Do I have a better chance of finding someone and bringing that new hire up to speed if things are automated using an off the shelf framework vs. fully custom scripting or not automated at all?
-Can I leverage the work of others? Can I pull in knowledge and updates (in the form of reusable automation code) from other organizations or do I have to compile and maintain it all myself?
These are all concerns that are usually above the normal day-to-day concerns of a systems administrator. However, to me DevOps is all about elevating our individual game so we see how our actions/choices impact others and the success of the business.
No seriously, I don't want to come off sounding mean here, but disagree strongly with much of this logic. Automation does not alleviate many of the risks that you bring up. In fact, it often makes the risk greater.
- Scripts still break and when they do they often break horribly in ways that make the problem way harder to find, I don't care what abstraction your using, it is an abstraction and often hides the root cause of problems. -
- Scripts/Automation do NOT, NOT, NOT solve the problem of if I get hit by a bus, it is often much harder to dig into complex code especially when you don't understand it. This is just a different set of risks
- It can be much harder to hire someone who will be able to hit the ground running in an automated environment. for 55K you can hire somebody to solve most problems in my metro area. A person that can do "devops" can easily demand 80K-110K
- When you need to scale quickly automation will always help you do it quicker. I have brought this example up before, but I have a Redhat Cluster Build tool that I created which is very complex and was my first attempt at such a complex build tool. We had a deadline where I needed to build a cluster and I should have had it done in 20 minutes, but some infrastructure had changed since the last time I had used it causing about a day and a half delay. Luckily, I started about three days before the project needed completed, but the automation didn't help in building the system in time, that is for sure. I would also argue that it wasn't measurably more consistent either. We have clusters that were built by hand and by automation and they are still flaky finicky muckery.
I would argue that pieces of automation will help, kickstart for example, but not complete automation. Complete automation is a pipe dream, there are always human inputs. There is definitely times when it is inefficient to automate. When Toyota has a recall, they don't tool up a new factory and fix the accelerator pedals on Lexus', running each one through the line, that would be ridiculous. Instead they manually fix each one at a dealership because it is cheaper. Guess what happens, some of them are messed up still, so they bring back a third or fourth time to a dealership and let a mechanic fix them, eventually getting it right. No car company or manufacturer does it the other way. I rest my case.
--
On Jul 9, 2010, at 2:32 PM, Scott McCarty wrote:No seriously, I don't want to come off sounding mean here, but disagree strongly with much of this logic. Automation does not alleviate many of the risks that you bring up. In fact, it often makes the risk greater.
- Scripts still break and when they do they often break horribly in ways that make the problem way harder to find, I don't care what abstraction your using, it is an abstraction and often hides the root cause of problems. -
Meh. I'm pretty happy not knowing that my hard drive is actually multiple platters with data stored concentrically in cylinders on each side. That abstraction hasn't yet hurt me. I have no idea how many registers my processors have, nor how many chips my RAM is broken into - it just looks like 2gb of linear storage to me.
Obviously abstractions can be bad, but believe it or not there are just tons of abstractions that you don't even notice every day, and at some point in time, someone cried that there was no way they could survive if that critical detail was abstracted. In the switch to high level languages from assembly, programmers were convinced it wasn't possible to write usable software unless you managed every register manually, along with the heads of the hard drive, and just about everything else.So yeah, we could have bad abstractions, we could provide poor debugging interfaces, we could provide poor logging, we could catch your computers on fire whenever there's an exception. But good software will fail well and provide transparency through the abstraction when you have problems or questions. Puppet will tell you every single command it ever runs, if you run it in debug mode.In fact, the mark of a good abstraction vs. a crap one is that it provides transparency - it hides detail when you don't need it, but provides access to it when you need it. The goal isn't to forbid you from dealing with detail, but rather to just not require that you deal with it.- Scripts/Automation do NOT, NOT, NOT solve the problem of if I get hit by a bus, it is often much harder to dig into complex code especially when you don't understand it. This is just a different set of risks
Having had a coworker -- the sole owner of storage and month-end processing -- die in his sleep, I can't agree less. Trust me, it's far harder to extract information from a dead person's brain than it is from code, no matter how poorly written.
- It can be much harder to hire someone who will be able to hit the ground running in an automated environment. for 55K you can hire somebody to solve most problems in my metro area. A person that can do "devops" can easily demand 80K-110K
Sure, you just have to hire five of them and a team lead. People who get more done in less time always get paid more, and it's not because they have words like devops on their resumes, it's because they get more done in less time. It's up to your company to decide whether they'll pay that premium.
- When you need to scale quickly automation will always help you do it quicker. I have brought this example up before, but I have a Redhat Cluster Build tool that I created which is very complex and was my first attempt at such a complex build tool. We had a deadline where I needed to build a cluster and I should have had it done in 20 minutes, but some infrastructure had changed since the last time I had used it causing about a day and a half delay. Luckily, I started about three days before the project needed completed, but the automation didn't help in building the system in time, that is for sure. I would also argue that it wasn't measurably more consistent either. We have clusters that were built by hand and by automation and they are still flaky finicky muckery.
That seems like a success story to me. It sounds like it was easier to fix your broken automation than it would have been to build the cluster manually, right?And really, the problem with your cluster automation is that it wasn't declarative - if you'd built it in Puppet, you would have been running it constantly, and you would have been changing it as your infrastructure changed. Then your old clusters and new clusters would be built the same.
I would argue that pieces of automation will help, kickstart for example, but not complete automation. Complete automation is a pipe dream, there are always human inputs. There is definitely times when it is inefficient to automate. When Toyota has a recall, they don't tool up a new factory and fix the accelerator pedals on Lexus', running each one through the line, that would be ridiculous. Instead they manually fix each one at a dealership because it is cheaper. Guess what happens, some of them are messed up still, so they bring back a third or fourth time to a dealership and let a mechanic fix them, eventually getting it right. No car company or manufacturer does it the other way. I rest my case.
No argument that not everything can be automated, but is that really a reason to just give up on any automation ever?
Same team! Same team! :)We are all on the same team here! Let's not lose that DevOps groovy vibe. It's just my read on the thread but I don't think anyone is preaching religion. Just different experiences pointing out different points of view.
Scott, I'd ask you to think about one last thing before you go. Here is the ultimate business evaluation...Assume another company popped up next door to yours tomorrow. Maybe they are a startup or maybe they are a bigger player taking notice of what you do. They have maximized their automation to the fullest extent possible and all other parts of your businesses (sales, marketing, product design, etc.) are equal. Could you compete? Would you be at a disadvantage? If so, then the choices you are making are calculated risks that you and the rest of your company have to decide on. If not, then carry on and do the smallest amount of work that gets you to the pub by 5. :)
--
The extreme opposite argument to much of the dev-ops comes from those
who would scoff at a $500k a year team of dev-ops people, and instead
hire a 25-to-30 head team somewhere offshore to work 3 shifts of doing
grunt work. While this isn't infinitely scalable, reconfiguring three
dozen employees to follow a new policy can be much quicker than
re-writing automation tools. This isn't necessarily a practical choice
for many options, but the "meat cloud" is a surprisingly competitive
alternative in everything but the most extreme case of scale.
Personally, I find the DevOps vibe a lot more interesting and fun, but
the world is full of diverse solutions to problems... and it's good to
know what people are going to be benchmarking you against.
>> Assume another company popped up next door to yours tomorrow. Maybe
>> they are
>> a startup or maybe they are a bigger player taking notice of what
>> you do.
>> They have maximized their automation to the fullest extent possible
>> and all
>> other parts of your businesses (sales, marketing, product design,
>> etc.) are
>> equal. Could you compete? Would you be at a disadvantage? If so,
>> then the
>> choices you are making are calculated risks that you and the rest
>> of your
>> company have to decide on.
>
> The extreme opposite argument to much of the dev-ops comes from those
> who would scoff at a $500k a year team of dev-ops people, and instead
> hire a 25-to-30 head team somewhere offshore to work 3 shifts of doing
> grunt work. While this isn't infinitely scalable, reconfiguring three
> dozen employees to follow a new policy can be much quicker than
> re-writing automation tools. This isn't necessarily a practical choice
> for many options, but the "meat cloud" is a surprisingly competitive
> alternative in everything but the most extreme case of scale.
This story plays out in the dev world, though, too, right? And
Mythical Man Month hit this right on - you just can't scale people up,
because the communication costs pretty quickly overwhelm the benefits
of additional people.
I work with a lot of those companies that just hire people instead of
tooling up, and I wouldn't describe what they do as successful - in
fact, that's a lot of why they call us in for help.
> Personally, I find the DevOps vibe a lot more interesting and fun, but
> the world is full of diverse solutions to problems... and it's good to
> know what people are going to be benchmarking you against.
Indeed. Darwin ftw!
--
I worry that the person who thought up Muzak may be thinking up
something else. -- Lily Tomlin
On Fri, Jul 9, 2010 at 5:55 PM, Scott McCarty <scott....@gmail.com> wrote:
> I didn't say no documentation, I said wiki docs vs. code. Bad docs is bad
> documentation and people that don't document aren't going to code any
> better. Maybe they will just go be trash men, I don't know.
Code is doc, but it's active doc - and compared with passive doc, it's
more likely to be tested, which is key. If you have an automated
configuration management system, you have no choice but to run the
code. The computer can't cheat by adding steps that aren't there, and
if it doesn't work, you'll notice. So even if it's "bad" code, it's
*working* code. Which is more than you may be able to say about your
passive documentation; it can lie dormant until it's needed because
the guy who wrote it is suddenly unavailable, and then you're in
trouble.
You can try to mitigate this problem by exercising the doc, as if it
were code - have someone, ideally a new person, run the procedure
every time, following the checklist exactly with nobody "helping".
But with humans instead of computers, you really have to nail down the
discipline. It's too easy for little extra "trivial" steps to get run
in between the formal steps of the procedure (and not even added
later).
If you have good discipline around creating and exercising
documentation, that's great - but it also means you'll seriously kick
ass at automation. :)
--
Mark J. Reed <mark...@gmail.com>
Agreed, same team. I was getting fired up earlier, but it's because not one person on here has come out and said that there are times when it is less efficient to automate.
Now here is the funny part. In general (computers and manufacturing), automation helps you be really good at one thing but really bad at doing a bunch of unknown things, as in my recall example. If you automate too much you loose fundamental business agility. I am not talking about features, I am talking about new markets.
For example, we don't go after Windows/MS work. It just isn't efficient for my organization to undertake. All of our automation/scripting/knowledge is in Linux, so we turn that work away. Specialization, including automation saps your agility, but allows you to do one thing really good.
On your point about the second company popping up, I really like this point and it is honestly one of the only genuinely interesting/original points brought up.
To answer your question, I believe at my company, we have currently automated everything that is possible to automate. My company is 15 years old and always had two systems administrators. Currently our web operations is bigger than it has ever been and I am now running it, by myself, and not breaking a sweat. I mean, hell, I have enough time to contribute to this list ;-)
From another angle though, I don't think guys that automate first would ever go after a business niche like we go after, they are just not wired that way, so really I don't think they coulod compete. I bring this up only because it is interesting. Second, our owner/sales weasel is one of the best I have ever seen. He is really good at selling to people that need this kind of niche. Finally, it is I think a fully automated environment is probably more attractive work which produces an intangable value for the employees that love this kind of work.
If we automate anymore than we are, given the current tools that exist, we would start to loose money and narrow the focus of what work we can undertake. Trust me, it is already a struggle in every organization with sales guys. They always want to sell on the edges of what the organization can do. It's a balance.
Oh and btw, yes, they pay me more than any sysadmin that has ever worked for them, but I bring a lot more than automation capability to our org. I do mrc/nrc calculations, product engineering/advise, etc, etc, etc
Time for the pub
Scott M
On Jul 9, 2010 6:23 PM, "Damon Edwards" <da...@dtosolutions.com> wrote:
Same team! Same team! :)We are all on the same team here! Let's not lose that DevOps groovy vibe. It's just my read on the thread but I don't think anyone is preaching religion. Just different experiences pointing out different points of view.Scott, I'd ask you to think about one last thing before you go. Here is the ultimate business evaluation...Assume another company popped up next door to yours tomorrow. Maybe they are a startup or maybe they are a bigger player taking notice of what you do. They have maximized their automation to the fullest extent possible and all other parts of your businesses (sales, marketing, product design, etc.) are equal. Could you compete? Would you be at a disadvantage? If so, then the choices you are making are calculated risks that you and the rest of your company have to decide on. If not, then carry on and do the smallest amount of work that gets you to the pub by 5. :)
On Jul 9, 2010, at 2:55 PM, Scott McCarty wrote:
> After this comment I am done, this is getting ...
Damon Edwards
[ DTO Solutions, Inc. | 1840 Gateway Drive - Suite #200, San Mateo CA 94404 | o: 1.65...
I want to address why this is faulty logic. Our wiki docs are NOT passive. The only stuff in my org that isn't automated is either ran so rarely and so complex, it doesn't make sense or very very old and legacy.
We aren't building legacy stuff. The stuff that we do so rarely/complex that I don't automate, I can't remember how to do.(as a side note I have to do so much stuff I can barely remember how to do anything). In these scenarios, the wiki docs are by no means passive, they are the only way to get it done without a lot of suffering to figure it out again. This creates natural pain based sanctions thereby making the wiki authoratative.
Tengentally, I think this is when you know you have enough automation. If your docs are passive you better keep automating, when the stop being and your spending more time fixing automation, then building new infrastructure, you might have to much. It really is a bit of an art.
So, in short, I think your arguement has been being made since 1980 and is at this point a straw man. Everyone knows too much process sucks, and I probably error on the side of a hair too much automation myself, 1. Because it is more fun 2. Because I think it is a business advantage (for my own career) to be on the cutting edge of what is possible with technology and engineering.
Alright, you guys turned me into a liar about dropping out, oh well, I love this stuff ;-)
Scott M
> I want to address why this is faulty logic. Our wiki docs are NOT passive. The only stuff in my org that isn't automated is either ran so rarely and so complex, it doesn't make sense or very very old and legacy.
>
> We aren't building legacy stuff. The stuff that we do so rarely/complex that I don't automate, I can't remember how to do.(as a side note I have to do so much stuff I can barely remember how to do anything). In these scenarios, the wiki docs are by no means passive, they are the only way to get it done without a lot of suffering to figure it out again. This creates natural pain based sanctions thereby making the wiki authoratative.
>
> Tengentally, I think this is when you know you have enough automation. If your docs are passive you better keep automating, when the stop being and your spending more time fixing automation, then building new infrastructure, you might have to much. It really is a bit of an art.
>
> So, in short, I think your arguement has been being made since 1980 and is at this point a straw man. Everyone knows too much process sucks, and I probably error on the side of a hair too much automation myself, 1. Because it is more fun 2. Because I think it is a business advantage (for my own career) to be on the cutting edge of what is possible with technology and engineering.
Unfortunately, as great as wikis are, they easily become "passive" and very stale in large organizations. It takes whole teams of people to curate wikis and prevent stagnation.
Once you hit about ten people maintaining docs lasting years, the size of your department's wiki easily grows to thousands of pages. At that point duplicated work can be a serious problem. Not to mention old documentation that is outdated.
This requires a lot of work in large orgs. When faced with spending money on personnel to curate and maintain or self-document through code and automated tasks, which are taken care of inline with the Sysadmin's work itself, the choice is pretty clear.
-scott
The exact same thing is true of operations code and automation. Exact same problem, unused code is duplicated, etc, etc. Ever look at a piece of code you didn't like and get that urge to write it from scratch, or maybe you don't understand it all the way. Either way code gets duplicated/out dated (broken because of infrastructure changes) just like docs and maybe more so!
I still hold that there will be stuff that should be in the wiki, not automated, and it will be so complex/rarely used that it will by it's very nature athoritative, this is healthy and efficient!
This is another strawman, sorry
Scott M
> The exact same thing is true of operations code and automation. Exact same problem, unused code is duplicated, etc, etc. Ever look at a piece of code you didn't like and get that urge to write it from scratch, or maybe you don't understand it all the way. Either way code gets duplicated/out dated (broken because of infrastructure changes) just like docs and maybe more so!
My real life experience says otherwise: If it's unused then it's not doing anything to your system. If it's used and unmodified it's working as intended.
If it's NOT working as intended, you've got other fish to fry.
-scott
I'm also not arguing that everything can or should be automated; you
definitely have to look at the cost/benefit (Tom Limoncelli had some
good guidelines at LISA last year). I'm saying that automation has
inherent advantages over plain documentation: other things being
equal, automation wins. The fact that other things often *aren't*
equal absolutely has to go into your decision, but it doesn't affect
my point.
In particular, "good enough" automation can be far more useful
day-to-day than great documentation. Of course, you also have to be
careful - bad automation can wreak havoc across your infrastructure
much faster than people manually following terrible documentation.
The first time you do something, figure it out and just do it..
The second time you do something, make it into a checklist.
The third time you do something, automate it.
This has helped me not waste a bunch of work into a configuration I
never use again, but ensures that if I find myself doing it with any
frequency that I gain the advantages of documenting and automating it.
-- Greg
Scott McCarty wrote:
> Agreed, same team. I was getting fired up earlier, but it's because not
> one person on here has come out and said that there are times when it is
> less efficient to automate.
--
Greg Retkowski / I.T. Infrastructure Consultant | RAGE
gr...@rage.net http://www.rage.net/~greg/ C:408-455-3913 | .NET
Packets routed through the Bay Area, CA, USA
Here's what we've done and has worked/not worked for us.
In my previous environment, which was traditional hardware Web systems, we
picked and chose what to automate. We wrote code to do frequently repeated
work, and the top ones for us were:
1. Pushing new static content to the Web servers (take stuff out of SCM
based on control documents and rsync it out to the multiple targets and
invalidate it in the various caching levels) - eventually just turned it
into an interface for the content people to use directly.
2. Pushing new Java applications to the app servers (again, pulling out of
SCM based on control docs); devs use it to push to dev and ops uses it to
push to test/production. The dev basically provides a template full of
Perl variables of sources and targets and we "do the right thing" with it.
3. Apache redirect management, so that the business people eager for
dozens of redirects a week can do that themselves safely using a Web
interface (why there's not a real app that does this yet I'll never know)
4. Automatic actions (often restarts) on servers based on monitor results,
a typical SSH framework kind of thing. We have hundreds of apps and due to
priority conflict, when one had a major production problem there wasn't
always a dev doing a same day fix, so we'd need to NOT page a hapless
sysadmin 200 times a week about it. If all the sysadmin was going to do is
stare balefully at it and take stack traces and then restart - we have
computers for that.
5. Other more commodity/packaged stuff, like splunk for log management,
security scans, monitoring. We tried to expose as much of this directly to
the developers as we could.
There were a number of things we couldn't/didn't automate that made us sad.
1. Some kinds of code deploys. Annoyingly, some products we used
(Vignette v7, FAST Search) did not have full automation, you actually had
to click around in their GUIs to do some deploys and changes. We do
monthly "Web releases" where we take the whole site down late Friday night
and do 60+ app deploys. The Java stuff once automated was never on the
critical path, so this stuff usually was (although our DBAs often were,
running random long jobs and whatnot). This means that to this day we have
~8 hours of downtime once a month to push code. We also had the problem of
very few servers being "the same" - at most, small clusters were.
2. We managed system configs and builds using "the wiki method." But as
many have noted, we got thousands of docs, doc duplication, and when you'd
pawn off a system build on a new guy - you'd think they should be able to
do steps 1-30 as written down, but for some reason they always messed it
up.
Our justification to not automate it more using Chef or Puppet or whatever
was that we didn't do system builds that often, and it would take work and
maintenance, and there was always loads of other work to do. Plus, we
didn't have complete control over our boxes - we were part of a big
fragmented IT Infrastructure department where OS changes had to go through
the "UNIX team" or whatnot, we could only control/touch our app layer
stuff. (The UNIX team used cfengine but had zero interest in using it to
automate our stuff. Ah, OpsOps.)
As it took weeks to get a PO and buy a server and have the network team
rack and jack it and the UNIX team build it - it also didn't really matter
that our builds took a while to do "off the wiki" as well. What's a couple
days on top of 6 weeks?
3. Tools require continuous upkeep. Our DBAs learned that; they bought an
expensive Quest DB APM tool and didn't put enough time into keeping it up;
eventually it fell apart and they just uninstalled it and didn't re-up
maintenance. But we needed to spend more and more time maintaining the
tools (third party stuff like splunk and monitoring, but also our own
code).
4. Every time you did something like add a new system there would be gaps.
Maybe it didn't get added to monitoring. Maybe it didn't get added to the
server tracking db. But eventually someone would find the problem and fix
it manually, like the "Toyota accelerator solution" mentioned earlier.
So we got to a steady state. Things worked. The biggest time sinks were
filled. But agility was poor. It wasn't all lack of tooling - it was part
culture, part organizational, part un-automation, part using physical
hardware. But I saw more and more the business or devs not even trying
things because they had a tight timeframe and "knew the sysadmins couldn't
deliver that on a tight timeframe." Sure, there were DR concerns and
consistency/quality issues and all - but those didn't ever "cost justify"
more automation or more fundamental change. As team manager I enforced
documentation and crosstraining and other "hit by a bus" mitigations. We
were lightly automated waterfall with some dev collaboration (limited by a
high dev to op ratio).
But eventually, this didn't keep up with our needs. We stopped just
running our e-comm Web site and decided to offer actual SaaS products.
Everyone - business, devs, ops - realized "hey we have to be able to do
things faster. More code releases with less risk. Faster provisioning of
environments." It was clear that doing it our existing way was a
non-starter. We need entire new dev and/or production environments to
appear largely on demand, not in weeks. We can't queue up all our app
changes for one big release a month (that incurs downtime).
The solution was a mix of all this newfangled stuff. Yes, just taking the
existing people, process, and systems and putting more automation on them
doesn't get you all that much. But it's part of it. We greenfielded a
team of joint devs and ops for collaboration, started using cloud computing
to avoid hardware procurement cycles, collapsed ops responsibility for all
remaining components into that team to avoid the OpsOps fragmentation, and
automated provisioning, control, and monitoring. We are architecting the
systems differently, no longer cramming various services into every box to
maximize CPU usage and make them more commoditized. We're just finishing
version 1 of our automation framework and it has certainly taken a good
amount of time and effort.
Is it cost justified? Well, "hard ROI" is largely a myth. But when we
laid out this vision to the dev managers and business decisionmakers, they
sure as hell think it is. Frankly we wouldn't be able to turn back from a
more fully automated approach now, they wouldn't stand for it. Once they
realized what continuous deployment, pushbutton environment provisioning,
etc. get them in terms of flexibility and agility, they are more eager for
it than we are.
If I went back to our Web systems team, with a more than 10:1 dev to op
ratio, and multiple stovepipe infra teams, and physical hardware, and all
that, would full automation be at the top of my list? No, probably not.
There were bigger fish to fry than automating a once-per-month new server
build. But if that team needed more agility, it would need a series of
changes, where fuller automation is really one of the later stages.
People>process>tools, after all...
Ernest
______________________
UN-altered REPRODUCTION and DISSEMINATION of
this IMPORTANT information is ENCOURAGED.
From: Damon Edwards <da...@dtosolutions.com>
To: devops-t...@googlegroups.com
Date: 07/09/2010 03:57 PM
Subject: Re: Devops Workout: 1 and 2 and 3 oohrah: Quantitative vs. Qualitative
Sent by: devops-t...@googlegroups.com
Wow, what a great story!
Something related I think is worth underscoring is morale. I don't know about you, but in scenarios where one department "gets to" automate, collaborate, be more agile, etc, I've found morale relatively higher for those doing "operations" work. As everyone who can legally participate participates, less jobs suck. The former pattern of "random ops project given to Bob so he wont quit" is a non issue.
Good stuff guys!
-Adrian
...
Date: 07/09/2010 03:57 PM
...
Sent by: devops-t...@googlegroups.com
...