Questions to ask Yourself

Dan Nemec

unread,

Aug 24, 2011, 10:13:04 AM8/24/11

to devops-t...@googlegroups.com

As I was driving home from work after a co-worker and I were debugging a Chef recipe, and after I had just been talking to some other co-workers about contributing to our growing body of Chef code I came up with these thoughts to help me and us write better code. I'm sure someone has posted something like this before, but even though it's been said before I think it warrants being said again.

When you write a block of code in your automation tool ask yourself these questions and answer them before you run your new code:

1) What will happen if this function runs on a brand-new system? (What are the prerequisites?)

2) What will happen if this function runs on an existing system but has never been run before? (different prerequisites?)

3) What will happen if this function's behavior is changed from the last run?

4) What will happen if this function is re-run on the same system with no other changes? (idempotence)

5) What will happen if the prior run failed? How will the function recover from a failed or partial run?

What are some other questions to ask yourself? What are some details pertaining to each question?

Dan

Luke Kanies

unread,

Aug 24, 2011, 10:15:36 AM8/24/11

to devops-t...@googlegroups.com

I guess my follow-on would be, what can the tools you're using do to make it easy to encode the answers in your solutions, so that you don't have to work as hard to ask those questions every time?

--
The most overlooked advantage to owning a computer is that if they foul
up there's no law against wacking them around a little. -- Joe Martin
---------------------------------------------------------------------
Luke Kanies | http://puppetlabs.com | http://about.me/lak
Join us in PDX for PuppetConf: http://bit.ly/puppetconfsig

Dan Nemec

unread,

Aug 24, 2011, 11:39:25 AM8/24/11

to devops-t...@googlegroups.com

Great question. Do you have any specific examples where Puppet (or any tool) helps?

Now that you mention it, the questions primarily apply to the initial release of a "function". Maintenance over time doesn't require as much question asking.

The situation that brought this to mind was: We were writing a new recipe to upgrade Control Tier and I had it working for the case where an upgrade needed to occur. But when we ran it on a system where an upgrade didn't need to occur we found I forgot to put "not_if" in a couple of places (I had it in some). So, to get the recipe releasable, I had to go through all the questions. If I later do some refactoring, I probably don't need to spend as much time thinking through all use-cases.

Chef has nice features like "not_if" for its Resources, but you still have to remember to use them.

Dan

Sean OMeara

unread,

Aug 24, 2011, 11:47:31 AM8/24/11

to devops-t...@googlegroups.com

Hi Dan,

Chef provides an easy way to create custom resources. If you ever find
yourself doing something twice, you can embed your not_ifs and
only_ifs in them.

Check this page out here:
http://wiki.opscode.com/display/chef/Lightweight+Resources+and+Providers+(LWRP)

-s

Brian Henerey

unread,

Aug 24, 2011, 3:53:58 PM8/24/11

to devops-t...@googlegroups.com

Hi all,

I've had a similar situation recently and like those questions put forth. Much of my use of Chef in the last 10 months or so has involved running chef-client on EC2 nodes which I don't use for a long period of time. I terminate them often and spin up new ones, so the recipes applied are good at configuring them how I intend from a vanilla AMI. I've recently started to want to make changes to longer running systems, and I don't have as much confidence that the recipes will work as well. Using not_if is helpful, but you have to think through the recipes a bit deeper.

I wonder if some how Devstructure's blueprint and blueprint-diff could be used. If you had a patched/updated system and compared its to blueprint to a fresh install.

I haven't had time to investigate those. We've spent more time trying to shorten the feedback look when recipes quit working as intended. We'd like to write more of them test-first with cucumber-chef, and plug those into a CI server, but alas, haven't had a lot of time for that lately either.

-Brian

Luke Kanies

unread,

Aug 24, 2011, 3:54:11 PM8/24/11

to devops-t...@googlegroups.com

On Aug 24, 2011, at 8:39 AM, Dan Nemec wrote:

Great question. Do you have any specific examples where Puppet (or any tool) helps?

We obviously haven't targeted this list of questions per se, but this kind of problem is definitely a big part of what we worry about. I'll answer what I can.

Note, of course, that my world view is heinously colored by my development of Puppet - we think about the world very differently than the other tools, and my guess is others wouldn't just give different answers, they'd give entirely different kinds of answers.

Providing the ability to answer this kind of question, and whatever else you wanted to throw at it, was a big driver in Puppet's development - Puppet compiles everything into a catalog, and then applies that catalog, and nearly everything you want to know comes from that catalog. The only other tool I know of that has anything like Puppet's catalog is bcfg.

1) What will happen if this function runs on a brand-new system? (What are the prerequisites?)

Puppet can give you a complete list of what it manages, the order in which it will be applied, and then a complete list of what of that list actually needs to be done (i.e., a simulation/noop mode). So, Puppet covers this question very well, and most related tools do ok. AFAIK, Puppet is the only tool that can tell you what you're managing and the order in which it would run without actually just running it (although cfengine, at least, also supports a dry-run mode).

What none of us can tell you is whether the thing will actually work - that is, if we do what you want, will it result in a functional service? That's a Turing problem, so you're kind of stuck.

2) What will happen if this function runs on an existing system but has never been run before? (different prerequisites?)

This is largely a superset of the previous question. It's pretty easy (today it requires some programming, but straightforward programming) to compare what the "new" puppet configuration will do vs. the old one, so you can see exactly what your new function will do and how it is different from what was there before.

If you weren't already using Puppet, then the question largely collapses into the previous question, other than the fact that you're already running services. People are doing things like running configuration subsets for some cases like this, since their "new system" configuration is much larger and doesn't need to run as often. We're working on supporting this directly in the tool, but it's pretty darn simple to hack up yourself.

3) What will happen if this function's behavior is changed from the last run?

Again, this is an area where Puppet's catalog makes a huge difference - you can easily see how this run's catalog is different from the past.

In fact, we had a customer who needed to upgrade 30k machines running Puppet, and their problem was they wanted to ensure the tool upgrade didn't have any affect on the configuration. We mechanically compiled the catalogs for all 30k machines using 0.24.8 and 2.6 and then diffed the results to confirm that there is no difference. Based on the confidence they got from that comparison, we were able to start upgrading 10k machines a day, and it all went swimmingly.

For those tools that support a dry-run, you could do the whole upgrade and run in dry-run mode until you're confident, but that's obviously a much harder problem requiring a lot more babysitting, and if it goes poorly you are in a bad situation until it's all fixed (e.g., you can't deploy new changes during that window, you might have to downgrade).

Note also things like the puppet-rspec project (https://github.com/rodjek/rspec-puppet), that allow you to write complete unit tests for your Puppet code, so you can actually have a ton of confidence before you even commit the code, and then Puppet's data gives you more checkpoints from commit through affecting the system.

4) What will happen if this function is re-run on the same system with no other changes? (idempotence)

Puppet and Cfengine guarantee idempotence to the extent possible (which basically means for everything other than 'exec'). You should be able to achieve idempotence with Chef, but you're sending ruby scripts to all of your machines and running them as root, so you obviously can't guarantee idempotence.

5) What will happen if the prior run failed? How will the function recover from a failed or partial run?

Hmm. I guess this is kind of a question of statefulness - how stateful the system as a whole is. And I think all of the major tools have low-level primitives and assumed idempotence that make them relatively stateless. For us, our primitives (resources like file, package, etc.) can only rarely be in a kind of failure state, and Puppet handles those failures appropriately (e.g., failing in a useful way or fixing), AFAIK.

Now that you mention it, the questions primarily apply to the initial release of a "function". Maintenance over time doesn't require as much question asking.

The situation that brought this to mind was: We were writing a new recipe to upgrade Control Tier and I had it working for the case where an upgrade needed to occur. But when we ran it on a system where an upgrade didn't need to occur we found I forgot to put "not_if" in a couple of places (I had it in some). So, to get the recipe releasable, I had to go through all the questions. If I later do some refactoring, I probably don't need to spend as much time thinking through all use-cases.

Chef has nice features like "not_if" for its Resources, but you still have to remember to use them.

Ah. Yeah, this basically couldn't happen in Puppet unless you used an exec without a test state, and in that case, that exec would fire every time so you'd figure it out pretty quickly.

Anyway, like I said, this whole thing is very colored by how Puppet works, but it was absolutely built to make answering this kind of question as simple as possible.

--
We all have strength enough to endure the misfortunes of others.
-- Francois de La Rochefoucauld

Aleksey Tsalolikhin

unread,

Aug 24, 2011, 5:06:10 PM8/24/11

to devops-t...@googlegroups.com

On Wed, Aug 24, 2011 at 12:54 PM, Luke Kanies <lu...@puppetlabs.com> wrote:
> AFAIK, Puppet is the only
> tool that can tell you what you're managing and the order in which it would
> run without actually just running it (although cfengine, at least, also
> supports a dry-run mode).

Also, bcfg2 has interactive mode, where it prompts you "y/n?" before
making each change.

Best,
-at

Noah Campbell

unread,

Aug 24, 2011, 6:19:31 PM8/24/11

to devops-t...@googlegroups.com

Not to be too glib, but if any change requires a new system image, the way Netflix rolls, you only need to consider one question: What will happen if this function runs on a brand-new system?

Now…this isn't panacea. Other questions start cropping up…like how do you manage per instance/per environment configuration. How do you handle data between upgrades? How do accommodate downtime as one instance spins up. But those questions may still need to be configured regardless.

-Noah

On Aug 24, 2011, at 7:13 AM, Dan Nemec wrote:

Noah Campbell

415-513-3545

noahca...@gmail.com

Nathaniel Eliot

unread,

Aug 25, 2011, 12:02:27 PM8/25/11

to devops-t...@googlegroups.com

+1

Rebuilding from scratch and importing old data is generally less error-prone than upgrading existing systems. This is especially true if the systems have lived for a long time (or in a high velocity development shop, where changes come in all the time). You get to do dry-runs on every step, which you *cannot* do reliably in altering existing systems.

It can't solve every problem, but it can make many of them more manageable.

--
Nathaniel Eliot
T9 Productions

Luke Kanies

unread,

Aug 25, 2011, 12:44:56 PM8/25/11

to devops-t...@googlegroups.com

It's great to say that no one should make attempts to manage brownfield systems and should just blow the whole thing away and start over every time they change management practices, but that's just not practical in the vast majority of cases.

It's a very high cost to pay for just not having decent tools. Puppet works fantastically for managing teeny tiny bits of your systems, and for bringing completely unmanaged systems into somewhat managed or nearly entirely managed states. Yes, it also works well for green field, but even green field is usually relying on recent history and tools to know what to do.

--
The most overlooked advantage to owning a computer is that if they foul
up there's no law against wacking them around a little. -- Joe Martin

Nathaniel Eliot

unread,

Aug 25, 2011, 1:49:36 PM8/25/11

to devops-t...@googlegroups.com

I'm not saying don't manage brownfield systems. I'm saying design new systems so they don't become brownfield systems, and you'll have a much simpler admin loop. Migrating old brownfield setups into greenfield implementations can often be done piecemeal, and produces similar advantages.

We use Chef. Neither it nor Puppet seem panacea against cruft, and neither is build-from-scratch: recipes and manifests can get as just as crufty as system config itself. When done right, scratch-building constrains cruft to known sources of truth, instead of the many reasons for a long-running system to have a given configuration (such as "employee who did something manually, and left the company without documenting it"). Which makes it all a bit more manageable.

--
Nathaniel Eliot
T9 Productions

Luke Kanies

unread,

Aug 25, 2011, 1:51:14 PM8/25/11

to devops-t...@googlegroups.com

I definitely agree building from scratch is preferable when available. I just wish it were available more often. :/

--
Every great advance in natural knowledge has involved the absolute
rejection of authority. --Thomas H. Huxley

John Vincent

unread,

Aug 25, 2011, 1:59:53 PM8/25/11

to devops-t...@googlegroups.com

All it takes to successfully manage a brownfield system is one free
system. If you're virtualized, more's the better. I would HIGHLY
advise against trying to migrate an existing system to a new model.
It's like the little puzzle games where you have one space free on the
board to move pieces around - annoying but it works. Just rebuild the
simplest possible component and work your way up freeing up resources
as you go along.

Mind you, that's my opinion but I honestly think trying to retrofit is
an exercise in frustration. For the first few months here, I was
trying to work around the concept of a "legacy" tag on systems before
I finally said, screw it. I was getting deep in conditional hell and
not making any real progress.

--
John E. Vincent
http://about.me/lusis

Noah Campbell

unread,

Aug 25, 2011, 2:30:38 PM8/25/11

to devops-t...@googlegroups.com

This discussion sounds like the 4 steps described in Visible Ops.

One of the fundamentals in the book is a repeatable build library, Having one means once a system gets dorked it's quickly thrown away and a new one brought up in its place. The book was written before AWS was common place or Puppet and Chef were on the scene, meaning it has been done with old school unix power tools.

-Noah

Noah Campbell
415-513-3545
noahca...@gmail.com

Eric Shamow

unread,

Aug 25, 2011, 2:33:22 PM8/25/11

to devops-t...@googlegroups.com

+1 on this. It's not always possible initially, but my strategy has always been to start with ONE server I trust, and then expand that zone of trust by rebuilding servers and moving them inside that zone, one at a time.

Not always an option with everything immediately, true, but where it can be done it should be done...it's like dealing with a compromised system. You can clean it as much as you like, the only way to trust it is to rebuild.

-Eric

--
Eric Shamow
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)

On Thursday, August 25, 2011 at 1:59 PM, John Vincent wrote:

> All it takes to successfully manage a brownfield system is one free
> system. If you're virtualized, more's the better. I would HIGHLY
> advise against trying to migrate an existing system to a new model.
> It's like the little puzzle games where you have one space free on the
> board to move pieces around - annoying but it works. Just rebuild the
> simplest possible component and work your way up freeing up resources
> as you go along.
>
> Mind you, that's my opinion but I honestly think trying to retrofit is
> an exercise in frustration. For the first few months here, I was
> trying to work around the concept of a "legacy" tag on systems before
> I finally said, screw it. I was getting deep in conditional hell and
> not making any real progress.
>

> On Thu, Aug 25, 2011 at 12:44 PM, Luke Kanies <lu...@puppetlabs.com (mailto:lu...@puppetlabs.com)> wrote:
> > It's great to say that no one should make attempts to manage brownfield
> > systems and should just blow the whole thing away and start over every time
> > they change management practices, but that's just not practical in the vast
> > majority of cases.
> > It's a very high cost to pay for just not having decent tools. Puppet works
> > fantastically for managing teeny tiny bits of your systems, and for bringing
> > completely unmanaged systems into somewhat managed or nearly entirely
> > managed states. Yes, it also works well for green field, but even green
> > field is usually relying on recent history and tools to know what to do.
> > On Aug 25, 2011, at 9:02 AM, Nathaniel Eliot wrote:
> >
> > +1
> >
> > Rebuilding from scratch and importing old data is generally less error-prone
> > than upgrading existing systems. This is especially true if the systems have
> > lived for a long time (or in a high velocity development shop, where changes
> > come in all the time). You get to do dry-runs on every step, which you
> > *cannot* do reliably in altering existing systems.
> >
> > It can't solve every problem, but it can make many of them more manageable.
> > --
> > Nathaniel Eliot
> > T9 Productions
> >
> >

> > On Wed, Aug 24, 2011 at 5:19 PM, Noah Campbell <noahca...@gmail.com (mailto:noahca...@gmail.com)>

> > > noahca...@gmail.com (mailto:noahca...@gmail.com)

Eric Shamow

unread,

Aug 25, 2011, 2:34:44 PM8/25/11

to devops-t...@googlegroups.com

For those unfamiliar with Visible Ops, this is a surprisingly comprehensive summary I often use:

http://www.wikisummaries.org/Visible_Ops

It's a *great* and very short book, especially valuable if you're taking over a server environment that's a total mess and don't know where to start. I found it just after I needed it most, but it confirmed for me that I was on the right track and helped me formulate a roadmap forward.

-Eric

--
Eric Shamow
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)

> > > On Wed, Aug 24, 2011 at 5:19 PM, Noah Campbell <noahca...@gmail.com (mailto:noahca...@gmail.com)>

> > > > noahca...@gmail.com (mailto:noahca...@gmail.com)

> > >
> > >
> > >
> > > --
> > > The most overlooked advantage to owning a computer is that if they foul
> > > up there's no law against wacking them around a little. -- Joe Martin
> > > ---------------------------------------------------------------------
> > > Luke Kanies | http://puppetlabs.com | http://about.me/lak
> > > Join us in PDX for PuppetConf: http://bit.ly/puppetconfsig
> >
> >
> >
> > --
> > John E. Vincent
> > http://about.me/lusis
>
> Noah Campbell
> 415-513-3545

> noahca...@gmail.com (mailto:noahca...@gmail.com)

Spike Morelli

unread,

Aug 25, 2011, 2:35:10 PM8/25/11

to devops-t...@googlegroups.com

Hey Luke,

what would you say are the reasons why this is not more often available?

Spike

--

http://www.spikelab.org/

http://twitter.com/spikelab

http://uk.linkedin.com/in/spikemorelli

Luke Kanies

unread,

Aug 25, 2011, 3:19:51 PM8/25/11

to devops-t...@googlegroups.com

Basically, I think the cost of a full rebuild is often too high relative to the reward.

E.g., one of the very common brownfield use cases we see, and something we recommend very strongly, is people have a small problem that causes a lot of pain on a lot of machines, such as maintaining the ldap configurations for user authentication. The value of fixing that problem is moderate to small - at worst, it costs them 5 hours a week or whatever. So, if you go to them and say, hey, you could fix that and a bunch more by beginning a gradual rebuild of your network, and in 6-18 months the whole thing will be done.... You're basically saying they should spend 18 months working on a problem that only costs them 5 hours a week.

Yes, they get a lot more from it than that 5 hours, but the cost is a lot higher, too.

If you instead say, hey, spend 20 hours this week automating that one piece of your infrastructure, then they get full ROI in 5 weeks, and they can start spending that 5 extra hours on other automation and management.

Solving small problems shouldn't require big investments, and they shouldn't have to pile up to the point where a big investment is the only solution. I always prefer attacking the most painful part of the system, automating away the pain, and then picking the next problem that shows up in the pez dispenser. Even if all this does is get you the breathing space you need to invest in a full rebuild, it's a far preferable way to work toward that than just saying, no, you can't have that 5 hours a week back until we've done a full rebuild.

This is essentially exactly how we sell against the big commercial products like BladeLogic and OpsWare - they want to sell you a million dollar contract and promise to automate your whole infrastructure, whereas we're happy to sell you a $10k contract and promise to make your life better within a week. Low investment, fast ROI, and most importantly, you can always size your investment amount based on the expected value of the result.

--
The Internet, of course, is more than just a place to find pictures of
people having sex with dogs. -- Time Magazine, 3 July 1995

James White

unread,

Aug 25, 2011, 4:37:50 PM8/25/11

to devops-t...@googlegroups.com

It also doesn't hurt that using BladeLogic made me want to kill myself. </justsayin'>

Matthew Macdonald-Wallace

unread,

Aug 26, 2011, 2:13:08 AM8/26/11

to devops-t...@googlegroups.com

I completely agree with this approach.

When I started at $dayjob we had about 10 clusters performing
different tasks (SMTP/Web/IMAP etc) all of which had been built by
hand following documentation which included helpful hints such as:

"Copy the configuration file for this service from an existing cluster
node. If there aren't any other nodes, find someone who can write the
configuration file for you"

My first step was to introduce puppet to the team and gather their
thoughts. One member of the team liked the concept so much he
installed it on one of our clusters and configured it so that it
pushed *everything* to the servers that required it - including
pre-compiled binaries of things like PHP to /usr/local/bin etc.

I quickly pointed out that this was not the best use of puppet and we
set about writing our manifests for *one* of the clusters as we needed
to add a new node.

It used to take us approx. 10 working hours to build a server and even
then we had absolutely no guarantee that it would work when we put it
live.

We took those ten hours (and possibly a few more) and invested them in
creating a hierarchical set of puppet manifests similar to the
following:

* "base" - configures things that are relevant to all systems either
directly or through including other modeuls (nrpe/mcollective etc).
* service specific class - smtp for example
* cluster specific class - shares the name with the cluster, includes
the relevant service class(es) and the base class

Within a very short time, we could easily recreate a virtual version
of this particular cluster using cobbler etc.[0]

Our build times went from c. 10 hrs to around 15 minutes - *and* we
could be sure that it would work.

We built the one node, and about six months after (we were paranoid
about finding a bug in the configuration!) we rebuilt the entire
cluster using *exactly the same manifests*.

We now have a general rule "if it's new (where "new" includes hardware
refresh, new features/services etc) it gets built from puppet." -
sure, this means that we still have some clusters which are not using
puppet, however we just don't have the time to rebuild them all.

We know that at some point in the future, as we move away from the
systems I inherited, that all of our kit will be managed by puppet and
this makes us happy... :)

One more thing - if you store your puppet manifests in some form of
SCM (git/svn etc) then you also get a "free" change management system.
If you do a puppet run and something breaks, just check the git/svn
log to see who worked on that class/template/manifest last and see
what changed from the diffs - this has been a life-saver on a number
of occasions.

If anyone is interested in knowing more about this, please feel free
to contact me off-list and I'll respond as soon as a I can with as
much information as I can (commercial confidentiality being maintained
of course!)

Cheers,

Matt

[0] Shameless plug: You can read about configuring some of the things
mentioned in this email on my blog at www.threedrunkensysadsonthe.net

Reply all

Reply to author

Forward