Application deployment vs system configuration

2,190 views
Skip to first unread message

gareth rushgrove

unread,
May 17, 2011, 3:07:37 PM5/17/11
to devops-t...@googlegroups.com
Something that interests me (in that I change my mind about the
answer) and was discussed a little at the last devopsdays Europe is
where application deployment and configuration management collide.

Mentioning now because Adam Jacob said (hopefully not out of context?)
on the Windows automation thread:

"If you have the luxury of starting from scratch (or nearly scratch)
the way you want to model app-deployment is as a part of system
configuration"

Ignoring the practicalities/limitations of existing systems, is this
where everyone else would start?

Are their any reasons to continue with push tools like capistrano or
well loved bash|fabric|ant scripts for deployment except for inertia?

G

--
Gareth Rushgrove
Web Geek

morethanseven.net
garethrushgrove.com

Grig Gheorghiu

unread,
May 17, 2011, 3:16:15 PM5/17/11
to devops-t...@googlegroups.com
I still think it makes sense to use push-based tools for application
deployment, unless you have hundreds of servers that need to be
updated. For a manageable number of servers, I like the luxury of
knowing exactly which servers are targeted during the rollout of a new
version of the application.

In an ideal world, I would probably use the pull/async model, but I
would still split the target servers into cluster and roll out to one
cluster at a time, with a way to roll back in case of errors. This all
is much easier said than done, but I would love to find out how people
do it, if they do it.

Grig

Ian Chilton

unread,
May 17, 2011, 3:21:04 PM5/17/11
to devops-t...@googlegroups.com
Hi,

I don't have any sound reasoning, but my feeling is to keep deployment separate to configuration management.

I'd rather have the system managed with configuration management and then have application specific deployment tools.

I also prefer push deployment and I don't want publicly accessible (production) servers to have access to the source control or a source control loaded working directory.

Ian

gareth rushgrove

unread,
May 17, 2011, 3:29:58 PM5/17/11
to devops-t...@googlegroups.com
(Having kicked this off I'll probably play devils advocate for a bit)

On 17 May 2011 20:16, Grig Gheorghiu <grig.gh...@gmail.com> wrote:
> I still think it makes sense to use push-based tools for application
> deployment, unless you have hundreds of servers that need to be
> updated.

Is there a line in the sand here? > 100 servers do x, < 100 servers do
y. Or is it never that clear cut? OK, so everyone thinks their
application is unique but but how do you determine that number? Just
feeling as you grow or is it based on some quantifiable properties?
Throughput, database read/write ratio, chosen
webserver/framework/language/platform.

G

Joshua Timberman

unread,
May 17, 2011, 3:46:46 PM5/17/11
to devops-t...@googlegroups.com
Hello!

On Tuesday, May 17, 2011 at 1:07 PM, gareth rushgrove wrote:
Something that interests me (in that I change my mind about the
> answer) and was discussed a little at the last devopsdays Europe is
> where application deployment and configuration management collide.
>
> Mentioning now because Adam Jacob said (hopefully not out of context?)
> on the Windows automation thread:
>
> "If you have the luxury of starting from scratch (or nearly scratch)
> the way you want to model app-deployment is as a part of system
> configuration"
>
> Ignoring the practicalities/limitations of existing systems, is this
> where everyone else would start?

It's more than just system configuration. The configuration is there to facilitate a system state that you've defined as policy for your infrastructure. When you're looking at the entire infrastructure itself, every application component needs to be integrated with everything else. This means treating the application itself as a managed resource that should be in a particular state, such as "deployed".

When scaling from a small environment to a large one, I've found it is much easier to reason about application deployment when it is included in the rest of system "configuration." It also means you have one canonical location in your "infrastructure as code" to look for how applications are deployed. Push-based application deployment tends to be prone to failure at scale in my experience.


--
Joshua Timberman, jos...@housepub.org
http://twitter.com/jtimberman | http://jtimberman.posterous.com

Jordan Sissel

unread,
May 17, 2011, 3:50:53 PM5/17/11
to devops-t...@googlegroups.com
On Tue, May 17, 2011 at 12:07 PM, gareth rushgrove <gareth.r...@gmail.com> wrote:
Something that interests me (in that I change my mind about the
answer) and was discussed a little at the last devopsdays Europe is
where application deployment and configuration management collide.

Mentioning now because Adam Jacob said (hopefully not out of context?)
on the Windows automation thread:

"If you have the luxury of starting from scratch (or nearly scratch)
the way you want to model app-deployment is as a part of system
configuration"


I just view the system as another "application" in the instance you are describing above; granted, "application" is likely the wrong word and a better term that encompases what an 'application' and a 'system config' can both be termed would be better.  I might have a frontend depend on another mysql server, but that frontend also depends on the system. You can use one tool to manage the frontend, database, and system, or you can use 3 separate tools to manage each individually. Scale usually factors in, right? You might have 1000 systems, but a much smaller size of each 'application' type - 30 frontends, 3 databsaes, 500 hadoop, whatever.

At my current job, I use puppet to do everything (apps and systems). Why? There's no resource constraint that makes puppet behave poorly so I haven't had a want to make or use another tool for apps or systems. At my previous job, I used puppet for systems and hudson jobs (and custom scripts) for deployment - why not puppet? We wanted more highly-orchestrated deployments (iterate across servers serially, health checking in between, halting on failures) and bugs/features in rpm/yum prevented me from changing package versions as arbitrarily as needed.

That said, I start by treating them all the same and then making modifications based on scale/business needs.

 -Jordan

gareth rushgrove

unread,
May 17, 2011, 3:54:30 PM5/17/11
to devops-t...@googlegroups.com
On 17 May 2011 20:46, Joshua Timberman <gro...@housepub.org> wrote:
> Hello!
>
> On Tuesday, May 17, 2011 at 1:07 PM, gareth rushgrove wrote:
>> Ignoring the practicalities/limitations of existing systems, is this
>> where everyone else would start?
>
> When scaling from a small environment to a large one, I've found it is much easier to reason about application deployment when it is included in the rest of system "configuration." It also means you have one canonical location in your "infrastructure as code" to look for how applications are deployed. Push-based application deployment tends to be prone to failure at scale in my experience.
>

So what's the definition of small and large in this scenario?

And what causes the "failure at scale"?

Is it easier/better to go with a push based mechanism below that
threshold? Or is is better to always go for a pull/integrated
approach?

G

--

John Vincent

unread,
May 17, 2011, 3:54:46 PM5/17/11
to devops-t...@googlegroups.com
I'm currently working around this issue right now. While I would
totally agree with Adam to an extent, I think that in many cases using
your CM tool for application deployment isn't always an option. In our
case, here are the limitations (as a Java shop):

- Two repositories to manage configuration.
- Integration into the development ecosystem
- Multi-step deployment process

Having two repositories to manage configuration is almost a
non-starter. Let's assume we have the following classes of
configuration "atoms":

- Volatile
- Environment-specific
- Fairly static

In most cases, the static stuff we can ignore. It can be managed as
part of the normal codebase and most likely will be packaged as part
of a jar somewhere. However, when we get to the environment specific
stuff, we now need to resort to ERB templates that chef can populate.
Those are stored in a different repo than the code base. We can't use
any fancy submodule magic because the main code base is SVN and the
chef repo is git. So now, when we add a new configuration atom that
needs to be environment aware, we have to manage it in TWO places. We
need a local copy in the SVN repo that developers can use for local
development and a templated copy in Chef. This will almost always fall
down when a configuration setting doesn't get added to the template.

However assuming we get that worked out, we now have these highly
volatile configurations that, if we're lucky, aren't environment
specific. If they are, we've just created additional overhead. The
"solution" is probably to use something like Vagrant but that's a
mindset change in the midsts of another mindset change.

Then we have the issue of integration into the development ecosystem.
How do we manage those artifacts in Maven when they're needed in Chef?
Do we define dependencies in two places, once in Maven and once in a
Data Bag? Or do we simply build with Maven and generate system
packages to install? Right now we package our configurations as Maven
artifacts because that's the ecosystem the developers use and know.
The java application cookbook in Chef is awesome but it's fairly
simplistic. For any moderately complex java application, it's not just
a single war. There's various settings that are outside the war file -
log4j settings, spring property files. If we package those configs in
the war, we're no farther along that we were. So is the solution to
deploy using the maven artifacts (config tarball + war file) and then
overwrite the configs with Chef? That feels like an antipattern.

Finally, the multi-step deployment process. This is something we're
struggling with now. We, like many people, don't run in daemon mode.
Updating the data bag and then logging on to a given server to stage
the rollout doesn't make sense. Especially at any amount of scale.
Knife is a great tool but it's not a tool I want to use to manage
client runs across 100s or 1000s of machines.

In my perfect world, code commits trigger action and everything from
that point on is automated. Even if we were starting from scratch, I
don't think the tight integration makes sense for the Java world. I'm
pretty bullish (as everyone knows) on differentiating certain types of
configurations. Our solution so far has been to do things like using
haproxy on local machines to load balance across various service
clusters (so essentially every app server that needs to talk to FOO
services talks to localhost:someport). This allows us to manage that
environment aware and volatile configuration with Chef without
impacting the developer workflow. The rest of the stuff we're moving
into a system like Noah.

Hope my rambling made some sense ;)

--
John E. Vincent
http://about.me/lusis

patrick.debois

unread,
May 17, 2011, 4:02:43 PM5/17/11
to devops-t...@googlegroups.com
I concur to Jordan and Vincent's experiences.

The main issue I had , was that development was more used to capistrano in the beginning.
I think it is key you can use the same process/tools during dev/test and prod work.

For example approaches I've used:
1) use vagrant and chef recipes so they can test the deployment
2) or drive capistrano from chef

Our biggest problems were in who handles/where you put the config files. Do you put them in the source code, or in the recipes/manifests.
We usually define one source project now that has both the code and the manifests as submodels and everybody checks out all codebases.

P.S. I think this is unrelated to the push or pull model.

Patrick

Grig Gheorghiu

unread,
May 17, 2011, 4:07:42 PM5/17/11
to devops-t...@googlegroups.com
Then it would be great if people advocating 'do everything with your
config mgmt tool of choice) (Joshua, Jordan and Patrick so far) would
go in some detail about the exact workflow that they apply during an
application rollout.

Grig

peco

unread,
May 17, 2011, 4:10:39 PM5/17/11
to devops-toolchain
Just to chime in an opinion here :)

I think the push vs pull topic is probably a separate discussion than
deploy vs configuration mgmt (I assume the context here is system
configuration).

Deploy vs configuration management
Deploy I would define as moving code and making changes to a system.
It could be deploying code or the newest mysql rpm, or activating a
new configuration. Configuration management to me is more the art and
science of making sure the right bits are in the right place, which
mostly entails knowing what are the right bits and what are the right
places.

Push vs pull
Push and pull are really runtime architectures. I think you can design
a "push" system to scale just as well as a "pull" system (I know there
is religious wars lead on this front and I am not trying to stir the
pot). When I think of push, I think in terms of attended / interactive
operation - I run a command and get feedback on its execution as the
environment is changing. Pull I would qualify as unattended/automatic
operation - the management system updates the environment
automatically based on certain conditions.

So, should pull or push be used for deployment or config management, I
am not sure I could generalize. I would look at how specific systems
like chef, puppet, capistrano match your devops workflows and how they
behave in your environment.

My 2 c :)


On May 17, 2:29 pm, gareth rushgrove <gareth.rushgr...@gmail.com>
wrote:
> (Having kicked this off I'll probably play devils advocate for a bit)
>

Jordan Sissel

unread,
May 17, 2011, 4:21:35 PM5/17/11
to devops-t...@googlegroups.com
On Tue, May 17, 2011 at 1:07 PM, Grig Gheorghiu <grig.gh...@gmail.com> wrote:
Then it would be great if people advocating 'do everything with your
config mgmt tool of choice) (Joshua, Jordan and Patrick so far) would
go in some detail about the exact workflow that they apply during an
application rollout.

Grig


compile code, test, deploy. Right?

We build .deb packages for apps, so when it's time to deploy to staging, prod, or elsewhere, the steps are generally:

* 'make package'
* publish myapp.deb
* edit "deployment info" to update the package version that should be installed.
* build "deployment.deb" (contains deployment info)
* wait for next puppet run

The 'deployment info' is a directory like this:
/deployments/ENVIRONMENT/config.csv

Prod would be, example, /deployments/prod/config.csv - It uses puppet's extlookup. It's a csv file that looks like this:

package/loggly-frontend,20110512184053.trunk
package/loggly-monitoring-solrserver,latest
config/throttle_enable,true
config/stats_enable,true
config/infrastructure/iptables-management,true

etc, basically each key is something lke 'package/loggly-frontend' and puppet uses the value to select what version gets installed. There are also other things like 'config/foo' that turn on or off some feature flags (for systems and apps)

Every puppet run installs the latest 'deployment' deb. So two builds + published packages and a change to the deployment config == new application version deployed.

Puppet runs are staggered slightly so upgrades happen randomly. In the future, we'll probably move to more orchestrated (but still automated) deployments.

-Jordan

Grig Gheorghiu

unread,
May 17, 2011, 4:25:47 PM5/17/11
to devops-t...@googlegroups.com

Awesome, that's exactly the level of detail I was hoping for. Thanks
for sharing!

Grig

Joshua Timberman

unread,
May 17, 2011, 4:27:31 PM5/17/11
to devops-t...@googlegroups.com
On Tuesday, May 17, 2011 at 1:54 PM, gareth rushgrove wrote:
So what's the definition of small and large in this scenario?
"It depends". The most generalized answer I have is a question, "How many can you reason about in your head?"

There's a number of reasons for this, most of them having to do with psychology and how many things humans can track in their heads at any given point in time.

> And what causes the "failure at scale"?
>
> Is it easier/better to go with a push based mechanism below that
> threshold? Or is is better to always go for a pull/integrated
> approach?

For all three of these questions, see push vs pull at infrastructures.org

http://www.infrastructures.org/bootstrap/pushpull.shtml

The short answer is, because sometimes hosts are down or otherwise unreachable, especially in a large distributed infrastructure.

John Vincent

unread,
May 17, 2011, 4:29:39 PM5/17/11
to devops-t...@googlegroups.com
For the record, I would *LOVE* to do everything in Chef or Puppet but
I don't know that it can work in all ecosystems.

Here's a good example. One of the reasons I was hired for my current
company was to "kaching-ify" the environment. However retrofitting
that onto the existing environment is next to impossible. It's not a
bad environment. It just wasn't designed with automatic scaling in
mind. I eventually decided that the best bet, and the one that
everyone is cool with, is to replace all of our existing EC2 instances
for a given component at a time. I went through the gamut of tagging
machines as legacy vs. non-legacy and trying to write cookbooks that
took that into account. It wasn't happening.

So in essence I *AM* starting from scratch. I'm moving configurations
around per-application as I described. I'm using Jenkins for
self-service builds. However there's still the disconnect around
managing application configuration and the overlap with system
configuration (what I'm calling environment-aware configuration) that
makes sense at scale. As I said, the ultimate goal is code deploys are
triggered by commit message and scaling is ENITRELY automatic.

--

Adam Jacob

unread,
May 17, 2011, 4:30:01 PM5/17/11
to devops-t...@googlegroups.com
On Tue, May 17, 2011 at 12:07 PM, gareth rushgrove
<gareth.r...@gmail.com> wrote:

To clarify my actual position: I think the goal here is a fully
automated infrastructure, soup to nuts. That means if I have new
systems to roll out, the entire process is automated - in the places
where it isn't, it's a documented thing and part of the business
process.

Secondary to this is the desire to be able to re-build, from scratch,
the entire business from nothing but source code, an application data
backup, and bare metal hardware. You can do that with multiple tools -
I've found that if I start with that goal in mind, I rarely need a
second step.

Push/Pull is, in my mind, a red herring here - you can push if you
want to with tools like Chef and Puppet, just like you can with
fabric/func/capistrano. Which you choose is largely a function of the
application itself, and the workflow needs you have around
orchestrating the various state transitions.

For 99% of the people I've talked to, they think about the common
transitions inherent in things like Application Deployment as gates
and phases: you have a phase that happens on some set of systems, that
completes a gate (think a guard statement, like "all the servers have
copied the war file") that moves them on to the next phase ("all the
servers reload tomcat"). You have lots of variation around what
happens with failure, and how much (and when) something needs to be
guarded.

As an example, we roll out updates to Opscode entirely with Chef, but
we trigger the event - essentially we push. We built the application
to be free of the need for very many gates - each individual server
can update itself without having to worry about orchestrating with
others. Now, we gained that ability intentionally - the ability to
operate our application was in our minds when we built it.

In the real world application deployment is all about the application
- the devil is entirely in the details. If you get to start from
scratch, I would recommend you put your effort in to making sure the
application can be deployed with as little orchestration as possible.
If you don't get to start from scratch, then I would do what fits the
existing workflow best first, and work towards the world where you can
remove orchestration. :)

Adam

--
Opscode, Inc.
Adam Jacob, Chief Product Officer
T: (206) 619-7151 E: ad...@opscode.com

Dan Sully

unread,
May 17, 2011, 4:30:07 PM5/17/11
to devops-t...@googlegroups.com
* Jordan Sissel shaped the electrons to say...

>Every puppet run installs the latest 'deployment' deb. So two builds +
>published packages and a change to the deployment config == new application
>version deployed.

How do you handle rollbacks?

Using system level packaging, you can't do this atomically.

And in your case, it sounds like upgrades are eventually consistent by design.

--dan

--------------------------------------------------------------
<dsully> please describe web 2.0 to me in 2 sentences or less.
<jwb> you make all the content. they keep all the revenue.

John Vincent

unread,
May 17, 2011, 4:35:23 PM5/17/11
to devops-t...@googlegroups.com
Not answering for Jordan, but this is where the development side of
the house has to step up. Tightly wound dependencies between disparate
applications and lack of backwards compatibility in code bases is a
killer. If you're starting from scratch, you can usually change
people's mindset early on to stop doing stupid shit like using RMI and
move to a web service approach between components.

--

Adam Jacob

unread,
May 17, 2011, 4:43:51 PM5/17/11
to devops-t...@googlegroups.com
On Tue, May 17, 2011 at 1:29 PM, John Vincent <lusi...@gmail.com> wrote:
> For the record, I would *LOVE* to do everything in Chef or Puppet but
> I don't know that it can work in all ecosystems.
>
> Here's a good example. One of the reasons I was hired for my current
> company was to "kaching-ify" the environment. However retrofitting
> that onto the existing environment is next to impossible. It's not a
> bad environment. It just wasn't designed with automatic scaling in
> mind. I eventually decided that the best bet, and the one that
> everyone is cool with, is to replace all of our existing EC2 instances
> for a given component at a time.  I went through the gamut of tagging
> machines as legacy vs. non-legacy and trying to write cookbooks that
> took that into account. It wasn't happening.
>
> So in essence I *AM* starting from scratch. I'm moving configurations
> around per-application as I described. I'm using Jenkins for
> self-service builds. However there's still the disconnect around
> managing application configuration and the overlap with system
> configuration (what I'm calling environment-aware configuration) that
> makes sense at scale. As I said, the ultimate goal is code deploys are
> triggered by commit message and scaling is ENITRELY automatic.

We've talked a bit about this before, but not in public. :)

There are lots of different kinds of state - active state, desired
state, etc etc.

Nothing inherent in Chef, or any other tool, makes it so you couldn't
use it to the problems you describe. With Chef, you could be having
the builds triggered by Jenkins updating the desired state of your
application through the API, so that when they get deployed they are
tightly coupled. You could be doing it on a per-version basis. You
could model it as baselines with differentials. All sorts of ways.

The bare bones reality of it is that you have to deal with the world
the way you find it, and you must leave it a better (read more
efficient and pleasant) place than you found it. I love the explosion
of tooling in the space - it's a reflection of the reality that
everyones environment is different, and no two applications are alike.

The best thing you can do is step back from the problem (read
implementation) for a minute, and think about the actual business
case. You're doing that clearly: I want to go from commit to
deployment with no steps in between, including scale. I would argue
that whether you're doing it all with Chef or not, you're getting the
same net effect - one inflection point that kicks of a series of
relatively autonomous actions that bring themselves as close to fully
functioning as possible, and that any steps that are left are *also*
happening, and it's just a matter of time before everyone agrees on
the new world order.

gareth rushgrove

unread,
May 17, 2011, 4:44:44 PM5/17/11
to devops-t...@googlegroups.com
On 17 May 2011 21:27, Joshua Timberman <gro...@housepub.org> wrote:
> On Tuesday, May 17, 2011 at 1:54 PM, gareth rushgrove wrote:
> So what's the definition of small and large in this scenario?
> "It depends". The most generalized answer I have is a question, "How many can you reason about in your head?"
>
> There's a number of reasons for this, most of them having to do with psychology and how many things humans can track in their heads at any given point in time.

As someone with a blog titled after an psychology paper from the 1950s
about this exact issue
(http://en.wikipedia.org/wiki/The_Magical_Number_Seven,_Plus_or_Minus_Two)
I like you're answer :)

The question becomes how do you work out this number for your given
team/architecture? Also, that number should have a cool sounding name.

G

>> And what causes the "failure at scale"?
>>
>> Is it easier/better to go with a push based mechanism below that
>> threshold? Or is is better to always go for a pull/integrated
>> approach?
>
> For all three of these questions, see push vs pull at infrastructures.org
>
> http://www.infrastructures.org/bootstrap/pushpull.shtml
>
> The short answer is, because sometimes hosts are down or otherwise unreachable, especially in a large distributed infrastructure.
>
> --
> Joshua Timberman, jos...@housepub.org
> http://twitter.com/jtimberman | http://jtimberman.posterous.com
>

--

Jordan Sissel

unread,
May 17, 2011, 4:46:05 PM5/17/11
to devops-t...@googlegroups.com
On Tue, May 17, 2011 at 1:30 PM, Dan Sully <dan...@electricrain.com> wrote:
* Jordan Sissel shaped the electrons to say...


Every puppet run installs the latest 'deployment' deb. So two builds +
published packages and a change to the deployment config == new application
version deployed.

How do you handle rollbacks?

apt-get handles downgrades just fine (via puppet's package provider).
 

Using system level packaging, you can't do this atomically.

See above :)
 

And in your case, it sounds like upgrades are eventually consistent by design.

I wouldn't say "design" because that sort of implies a requirement to that selected this behavior. It was more of "I'll start with puppet, and if it does well enough, I'll leave it alone" kind of thing.

-Jordan

Darrin Eden

unread,
May 17, 2011, 4:52:43 PM5/17/11
to devops-toolchain
Interesting thread!

To toss in another data point:

I prefer thinking apps are deployed constantly, at random and only
roll forward. For instance, chef-client polling every thirty minutes,
works just fine in my case.

However, many of the developers I work with want more control. I'm
fine with that too. As such I started hacking on a work flow catering
to a "push" point-of-view.

https://github.com/dje/jellyfish

Thanks!

Luke Kanies

unread,
May 17, 2011, 4:57:12 PM5/17/11
to devops-t...@googlegroups.com

I've always thought the push vs. pull is a red herring in terms of the critical part of the conversation, partially because it's so easy to switch from one to the other - e.g., Puppet supports both just fine, using http, mcollective, capistrano, or whatever you want.

What I've become more interested in recently is focusing on the decision - who makes it, how it propagates, etc. For system stuff, we seem to be pretty comfortable having code deployed automatically, within half an hour or so, but as Darrin says, app developers generally prefer to be able to control when and how an app gets deployed.

Even doing that deployment could be done relatively easily in a pull model - e.g, we're working with one customer to have their Puppet clients checking in every 60 seconds instead of every half hour, and they update based on db state, so it's all pull but centrally controlled and fits perfectly into a developer workflow without needing to use parallell ssh or whatever.

--
Be wary of the man who urges an action in which he himself incurs no
risk. -- Joaquin Setanti
---------------------------------------------------------------------
Luke Kanies -|- http://puppetlabs.com -|- http://about.me/lak


gareth rushgrove

unread,
May 17, 2011, 4:57:46 PM5/17/11
to devops-t...@googlegroups.com

Maybe it's the use of the words new and from scratch here, but
incremental deployments (maybe even just a few lines or at least a few
commits) are more common. Both these scenarios sound like it's just
for new servers or for initial rollouts of brand new systems?

>
> Push/Pull is, in my mind, a red herring here - you can push if you
> want to with tools like Chef and Puppet, just like you can with
> fabric/func/capistrano. Which you choose is largely a function of the
> application itself, and the workflow needs you have around
> orchestrating the various state transitions.
>

Glad someone said that. I've recently been using fabric to trigger
chef/puppet runs on demand on relevant machines with some success.

> For 99% of the people I've talked to, they think about the common
> transitions inherent in things like Application Deployment as gates
> and phases: you have a phase that happens on some set of systems, that
> completes a gate (think a guard statement, like "all the servers have
> copied the war file") that moves them on to the next phase ("all the
> servers reload tomcat"). You have lots of variation around what
> happens with failure, and how much (and when) something needs to be
> guarded.
>
> As an example, we roll out updates to Opscode entirely with Chef, but
> we trigger the event - essentially we push. We built the application
> to be free of the need for very many gates - each individual server
> can update itself without having to worry about orchestrating with
> others. Now, we gained that ability intentionally - the ability to
> operate our application was in our minds when we built it.
>
> In the real world application deployment is all about the application
> - the devil is entirely in the details. If you get to start from
> scratch, I would recommend you put your effort in to making sure the
> application can be deployed with as little orchestration as possible.
> If you don't get to start from scratch, then I would do what fits the
> existing workflow best first, and work towards the world where you can
> remove orchestration. :)

So, sometimes the real world is annoying. I'm keen on ignoring that in
the spirit of idealism.

In the back of my mind when I posted this thread was whether anyone
fancies putting together some "best practice" examples of deployment?
Taking a really simple application (lets say a single file wsgi python
application, php file or sinatra app) what would the 'perfect'
deployment mechanism look like?

>
> Adam
>
> --
> Opscode, Inc.
> Adam Jacob, Chief Product Officer
> T: (206) 619-7151 E: ad...@opscode.com
>

--

Dan Sully

unread,
May 17, 2011, 5:23:31 PM5/17/11
to devops-t...@googlegroups.com
* Jordan Sissel shaped the electrons to say...

>> How do you handle rollbacks?


>
>apt-get handles downgrades just fine (via puppet's package provider).
>>
>> Using system level packaging, you can't do this atomically.
>>
>See above :)

Right - apt can't atomically switch. You need to remove & re-install.

Not as easy or as fast as changing a symlink & restarting your process.

What happens when engineering screws up (will happen), and the released code
isn't working right? Do you roll back, or do you fail forward? What sort of
downtime do you incur?

Does your eventual update model allow this type of symptom to be shown on
only the hosts it's been deployed to, and then you make your rollback/forward
choice?

Adam Jacob

unread,
May 17, 2011, 5:28:41 PM5/17/11
to devops-t...@googlegroups.com
On Tue, May 17, 2011 at 1:57 PM, gareth rushgrove
<gareth.r...@gmail.com> wrote:
> Maybe it's the use of the words new and from scratch here, but
> incremental deployments (maybe even just a few lines or at least a few
> commits) are more common. Both these scenarios sound like it's just
> for new servers or for initial rollouts of brand new systems?

Nah - I mean more if the project starts from scratch, not if the
servers do. When you talk about application deployment and server
configuration, it's all about what the application needs. If you're
managing one that was already built, the fastest path will be to
automate what is already in place: which means automating the existing
(perhaps inferior) workflow. If you have the luxury of defining that
workflow, my experience has been that the workflow that has as few
points of orchestration as possible kicks ass, from an operator point
of view. :)

> So, sometimes the real world is annoying. I'm keen on ignoring that in
> the spirit of idealism.

Not ignoring it is part of my idealism. :)

> In the back of my mind when I posted this thread was whether anyone
> fancies putting together some "best practice" examples of deployment?
> Taking a really simple application (lets say a single file wsgi python
> application, php file or sinatra app) what would the 'perfect'
> deployment mechanism look like?

The more simple you make it, the less useful it is. And the more
complex you make it, the more you'll get argued that you're solution
is baked in with too many complexities. You can't win for losing on
this road. We have good examples of deploying all of the above in
multi-server environments that scale well, and we have people who have
looked at those same solutions and found them either to complex or too
simplistic.

The reason capistrano/fabric/func are ubiquitous in this space is that
they provide really easy access to the common gates/phases primitives.
The perfect tool for me would wrap the value of Chef's desired state
management with the trivial gates/phases and active state inspection
you want. I think Chef+Noah actually goes a very long way down this
road.

Best,

Jordan Sissel

unread,
May 17, 2011, 5:52:41 PM5/17/11
to devops-t...@googlegroups.com
On Tue, May 17, 2011 at 2:23 PM, Dan Sully <dan...@electricrain.com> wrote:
* Jordan Sissel shaped the electrons to say...

How do you handle rollbacks?

apt-get handles downgrades just fine (via puppet's package provider).

Using system level packaging, you can't do this atomically.

See above :)

Right - apt can't atomically switch. You need to remove & re-install. 

Not as easy or as fast as changing a symlink & restarting your process.

I did mention previously that I use puppet now for everything because it works well enough. If speed becomes a bottleneck, I'll change to address that bottleneck - just like I'd do for any resource constraint (deployment-related or not). 

There's an easy work around for your "remove and reinstall" question. You just need to build the package in a non-conflicting way.
For example, you can include the version in the package name, and have two of those installed (different versions) at the same time.

Plenty of distros already abuse this to install multiple versions of the same thing. See 'libxml2' (vs libxml) or 'ruby1.8'  (vs ruby) or 'python30' (vs python) package names. All the major distros seem to do this (debian,ubunu,centos,rhel,fedora,etc).
 

What happens when engineering screws up (will happen), and the released code
isn't working right? Do you roll back, or do you fail forward? What sort of
downtime do you incur?

As described previously, much of these requirements you mention depend on the situation, tools, and agility of your team. At Loggly, I also write code for the app-side; other engineers (than myself) maintain puppet manifests, etc. If there's a bug, we generally "roll forward" if the bug is fixed quickly. Otherwise, rolling back is just as easy (assuming there's no awkards database schemas to rollback, but that's unrelated to package rollbacks, imo). Do we have automated rollback? No. Is it hard to manually roll back? Not for us right now. If I had thousands of servers would I invest energy into building stronger automated deployment tools? Certainly.

The following are sort of rhetorical questions: Define downtime? Does the bug impact everything? Someone? Some section of the app? One feature? Bugs vary in impact on the business.

I treat bugs during 'roll out" the same as I'd treat them any other time - a bug is a bug: identify it, prioritize it, fix it, move on. If the fastest time to recovery involves rolling back, then roll back. If the fastest time to recovery involves a bug fix and a new version, then do that.

Very situation dependent.
 

Does your eventual update model allow this type of symptom to be shown on
only the hosts it's been deployed to, and then you make your rollback/forward
choice?

There's no internal tracking of the roll out once it's going, mostly because we don't need it. If our monitoring sees a problem or a customer files a ticket, and it's related to the new rollout, it's easy to pivot and rollback or "fail forward" as you say :)

Mainly, deployment scenarios vary with extreme wildness by business unit, business, team, and application. 

-Jordan

gareth rushgrove

unread,
May 17, 2011, 6:10:28 PM5/17/11
to devops-t...@googlegroups.com

So how do you segment people when it comes to what they are looking
for? I'll come back to number of servers because it's been mentioned,
but anything else?

I think when it comes to examples the problem is more trying to not
introduce constrains and trying to be too generic. An example that is
tailored for <10 servers for a Rails environment is probably doing to
get different folks interested and different feedback than a 100
server Python setup or a 1000 server Java setup.

So more constraints in this area are a big win in my view. I'm not
looking for one example to rule them all, more lots of examples that
we, as a community, thing is best, given different constrains. The
question is to me: is the number of permutations manageable for a set
of examples.

Also, some feedback on this topic is always going to be negative
because it might involve change (also generalisms). I'm happy to
ignore some of that. As mentioned, this thread is wholly idealistic :)


>
> The reason capistrano/fabric/func are ubiquitous in this space is that
> they provide really easy access to the common gates/phases primitives.
> The perfect tool for me would wrap the value of Chef's desired state
> management with the trivial gates/phases and active state inspection
> you want. I think Chef+Noah actually goes a very long way down this
> road.

Anyone have a example of this working? Or disagree this is the ideal?
(obviously substituting Chef for Chef/Puppet and Noah for
Noakh/ZooKeeper for technology agnostic watchers)

G

>
> Best,
> Adam
>
> --
> Opscode, Inc.
> Adam Jacob, Chief Product Officer
> T: (206) 619-7151 E: ad...@opscode.com
>

--

Ernest Mueller

unread,
May 17, 2011, 6:23:31 PM5/17/11
to devops-t...@googlegroups.com

devops-t...@googlegroups.com wrote on 05/17/2011 04:28:41 PM:

> From: Adam Jacob <ad...@opscode.com>



> Nah - I mean more if the project starts from scratch, not if the
> servers do. When you talk about application deployment and server
> configuration, it's all about what the application needs. If you're
> managing one that was already built, the fastest path will be to
> automate what is already in place: which means automating the existing
> (perhaps inferior) workflow. If you have the luxury of defining that
> workflow, my experience has been that the workflow that has as few
> points of orchestration as possible kicks ass, from an operator point
> of view. :)

Yeah, I think that in highly complex environments you just end up needing orchestration. Sure, if I'm Google or Facebook and I've built some huge pool of stateless apps, it's great... But us enterprise guys have big ass hairballs of junk we have to deal with a lot of the time, and tight orchestration is often just the only tenable way to do something. Including "human workflow steps", to use a BPELy term, where oh say a DBA who hates automation because they hate people has to manually do something... Also I'm glad everyone is always so sure of their code that they figure they can dribble it out into production at random, but eek!

We normally resort to a 'rolling deploy' where we put the code to X of Y boxes, test it, and redeploy previous code if it fails, or keep rolling if it doesn't. Not DevOps nirvana but you go to war with the army you have.

Ernest

______________________
UN-altered REPRODUCTION and DISSEMINATION of
this IMPORTANT information is ENCOURAGED.

Adam Jacob

unread,
May 17, 2011, 6:30:28 PM5/17/11
to devops-t...@googlegroups.com
On Tue, May 17, 2011 at 3:23 PM, Ernest Mueller <ernest....@ni.com> wrote:
> Yeah, I think that in highly complex environments you just end up needing
> orchestration. Sure, if I'm Google or Facebook and I've built some huge pool
> of stateless apps, it's great... But us enterprise guys have big ass
> hairballs of junk we have to deal with a lot of the time, and tight
> orchestration is often just the only tenable way to do something. Including
> "human workflow steps", to use a BPELy term, where oh say a DBA who hates
> automation because they hate people has to manually do something... Also I'm
> glad everyone is always so sure of their code that they figure they can
> dribble it out into production at random, but eek!
>
> We normally resort to a 'rolling deploy' where we put the code to X of Y
> boxes, test it, and redeploy previous code if it fails, or keep rolling if
> it doesn't. Not DevOps nirvana but you go to war with the army you have.

Right - it's about complexity of the process/environment as you found
it, not necessarily application complexity.

I worked for a company that had a 20+ step minimum deployment process,
with several of the steps tied directly to specific compliance
line-items. Any attempt to alter the 20 steps would have meant a
ridiculous push that would probably have died on the vine. Instead, we
automated the 20 that already existed, including the existing
compliance components, so that we had the victory we needed to change
things: once we owned it, dropping down to 5 steps under the hood was
easy.

Adrian Cole

unread,
May 17, 2011, 6:46:14 PM5/17/11
to devops-t...@googlegroups.com
On Tue, May 17, 2011 at 3:10 PM, gareth rushgrove

Generally, I think people are looking for congruence between the
complexity they perceive in their request and the complexity of the
implementation of it. This is personal, cultural, and experience
settling what this balance ends up looking like.

More concretely, I believe those who are used to dragging and dropping
images of applications into an eclipse server view will have a certain
set of expectations on what deployment looks and feels like vs the
thousand node condor job vs autosys vs a big stinky n-tier vs...
They'll have their most common deployment experience as a measuring
stick and anything more complex will be too much and less would be too
naive. This is the opportunity of PaaS.

The service part of PaaS, IMHO, is marrying up expectations of the
user, with the right level of detail, in an API form, possibly
multiple API views per culture. Some cultures will care about reusing
tech they already bought into, or auditability over all. Offerings
needn't use the same technology for different deployment use cases and
scale, but it is certainly simpler on the backend to have pliable
technologies that cover vast areas. As I think Luke mentioned, I
agree the user is less concerned with mechanics of push/pull vs being
in control, and having reliable state transitions. Many tech can work
either way, and the point of raising abstraction is to help give a
better view to the user and more options to the implementor.

So, anyway, the perfect deployment matches the requestors expectations
with relevant execution, and to me.. this is also a goal of PaaS.

thanks for the fun thread.
-A

Scott Zahn

unread,
May 17, 2011, 10:47:49 PM5/17/11
to devops-t...@googlegroups.com, Jordan Sissel
Jordan,

I understand that your deploys are handled by puppet and that your deploys
are staggered during the puppet run window (default is 30 minutes. I
don't know how often yours runs.). Based on what you say below, am I
correct in inferring that your application is in an inconsistent state for
N minutes while each of your servers eventually runs the latest puppet
manifest? Can you elaborate on how this plays out practically during
upgrades that cause noticeable site change to end users and/or breakage to
the site (such as an deploy requiring a database schema change)?

I purposefully haven't automated my deploys via puppet in order to avoid
having an inconsistent application state during a puppet run window, so
I'm interested to see how you deal with such deployment scenarios in your
environment.

regards,
Scott

On Tue, 17 May 2011 16:21:35 -0400, Jordan Sissel <j...@semicomplete.com>
wrote:


--
Using Opera's revolutionary email client: http://www.opera.com/mail/

Jordan Sissel

unread,
May 17, 2011, 11:22:39 PM5/17/11
to Scott Zahn, devops-t...@googlegroups.com
On Tue, May 17, 2011 at 7:47 PM, Scott Zahn <sc...@zahna.com> wrote:
Jordan,

I understand that your deploys are handled by puppet and that your deploys are staggered during the puppet run window (default is 30 minutes.  I don't know how often yours runs.).  Based on what you say below, am I correct in inferring that your application is in an inconsistent state for N minutes while each of your servers eventually runs the latest puppet manifest?  Can you elaborate on how this plays out practically during upgrades that cause noticeable site change to end users and/or breakage to the site (such as an deploy requiring a database schema change)?

Our puppet runs are every 30 minutes with a 10-minute random sleep before the run. Originally, we did not stagger, but we got word from RightScale that our systems were hammering it's API too hard, so I staggered runs to avoid receiving further complaints. Even still, some changes require two puppet runs (sigh) to achieve state - a frontend run could export a new nagios check that the monitoring server didn't pick up 30 seconds ago, so the next 'monitoring server' puppet run will pick it up in 30-ish minutes. Does it matter?

So, yes, there is inconsistent state at times. However, anyone is able to run the same command the puppet cronjob runs to run puppet manually. To that end, if there is a change that needs tight control on timing or ordering, we can orchestrate it ourselves. We aren't targeting at 100% uptime on the frontend simply because there's yet to be a business need that asks for that. 

The situation you're describing (What do you do during noticable site changes) is unrelated to puppet, imo. That's more a question of "How do I seamlessly deploy my web application without interrupting user sessions?" and depends greatly on the application and the situation, does it not? The answer to the previous question is certainly not as simple as "copy out code and update a symlink" since "current sessions" can still be interrupted: was the user doing something that's no longer supported? Did you restart your webserver without first quiescing it and removing it from the load balancer? Do you have in-memory state (some sessions, whatever) that needs to be saved before upgrading? All of these questions have nothing to do with puppet (a tool) and everything to do with how you designed your upgrade scenario.

Is our implementation perfect? No. Does it matter? Not right now. It does what we need. Until our needs change, what we have works peachy.

As mentioned previously, if there was a business need that pushed stricter requirements on our upgrade process, then I would make whatever changes were necessary to do that.


I purposefully haven't automated my deploys via puppet in order to avoid having an inconsistent application state during a puppet run window, so I'm interested to see how you deal with such deployment scenarios in your environment.

Puppet is still mostly about "doing stuff on a host" vs "doing stuff in an infrastructure" - for now, anyway. So any cross-host synchrony will require another tool or a human to manage.

Doing perfectly-timed, zero-interruption upgrades is pretty hard, has wild variation in solution, and depends on the right design decisions being made through your entire stack.

-Jordan

Adam Jacob

unread,
May 17, 2011, 11:30:51 PM5/17/11
to devops-t...@googlegroups.com, Jordan Sissel
On Tue, May 17, 2011 at 7:47 PM, Scott Zahn <sc...@zahna.com> wrote:
> I understand that your deploys are handled by puppet and that your deploys
> are staggered during the puppet run window (default is 30 minutes.  I don't
> know how often yours runs.).  Based on what you say below, am I correct in
> inferring that your application is in an inconsistent state for N minutes
> while each of your servers eventually runs the latest puppet manifest?  Can
> you elaborate on how this plays out practically during upgrades that cause
> noticeable site change to end users and/or breakage to the site (such as an
> deploy requiring a database schema change)?
>
> I purposefully haven't automated my deploys via puppet in order to avoid
> having an inconsistent application state during a puppet run window, so I'm
> interested to see how you deal with such deployment scenarios in your
> environment.

This is another great use case for things like dark launching - you
put in feature flags, and a way to twiddle features on or off. That
means you can roll the deploys out at your leisure, and make a single
application change to launch the feature in a coordinated fashion. You
don't *have* to tie things up to the way you orchestrate the
deployment.

With Chef doing deploys and the 0.10 release, we send a signal to the
running daemons, to trigger the run. You still won't necessarily get
gates, but you can get much closer intervals.

Noah Campbell

unread,
May 17, 2011, 11:36:10 PM5/17/11
to devops-t...@googlegroups.com
Thank you Jordan. This is the ultimate devops answer assuming you can align the culture to rally behind a problem.

> Doing perfectly-timed, zero-interruption upgrades is pretty hard, has wild variation in solution, and depends on the right design decisions being made through your entire stack.


One additional point to the pull vs push discussion, along the line of what Jordan said.

I recently came off a project where pull deploy was used to distribute the application code to the cluster of 5000+ nodes. The process worked great except when it came to switching to the new version, then the out of sync problem really hit them. They're were losing real money during these releases. The solution they were headed too: a pull model to get all the bits on the box and a very tight push model to signal the symlink change. The signaling was custom built and relied on established connections to the "signalling" agent.

The point is that their deployment process required this type of sophistication because there was a financial incentive to it and the organization, at least the release engineering/system administration organizations got behind finding a solution. The next step was to look at the app code and figure out how they could do it it seamlessly, but all in due time. The organization was still getting *all* elements of the business units to line up.

-Noah

John E. Vincent

unread,
May 18, 2011, 11:58:05 AM5/18/11
to devops-toolchain
I've done a POC using Noah as a blocker for chef-client runs already:

https://github.com/lusis/lusis-cookbooks/blob/master/noahlite/README.md

The only reason I haven't flushed out that cookbook further is because
I realized I needed to go ahead and write a Noah API client library
instead of duplicated all the work in the cookbook.

There should be plenty of other use cases when you treat Noah as not
only as an arbiter of action but as an external data source. Mind you
I don't consider Noah to primarily be the system of record for Chef
setups but it would work as an external node database for Puppet
installs. I've pestered James Turnbull for some better Puppet
integration use cases. I'm not at a Puppet shop right now so I can't
devote the time myself to real world use cases with Puppet.

On May 17, 6:10 pm, gareth rushgrove <gareth.rushgr...@gmail.com>
wrote:

Jeff Sussna

unread,
Apr 18, 2013, 3:27:12 PM4/18/13
to devops-t...@googlegroups.com
This thread is two years old. Wondering if the state of the art/knowledge has progressed since then, or any of the commenting parties have changed their thoughts or approach to it. Where do new tools like BOSH or Ansible fit in?



On Tuesday, May 17, 2011 2:07:37 PM UTC-5, garethr wrote:
Something that interests me (in that I change my mind about the
answer) and was discussed a little at the last devopsdays Europe is
where application deployment and configuration management collide.

Mentioning now because Adam Jacob said (hopefully not out of context?)
on the Windows automation thread:

"If you have the luxury of starting from scratch (or nearly scratch)
the way you want to model app-deployment is as a part of system
configuration"

Ignoring the practicalities/limitations of existing systems, is this


where everyone else would start?

Adam Jacob

unread,
Apr 18, 2013, 3:36:28 PM4/18/13
to devops-t...@googlegroups.com

For me, it’s actually become more entrenched. Regarldess of the tooling you use, thinking systemically about deployment in the context of the entire lifecycle, from cradle to grave, is the best thing you can do for yourself. Some tools are easier to work with than others in that systemic sense.

 

Adam

--
You received this message because you are subscribed to the Google Groups "devops-toolchain" group.
To unsubscribe from this group and stop receiving emails from it, send an email to devops-toolcha...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Ryan Miller

unread,
Apr 18, 2013, 10:22:21 PM4/18/13
to devops-t...@googlegroups.com
Sorry to join this conversation so late, but for those who advocate packaging apps and using configuration management to deploy, I'm wondering how you handle stateful deployment interactions?  Things like taking servers out of loadbalancers while their caches rewarm, making sure that servers running the new version of an app are talking to database servers with the schema changes, etc.  

Do you force all app-schema dependencies into flags?  Do you let monitoring control the loadbalancers?  I can see how both of these (though requiring a lot of culture change) would work most of the time, but usually everybody has rollouts at least occasionally that can't manageably have their dependencies forced into flags--how do you work those?  

How does cluster health-monitoring on deployments work for you?  Does the config management system know how to pick random subsets of servers for initial deployment, and rollback if those servers begin to behave badly?

I'm curious because I love Puppet dearly and hate most/all deployment tools with a passion, but I haven't figured out how to get around these problems.with moving everything to config management.  And then I can begin trying to wean developers off of Maven.

Ryan

Steve Conover

unread,
Apr 18, 2013, 10:44:17 PM4/18/13
to devops-t...@googlegroups.com
Not to mention that the act of creating flags and writing backwards-compatible code is often burdensome/overkill. A deployment tool that strives to be as transactional as possible, where you have a short canarying period, and assume humans are actively watching graphs and such, covers a huge range of practical rollout use cases.

Mohit Chawla

unread,
Apr 19, 2013, 12:36:33 AM4/19/13
to devops-t...@googlegroups.com
Hello,

@Ryan, look at https://github.com/ripienaar/puppet-mcollective/tree/master/example/web_deploy for an implementation of some of these ideas.

Adam Jacob

unread,
Apr 19, 2013, 1:01:13 AM4/19/13
to devops-t...@googlegroups.com
A couple of things here, mostly philosophical.

The first rule of great orchestration is to attempt to limit the need for an orchestrator. In systems that are easy to reason about (and thereby to maintain and operate) the number of steps that must be observed and reacted against to accomplish a complex task are, ideally, dropped as low as possible. This point of view comes at us from multiple places – human factors and system safety and promise theory being the two that are most directly related to most of our day-in-day-out lives.

Let's try answering your questions with our goal being to strive to have as many of the systems involved be responsible for correct behavior themselves (they are autonomous actors)  as possible, and see what we come up with.

# Taking servers out of loadbalancers while their caches rewarm

In this case, we have the load-blancers expecting a pool of resources to route traffice to. The load balancer is responsible for deciding whether or not traffic should be sent to that system, often with some kind of a health check. They are monitoring the application servers – the question here is how reliable is the response from them, and what information does it decode? An essentially orchestrator-less approach to this problem would be for the application servers themselves to respond with a smart status code to the load balancer, rather than a stupid one. One example here would be a web app that has a /status endpoint, and one of the contributing factors to it returning a postive result is it's own evaluation of the caches status. 

# Making sure that servers running the new version of an app are talking to database servers with schema changes

To problems here. One is how do you potentially change the configuration of the application to point to new database servers with a different schema. The second is how do we know we have the right schema on the database. The database needs a way to report the schema version to the application, and the applicatoon needs a mechanism for validating it. Using a similar approach to load balancing above, we can get to basically 0 orchestation here too: the application itself is aware of a minimum schema revision required to operate, and its status endpoint returns a negative result if it cannot be met. This could be as simple as an auto-incrementing schema_revision in your database.

Talking to the right app servers is easy in Chef, at least – you would just have a search query for the backend servers that are for that application and have the right schema number.  

# How does health-monitoring on deployments work for you?

I bet you can guess! :) Again, the application itself is responsible for its health checking.

# Does the config management system know how to pick random subsets of servers for initial deployment?

We have people who do this in Chef. We've seen many more use environments and promotion to handle this scenario.

# Rollback if they behave badly

Maybe. Nothing about using CM primitives for deployment makes this not possible – but the realities of complex application deployment often does.

Best,
Adam

Ryan Miller

unread,
Apr 19, 2013, 1:29:45 AM4/19/13
to devops-t...@googlegroups.com
Mohit:  

So this uses mcollective, which has some nice puppet integration (facter, etc) and will probably have more, but I don't see how the strategy is different than with any other deploy tool.  This looks like 'stateful deploy' to me, not 'autonomic config management'.

Am I missing something?

Ryan

Ranjib Dey

unread,
Apr 19, 2013, 1:52:03 AM4/19/13
to devops-t...@googlegroups.com
mcollective has an agent based architecture , we have use to with chef as well as with puppet. Also with chef + mcollective we were able orchestrate across a mix of windows + linux mix environment. Some of our trigger points were the CI servers (feature deployments, canary releases etc), while some other like fail over, schedule maintenance (like taking out mongo nodes from replicasets for indexing large collections) were triggered from sensu or cron.

One key learning was not to treat deployment special. Treat them like any other change that you want to introduce, and the associated risks, depending upon the risks you decide how many (or how strict ) checks they have to go through (i tend to use a mix of CI and monitoring system to do this).

We dont have a silver bullet for this. Configuration management systems were not built for orchestration, though you can go for an infrastructure wide eventually  consistent  mode where you design cross node changes will be converged in no more than 2 runs (for example) and run the CM system after every 15 mins (for example), this way you dont need an orchestrator, but your feedback time is higher. This greatly simplifies the system.

I have automated the exact scenario for a java/mysql/memcache/activemq stack , with blue-green deployments. A typical deployment will start with CI server (like teamcity) publishing a green build, which Chef publishes in artifact store, then app servers consume them and caches it , publish their state as 'blue', but does not do the actual deployment, then the  `lb` removes the app servers in blue state out of rotation , after that the node does deployment & health check, and then again the lb adds the server back. We had a custom knife plugin which will do the exact step, which we'll use  for hot deployment or deployments with additional steps.

One more thing i'd like to note is that you have to follow some application design practices to simplify the orchestration steps. Else you'll be dealing with lot of additional complexities. Conventions are another great thing that reduces orchestration complexity. For example, dont drop RDBMS columns in a single release, deprecate their usage in one, and drop them in the next release, this simple trick will help you a lot in db migration related works. Another example would be tagging commits that does not need any db migrations. You dont really have to solve the entire orchestration problem in one go, we can take an incremental approach and automate those that are most frequent and relatively simple and then build on top of those.

 
      

Ryan Miller

unread,
Apr 19, 2013, 4:10:37 AM4/19/13
to devops-t...@googlegroups.com
Ranjib,

Sure, I understand how orchestration works, and at the big picture level I don't know that it matters whether the system is true push or message bus like mcollective.  And I agree that gradual progress is good, and your concrete suggestions are good ones (we do those).  

But all of those methods leave you with the question of which config and items happen in the configurator and which happen in the deployer, and how to guarantee interface between the two.  Since some smart people like Jordan Sissel apparently do things the all-configuration way, and that would obviously be nice, so I'm wondering how they do it.

Ryan

Schlomo Schapiro

unread,
Apr 19, 2013, 4:40:18 AM4/19/13
to devops-t...@googlegroups.com
Short answer is at http://yadt-project.org

On 18 April 2013 21:27, Jeff Sussna <j...@ingineering.it> wrote:

Mentioning now because Adam Jacob said (hopefully not out of context?)
on the Windows automation thread:

"If you have the luxury of starting from scratch (or nearly scratch)
the way you want to model app-deployment is as a part of system

​ ​
configuration"​​

That is why we package everything and use RPM to deliver all relevant files to our servers​​ (or deliver an app that manages those files). The basic idea was to use what is there (to build the OS) for our own stuff. The result is a single tool to manage all file-related things, regardless of their origin. The trick is mostly how to package things. Our templateing/overlaying for configuration files can be found at https://github.com/yadt/yadt-config-rpm-maker

On 19 April 2013 04:22, Ryan Miller <rmm...@gmail.com> wrote:
Sorry to join this conversation so late, but for those who advocate packaging apps and using configuration management to deploy, I'm wondering how you handle stateful deployment interactions?  Things like taking servers out of loadbalancers while their caches rewarm, making sure that servers running the new version of an app are talking to database servers with the schema changes, etc.  

​For that we also could find nothing that would meet our needs​ and decided to write a new - very simple - orchestration software called yadtshell (https://github.com/yadt/yadtshell) that knows only two remote commands (sudo yum upgrade and sudo service start|stop|status) and also knows about the relationship between packages, services, servers, load balancers, nagios instances etc. The main objective of yadtshell is to keep everything up-to-date and running. If a server or a group of servers have outstanding updates it will use the information from this unified dependency tree to calculate an action plan that will keep the service active while rolling the update through the servers. Independant chunks of servers are then removed from the load balancer, disabled in monitoring and the services are shut down. Then it does the yum upgrade and brings everything up again.

You still have to help it with regard to your questions, e.g. having everything that needs a compatible version also receive the same updated yum channels. And I would always recommend to build your software resilient so that it would just refuse to work with the wrong DB schema. Or better yet, gracefully degrade to a reduced feature set. You will need this in any case to do live deployments without taking your world down...

Coincidentially, the day before yester​day I gave a talk about this at http://www.netways.de/en/osdc/osdc2013/program/vortraege/configuration_management_and_linux_packages_eng/ but the video is not yet online. Slides can be found at http://www.slideshare.net/schlomo/osdc-2013-configuration-management-and-linux-packages

Jeff Sussna

unread,
Apr 19, 2013, 1:21:45 PM4/19/13
to devops-t...@googlegroups.com
Thinking out loud:

Reading about YaDT, it occurs to me the best approach may depend on how "21st-century" your application/toolset/culture are. If you've adopted a web-scale, loosely-coupled, stateless, highly resilient architecture and mindset, then a decentralized, pull, eventually consistent solution might make sense. If on the other hand, you still think about distributed applications as monolithic sets of tightly coupled components, then a formal orchestration solution might make sense.


On Tuesday, May 17, 2011 2:07:37 PM UTC-5, garethr wrote:
Something that interests me (in that I change my mind about the
answer) and was discussed a little at the last devopsdays Europe is
where application deployment and configuration management collide.

Mentioning now because Adam Jacob said (hopefully not out of context?)


on the Windows automation thread:

"If you have the luxury of starting from scratch (or nearly scratch)
the way you want to model app-deployment is as a part of system
configuration"

Ignoring the practicalities/limitations of existing systems, is this


where everyone else would start?

Ryan Miller

unread,
Apr 19, 2013, 1:30:45 PM4/19/13
to devops-t...@googlegroups.com
Schlomo--that sounds like a really mature toolchain, and while a lot of people have built something similar (myself included) it's awesome that you've open sourced yours so we can look at it and/or build on it.  But in the end it's a stateful deploy tool.  Which is fine; it's just that some people (who I respect) in this thread have suggested they don't need/use one and I'm curious about how.

Jeff--but basically everybody has tight state somewhere.  I mean, Facebook and Google Docs run on MySQL.  What is true is that they've (or at least Facebook has, who seem to talk about it more) come up with really generic schemas so they don't need many schema changes, and are super-disciplined about feature-flagging.  What worries me is they say they still have occasional non-flaggable / tightly coupled releases, and what "continuous deployment" culture has taught us is that rare is scary.  The bigger you are, and the more advanced your culture, the more rare and dangerous those occasional super-stateful deploys are going to be.  So I'm curious about if/how people manage to get around them.
Ryan

Adam Jacob

unread,
Apr 19, 2013, 1:32:52 PM4/19/13
to devops-t...@googlegroups.com

I think you’ll find that, in general, they don’t have them at all. The way the system works is so deeply embedded, culturally and systemically, that routing around it as a requirement of a given piece of software isn’t really feasible. Essentially, they find another way.

 

Adam

 

From: devops-t...@googlegroups.com [mailto:devops-t...@googlegroups.com] On Behalf Of Ryan Miller
Sent: Friday, April 19, 2013 10:31 AM
To: devops-t...@googlegroups.com
Subject: Re: Application deployment vs system configuration

 

Schlomo--that sounds like a really mature toolchain, and while a lot of people have built something similar (myself included) it's awesome that you've open sourced yours so we can look at it and/or build on it.  But in the end it's a stateful deploy tool.  Which is fine; it's just that some people (who I respect) in this thread have suggested they don't need/use one and I'm curious about how.



Jeff--but basically everybody has tight state somewhere.  I mean, Facebook and Google Docs run on MySQL.  What is true is that they've (or at least Facebook has, who seem to talk about it more) come up with really generic schemas so they don't need many schema changes, and are super-disciplined about feature-flagging.  What worries me is they say they still have occasional non-flaggable / tightly coupled releases, and what "continuous deployment" culture has taught us is that rare is scary.  The bigger you are, and the more advanced your culture, the more rare and dangerous those occasional super-stateful deploys are going to be.  So I'm curious about if/how people manage to get around them.

Ryan

--

Elliot Murphy

unread,
Apr 19, 2013, 1:43:48 PM4/19/13
to devops-t...@googlegroups.com
Launchpad.net is a very large python and postgresql web app that over time moved closer and closer to continuous deployment, starting from a place where it was only deployed/updated once a month with long downtimes bottlenecked on DB schema changes. Here are a few blog posts that go into a fair amount of detail about how they did it. The details on tactics around the DB may be helpful for other folks who are trying to transform an existing "legacy" software system to be easier and faster to deploy.

http://blog.launchpad.net/coming-features/no-more-monthly-90-minute-downtime
http://blog.launchpad.net/general/speeding-up-development
https://dev.launchpad.net/LEP/FastDowntime
https://dev.launchpad.net/PolicyAndProcess/DatabaseSchemaChangesProcess

--
Elliot Murphy

Ryan Miller

unread,
Apr 19, 2013, 2:17:28 PM4/19/13
to devops-t...@googlegroups.com
Adam,
 
Mark Callaghan had a couple of posts about occasional tightly-coupled releases, but since MySQL@facebook is a Facebook page it's hard to search through back posts, and therefore probably not worth fighting about.  But it is worth noting that we live in a world where doing a major Hadoop upgrade, say, can involve a whole bunch of semi-manual, pretty stateful work.

More to the point, though, since you're on the thread and obviously an opinionated expert on these kinds of things, what do you suggest to clients about managing state through rollouts in terms of loadbalancers, caches, database pools, etc.  Does the server knowing its own state (via chef) allow it to know which Zookeeper/Noah key to read the relevant values off of?  How do you make sure that nagios/other monitoring configs know which servers are going through an upgrade so they don't alarm?  I mean, when you select a subset of servers to be the first in a deployment, do they first take themselves out, then nagios rebuilds its config, then they upgrade, then nagios puts them back in?  That would require some kind of intermediate cookbook version or server role between the old and new, right?  How do you handle the selection of what servers go first?  Some kind of randomizing fact?

Sorry for any terminology errors (I'm a puppet guy) but I'd love to learn from what chef is doing.

Ryan

Adam Jacob

unread,
Apr 19, 2013, 2:27:23 PM4/19/13
to devops-t...@googlegroups.com

Yeah – I’m not saying it never happens. Simply that it gets more and more rare as the systems themselves become more and more understood, and the cultural/engineering backlash for changes that require it gets larger over time.

 

Pragmatism is the order of the day here – if upgrading hadoop requires you to roll that way, man, roll that way. J

 

The big suggestion to our customers is that they start thinking holistically about the system, not about the workflow. Take nagios as an example. The only reason to rebuild that config is because you aren’t monitoring the real status – you are looking at observable side effects. So imagine a world where your monitoring system is hitting a smarter status endpoint, which knows about things like “I’m about to upgrade”. Now nagios is removed from your orchestration flow – the status endpoint of the thing taking the action (the node being upgraded, in this case) makes it possible to observe its state. The nagios server assumes that promise (that the server is the source of truth) will be kept. So – first in a deployment means first to have its status changed, which means when nagios checks next, its “taken out” automatically. Work hard on eliminating the hand of external state manipulation in your infrastructure, and the benefits are legion.

 

Note the first part, though: pragmatics win out. But you never get to the goal if you don’at know what it is, an in my opinion, the goal is an easy to reason about, easy to operate, well understood infrastructure. And at scale (both in terms of systems and humans) that’s almost always one like I’ve described. In the small, it might not be – pretty easy to grok what happens in that fabric/Capistrano script when you can hold the list of tasks, servers, and humans in your head.

 

Best,

ADam

 

From: devops-t...@googlegroups.com [mailto:devops-t...@googlegroups.com] On Behalf Of Ryan Miller
Sent: Friday, April 19, 2013 11:17 AM
To: devops-t...@googlegroups.com
Subject: Re: Application deployment vs system configuration

 

Adam,

--

Message has been deleted

Ryan Miller

unread,
Apr 19, 2013, 2:44:39 PM4/19/13
to devops-t...@googlegroups.com
Ok, so I see how that works for the appserver--you monitor the status page and the app itself, and only fire on app failure if the status page thinks the app should be working.  I like that.  But it seems like there are lots of times when it won't work--say you need a new java version, so you have to restart Tomcat, which takes some time.  (And just as much so for lower-level things, where you have to upgrade kernel, database, etc).  And there's obviously the problem of making sure that you only do so many servers at a time, check before doing more, etc, which I haven't figured out how to naturally do in the autonomic model. 

None of which is to say that culture isn't the most important thing, or that tooling integration isn't getting better.  But I would like to move more things to the autonomic model, because it's less painful in general. 
 
Ryan

Don O'Neill

unread,
Apr 19, 2013, 2:57:40 PM4/19/13
to devops-t...@googlegroups.com
My team has been calling this the "carwash model" - monitor that "cars" are coming out "clean", and it doesn't much matter whats happening "inside the carwash".  


 
Ryan

--

Adam Jacob

unread,
Apr 19, 2013, 3:13:35 PM4/19/13
to devops-t...@googlegroups.com

Let’s break them down.

 

First, the java upgrade and tomcat restart. The problem here is that the “infrastructure” upgrade (Java) is decoupled from the application that is (in theory) passively receiving it (Tomcat). So now we have a new thing we have to track – not just is the application available, but are we doing something outside that we know has an impact. Lets move our application state thingy into a new bucket - one that runs outside the app server. Now we have the process that triggers our java update notify this system (which is local to it) that such an activity is occurring, and our status moves on nicely.

 

Only doing so many at a time is a function of how your system groups delivery of desired state descriptions. In chef, you do this through environments – in the future, they will likely be even cooler mechanisms. But using environments goes a real, real long way.

 

Lets rock kernel upgrades, which require a reboot. The hard part here is getting your status to understand that the status endpoints themselves are going to go away for a bit, but that it’s a normally scheduled event. This one is harder to push to the edges, since we don’t have a thing running we can keep consistent – it’s really a thing that can only be observed by an external entity. So lets solve this with one more piece of small code – a service that takes a notification from a sever of a planned reboot, and that automatically suppresses notifications for a period of time (since, hey, lets face it: we can’t suppress them forever, because we care). Pre-reboot, we notify this service of our impending demise.

 

Getting the hang of it?

 

Best,

Adam

 

From: devops-t...@googlegroups.com [mailto:devops-t...@googlegroups.com] On Behalf Of Ryan Miller
Sent: Friday, April 19, 2013 11:45 AM
To: devops-t...@googlegroups.com
Subject: Re: Application deployment vs system configuration

 

Ok, so I see how that works for the appserver--you monitor the status page and the app itself, and only fire on app failure if the status page thinks the app should be working.  I like that.  But it seems like there are lots of times when it won't work--say you need a new java version, so you have to restart Tomcat, which takes some time.  (And just as much so for lower-level things, where you have to upgrade kernel, database, etc).  And there's obviously the problem of making sure that you only do so many servers at a time, check before doing more, etc, which I haven't figured out how to naturally do in the autonomic model. 


None of which is to say that culture isn't the most important thing, or that tooling integration isn't getting better.  But I would like to move more things to the autonomic model, because it's less painful in general. 

 

Ryan

--

John Vincent

unread,
Apr 19, 2013, 3:24:09 PM4/19/13
to devops-t...@googlegroups.com
I've backed of a BIT on how much orchestration I want to automate in the last year. I mentioned it a bit at ChefConf but the fully automated orchestration people want is....not what they want. When you start getting into the weeds, you can quickly end up with dependency loops, unintended changes due to transitive deps and other stuff.

Orchestration is a lot like mutable state - keep it minimal and only use it when you have to. When you do that, it makes everything else you're doing almost trivial.

Flipping an example I used in the past w.r.t Noah - load balancers. Your time is better spent getting your CM tool and cookbooks/modules to a point where you can leave it running in the background or via cron. Then you only have n minutes to wait before the change is picked up when you add a new backend. Conversely, do you really need to remove a system from the LB, or can you rely on the LB to do what it was DESIGNED to do which is not send traffic to an offline backend. As Adam said, you can structure your application's health check URL to be informative to both HAProxy and Nagios. Availability isn't a binary thing. Up, Down, Upgrading, Starting, Stopping. These are all valid states that Nagios can handle specifically. Haproxy can be configured to only send to a backend with a health check return of "Running".

I still see the value of orchestration among parallel client runs like I used with Noah. The value there for us was being able to duplicate environments quickly to get to an posture of environment swapping for major deploys. I also do see the value in using tools like Noah or ZK as a shared state system that reflects current state but even then I advise using it sparingly.

I love this topic and will gladly talk about it all day.


 
Ryan

--
You received this message because you are subscribed to the Google Groups "devops-toolchain" group.
To unsubscribe from this group and stop receiving emails from it, send an email to devops-toolcha...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
John E. Vincent
http://about.me/lusis

Adam Jacob

unread,
Apr 19, 2013, 3:27:44 PM4/19/13
to devops-t...@googlegroups.com

Right. End of the day, Orchestration is neccessary.

 

Just remember the first rule of orchestration: first, try and eliminate the orchestrator. Use this question: “How can I change this workflow so that we no longer need an orchestrator?”

 

Adam

 

From: devops-t...@googlegroups.com [mailto:devops-t...@googlegroups.com] On Behalf Of John Vincent
Sent: Friday, April 19, 2013 12:24 PM
To: devops-t...@googlegroups.com
Subject: Re: Application deployment vs system configuration

 

I've backed of a BIT on how much orchestration I want to automate in the last year. I mentioned it a bit at ChefConf but the fully automated orchestration people want is....not what they want. When you start getting into the weeds, you can quickly end up with dependency loops, unintended changes due to transitive deps and other stuff.

Stuart Charlton

unread,
Apr 20, 2013, 3:10:50 PM4/20/13
to devops-t...@googlegroups.com

it seems from what I've seen the tension is fundamental between

1.  investment in a legacy that "forgot" about (or didnt think about inbrh first place) aspects of the end to end system health, care and feeding .. pragmatically this can be solved by 
replacing the painful pieces (best if cost is justifiable), wrapping them decentralized manner (next best), and managing with an external orchestrator (boo), or usually some combination of the above

2. greenfield where you can "get things right this time" or have the freedom to adopt the more recent open source technology 



I keep wondering when there will be enough experience and economic incentive to get a choreography model working, instead of a central orchestrator, where the state translations and state are specified  centrally but aren't managed or but executed by the end points themselves.    there have been stabs at this but in the end its just easier to make do with the old approaches.   but IMO this in particular would help those with a heavy legacy and no cost model (or political model) to justify replacing it...

Stu

Sent from my iPhone
--

JaimeGago

unread,
May 5, 2013, 2:10:17 AM5/5/13
to devops-t...@googlegroups.com
This is great reading, keep going, please!

@Adam IRT first rule and eliminating the orchestrator: could it be that what "we" need is something less "radical", say maybe "Ephemeral Orchestrator(s)"? Then depending on the environment one could play with the ephemerality settings, after all nothing is more relative than time, what does a "short time" means anyway? Then eventually the "ephemerality variations" could be automated?
I'm just wondering how the binary approach to the orchestrator existence can deal with all 3 cases (i.e. divergence,convergence and congruence) described by Steve Traugott and Lance Brown in "Why Order Matters: Turing Equivalence in Automated Systems Administration". In particular no orchestrator and congruence, orchestrator and convergence.

What if we dreamed out loud? Would that take us to Configuration Mgmt not being computed via an external tool to the applications but through a framework almost inevitably embedded with a distributed application as a best practice? Wouldn't Zookeeper be a good starting point for such a CM framework? The example that comes to my mind is such of logging and Log4j. Wouldn't that then remove the question of Command Orchestration for Application deployments of the table since it would embedded within the application through the config mgmt framework built in logic?


J.

Steve Conover

unread,
May 5, 2013, 4:18:52 AM5/5/13
to devops-t...@googlegroups.com
> The big suggestion to our customers is that they start thinking holistically
> about the system, not about the workflow. Take nagios as an example. The
> only reason to rebuild that config is because you aren’t monitoring the real
> status – you are looking at observable side effects.

Those observable side-effects give me potentially very important
information in an outage scenario - why wouldn't I want to monitor all
of the above?

> So imagine a world
> where your monitoring system is hitting a smarter status endpoint

The smarter it is, the more likely it will lie to me.

Jez Humble

unread,
May 5, 2013, 7:21:28 AM5/5/13
to devops-t...@googlegroups.com
Orchestration is a lot like mutable state - keep it minimal and only use it when you have to. When you do that, it makes everything else you're doing almost trivial.

This is super.

I think complex orchestration (like lots of feature branching) is a sign of a crappy architecture. We should be architecting for deployability - which means (amongst other things) that:
  • developers should be able to deploy to their own machines (perhaps using Vagrant) without having to stand up a fully integrated environment
  • we should be able to find most of the bugs without a fully integrated environment
  • if you have runtime dependencies between components (including databases, event buses and external services) you should be able to redeploy any of them independently without touching the others.
This is my beef with "Application Release Automation" tools - it's a band-aid for a crappy architecture. Fix the architecture! (I know, it's hard, but if we wanted easy we shouldn't have become engineers. Quit whining.)

PS on mutable state in infrastructure, my colleague Ben Butler-Cole says "don't ever touch a box once it's up. If it needs changing, tear it down and replace it with a new one."

Stuart Charlton

unread,
May 5, 2013, 4:49:09 PM5/5/13
to devops-t...@googlegroups.com


On May 5, 2013, at 5:21 AM, Jez Humble <j...@jezhumble.net> wrote:

> This is my beef with "Application Release Automation" tools - it's a band-aid for a crappy architecture. Fix the architecture! (I know, it's hard, but if we wanted easy we shouldn't have become engineers. Quit whining.)

That presumes it is up to the engineers. Much whining and band aids (especially in enterprise) is because it is up to the politicians to get the investment to fix the crappy architecture. (Look at J2EE circa 2002).

Not likely to happen without a quit/rehire cycle.

Stu

Adam Jacob

unread,
May 5, 2013, 4:55:47 PM5/5/13
to <devops-toolchain@googlegroups.com>, devops-t...@googlegroups.com
This is an excuse for status quo hidden as pragmatic realism. Look: either you have a critical business problem (low shipping velocity causing poor customer experiences) or you do not. If you don't, great - make some incremental improvement to your arch and move on.

If you do, look it straight in the eye an make the real assessment. The result is you will/must look at the whole system. Using the same pattern and approach that got you to where you are does not result in order of magnitude different outcomes (in either direction.)

It's time we stop apologizing for the difficulties inherent in large organizations and start working together to solve them.

Adam

Sent from my iPhone

Stuart Charlton

unread,
May 5, 2013, 5:34:38 PM5/5/13
to devops-t...@googlegroups.com

On May 5, 2013, at 2:55 PM, Adam Jacob <ad...@opscode.com> wrote:

> This is an excuse for status quo hidden as pragmatic realism.

Hell no! What I am trying to understand is how one can cope with and improve a situation when the funding for an architecture rethink happens about once a decade.

The trend in this thread seems to be: don't bother until you can redo it all.

> Look: either you have a critical business problem (low shipping velocity causing poor customer experiences) or you do not. If you don't, great - make some incremental improvement to your arch and move on.

> If you do, look it straight in the eye an make the real assessment. The result is you will/must look at the whole system. Using the same pattern and approach that got you to where you are does not result in order of magnitude different outcomes (in either direction.)

I agree with looking at the end to end system.

Where I disagree is the notion that one cannot execute incrementally and apply appropriate compromise to build credibility and momentum among a skeptical management that could just offshore the lot and set things back another 10 years.

>
> It's time we stop apologizing for the difficulties inherent in large organizations and start working together to solve them.

Okay, I'm inside these large organizations and trying to work together to solve them. All I said was "quit whining" is a bit flip when we are trying to fix a situation today, where the constraints are all too real. "Rethink everything" is in many cases a $300 to $600 multi dollar price tag and nearly impossible to justify to a non technical management.

My view is that things are usually so bad there is ample room for improvement - even order of magnitude improvement - without a "burn the ships" approach. Release automation tools are a part of that, but mostly as a catalyst to fix the thinking and culture.

Stu

Adam Jacob

unread,
May 5, 2013, 7:37:37 PM5/5/13
to <devops-toolchain@googlegroups.com>, devops-t...@googlegroups.com
For the record, I super like a d respect stu :)

More below.

Sent from my iPhone

On May 5, 2013, at 2:34 PM, "Stuart Charlton" <stuartc...@gmail.com> wrote:

>
> On May 5, 2013, at 2:55 PM, Adam Jacob <ad...@opscode.com> wrote:
>
>> This is an excuse for status quo hidden as pragmatic realism.
>
> Hell no! What I am trying to understand is how one can cope with and improve a situation when the funding for an architecture rethink happens about once a decade.

I think incremental improvement is your only hope here - and incrementally inching towards CD isn't realistic. You'll get better - you won't be unrecognizable.

> The trend in this thread seems to be: don't bother until you can redo it all.

Honestly? Unless you are committed to a two-three year journey, at the end of which you are orders of magnitude better (on the axis of deployment/development velocity) ... Don't bother with CD. Do the less disruptive work that focuses on improving the lives of those whose culture/arch cannot transform like that. No mistake - this is important work (some of which you likely do anyhow, since its a journey, not a snap). It just isn't fundamentally transformative along the business velocity track.

>> Look: either you have a critical business problem (low shipping velocity causing poor customer experiences) or you do not. If you don't, great - make some incremental improvement to your arch and move on.
>
>> If you do, look it straight in the eye an make the real assessment. The result is you will/must look at the whole system. Using the same pattern and approach that got you to where you are does not result in order of magnitude different outcomes (in either direction.)
>
> I agree with looking at the end to end system.
>
> Where I disagree is the notion that one cannot execute incrementally and apply appropriate compromise to build credibility and momentum among a skeptical management that could just offshore the lot and set things back another 10 years.

You can't. If they don't see it now, they may in 2, 3 or 5 years. Pushing twoards a goal they fundamentally don't agree with isn't doing a good job. Help them achieve the goals they do understand if you like, but don't kid yourself (or them) that having mediocre interim goals leads to magnificent future results: it does not. You know this in your bones.

If you do try this, your destiny is either mediocre outcomes or bureaucratic pillow death.

I'm not saying you start completely green field - I'm saying your fundamental starting position has o be that everything is on the table, if what you want is huge increases in business velocity.

>
>>
>> It's time we stop apologizing for the difficulties inherent in large organizations and start working together to solve them.
>
> Okay, I'm inside these large organizations and trying to work together to solve them. All I said was "quit whining" is a bit flip when we are trying to fix a situation today, where the constraints are all too real. "Rethink everything" is in many cases a $300 to $600 multi dollar price tag and nearly impossible to justify to a non technical management.

Then they either don't have a business velocity problem, or they are can't see they have one. In both cases, rethink everything is wrong - either because you build wrong automation for the business, or because you are being wasteful.

>
> My view is that things are usually so bad there is ample room for improvement - even order of magnitude improvement - without a "burn the ships" approach. Release automation tools are a part of that, but mostly as a catalyst to fix the thinking and culture.


Absolutely! On many, many fronts, this is true. On business velocity? Not so much.

Best,
Adam

Jez Humble

unread,
May 5, 2013, 10:54:39 PM5/5/13
to devops-t...@googlegroups.com
What I am trying to understand is how one can cope with and improve a situation when the funding for an architecture rethink happens about once a decade.

What if, whenever we implemented new requirements, we never touched legacy stuff but always implemented it in an architecture that _was_ deployable, testable, etc? If that required moving functionality over from existing legacy stuff, we'd find a way to port the smallest possible amount of functionality over that would let us achieve our goal. Occasionally we'd have to modify the legacy stuff to create a little API, but we're talking minimal invasiveness.

Over time, the old stuff would be doing less and less. A couple of years in, we could write little tools that instrumented it to discover which code paths were getting executed, and delete big chunks of code.

It turns out people actually do this - it's called the Strangler Application pattern. It's how Amazon moved from a big ball of mud to a service oriented API.

My experience is that the Big Rewrite is almost always a total disaster, whereas incremental ways to get us from A to B are low risk. They take longer, but something that takes longer and actually works is usually better than something that is quicker but keeps gets delayed until it actually takes longer than the incremental way, but still doesn't work, and is eventually cancelled. We tend not to do it incrementally because at some point there will be funding for the big architecture rethink. When you get that funding, in general you'll wish you hadn't.

One company that did do the Big Rewrite successfully was HP with their LaserJet firmware (which doesn't mean you should try it). One of the things they did that made them successful (their words not mine) was architecting for continuous delivery. In order to do that, you have to do continuous integration, which means everyone has to check in all functionality to trunk regularly (a few times a week in their case). They were getting 10-15 good builds - 100-150 checkins, 75-100 kloc in changes - a day. To do that, they had to do everything incrementally. That changes your culture. Indeed I have the guy who was director of that division on camera saying in retrospect they probably could have done it incrementally (that interview will go up on my blog later this month). Here's what they say in the book:

After four years of a major re-architecture effort with significant changes, we can only think of two or three times when we made a conscious effort to hold other things off while we brought in a major change. Each time after we did it, we had several of our lead engineers pointing out that it wasn’t necessary. In fact, it got to being a point of pride where lead engineers would bring in a major change through our standard processes. It might require testing a few components together offline before committing, or a little extra testing up front to reduce the risk, but then major changes would come through the queue along with everything else.

The hard bit is changing the mindset of people in the organization to understand that the best way to get from A to B is incrementally, not through some big-bang project, and then teaching everybody how to do it - BEFORE the shit hits the fan and we're about to be outflanked by our competitors. Oh, and read Toyota Kata.

Thanks,

Jez.



Stu

--
You received this message because you are subscribed to the Google Groups "devops-toolchain" group.
To unsubscribe from this group and stop receiving emails from it, send an email to devops-toolcha...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Jez Humble

unread,
May 5, 2013, 11:03:45 PM5/5/13
to devops-t...@googlegroups.com
Interesting historical footnote - here's Ron Jeffries talking about how the C3 project (the first XP project, which notoriously ultimately failed) should have been done through strangler applications: 

The C3 project’s purpose was to replace the entire family of Chrysler payroll programs. It didn’t accomplish that. Many years later, the subsequent project to replace the entire family of payroll programs has also not accomplished it. We now realize what should have been done.
 
We should have replaced the broken bits, one at a time, most valuable first.

(His emphasis). Find this bit and then read the subsequent few paragraphs.

Adam Jacob

unread,
May 6, 2013, 12:04:22 AM5/6/13
to <devops-toolchain@googlegroups.com>, devops-t...@googlegroups.com
Similarly, look at Rob Cummings from Nordstrom talking at Chefconf. They are doing exactly this - moving the "performance engine" into the new model over time. 

It is the "big rethink" - simply done in vertcal, rather than horizontal, slices.


Adam

Sent from my iPhone

Stuart Charlton

unread,
May 6, 2013, 12:07:10 AM5/6/13
to devops-t...@googlegroups.com
On Sun, May 5, 2013 at 5:37 PM, Adam Jacob <ad...@opscode.com> wrote:
For the record, I super like a d respect stu :)


Likewise, of course :)


I think incremental improvement is your only hope here - and incrementally inching towards CD isn't realistic. You'll get better - you won't be unrecognizable.

Help me understand this view.  Perhaps I can provide a naive scenario:

Say I'm in a bank, and I have a central legacy system that's developed in a mostly lean fashion - business-driven priorities and tasks, automated tests, demos every two week iteration, etc.  The technology platform is one of those typical (but modern) commercial middleware platforms that integrates the mainframe and open systems, performs and scales well for the use cases and includes a reasonable approach to high availability... but ultimately their administrative tooling isn't constructed for continuous delivery.    Fortunately they provide APIs and/or command line access to get at the elements.   The architecture is also relatively loosely coupled enough so that different areas of functionality can be released independently of the other without severely impacting availability (though it's far from perfect).

Ultimately the middleware platform will not alone provide all the business agility they're seeking because their purchase of this platform was clearly a case of local optima - they thought their bottleneck to business velocity was integration & event handling.   But really, the end-to-end system is much broader than that, encompassing service transition and operations, not just how they integrate 

The primary obstacle is that there is a chasm between the delivery and operations culture.    Change releases are queued up into quarterly release windows, of which only twice annually can be major releases.     The management of the operations organization WANTS to change - has been (for example - I'm simplifying) 
 - providing more production metric & log visibility to delivery, 
 - embedding operations personnel within the project teams

Now they'd like to also:
 - version control everything in a release, including configuration
 - automate all technical aspects of a release through to production
 - manage the release checklist in a central collaboration app with the aim of eliminating the release checklist to the fullest extent possible

They're looking to a combination of cross-team (delivery, operations) collaboration and application release automation tooling to achieve the above.    

We estimate that we can reduce the time for a major release from QA to production from the current 6-8 weeks (not kidding) to around a day.
 
Now, from a business velocity perspective - I would agree this is not the end point.   It's too IT focused, not necessarily business focused.   The delivery team is actually not all that lean.    The architecture of their platform is not nirvana.  Etc.  We're going to solve one area and promote two or three currently smaller ones.   But in the meantime we've streamlined a lot of waste out of the system and helped to transform the culture to see new possibilities.

One of the sayings in lean thinking is "Learning To See".   To many of those coming from an ITIL v2 background, this new scenario is almost impossible to fathom.   

This will have to eventually rethink their entire architecture... but they can't see (and are paid not to see) that today.    They also have a lot of monied interests among the large IT vendors ensuring they never see that.

The above is a two year journey - 8 months for the pilot, and another year for the broader rollout.   

Is it clearly a fool's errand?   Is it going to fall apart and provide no major benefit?    What am I missing?


> My view is that things are usually so bad there is ample room for improvement - even order of magnitude improvement - without a "burn the ships" approach.   Release automation tools are a part of that, but mostly as a catalyst to fix the thinking and culture.


Absolutely! On many, many fronts, this is true. On business velocity? Not so much.

Perhaps the scenario organizations I am dealing with (telecom, bank) don't really care about velocity all that much (yet).   They want to get better but won't take the risk to throw out the mainframe, for example.    Those that have tried have often had $1-billion price tags associated with it, and lots of turnover carnage at all levels.  (On the other hand, as I'll agree in my reply to Jez, it's arguably because they approached it wrong).

Stu

Adam Jacob

unread,
May 6, 2013, 12:07:03 AM5/6/13
to <devops-toolchain@googlegroups.com>, devops-t...@googlegroups.com
The thing that is not incremental is your goal and design. The implementation nearly always is, to some degree.

Best
Adam

Sent from my iPhone

Stuart Charlton

unread,
May 6, 2013, 12:31:07 AM5/6/13
to devops-t...@googlegroups.com

Hi Jez,


On Sun, May 5, 2013 at 8:54 PM, Jez Humble <j...@jezhumble.net> wrote:
What I am trying to understand is how one can cope with and improve a situation when the funding for an architecture rethink happens about once a decade.

What if, whenever we implemented new requirements, we never touched legacy stuff but always implemented it in an architecture that _was_ deployable, testable, etc? If that required moving functionality over from existing legacy stuff, we'd find a way to port the smallest possible amount of functionality over that would let us achieve our goal. Occasionally we'd have to modify the legacy stuff to create a little API, but we're talking minimal invasiveness.

Over time, the old stuff would be doing less and less. A couple of years in, we could write little tools that instrumented it to discover which code paths were getting executed, and delete big chunks of code.

It turns out people actually do this - it's called the Strangler Application pattern. It's how Amazon moved from a big ball of mud to a service oriented API.

My experience is that the Big Rewrite is almost always a total disaster, whereas incremental ways to get us from A to B are low risk.

So, one of the reasons I am no longer the head of IT Operations at a certain large transportation & logistics company is that I proposed this exact pattern as a recipe for transforming our primary "over the road" and yard management systems.    We had demonstrated major success in a lean, continuous delivery approach on our bulk commodity demand systems, and saved another doomed offshore project with this approach. 

But the alternative, a massive $300m+ rewrite on SAP, won - and was viewed as much less risky by (the new)  executive management and board of directors.   The numbers (in terms of release timelines, budgets, quality, etc.) on the pilot areas were too good to be believed, and thus needed to be buried. 

On the bright side, we did a few major improvements that stuck.   I would need beer to explain more.   :)

I'm not grousing - this kind of fight comes with the territory.   Perhaps I'm just saying (in this thread, and with Adam) that it's sometimes necessary to do the wrong thing sometimes to defeat an even wronger thing, to live another day and do the right thing.

 
The hard bit is changing the mindset of people in the organization to understand that the best way to get from A to B is incrementally, not through some big-bang project, and then teaching everybody how to do it - BEFORE the shit hits the fan and we're about to be outflanked by our competitors. Oh, and read Toyota Kata.

I worked in an organization that lived and breathed Lean on the operations side, but refused to see how it applied to IT.   The Phoenix Project was a bit close to home - unfortunately I had no Yerba Maté quaffing board member named Erik to help guide me ;)

Stu

Stuart Charlton

unread,
May 6, 2013, 12:46:41 AM5/6/13
to devops-t...@googlegroups.com
One other quick note...


On Sun, May 5, 2013 at 5:37 PM, Adam Jacob <ad...@opscode.com> wrote:
 
> Where I disagree is the notion that one cannot execute incrementally and apply appropriate compromise to build credibility and momentum among a skeptical management that could just offshore the lot and set things back another 10 years.

You can't. If they don't see it now, they may in 2, 3 or 5 years. Pushing twoards a goal they fundamentally don't agree with isn't doing a good job. Help them achieve the goals they do understand if you like, but don't kid yourself (or them) that having mediocre interim goals leads to magnificent future results: it does not. You know this in your bones.


Just to be clear, I have no illusions about this.   

However, what I've tended to see is that there are three ways to accomplish progress in a large organization:
a) quickly, before anyone notices and can stop you or dismantle what you've done
b) with enough air cover and credibility that you've built over time by accumulating enough mediocre but significant results
c) in a crisis, when you have the ear of god

(c) doesn't come too often.   I will admit that (a) is much easier and less stressful then (b), but I've observed (b) in action, and it tends to lead to much longer term impact, assuming you survive.   

These days I've come to the conclusion I don't have the endurance to spend the 5-7+ years to build up such a power base, so I tend to just help others on their path, assuming we have a shared vision.

Stu

Jez Humble

unread,
May 6, 2013, 1:04:44 AM5/6/13
to devops-t...@googlegroups.com
Thanks a lot for sharing Stuart. I imagine we agree vigorously, but I can't stop myself from making the following points:

We had demonstrated major success in a lean, continuous delivery approach on our bulk commodity demand systems, and saved another doomed offshore project with this approach. 

It's sometimes hard to believe, but if you don't fuck it up, software development can actually work pretty well.

But the alternative, a massive $300m+ rewrite on SAP, won - and was viewed as much less risky by (the new)  executive management and board of directors.
  
I'm all for using COTS if you don't have to do any customization. If you have to customize a system that is not designed for customization (on purpose, so consultants can make a ton of money) you are entering a world of pain. I like to point people who think otherwise at this case study (for Telstra, an enormous telco): http://www.zdnet.com/keep-it-simple-stupid-telstraclear-1339307482/. I am guessing your management failed to grasp this.

The thing is, if you're not going to customize the COTS, that means that the business processes you're running on it are vanilla. And if that's true, what's your company's competitive advantage?

The reason execs don't like dealing with IT is because they don't understand it and they don't think it's a core part of their business. That's already not true, and it's about to become a lot more not truer. Fortunately some people are realizing this. GM pioneered outsourcing, and now they've realized it's not a good idea:

GM’s reasons for doing this may well apply to many other firms too. “IT has become more pervasive in our business and we now consider it a big source of competitive advantage,” says Randy Mott, GM’s chief information officer, who has been responsible for the reversal of the outsourcing strategy. While the work was being done by outsiders, he said most of the resources that GM was devoting to IT were spent on keeping things going as they were rather than on thinking up new ways of doing them. The company reckons that having its IT work done mostly in-house and nearby will give it more flexibility and speed and encourage more innovation. [1]

Thanks,

Jez.

[1] The Economist, “Special Report: Outsourcing and Offshoring”, p18, Vol. 406 No. 8819, 19 January 2013.


--
You received this message because you are subscribed to the Google Groups "devops-toolchain" group.
To unsubscribe from this group and stop receiving emails from it, send an email to devops-toolcha...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Stuart Charlton

unread,
May 6, 2013, 1:13:15 AM5/6/13
to devops-t...@googlegroups.com
On Sun, May 5, 2013 at 11:04 PM, Jez Humble <j...@jezhumble.net> wrote:
 
But the alternative, a massive $300m+ rewrite on SAP, won - and was viewed as much less risky by (the new)  executive management and board of directors.
  
I'm all for using COTS if you don't have to do any customization. If you have to customize a system that is not designed for customization (on purpose, so consultants can make a ton of money) you are entering a world of pain. I like to point people who think otherwise at this case study (for Telstra, an enormous telco): http://www.zdnet.com/keep-it-simple-stupid-telstraclear-1339307482/. I am guessing your management failed to grasp this.


Indeed.   Or they felt they were more vanilla than they actually were, because SAP was proposing a co-development situation for certain new module versions.  (run far, far away, IMO).

The reason execs don't like dealing with IT is because they don't understand it and they don't think it's a core part of their business. That's already not true, and it's about to become a lot more not truer. Fortunately some people are realizing this. GM pioneered outsourcing, and now they've realized it's not a good idea:

GM’s reasons for doing this may well apply to many other firms too. “IT has become more pervasive in our business and we now consider it a big source of competitive advantage,” says Randy Mott, GM’s chief information officer, who has been responsible for the reversal of the outsourcing strategy. While the work was being done by outsiders, he said most of the resources that GM was devoting to IT were spent on keeping things going as they were rather than on thinking up new ways of doing them. The company reckons that having its IT work done mostly in-house and nearby will give it more flexibility and speed and encourage more innovation. [1]

 
One of the aspects that did stick in my efforts with our former CIO was exactly the above - we got board approval to turf our strategic outsourcers for an insourcing strategy (with selective out tasking).    That took a year of dodging many attempts from the incumbents at firing us through their political clout, but we prevailed.  One of my prouder moments.  

Randy's move at GM is inspiring though I have a dose of skepticism given the negative reviews that have been leaking out of HP.  But people and circumstances differ from place to place, so who knows.

Cheers
Stu

Jez Humble

unread,
May 6, 2013, 1:17:30 AM5/6/13
to devops-t...@googlegroups.com
I have a dose of skepticism given the negative reviews that have been leaking out of HP

Well that's the problem, it's so easy to fuck it up :(

BTW how is it insourcing if HP are doing it?!


--
You received this message because you are subscribed to the Google Groups "devops-toolchain" group.
To unsubscribe from this group and stop receiving emails from it, send an email to devops-toolcha...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Jez Humble

unread,
May 6, 2013, 2:00:27 AM5/6/13
to devops-t...@googlegroups.com
I think there's also (and guess what my preference is)

d) team up with a courageous executive who actually wants to get things done.

Sometimes these people are brought in and only have a short amount of time to prove themselves. Sometimes they get kicked out again after a year or two, but in that time they manage to make lasting course changes (I have at least one example of this happening at a very large company). Sometimes you can use a) to achieve d).

"Shadow IT" is basically d) in action. The trick is leveraging it to transform the rest of IT.




--
You received this message because you are subscribed to the Google Groups "devops-toolchain" group.
To unsubscribe from this group and stop receiving emails from it, send an email to devops-toolcha...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Stuart Charlton

unread,
May 6, 2013, 9:22:52 AM5/6/13
to devops-t...@googlegroups.com


On Sunday, May 5, 2013, Jez Humble wrote:
I have a dose of skepticism given the negative reviews that have been leaking out of HP

Well that's the problem, it's so easy to fuck it up :(

BTW how is it insourcing if HP are doing it?!

Oh, I meant  feedback about Randy's tenure at HP while under Mark Hurd.  IT cleanup was a major part of Hurd's cost reducion program.  There have been stories here and there of this being mostly sacrificing tomorrow and the altar of today.  Lots of pent up frustration and then Apotheker didn't get along with him.  But, given the chaos there, like I said, who knows the real truth...

Stu

Stuart Charlton

unread,
May 6, 2013, 9:31:25 AM5/6/13
to devops-t...@googlegroups.com


On Monday, May 6, 2013, Jez Humble wrote:
I think there's also (and guess what my preference is)

d) team up with a courageous executive who actually wants to get things done.

Oh, yeah - that's what I meant by (a).  :-).  
 
 Sometimes these people are brought in and only have a short amount of time to prove themselves. Sometimes they get kicked out again after a year or two, but in that time they manage to make lasting course changes (I have at least one example of this happening at a very large company). Sometimes you can use a) to achieve d).

Agreed.  

And in our case we were mostly too late to save the company from itself, so my courageous executive left -- shortly followed by a board coup, complete firing of the management staff, etc.  

The incoming team had the right ideas for the operational business, thankfully, but their approach to IT ... Not my cup of tea.

"Shadow IT" is basically d) in action. The trick is leveraging it to transform the rest of IT.

Also agreed.  

Or it winds up turning into a cargo cult after you leave, because they don't understand what you've set up but don't dare touch it.

Cheers
Stu

Jeff Sussna

unread,
May 6, 2013, 10:26:38 AM5/6/13
to devops-t...@googlegroups.com
I strongly agree with the StranglerApplication approach. I also agree with the general goal of minimizing the need for orchestration-level complexity. Many companies, however, are just getting started with automation. The first step is to manage what they have. I want to propose the idea that orchestrational complexity can serve a canary function. As you build in orchestration, and either find complexity as you go, or else foresee it when considering future enhancements, that can be a good sign you need to consider incremental re-architecture. It also points to where you need to do that re-architecture first. In the meantime your environment is more managed than it was yesterday, which is a good thing.
Reply all
Reply to author
Forward
0 new messages