reference implementation for continuous deployment system

21 views
Skip to first unread message

yonatan maman

unread,
Nov 17, 2010, 2:33:38 PM11/17/10
to iltec...@googlegroups.com
Continuous Deployment is a hot topic and very popular: many people want to hear about it, read about it, and talk about it,  but I guess that only few are really doing  it, like me :-)

I think that one of the the barriers that prevent it from being more popular in the "field",  is that Continuous Deployment is an idea, and quite abstract one.
I guess that most of us have viewed an enjoyed Eishay's presentation about Continuous Deployment, but even this presentation is too high level and there is a large gap one need to bridge in order to adopt this idea.

Try to put your self instead of a technical leader who wants to follow the idea of Continuous Deployment: You read some blogs, and watch presentations ,you go to the relevant conferences, and talk to the right people, but when you want to start implementing the "Immune system" (for example), you don't have a good starting point.

I think that having a reference implementation for Continuous Deployment System can give it a real boost. You can look on it as a "cheat sheet"  - It could be set of scripts, services, dashboard....

I'm not sure if it is feasible to have such an implementation, maybe there is no  Continuous Deployment System that can fits all sizes (cause different shops have different needs)

What do you think?

-- Yonatan

Tor

unread,
Nov 17, 2010, 3:20:08 PM11/17/10
to iltec...@googlegroups.com
Yonatan - strong words.

I think you pinned the problem. 
I also believe that the real hard to understand parts are the outer circles of Continuous Deployment (which are the Immune system and the Continuous Deployment itself) since Continues Integration, Automatic Testing and TDD along with static analysis are well spread in the industry.
And mostly, as you say how does it all fit together. 

An out of the box simplified system would do the job :)

Tor Ivry

Ori Lahav

unread,
Nov 17, 2010, 3:56:33 PM11/17/10
to iltec...@googlegroups.com
Good subject - Yonatan.

In Outbrain we are doing Continues Deployment!
Do we have bullet proof Immune System? No. but we have a good starting point with something we developed.
Do we have continues integration that runs things automatically from Commit to production? No. Its under work.
Do we have full coverage of unit tests to the code? No.

But, we do few code changes and production changes a day. When we have code ready, it is pushed.
Big step we took to get to it and I admit it wasn't easy to decide.
However, once we decided this is where we want to go we started immediately while building the tools as the process is already running.


As you  mentioned, DC is a concept and you need to take the essence of it and build your tools to implement it for your system needs.
It is fine to look at solutions people did before and then implement it for your system. actually, this is where fun begins, isn't it?

Eishay Smith

unread,
Nov 17, 2010, 4:06:12 PM11/17/10
to iltec...@googlegroups.com
Congrats !

Guy Nirpaz

unread,
Nov 17, 2010, 4:19:52 PM11/17/10
to iltec...@googlegroups.com
Much like other lean practices, I don't think you can actually bake CD based on a recipe.

Eishay showed us that this can actually work for real. Now it's up to us to implement it and understand what it means.
Much like OutBrain, but in a much smaller scale (for now;), we're also practicing CD at SaaSPulse. It's not perfect, and even far from it yet, but it's a mind set, and we'll get there over time once we learn more about our needs and from our implementation mistakes. 

I've experienced a similar process while I was head of R&D at GigaSpaces and we started to implement Scrum. When I joined, a regression cycle took literally 30 days, when I left we were running 60 cycles per day and released enterprise software every couple of weeks. We had a lot of skepticism from all over both internal and external, however, being persistent about reaching the goal made it work. I think this is the same way to get to CD.

My .02 cents,
Guy

Ori Lahav

unread,
Nov 17, 2010, 5:13:39 PM11/17/10
to iltec...@googlegroups.com
Guy,
I agree 100%

I think like in every major cultural change, it is, first and foremost,  a leadership challenge.

Leadership with clear and persistent message about this get all the skeptism and fears out of the way.

Ori

Eishay Smith

unread,
Nov 17, 2010, 5:41:05 PM11/17/10
to iltec...@googlegroups.com
Shameless plug of how companies would need to change to reach that goal: http://bit.ly/9TIajh
--es

Ori Lahav

unread,
Nov 18, 2010, 2:38:37 AM11/18/10
to iltec...@googlegroups.com
cool,
This interview actually put the spotlight on the most important thing in Continues Deployment - CULTURE!!!

- Developers that are more testing oriented.
- Developers that are more production Operation oriented.
- Planning ahead the flow of deployments and code changes.

This is something no tool will do for you - you just need to change the culture.

Changing the culture depends very much on leadership. Once the culture changed - Developers will develop the right tools to make their life easier and Safer (i.e. automated testing).

maybe it's my personal behavior but I usually like to manage processes in more extreme way.
If a change is needed, do it. things will probably break and we will fix them on the go. but that way your learning curve is much faster then planning ahead and getting prepared for some unknown reality.

I get too psychological...    

yonatan maman

unread,
Nov 18, 2010, 4:51:22 AM11/18/10
to iltec...@googlegroups.com
Guys,
no doubt  that the major challenges in adopting CD are leadership, cultural, fighting the pessimistic powers in the organisation, and putting the confidence that it is feasible (or in other words סורה חושך הלאה שחור)

Having said that, do you find more concrete guidelines or technologies   that can give a better starting point to someone who passed the first leadership/cultural challenge, and already jumped into the water ?

Ori, Guy  - I guess that you are the best candidate to to tell whether you cold benefit from such "reference implementation" .
-- Yonatan

Ori Lahav

unread,
Nov 18, 2010, 6:53:32 AM11/18/10
to iltec...@googlegroups.com
Few tips:

  1. re immune system. - use the continues integration server to run service tests on your production environment periodically. there are few advantages for that:
    1. it is meant to run tests.
    2. it give you UI and statistics for the success and failures.
    3. it sends alerts when things fail.
    4. developers that are used to writing Unit Tests use the same method to write Service tests that run on production.
  2. use IRC/Yammer/... for notifying on production changes- we use the hash tag #prodchange for everything (Code, hardware configuration) that changes production environment. if Its following a change request we attach it's number to the posting so we can have better tracking of who did the change, when and what was it.
I guess, more to come.

Eishay Smith

unread,
Nov 19, 2010, 12:46:04 AM11/19/10
to iltec...@googlegroups.com
Good points. 
Our in house philosophy is to automate our way into the culture and process we would like to have. We rather have a process backed into our tooling instead of being written in some wiki file and slowly be ignored. 

For example: 

We are using our commit logs in many ways. The commit logs are an important way we communicate with each other and the system. For that we use pre and post hooks commits (one of them verifies you actually right something meaningful in the commit).

If a test is broken, a pre-hook commit will not allow any commit to the code repository unless it has the hashcode #buildfix in it. It verifies we implement the culture and process of "tests and #1 and without it we stop the world", it also sends a message to visual and audio devices in the office to let people know test are broken. 

It gives us two things:
* Process: engineers are forced not by a manager but by the tools they wrote themselves. 
* Culture: peer pressure. If the build is boken for a while other developers will ask the one that broke the build when will the build will get fixed since they are blocked. They might feel more eager to help the person if he's a juniur and needs help in figuring out the problem. It also creates a Pavlov effect, even if you're not blocked on a failing test you're disliking the connotation of the spinning lights as you remember what happened last time it happened to you. in practice, new engineers tend to break the tests more and without anyone telling them so, they very fast learn to reduce these events.

When doing commits we're also analyzing the patches (simple text parsing, really). If one added a business logic file and did not add a test file in the same patch then a nice email will be sent to all of of the organization, letting them know that "I (Eishay) had just committed code without appropriate test. I'm very sorry about it and I'll explain why I did it". The email includes the code that got committed. After one needs to explain the team about the mis-commit the lesson is learned very fast. This way we build a culture instead of forcing it, and most of the times the developers are writing their own tools that ensure quality.

We have many more examples like that (restricted apis, coding conventions and more) that reduce the need for culture training and preserving so we can focus on stuff that matter.

Another non automatic culture measures we take are to start weekly eng meetings with tests statistics (time they run, slow tests, broken windows etc), and we're doing post mortems all the time. Managers first and then the rest will openly talk about things they f**ked up, it keeps the team learning all the time and not be intimidated when they need to discuss their faults. 

--es

Ori Lahav

unread,
Nov 19, 2010, 5:42:16 AM11/19/10
to iltec...@googlegroups.com
Eishay
What I like about you guys is that you are evolving and improving all the time.
I must admit that I'm trying to keep up with all the stuff you guys do and find it hard :)

Back to what Yonatan asked. It's different.
You guys did a great decision when you started the company to go in the CD path. You developed your tools and culture straight for this path and created something which is a row model for the entire industry.
I think Guy and SaasPuls guys are following you and also took this decision on very early stage. 

I guess what Yonatan is asking is, how you do the transition? How you transfer organization from traditional way of managing Development and deployment into something more like CD.

In outbrain we ran on traditional methods for long time. we spilled the bucket every other month. the business needs and product enhancements flow were too slow and the whole thing was very much... not "outbrainy". It is Kent Beck and you that showed us that  there is another way. however we don't full ourselves and acknowledge it's a process. "Even Welthfront was not build in one day". So we had 2 options: 
  1. start building tools and automated stuff for long time till we have the comfort to start continuously deploying.  
  2. start continuously deploying, see what need to be automated and do it.
As a mindset in outbrain, Engineering is a tool for the business. Nothing is more important then business success (actually, happy content readers are more important :)). There is one time along the year that we Stop the world and invest all the time in pure engineering stuff. It's in the last 2 weeks of every year (Xmass time). Last year we called it Cost Reduction Week where we put high engineering effort to reduce the serving costs. This year it's called "Quality time" to increase code quality and do big automation step for CD.

It's not that we don't invest in cost reduction and quality along the year. This year we did great and very significant projects in these areas but we did not Stop the world for it. (I know you meant other thing by "Stop the world" and I used it differently)

This is why option #2 is much more "outbrainy". I guess you can tell better but maybe it's more "Israeli" way to do it.

I think every company that enters such process should  see how it does it in a way that reflects it's culture.

Ran Tavory

unread,
Nov 19, 2010, 8:05:44 AM11/19/10
to iltec...@googlegroups.com
Eishay - Pavlov effect - nice, clockwork orange stuff... I like the way you treat developers as if they were humans or sort of, although pavlov was experimenting with dogs ;)

Yonatan, I realize your question might get interpreted in a few different ways, one of them is the cultural aspect and it's a profound one, but if what you're looking for is a recipe or a design pattern for CD then I'll do my best to share what I've learnt.
As Ori has mentioned at outbrain we've made a decision to go that route and are working on the culture and tooling as we go so the points I'm going to mention below are in many cases things I've only been learning from others or have made thought experiments with so only a few had been already implemented by me, that was the disclaimer.
What I can share is not a design pattern and, granted, things are different with every company, but rather a list of things I believe should be thought about and solved, at least the things we think about in outbrain. Some of them are applicable for CD, however I discovered that many of them are applicable in the more general sense of just writing good maintainable software. It's well known that a testable code is also a reusable code etc, so I think we may also establish that a Continuously-Deployed code is also - reusable, testable etc, it's almost a corollary.
I'll copy and paste bits of pages from our internal wiki that I've written to bootstrap the process, come to think of it, some of them long before I had the pleasure to learn about the exemplary work done by kaching.

  1. Deployment Manager. First and foremost we need a deployment manager. Available tools on the market are Puppet, Chef, ControlTier and more. We need to find a tool that suites our needs and if not, develop or extend one in house. The tool needs to satisfy the following requirements:
    1. Configuration management. E.g. configure each host according to its specific roles in a specific rack in a specific DC. For example configure ServiceX on ServerY, or configure ServiceX in DataCenterZ. The configuration spans both the application level configuration (e.g. java properties) as well as JVM attributes (e.g. Xmx) and system properties (e.g. mkdir or such)
    2. Deployment management. Be able to copy binaries, stop/start tomcat, stop/start cassandra etc.
    3. Can run sql patches in a controlled manner
    4. Have an API or other automated means of control. must be automation friendly.
    5. built in rollback ability
    6. Self served. Anyone can release anything, you don't need ops folks to help you with your release. The system needs to be sustainable enough so that idiots can't do too much harm, but the first step to that is hire smart ppl ;)
    7. nice to have - integration into version control, for example similar with what kaching have, if in the log message you have #release:ServiceX then if all tests passed ServiceX will be released from this revision. 
    8. nice to have - UI 
    9. nice to have - reporting, revision control
    10. nice to have - integration to monitoring systems. Either it has a built in monitoring system, or more realistically, the tool should be able to deploy monitoring configuration (opennms) the same way other configurations are deployed so that with each release monitoring is a streamlined part of the automated process.
  2. Monitoring. The current level of monitoring isn't sufficient to determine whether the new installed app or the new deployed configuration is in good shape or not. We've started having a really simple layer of self-tests in place and we have some level of test automation but we need to continue this work. We have to be able to automatically determine per each service, at any given time whether it's well functioning of not. This means several things:
    1. Expand self-test so it doesn't only test that all backends are in place, but also actually performs application logic and determines whether the result is acceptable or not. We need to have two types of self test: small which only check the backends are available (as is today) and extended which runs system wide tests and determines that the service is ready to start serving.
    2. KPIs monitoring. (this part was too outbrainy so I removed the example part)
    3. Performance and resource monitoring. Monitor number of threads, memory used, avg response time, 99th percentile response time etc.
  3. Rollbacks. We need to have an automated way of rolling back if we find that the new deployment has degraded any of the measured metrics listed above.
  4. trunk stable. The dev and ops team should stop working in branches and work only in trunk. Trunk must be stable. Always, all tests green and new features, if they are not ready for prime time, need to be hidden by feature flags of enabled only to internal IP addresses.
    1. Trunk-stable also also means that all forthcoming DB changes need to be backwards and forward compatible. 
  5. Testing. This subject was discussed enough so I won't expand. But we all understand it's the main pillar. 
  6. Sericization or Componantization. break down the services to smaller parts. By breaking down the system into smaller parts you get:
    1. Finer control over what's being released
    2. Let you test and monitor more closely the single task services. Less to test => more confident.
    3. Easier rollbacks
    4. Let us optimize things that can only be optimized per service (not related to CD but still a nice side effect)
  7. Service and configuration publishing. To manage each process backends we can implement an online configuration manager (just for the backends, not for all application properties). Kaching has done that by using zookeeper.
  8. Continuous Testing. What I mean by that is that we continuously test our production systems. Yes, production. (it's no related to junit max which looks cool but unrelated).  We send http requests to production systems and monitor their behavior. Of course, don't go crazy on that and load your production servers too much, stay sane (good advice for life ;). We do that in several levels, service level, UI level (selenium). 
  9. Monitor your logs continuously. Have a system that looks at the logs and spits out all errors and stack traces so anyone can see what's happening in real time in production. I think this one really helps the cultural side. If there's a stack trace on a monitor near you, you can't ignore it. 
  10. Service Instrumentation. Add hooks to your services such as self-tests or anything else that makes sense to your app so you can monitor it in real time. Add google we had /rpcstats or something like that which lists statistics about how long did any RPC take, how many of them ended with errors, how many of them were above the desired 99 percentile etc. I can think of a few simple and common ones such as /logs - how many log lines, how many warn, how many errors, /rpcstats, /perf etc.  This is useful from both automation perspective and manual inspection. 
 As noted above, many of the points apply to general software best practices and not specifically to CD which is nice since you're effort is being rewarded twice. 
--
/Ran

Ran Tavory

unread,
Nov 19, 2010, 10:04:39 AM11/19/10
to iltec...@googlegroups.com
another thing I forgot to mention is lean deployments. If you release your service once a month then reduced functionality once every month, or even, god forbid, down time at 2am may be acceptable but doing it 50 times a day isn't. This means a few things, first is that a deployment needs to be rolling with no downtime and minimal notice from the user's pov and you get that by multiplying the service and get a load balancer in front it (which is a good idea anyway) and second you need to keep your binaries slim. If you have a 70M war file then copying it over to dozens of servers once a month is acceptable but doing it 50 times a day is not negligible.
--
/Ran

Eishay Smith

unread,
Nov 19, 2010, 3:10:31 PM11/19/10
to iltec...@googlegroups.com
Ori,

We do have some fancy stuff in house, but I deliberately listed two extremely simple automation actions. I'm sure it would take few hours to get them rolling and will have a profound impact on an eng culture. 
The pre hook commit is less then twenty lines of code. Let me save you some time :-) http://eng.wealthfront.com/2010/09/keeping-trunk-stable.html
Log analyzing could be a simple cron job with some python script pulling on svn log, persisting the last checked revision in a file, parsing the patch for a pattern and sending an appropriate email.

Re: "stop the world", hope you'll find the following related to our discussion.
There are few garbage collection algorithms out there. A garbage collector cleans junk from the system. The most common is "stop the word" which usually runs when the system is out of resource and it must clean the stable to operate. Though it may be efficient, online system serving clients with high SLA requirements finds it prohibitiv. A common alternative is a concurrent algorithm (CMS) which implement by dedicating some percent of system resources to GC on continuous bases with some brief pauses (aka FixitDays). Even a CMS like system gets overwhelmed from time to time and must do a stop-the-world to set the records straight, though in most cases it means the system is not tuned right and it will soon be thrown into a spiral of death.
Granted, garbage collection systems are not eng orgs, and one must put a stake in the ground start somewhere. Some organization find throughput more important and some (as lean startups) tend to go with low latency "i don't care about a perfect product in two months, i want a working product by tomorrow!".

As for Ran point: hiring the best is indeed the key and we do abuse our engineers. Our next addition to our build system is a gun that shoots the engineer that broke the build. We're starting to invest in system mapping engineers to coordination in the office and nuking them off their chairs. 

Cheers, Eishay

Ori Lahav

unread,
Nov 19, 2010, 4:05:51 PM11/19/10
to iltec...@googlegroups.com
Eishay - this discussion is just getting better :)

I believe we agree on few things that must be stated (for the benefit of people that asks questions like Yonatan asked).
1. This game is for top tier engineers. In my perspective, Only the engineers that I can count their judgment, self criticism, well understanding of software systems and have the ability to produce and improve their own immune systems are in this league. If you don't believe your team is up to this, don't go this path... or replace the team. If you have that, all procedures, automated culture enforcing systems, meetings and guns, are a subject for the organization culture. for example, outbrain is a relatively meeting-less company, we have 2 meeting rooms their lights are off most of the time. So, "I f**ked up" meeting is not right but people will freely Yammer their f**ks and move on to fix them.

2. I super agree on this: "i don't care about a perfect product in two months, i want a working product by tomorrow!". Just wanted to remind that my product is "interesting content links" and not Continues Deployment automation. I'm doing CD because it help me do a better product on time and automation will make it even better. but if I'll stop everything and do only automation... I miss the point of continuesly improving the product. right?

Eishay Smith

unread,
Nov 19, 2010, 5:05:05 PM11/19/10
to iltec...@googlegroups.com
IMO organizational learning is something you must invest in. I'm not talking about "tech talks", "all hands strategy" and "our values", its about the day-to-day operation or the org. Its not about "MEETINGS", its "micro meetins": (1)What happened, (2)why x 5, and (3)some action item (4)lets take the rest offline. 
It helps contain people's ego and DRYing up mistakes which *should* happen if you go fast enough.

Agree about the second point. CD is a tool similar to source control or backup system, the investment should be proportional to the impact of them not existing or lacking functionality. Unfortunately, most organizations don't know what they can do with CD, thus they assume they don't need it -> things not broken -> why fix?

Ori Lahav

unread,
Nov 19, 2010, 6:07:41 PM11/19/10
to iltec...@googlegroups.com
Eishay
with your experience in pitching this, do you get a lot of "things not broken -> why fix?"?
when I talk to people they usually say - "WOW  X deployments a day? we wish we could do that BUT... " then all their secret fears are hidden behind that BUT.

At least the people I talk to (maybe because they are messy Israelies:)). already feel the pain of long cycles and would like to shorten it. they seem like the guy standing on the edge of the pool waiting for someone to give him a kick in the "BUT" and throw him into the water. 

Eishay Smith

unread,
Nov 19, 2010, 6:26:52 PM11/19/10
to iltec...@googlegroups.com
Don't get too many "things not broken", its more of "well, its part of doing business in this domain, everybody knows that - you think I'm new to this game?".
Businesses like these that don't iterate fast enough are being run down by ones that do and accepts small failures and recoveries as part of doing business. With internet products taking more share of the software market soon enough we'll see "oh, I can't afford *not* to do CD, all my competitors do. we have a feature perfect product no one needs and they have a crappy product that the customers must have". Unfortunately to them, wealthfront's competitors did not use CD and doing pretty bad.
Reply all
Reply to author
Forward
0 new messages