While making an analogy between the software development and system administration, I often hear people compare software testing with monitoring in the sysadmin world.
I agree that monitoring is a crucial part of being on top of the situation: typically we would have
/Unit tests/
* thresholds on cpu, disk , network usage * check for processes * check validity of config files * check logfiles for warnings * monitor security ports, checksum of files
/Functional tests /
* send a test mail * call a webpage and check the http status * send a query to the database
/Integration tests// /
* login, post a test entry, ... * login, send a test mail and see if it arrives
(do these categories seem right as a mapping?)
In case of alarm they would page, send a mail, send an SNMP trap, to tell us what is happening. In production we can often only do readonly tests and scenario's that are non destructive. If this is the only thing you would do as a sysadmin , this is a rather fatalistic approach.
Therefore IMHO an agile sysadmin should go beyond this 'monitor' approach, within the test environment he should actively develop scenario's and test them.
Let's say you enable a raid file system:
* Behavior testing: As a user I want to read and write files from /data o Scenario: given a disk has crashed, i want to have no problem writing it to /data + A Fixture would be: dd if=/dev/null of=/dev/hda1 o Scenario: given a disk is full, i don't want to loose my data + A Fixture would be: write a large diskfile to disk is full o Scenario: given a heavy load , i don't want to loose data + A Fixture could be: a lot of fork and exec simulating heavy load o Scenario: if network connection is interrupted no data must be lost in the application + A mocking could be: using iptables to simulate the network failure
These scenario's tell a better truth then df -k or an snmp trap not?
What are you people testing before you are confident that a system is 'under control'? Or is this still adhoc testing after an automated install?
> While making an analogy between the software development and system > administration, I often hear people compare software testing with > monitoring in the sysadmin world.
> I agree that monitoring is a crucial part of being on top of the > situation: typically we would have
> /Unit tests/
> * thresholds on cpu, disk , network usage > * check for processes > * check validity of config files > * check logfiles for warnings > * monitor security ports, checksum of files
> /Functional tests > /
> * send a test mail > * call a webpage and check the http status > * send a query to the database
> /Integration tests// > /
> * login, post a test entry, ... > * login, send a test mail and see if it arrives
> (do these categories seem right as a mapping?)
Yep for the integration test.
I'm not sure of the granularity for the unit tests vs functional tests. Your implied rule seem to be system level vs. application level?
Some examples are definitely in the great area...
> In case of alarm they would page, send a mail, send an SNMP trap, to > tell us what is happening. > In production we can often only do readonly tests and scenario's that > are non destructive.
Agreed, but stay aware that read-only != destructive.
You can have scenarios that do changes (say, a transaction on a website for a dummy customer).
> If this is the only thing you would do as a sysadmin , this is a rather > fatalistic approach.
> Therefore IMHO an agile sysadmin should go beyond this 'monitor' approach, > within the test environment he should actively develop scenario's and > test them.
> Let's say you enable a raid file system:
> * Behavior testing: As a user I want to read and write files from /data > o Scenario: given a disk has crashed, i want to have no > problem writing it to /data > + A Fixture would be: dd if=/dev/null of=/dev/hda1
I'm not sure what you mean in that section, but I think the example should be dd if=/dev/zero of=/dev/hda1 or dd if=/dev/random of=/dev/hda1
> o Scenario: given a disk is full, i don't want to loose my data > + A Fixture would be: write a large diskfile to disk is full > o Scenario: given a heavy load , i don't want to loose data > + A Fixture could be: a lot of fork and exec simulating > heavy load > o Scenario: if network connection is interrupted no data must > be lost in the application > + A mocking could be: using iptables to simulate the > network failure
> These scenario's tell a better truth then df -k or an snmp trap not?
> What are you people testing before you are confident that a system is > 'under control'? > Or is this still adhoc testing after an automated install?
I would prefer doing automated testing.
On new hardware, I would run subsystem stability tests: continuous bonnie++ to stress i/o, cpuburn to stress cpu, memtest to stress memory, smartctl to test harddrives and so on to allow early detection of hardware issues
On new production software I'd prefer in a perfect world: - load testing with either simple or advanced tests scenarios - optionally using fuzzing tools to test robustness.
I like your ideas of automating failures with iptables for instance if you want to test configurations with failovers, but I think it's not easy to configure so it can be used "unattended".
> I agree that monitoring is a crucial part of being on top of the situation: > typically we would have
> Unit tests
> thresholds on cpu, disk , network usage > check for processes > check validity of config files > check logfiles for warnings > monitor security ports, checksum of files
> Functional tests
> send a test mail > call a webpage and check the http status > send a query to the database
> Integration tests
> login, post a test entry, ... > login, send a test mail and see if it arrives
> (do these categories seem right as a mapping?)
I quite like thinking about the tests in terms of what they provide both in terms of quality and confidence they provide you with. For a development persepctive there are one set of definitions and a nice diagram of quality/feedback here:
I've seen a lot of people struggle with the naming of what Nat & Steve call Integration Tests and Acceptance Tests (Functional Tests, Customer Tests, System Tests).
To a development team Integration Tests would be tests that exercise the integration with other systems, which map to your Functional Test category and your Integration Tests would map to Acceptance Tests.
It's important to think about what the different test levels give you and how that maps to operations/systems engineering. Also we're hitting the fact you illustrated that testing and monitoring are conflated.
For development internal quality of code driven by unit tests enables ease of understanding and ease of change.
Your BDD style example captures intent much more and would allow you to potentially change the implementation - eg I've seen cases where things such as kernel parameters for tuning say nfs get baked into a system and they calcify, no-one remembers the exact criteria for success, and worse they may vary between systems. Having a set of tests focussed on that component allows you to say things like I expect a throughput of X so that ... Then if you change your backend storage to a filer or another distributed filesystem you can ensure you meet the requirements, and remove legacy configuration settings.
However in a lot of these cases we'll be testing around the interfaces/edges of systems, and I'm not convinced they are unit tests. Really they are the acceptance tests of a particular story around system performance which will usually have some business value attached as in your scenarios. To me we'd also probably be writing unit tests at the implementation level, asserting that the right configuration is there.
One thing that strikes me is that unlike developing in OO it isn't always obvious what the owner for the behaviour is and also how to link the test to the implementation language (which will often end up being config). If you're writing systems tests in this fashion how are you finding refactoring or dealing with changing requirements?
> I'm not sure what you mean in that section, but I think the example > should be > dd if=/dev/zero of=/dev/hda1
Spot on!
>> I would prefer doing automated testing.
>> ...
>> I like your ideas of automating failures with iptables for instance if >> you want to test configurations with failovers, but I think it's not >> easy to configure so it can be used "unattended"
Off course you can't test everything, and you have to take a good balance between the effort you spend and the benefit you get.
Unattended requires a good deal of automation, I agree. Virtualization can help to get control over this , but it does not have to restrict you. Rebooting or other stuff might be done via the Lights Out Management Interface. Network interfaces by scripting the config of a cisco router ...
Did you have a specific scenario in mind of a difficult to configure scenario? I like a challenge ;-)
Patrick Debois wrote: >>> I would prefer doing automated testing.
>>> ...
>>> I like your ideas of automating failures with iptables for instance if >>> you want to test configurations with failovers, but I think it's not >>> easy to configure so it can be used "unattended" > Off course you can't test everything, and you have to take a good > balance between the effort you spend and the benefit you get.
> Unattended requires a good deal of automation, I agree. Virtualization > can help to get control over this , but it does not have to restrict you. > Rebooting or other stuff might be done via the Lights Out Management > Interface. Network interfaces by scripting the config of a cisco router ...
> Did you have a specific scenario in mind of a difficult to configure > scenario? I like a challenge ;-)
Sure, you can probably automate everything, but my point was that some things might cost you so much it's not worth it.
For instance if you want to test all the possible failure cases that might happen in an highly-available distributed solution, you can spend an awful lot of time. (Of course, the more complex it is the more you have to test, hence we're back to the KISS paradigm.)
I think the low hanging fruit come from the automated integration tests: checking that after the deployment of a new software version, it works as intended. There, high level tests such as playing a set of scenarii once, then in a load testing fashion is probably your best bet.
Hence we're back to the idea that continuous integration tests are the next step for Ops if you want to gain in maturity, when considering an environment where you integrate something that is developed in-house. This is quite a restricted field of application...
Stuff like validation of Operating Systems, hardware and so on can probably benefit from a subset of this (think performance tweaking of an OS for instance as was mentioned earlier, checking that it's still valid after a SP/new version of this OS, or with the new SAN/network switches /etc is probably worth it) but are imho more often a one shot manual operation. Checking the performances of your SAN storage when in degraded rebuild mode make sense of course, but it's far from being easy to automate.
And I noticed that even in the limited context of continuous integration, you haven't mentioned the data migration part yet... ;)
> Sure, you can probably automate everything, but my point was that some > things might cost you so much it's not worth it.
> For instance if you want to test all the possible failure cases that > might happen in an highly-available distributed solution, you can spend > an awful lot of time. (Of course, the more complex it is the more you > have to test, hence we're back to the KISS paradigm.)
Yep, I've seen clustered being removed and being replaced by rapid re-deployment.
> I think the low hanging fruit come from the automated integration tests: > checking that after the deployment of a new software version, it works > as intended. There, high level tests such as playing a set of scenarii > once, then in a load testing fashion is probably your best bet.
> Hence we're back to the idea that continuous integration tests are the > next step for Ops if you want to gain in maturity, when considering an > environment where you integrate something that is developed in-house. > This is quite a restricted field of application...
I don't think it has to do with developed in our out-house.
> Stuff like validation of Operating Systems, hardware and so on can > probably benefit from a subset of this (think performance tweaking of an > OS for instance as was mentioned earlier, checking that it's still valid > after a SP/new version of this OS, or with the new SAN/network > switches /etc is probably worth it) but are imho more often a one shot > manual operation.
You might see is as a one shot, still when you want to keep with security patches for the OS, database, middleware, framework. I would most certainly want to have more testing available, in or out-house.
I feel one of the reasons we don't patch that often because we are afraid the impact of the patches. It used to be like that in development with deployments too. Don't deploy often because you break things. And they have succeed using the deploy often now because they have been investing in tests. So maybe sysadmins should do the same?
> Checking the performances of your SAN storage when in > degraded rebuild mode make sense of course, but it's far from being easy > to automate.
> And I noticed that even in the limited context of continuous > integration, you haven't mentioned the data migration part yet... ;)
thanks for the pointer, it really helped me visualize the difference between these tests and that got me thinking:
Why did i say that df -k , cpu are more on internal quality then external quality?
Well in our case we have a multiserver setup with loadbalancers in front of it, so having a high CPU and disk usage on one server
might not have an impact on the actual users, so i figure it has to do with internal quality more.
I agree that all changes should ideally happen based upon a user story. But as you mentioned, who is our user?
Is it the application (requiring a place to be run) , is it the end user (having an infrastructure that he can reach), is it the developers (who want to deploy their app)?
User stories don't always come from the project mode, but this might be tickets coming in from the endusers or security, bugfixes patches coming in from vendors.
On your question on refactoring, would you consider the following cases refactoring?
During Project Mode:
Say you want to start a new application, developers start right a way. Often you only have a small piece they can use to develop against, while the actual hardware is being ordered.
I often have seen the Big Design Upfront syndrome within the sysadmin group: we don't release the environment to development unless it is completely finished.
This is totally wrong ! Both sysadmin and developer can learn by doing their first deployment in a not so finished environment.
So you start with one server (doing web, db, dns, firewall, ldap, ...,
backup, ..) and step by step when the hardware becomes available, you do the migrations.
You switch from host files to DNS, you setup a dedicated firewall instead of iptables on the box, a dedicated router instead of vlans on the linux box.
Two network interfaces instead of one for bonding, teaming.
So you change you environment interatively instead of incrementaly.
During Production Mode:
If you have a lot of shared services such as mail, imap, dns, firewall, proxy, changes are that you have to refactor your environment to accommodate changes.
Changing Ip ranges, host names, Routings, patches all introduce changes. Let's say you have to move a website to a bigger server, or newer hardware. You are actually changing the environment. And similar to TDD, the important thing is that things after the change keep on working again.
> I quite like thinking about the tests in terms of what they provide
> both in terms of quality and confidence they provide you with. For a
> development persepctive there are one set of definitions and a nice
> diagram of quality/feedback here:
> I've seen a lot of people struggle with the naming of what Nat & Steve
> call Integration Tests and Acceptance Tests (Functional Tests,
> Customer Tests, System Tests).
> To a development team Integration Tests would be tests that exercise
> the integration with other systems, which map to your Functional Test
> category and your Integration Tests would map to Acceptance Tests.
> It's important to think about what the different test levels give you
> and how that maps to operations/systems engineering. Also we're
> hitting the fact you illustrated that testing and monitoring are
> conflated.
> For development internal quality of code driven by unit tests enables
> ease of understanding and ease of change.
> Your BDD style example captures intent much more and would allow you
> to potentially change the implementation - eg I've seen cases where
> things such as kernel parameters for tuning say nfs get baked into a
> system and they calcify, no-one remembers the exact criteria for
> success, and worse they may vary between systems. Having a set of
> tests focussed on that component allows you to say things like I
> expect a throughput of X so that ... Then if you change your backend
> storage to a filer or another distributed filesystem you can ensure
> you meet the requirements, and remove legacy configuration settings.
> However in a lot of these cases we'll be testing around the
> interfaces/edges of systems, and I'm not convinced they are unit
> tests. Really they are the acceptance tests of a particular story
> around system performance which will usually have some business value
> attached as in your scenarios. To me we'd also probably be writing
> unit tests at the implementation level, asserting that the right
> configuration is there.
> One thing that strikes me is that unlike developing in OO it isn't
> always obvious what the owner for the behaviour is and also how to
> link the test to the implementation language (which will often end up
> being config). If you're writing systems tests in this fashion how are
> you finding refactoring or dealing with changing requirements?
>> Hence we're back to the idea that continuous integration tests are the >> next step for Ops if you want to gain in maturity, when considering an >> environment where you integrate something that is developed in-house. >> This is quite a restricted field of application...
> I don't think it has to do with developed in our out-house.
Most of my experience has been supporting bespoke/in house development teams. Sadly a lot of operations teams I've seen don't even use source control for configs/scripts let alone config management. Simple disciplined practices can help but a lot of teams feel very swamped by the day to day firefighting.
>> Stuff like validation of Operating Systems, hardware and so on can >> probably benefit from a subset of this (think performance tweaking of an >> OS for instance as was mentioned earlier, checking that it's still valid >> after a SP/new version of this OS, or with the new SAN/network >> switches /etc is probably worth it) but are imho more often a one shot >> manual operation. > You might see is as a one shot, still when you want to keep with > security patches for the OS, database, middleware, framework. > I would most certainly want to have more testing available, in or out-house.
In some ways a lot of the fear and walls that get put up between operations and development teams and operations and vendor patches is due to being burnt in the past. If instead we embrace failure and volatility into our processes and try and figure out smart practices and principles that let us deal with the inherent complexity of systems I think that's going to drive out a set of good practices we can apply.
On a side note do most people have a test environment for operations, or a development one for that matter?
> I feel one of the reasons we don't patch that often because we are > afraid the impact of the patches. > It used to be like that in development with deployments too. Don't > deploy often because you break things.
Which really freezes the businesses ability to get to market quickly.
Obviously there is a business risk and costs/benefits involved with patching systems. If understanding if a system is working correctly is an expensive, time consuming process (eg a bank certifying a particular stack) then you don't want to do it often. If we can somehow continually measure and adjust at low cost a system, whilst keeping it operational then that concern goes away.
There are a lot of interesting solutions like degrading applications gracefully (as John Allspaw mentions in his book and here http://highscalability.com/how-succeed-capacity-planning-without-real...), AB testing, incremental test deploy, etc. Is it feasbile to use these on our supporting systems? I think for some organisations operations is a "secret sauce" but in others that might not be a suitable model.
> And they have succeed using the deploy often now because they have been > investing in tests.
Indeed, if something is risky do it more :)
The key here is fail fast (and at the appropriate place) and have a good feedback cycle. If you don't adopt that then you risk failures too late which cost more (production is a hard place to fix things).
>> And I noticed that even in the limited context of continuous >> integration, you haven't mentioned the data migration part yet... ;)
> Ah, you got me here ;-)
When we think about deployment of development applications, we also have configuration and data going along side that. There are some strategies for handling this - using things like dbdeploy, django evolution, to manage schema changes for an application. I'd like to see data be more of a first class citizen in a fully agile operations team, from logfiles, to application data, there is a startling amount of information about systems that gets ignored and that's wasteful.
Some problems are harder to test, but using strategies such as mocks we can certainly test out exotic failure cases that are much harder to setup normally.
>> Sure, you can probably automate everything, but my point was that some >> things might cost you so much it's not worth it.
>> For instance if you want to test all the possible failure cases that >> might happen in an highly-available distributed solution, you can spend >> an awful lot of time. (Of course, the more complex it is the more you >> have to test, hence we're back to the KISS paradigm.)
> Yep, I've seen clustered being removed and being replaced by rapid > re-deployment.
You lucky :) This is what I've been preaching for for a while now but never succeeded in seeing it actually implemented...
>> I think the low hanging fruit come from the automated integration tests: >> checking that after the deployment of a new software version, it works >> as intended. There, high level tests such as playing a set of scenarii >> once, then in a load testing fashion is probably your best bet.
>> Hence we're back to the idea that continuous integration tests are the >> next step for Ops if you want to gain in maturity, when considering an >> environment where you integrate something that is developed in-house. >> This is quite a restricted field of application...
> I don't think it has to do with developed in our out-house. >> Stuff like validation of Operating Systems, hardware and so on can >> probably benefit from a subset of this (think performance tweaking of an >> OS for instance as was mentioned earlier, checking that it's still valid >> after a SP/new version of this OS, or with the new SAN/network >> switches /etc is probably worth it) but are imho more often a one shot >> manual operation. > You might see is as a one shot, still when you want to keep with > security patches for the OS, database, middleware, framework. > I would most certainly want to have more testing available, in or out-house.
> I feel one of the reasons we don't patch that often because we are > afraid the impact of the patches. > It used to be like that in development with deployments too. Don't > deploy often because you break things. > And they have succeed using the deploy often now because they have been > investing in tests. > So maybe sysadmins should do the same?
I agree there. Still I am saying this is still sci-fi :) Think incremental! Every single place I've worked at weren't even monitoring all their systems for a start...
There's lot to do and I think there's the need for a big paradigm shift in the industry before we can get hardware/software that enables you to build a higher level of maturity because they are properly tooled to help you do so.
>> Checking the performances of your SAN storage when in >> degraded rebuild mode make sense of course, but it's far from being easy >> to automate.
>> And I noticed that even in the limited context of continuous >> integration, you haven't mentioned the data migration part yet... ;)
> Ah, you got me here ;-)
This is maybe because Ops alone is not the right level there and this imply synergy between Ops and dev?
> thanks for the pointer, it really helped me visualize the difference > between these tests and that got me thinking:
Not a problem, it'd be nice to get a wiki or something for this group so that we can throw up some pictures, more persistent examples, etc.
For kicks we could even try and do it as an exercise in collaboratively trying to setup a service in an agile way :)
> Why did i say that df -k , cpu are more on internal quality then > external quality? > Well in our case we have a multiserver setup with loadbalancers in front > of it, so having a high CPU and disk usage on one server > might not have an impact on the actual users, so i figure it has to do > with internal quality more.
This is interesting, as just by choosing to think about these qualities like this there is a requirement around user availability. In some ways that story leads us in the direction of horizontal scalability as a design decision (or even shared nothing) as we apply that criteria to growing capacity.
Trying to think about the why of each of these things really helps me think about the system in different ways.
> I agree that all changes should ideally happen based upon a user story. > But as you mentioned, who is our user? > Is it the application (requiring a place to be run) , is it the end user > (having an infrastructure that he can reach), is it the developers (who > want to deploy their app)?
I think we have stories from all those users/stakeholders (and also we generate stories for development too). I'd probably not say the application directly as there are usually more human stakeholders who gain benefit from it running.
> User stories don't always come from the project mode, but this might be > tickets coming in from the endusers or security, bugfixes patches coming > in from vendors.
This is something I'm struggling with - trying to think of a better way of managing the interrupt driven work. I need to look at some of the Lean approaches to this some more.
> On your question on refactoring, would you consider the following cases > refactoring?
> During Project Mode:
> Say you want to start a new application, developers start right a way. > Often you only have a small piece they can use to develop against, while > the actual hardware is being ordered.
This is one of the key things that differentiates systems/operations work with development, large lead times outside your control. Obviously there are strategies to help deal with easier provisioning such as virtualisation, cloud computing, etc.
Again if we put the focus on failing fast and lowering the cost of change we can and should do something pragmatic, but it might also require saying there are some risky areas that we want to prioritise first so that there is a greater chance of success.
> I often have seen the Big Design Upfront syndrome within the sysadmin > group: we don't release the environment to development unless it is > completely finished.
Yes that's not uncommon.
> This is totally wrong ! Both sysadmin and developer can learn by doing > their first deployment in a not so finished environment. > So you start with one server (doing web, db, dns, firewall, ldap, ..., > backup, ..) and step by step when the hardware becomes available, you do > the migrations. > You switch from host files to DNS, you setup a dedicated firewall > instead of iptables on the box, a dedicated router instead of vlans on > the linux box. > Two network interfaces instead of one for bonding, teaming. > So you change you environment interatively instead of incrementaly.
To zoom out to your question, yes any of these could be refactoring, but without concrete practical details its hard to say if they are. In theory you could have tests for something and only change the internals without changing functionality.
I'm growing more and more fond of the red green refactor process as a set of continual small steps to make the system better. I was talking with a colleague about this yesterday in the context of development, but I think that we should be in some way striving for clean, understandable systems constantly.
> During Production Mode:
> If you have a lot of shared services such as mail, imap, dns, firewall, > proxy, changes are that you have to refactor your environment to > accommodate changes. > Changing Ip ranges, host names, Routings, patches all introduce changes. > Let's say you have to move a website to a bigger server, or newer > hardware. You are actually changing the environment. And similar to TDD, > the important thing is that things after the change keep on working again.
There are probably some common patterns here (migrate service, rename system, etc) possibly we often try and do too much in one step, rather than a set of small changes. Given your example of moving to a new server we can probably break this down into small steps that bring us one step closer to that without changing the system. There probably needs to be some thinking about good common techniques that can be shared for this for systems teams ala Fowlers' refactoring.
To me you can't be confident in making these changes without having some way of testing that part of the system is operating.
Paul Nasrat wrote: >>> Hence we're back to the idea that continuous integration tests are the >>> next step for Ops if you want to gain in maturity, when considering an >>> environment where you integrate something that is developed in-house. >>> This is quite a restricted field of application...
>> I don't think it has to do with developed in our out-house.
> Most of my experience has been supporting bespoke/in house development > teams. Sadly a lot of operations teams I've seen don't even use source > control for configs/scripts let alone config management. Simple > disciplined practices can help but a lot of teams feel very swamped by > the day to day firefighting.
>>> Stuff like validation of Operating Systems, hardware and so on can >>> probably benefit from a subset of this (think performance tweaking of an >>> OS for instance as was mentioned earlier, checking that it's still valid >>> after a SP/new version of this OS, or with the new SAN/network >>> switches /etc is probably worth it) but are imho more often a one shot >>> manual operation. >> You might see is as a one shot, still when you want to keep with >> security patches for the OS, database, middleware, framework. >> I would most certainly want to have more testing available, in or out-house.
> In some ways a lot of the fear and walls that get put up between > operations and development teams and operations and vendor patches is > due to being burnt in the past. If instead we embrace failure and > volatility into our processes and try and figure out smart practices > and principles that let us deal with the inherent complexity of > systems I think that's going to drive out a set of good practices we > can apply.
> On a side note do most people have a test environment for operations, > or a development one for that matter?
>> I feel one of the reasons we don't patch that often because we are >> afraid the impact of the patches. >> It used to be like that in development with deployments too. Don't >> deploy often because you break things.
> Which really freezes the businesses ability to get to market quickly.
> Obviously there is a business risk and costs/benefits involved with > patching systems. If understanding if a system is working correctly is > an expensive, time consuming process (eg a bank certifying a > particular stack) then you don't want to do it often. If we can > somehow continually measure and adjust at low cost a system, whilst > keeping it operational then that concern goes away.
> There are a lot of interesting solutions like degrading applications > gracefully (as John Allspaw mentions in his book and here > http://highscalability.com/how-succeed-capacity-planning-without-real...), > AB testing, incremental test deploy, etc. Is it feasbile to use these > on our supporting systems? I think for some organisations operations > is a "secret sauce" but in others that might not be a suitable model.
>> And they have succeed using the deploy often now because they have been >> investing in tests.
> Indeed, if something is risky do it more :)
> The key here is fail fast (and at the appropriate place) and have a > good feedback cycle. If you don't adopt that then you risk failures > too late which cost more (production is a hard place to fix things).
I so very much agree with you there!
I'm so glad I can meet yet another person that shares my point of view on those matters.
>>> And I noticed that even in the limited context of continuous >>> integration, you haven't mentioned the data migration part yet... ;)
>> Ah, you got me here ;-)
> When we think about deployment of development applications, we also > have configuration and data going along side that. There are some > strategies for handling this - using things like dbdeploy, django > evolution, to manage schema changes for an application. I'd like to > see data be more of a first class citizen in a fully agile operations > team, from logfiles, to application data, there is a startling amount > of information about systems that gets ignored and that's wasteful.
My gut feeling there is that the data migration needs to be considered very early in the application conception, this is why I mentioned that I feel they are out of the scope of Ops team alone.
I haven't thought of the problem of the logfiles and so on but you're right again, that's also a need. Maybe we can think of different typology of data sets and describe basic migration schemes for those different typologies?
> Some problems are harder to test, but using strategies such as mocks > we can certainly test out exotic failure cases that are much harder to > setup normally.
> Paul
Ok but how would you define that something is definitely not worth testing? (i-e the cost of testing is way too high when considering the risk)
>> The key here is fail fast (and at the appropriate place) and have a >> good feedback cycle. If you don't adopt that then you risk failures >> too late which cost more (production is a hard place to fix things).
> I so very much agree with you there! > I'm so glad I can meet yet another person that shares my point of view > on those matters.
I'm glad, one of the things I've found hard is that often it's hard to find people to discuss the ideas surrounding this. It feels like we're getting to a place where communities can form around this.
> My gut feeling there is that the data migration needs to be considered > very early in the application conception, this is why I mentioned that I > feel they are out of the scope of Ops team alone.
That's possibly true, although as with a lot of things, having a collaborative cross functional team working towards the issues that is inclusive of operations will be imporatnt.
> I haven't thought of the problem of the logfiles and so on but you're > right again, that's also a need. Maybe we can think of different > typology of data sets and describe basic migration schemes for those > different typologies?
>> Some problems are harder to test, but using strategies such as mocks >> we can certainly test out exotic failure cases that are much harder to >> setup normally. > Ok but how would you define that something is definitely not worth > testing? (i-e the cost of testing is way too high when considering the risk)
Not wishing to sound flippant but I think this will end up being something that is a pragmatic decision by the team.
This happens in developer testing too, particularly around integration and acceptance testing I often have conversations with pairs about a particularly gnarly integration test and it can be that sometimes it's not needed, if you have a known interface (eg say SMTP) and a way of testing using a stub, then a full end to end test may not be necessary.
Discussing the value of a test and making a call should be part or the process. A lot of these approaches applied to operations are just emerging or at least just starting to consolidate. As such, you'll probably find teams adopting practices over testing (cf getter/setter testing when you are learning TDD as a developer), this is natural and part of the learning process. If you're running a team this way you'd probably start out with a set of rules, but part of agile processes is figuring out what will work for you. I've just been reading Pragmatic Thinking & Learning, which is probably influencing this answer, but you should find that overtime peoples intuition for that call will improve.