I suppose its just about time for me to jump in here.
First, my disclaimer in this case is that Ansible is in fact the only
tool in the CM-space that I haven't used in a serious way, among the
"CAPS" group as David put it above. There isn't any specific reason I
haven't used it except that I haven't had the need personally. I don't
mean to judge the tool in any way based on this paragraph.
So now, Salt has cloud provisioning, and Chef and Puppet are starting
too as well. I'll share with you my reasons for creating Terraform vs.
augmenting those tools or working with those tools to make them good
cloud provisioning tools.
Given my disclaimer above, I don't know if Ansible does any of these
things. Here we go.
1.) For Puppet and Chef, I felt their agent-based model was just wrong
for this. Puppet/Chef are based on the original foundation that their
agent runs on the host that is also being changed. They've both made
strides to support _both_ models now (puppet apply, chef zero?), but
it is still awkward and doesn't feel 1st class yet. I know Ansible
doesn't do this.
2.) Parallelism in all tools is really subpar. Provisioning cloud
instances takes a good amount of time, and the parallelism didn't
exist when I started TF. Some do it now, but its still pretty
elementary. Terraform uses a dependency graph, and walks the leaves of
that graph in parallel to ensure maximum parallelizability it can.
3.) Gluing together multiple resources. With Terraform you can take
_any_ attribute of _any_ resources/module and use it as an input to
another resource/module. For example, you can say: get the IP address
of that load balancer and configure a DNS provider with it. This basic
functionality is surprisingly difficult for existing tools.
4.) Planning. Terraform has the industry's best plan feature, hands
down. Puppet/Salt have a "no-op" mode but it is different: it shows
you what it would do if you executed it in THAT moment. When you
execute those tools for real, the state of the world may have changed,
and they may do something different. Terraform generates a plan, which
can be saved, and you can tell Terraform to only do that plan. This is
critical for infrastructure moreso than a single server because you
want to see the full rollout effect of a single change: will this
change require a DNS change in the middle of the day when TTLs are too
long? Or will this change be done in-place?
5.) The core features of a CM tool don't align with a cloud
provisioning tool. I'm going to show a couple examples for this.
5a.) Every CM I know of "refreshes" the state of all its managed
resources on every run. It has to do this to check if it needs to make
any changes. On non-trivial systems, this already takes a significant
amount of CPU/time (dozens of seconds). For cloud systems, these are
all API calls over WAN, and even at 100 servers this takes many dozens
of seconds. As infrastructure scales, you get an O(n) slowdown in
apply times even without changes due to this.
With Terraform, we haven't exposed it yet, but we've designed the core
in a way that we support partial refresh. In this mode, Terraform will
only refresh resources that are highest priority (config change from
cached state, new resource, deleted resource), things that are _likely
to change_. For remaining resources, it will not refresh them, or it
will refresh them X% at a time (you choose X).
If you then run Terraform periodically, this will amortize the cost of
refreshes at the cost of some accuracy of state. But realistically
we've found that servers don't change that often.
5b.) CMs don't support more complicated lifecycle management. As an
example, Terraform already has "create before destroy" which says that
if a resource needs to be destroyed, Terraform should create the new
one first before destroying the old one. This allows really basic
features to minimize downtime. More lifecycle options are coming in
Terraform: rolling deploys, etc.
5c.) Incorrect destroy ordering. To be fair, Terraform still has some
issues with this, but they're edge cases. In things such as Puppet, if
you `ensure => absent` (to delete) resources, Puppet routinely
destroys them in an incorrect ordering, and cloud providers sometimes
don't allow that. For example, to delete a VPC, you need to make sure
all things using that VPC (subnets, route tables, instances, EIPs,
etc.) are all gone _first_. Terraform does this, other CMs don't,
because they haven't had to think about a destroy ordering ever
before.
5d.) Multi-team parallelism on subresources. This is coming in
Terraform in the future, but its another thing we thought about early
on that CMs don't do at all that people need. With Terraform, you want
to be able to allow multiple teams to modify the same infrastructure
at the same time, _safely_. To do this, Terraform will grab a
"sub-tree lock" on the resources that a plan says it will touch. If
another team member on another machine runs Terraform and the
sub-trees overlap, then it will error. But if they don't overlap,
parallel infrastructure change can happen.
6.) Atlas integration. I am mentioning this because Ansible has
Ansible Tower, a pure commercial offering, that they use as a value
add to their cloud provisioning. So, likewise, we have Atlas, in its
early stages that does the same thing.
6a.) Auto-scaling. Ansible does this with Tower, we do this with Atlas
(shipping in 4 to 6 weeks, as I said Atlas is in its infancy but this
was always planned).
6b.) ACLs on sub-resources. Due to Terraform's plan, Atlas will allow
you to define ACLs on resources and their attributes. So you can say
something like this: operator "John" can modify infrastructure only
between 5 PM and 9 AM, and can never modify the root DNS entries of
any of our domains in CloudFlare. "John" can also never destroy
resources X, Y, Z. And Atlas will verify this by checking the plan.
I hope this helps.
Best,
Mitchell
>
https://groups.google.com/d/msgid/terraform-tool/CAOGuV3Ekpkp3YYTzArfGwqp_vFymaPmXPh6bOscQcZStx0zCJA%40mail.gmail.com.