Drift management at Scale

Ben Spencer

unread,

Aug 30, 2018, 3:06:08 PM8/30/18

to AWX Project

We are working the trying to prove out/confirm that Ansible (via AWX/Tower) is able to replace out current drift management system. Technically it is seems very likely as the functionality is there. One of the hurdles we are struggling with is the scalability aspect and the time it takes. We see comments/posts where people "manage 1000's of machines" (or 10's of 1000's of network devices). The simple word "manage" can mean a different things to different people though and the posts we have seen so far do not explain the details of what "manage" meant to them.

The question is: Is anyone using AWX/Ansible for drift management in the manner that we are attempt to or have we fallen off the rocker? if you are, a follow-up would be what did it take to get it to work.

Where we are coming from:

The full policy is run every 15 minutes on every server
There are about 6000 servers
There are 170 "rules" in the current policy

The rules are sometimes very finite (ex: "maintain the this configuration line/item")
The rules may be broad (ex: "maintain these 10 configuration aspects which somehow are loosely coupled to SSH")

Going forward with Ansible:

Frequency could be backed off to once an hour
The 6000 servers still all apply
Because of the the current rules including multiple items/aspects, we expect the "rules" to double

Some/many of the "rules" will likely require multiple tasks in Ansible
Not all rules apply to every server. there are many conditions (server OS, Server OS major version, server tier, server location, etc)
With all of this I'd expected 400-500 tasks when we are finally done

We are starting with testing the concept using AWX. TBD if the actual implementation would be AWX or Tower

The contention: It simply takes awhile to complete a single task across all of the servers. We know that a high level of parallelism is going to be needed but that doesn't seem to be all/enough itself.

Christopher Meyers

unread,

Aug 30, 2018, 3:37:07 PM8/30/18

to AWX Project

Does Ansible, without AWX, meet your timing requirements? Tower is going have the runtime of Ansible + some overhead (< 2x).

There is an exception, where AWX/Tower clustering can help you scale horizontally. The clustering feature + workflows can be used to split inventory across multiple jobs. You can then use workflows to easy run those jobs in parallel across multiple servers.

Matthew Jones

unread,

Aug 30, 2018, 3:45:49 PM8/30/18

to Christopher Meyers, AWX Project

Here's the WIP PR where we're tracking the feature that Chris Meyers mentioned: https://github.com/ansible/awx/pull/2174

--
You received this message because you are subscribed to the Google Groups "AWX Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email to awx-project...@googlegroups.com.
To post to this group, send email to awx-p...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/awx-project/757f8949-b43d-4764-b17f-5514e8b997b4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Matt Jones

Principal Software Engineer

Ansible Tower

Ben Spencer

unread,

Aug 30, 2018, 9:27:44 PM8/30/18

to AWX Project

Thanks Chris & Matthew

The Splitting/Sharding concept is one we talked about a little (playbook uses the API to create the sub-inventories and submit jobs against) but kludgy of course and I am not sure how reusable it would have been. This feature hits it on the head though. The forks + task methodology makes it hard without something like this (slow hosts kill the entire fork group; in 6K hosts, not all of them are super fast).

We saw a little talk about clustering AWX and outside of OpenShift, it didn't look like it was supported/existed. We've run into a few gotchas when trying to do it though some of the pointers in the one HA thread here look to be helpful. Is AWX clustering official or only functional within OpenShift?

I don't think Ansible itself could do it either. Most of our testing has been within AWX itself (interface, reporting, history of job execution - these are soft requirements). I set up a simple test to get a feeling for it:

Playbook: turns off gather_facts and contains 2 tasks:

task 1: uname -n & store in a variable

task 2: display the variable

Sample size: 50 servers

Forks: 16

Ansible runtime*: 27 seconds

AWX runtime: 32 seconds (18.5% slower)

* Ansible was run on a physical server with 40 threads; forks was set to 16 still

Calculating this out to 6K servers comes to about 22.5 minutes for Ansible. That doesn't leave much time for the other 400+ tasks which would probably exist. I'll perform some more tests (more tasks, bigger server sample) over the next few days and see how play out. Overall we are pretty new at Ansible and still learning it as well.

Frank Dias

unread,

Aug 30, 2018, 10:37:50 PM8/30/18

to Ben Spencer, AWX Project

Been

Take a look at this plugin,

https://sweetness.hmmz.org/2018-08-27-fork-in-the-road.html

We added the plugin to one of our test servers and we noticed a substantial speed increase in job completion

--
You received this message because you are subscribed to the Google Groups "AWX Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email to awx-project...@googlegroups.com.
To post to this group, send email to awx-p...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/awx-project/d7e133ff-7a63-4847-8b7d-282b0105cde4%40googlegroups.com.

Michael T. Wiley

unread,

Aug 30, 2018, 10:44:03 PM8/30/18

to Frank Dias, Ben Spencer, AWX Project

Clustering can be done on Kubernetes without openshift, I just set it up in our lab as a proof of concept.

For speeding up large jobs across numerous hosts you might look into either the ansible pull model and/or running your jobs/tasks asynchronous.

To view this discussion on the web visit https://groups.google.com/d/msgid/awx-project/CAH9d9TFie38TOhiEsdwjE24kefB5mg59Yr%2BJg7-MMCS08WD9Kw%40mail.gmail.com.

d...@botanicus.net

unread,

Aug 30, 2018, 11:13:38 PM8/30/18

to AWX Project

Hi Frank & Ben,

There should hopefully be a branch of Mitogen _vastly_ more suited to larger runs publicly available soon. Naturally due to the nature of the target, it's difficult to find real scenarios where good results can be collected, there is a limit to the reliability of data generated by profiling against a test cluster of VMs under no load except for Ansible. If you are running Ansible in a very-many-target environment and are interested in performance, please feel free drop me a message offlist.

Thanks,

David

d...@botanicus.net

Ben Spencer

unread,

Aug 31, 2018, 8:24:15 PM8/31/18

to AWX Project

Thanks. Good to know that too.

I'm mentally struggling with how setting up an completely unrelated infrastructure to run AWX on provides clustering when AWX itself does not but needs to be aware of the cluster members. Doesn't the (AWX) cluster itself need to know about the other members in it the cluster in order for workload scheduling and all still need to use the same MQ instance(s)? It is not out of the question to setup an infrastructure to run the AWX infrastructure but is a pretty heavy curve learning, setting up and selling the concept of supporting two new infrastructures.

Ben Spencer

unread,

Aug 31, 2018, 8:37:15 PM8/31/18

to AWX Project

David & Frank

Mitigen looks and sounds interesting and promising also. It is something we could play with now and see the impact/benefit and hope it would play out if/when the existing ruleset is rewritten.

Michael T. Wiley

unread,

Aug 31, 2018, 9:04:35 PM8/31/18

to Ben Spencer, AWX Project

Ben,

Each AWX instance in Kubernetes is a pod and each pod consists of the following containers awx-web, awx-task, memcached and rabbitmq. They can definitely share a backend postgres database, but seeing as each pod is typically isolated and self contained I think they each have separate instances of rabbitmq and memcached. I can see multiple instances listed within the AWX UI when I increase the number of nodes though so I'm assuming there's some sort making a record in the database and then maybe using records in the database to broker who takes what job? I'm still a little fuzzy, but will be exploring this further in the next week or so and I can let you know what I find out. One thing to note is that they all appear to be a part of the first tower instance group. So there may be some built in mechanic for splitting jobs amongst instance group members at the database level?

To view this discussion on the web visit https://groups.google.com/d/msgid/awx-project/fde85d28-a5ea-4f6c-87e5-5d6e7e7d8246%40googlegroups.com.

Ben Spencer

unread,

Aug 31, 2018, 9:20:03 PM8/31/18

to AWX Project

Michael

I need to talk to my AWX guru more about this next week. I think one difference may be the separate (not shared) rabbitmq (and memcached?) instances as I don't think he had those. I also seem to recall him mentioning that when he added additional instances the capacity of the first dropped with it eventually dropping to zero which basically defeated the purpose ("capacity" is not the term he used but I don't recall what he called it; units of work maybe?). Interested in hearing more as you learn. Thanks for sharing.

Ben Spencer

unread,

Sep 4, 2018, 8:06:52 PM9/4/18

to AWX Project

I looked at Mitogen with one someone who is more familiar than I am and it looks promising. Nice work. We are hopeful that this will help bring us closer to what we need for drift management (and activities which impact the entire environment). While I didn't see anything which would be of concern, it will be interesting to see how the various OSes we have are handled (I ran into the AIX SSH bug/pseudo-tty allocation issue when testing over the weekend)

Our AWX environment is currently sick so I have not been able to perform proper side by side tests with it yet but pretty numbers I gathered using strait Ansible:

(2 tasks; no facts; ~7K servers; 16 forks)

Baseline: 1 hour 45 minutes

strategy=free;serial=100: 37 minutes

(7 tasks; no facts; ~7K servers; 16 forks)

Baseline: 6 hours 21 minutes

strategy=free;serial=100: 2 hours 5 minutes

(serial values are basically a random choice; strategy with default serial took 3 hours 15 minutes)

(not all 7K hosts exist/were accessible but I didn't count how many were not)

Reply all

Reply to author

Forward