Bad performance: In Ansible 2.1 playbooks take factor 5.6 longer than 1.9.x

Tobias Wolf

unread,

May 10, 2016, 4:28:15 AM5/10/16

to Ansible Development

Salvete all,

we are still clinging to 1.9.x ever since 2.0 came out but we've been planning to move to 2.1 once it comes out.

We had already figured out that ansible 2.x starting up with parsing our dynamic JSON inventory (7MB of text, almost 9000 groups, excluding '_meta' hostvars) takes over 20 seconds versus 5 seconds before on 1.9. (Please see my PR #13957 on that)

Now we've started running stable-2.1 with --check on some of our playbooks with the incompatible changes cleaned up (backslash escapes, etc).

Ansible branch stable-2.1 takes more than 5 times longer to complete than 1.9.4!

The test playbook rolls stuff out on about 150 hosts out of a couple thousand in the inventory. There are a lot of tasks on each host and already visually it is apparent that something is very different compared to before. Even skips tick by at a snails pace (up to 0.5s per skip). The main ansible-playbook process holds 500MB res memory.

On ansible 1.9.4 this completes in 23 minutes. On 2.1 in 2 hours, 9 minutes.

This is ansible.cfg:

[defaults]

forks = 55

force_color = 1

roles_path = roles

hostfile = inventory

#library = library

retry_files_enabled = False

[ssh_connection]

ssh_args = -o ControlMaster=auto -o ControlPersist=61s -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=yes -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey,keyboard-interactive

pipelining=True

Since people were suggesting it, I tried fact caching with redis or the free strategy. This did nothing significant to the runtime (as expected since the time is not spent in gathering, it's spent forking the main process or something):

ansible 2.1.0.0 (stable-2.1 be28443943)

15125.40user 6640.76system 2:08:29elapsed 282%CPU (0avgtext+0avgdata 3262800maxresident)k

304inputs+1046376outputs (5major+613591630minor)pagefaults 0swaps

ansible 2.1.0.0 (stable-2.1 be28443943) -- fact_caching = redis

14469.46user 6497.10system 2:04:38elapsed 280%CPU (0avgtext+0avgdata 3052820maxresident)k

728inputs+1028496outputs (18major+606686383minor)pagefaults 0swaps

ansible 2.1.0.0 (stable-2.1 be28443943) -- strategy = free

16394.78user 8026.57system 2:19:29elapsed 291%CPU (0avgtext+0avgdata 1231688maxresident)k

80inputs+1084288outputs (0major+626770671minor)pagefaults 0swaps

We really want to run 2.x but a this point, we are considering maintaining 1.9.x for ourselves with backported modules. I wanted to bring this up on the mailing list first, to see if there is anyone on the dev team who has a grip on what is going on with 2.x at all? What is happening that throttles playbook execution so much? Is this going to improve anytime soon? If I wanted to explore this further, what would be the approriate tools to profile/benchmark/analyze the problem further with our workload? Apparently this has not come up before, otherwise — I assume — 2.0 would not have been released like this. So I wonder what is different in our setup that slows Ansible 2.x to a crawl. I suspect it is again our large inventory, but i'd like some help to figure it out.

cheers,

Tobias

jhawkesworth

unread,

May 10, 2016, 7:25:02 AM5/10/16

to Ansible Development

Do you have a lot of vaulted vars?

I installed cryptography from pip (after some messing around meeting cryptography's dependencies) and that improved speed for me a lot (although I only have a few groups and fairly small inventory) - as mentioned here: http://docs.ansible.com/ansible/playbooks_vault.html#speeding-up-vault-operations

which python version are you using?

Tobias Wolf

unread,

May 10, 2016, 8:11:35 AM5/10/16

to Ansible Development

On Tuesday, May 10, 2016 at 1:25:02 PM UTC+2, jhawkesworth wrote:

Do you have a lot of vaulted vars?

We do not use vault at all. For secret vars we use locally stored persistent facts and a FUSE mounted sshfs.

which python version are you using?

2.7.11

So, I've collected some metrics about our setup. I got starved for memory once and OOM killer when trying to count our hostvars.

$ nice -n 19 ansible all -m debug -a var=hostvars | pv | wc -l

9.84GiB 0:56:07 [2.99MiB/s] [ <=> ]

291874566

And hostvars relevant to the test playbook only:

$ ansible "foo_type=bar_baz:&host.os=Ubuntu_14.04" -m debug -a var=hostvars | wc -l

11247516

Number of groups:

$ inventory/inventory.foo.pl | jq '. | length'

9089

Size of inventory total (with ~140 hosts matching the play)

$ ansible all --list-hosts | wc -l

3843

Number of tasks in the test playbook

$ ansible-playbook playbooks/foo/bar/baz_quux.yml --list-tasks | grep " [a-Z]" | wc -l

561

And this takes 129 minutes vis-a-vis 23 minutes on 1.9.4.

Will Thames

unread,

May 10, 2016, 8:45:54 PM5/10/16

to Ansible Development

Might be worth using the profile_tasks callbacks to see where the time is being spent. Is every task just taking 6 times longer, or is it specific tasks.

https://github.com/jlafon/ansible-profile (it's built into v2.0+ but that page has details of how to set it up)

In the above example of just running debug, it's not clear whether it's the inventory processing that is slow, or the fact gathering, or both.

Once you have profile_tasks setup, using a playbook as simple as:

- hosts: "{{hosts}}"

with

time ansible-playbook playbook.yml -e hosts="foo_type=bar_baz:&host.os=Ubuntu_14.04"

will help see where time is spent.

Tobias Wolf

unread,

May 11, 2016, 4:06:16 AM5/11/16

to Ansible Development

Might be worth using the profile_tasks callbacks to see where the time is being spent. Is every task just taking 6 times longer, or is it specific tasks.

I don't think that is going in the right direction considering that everything is slower, most prominently skipped tasks as well. Seems to be a core issue concerning the execution logic.

Here's another test result surrounding size of inventory:

| full inventory minimal inventory
| (9000 groups) (1 group + _meta)
---------------+------------------------------------ ------
ansible 2.1 | 12m45s 4m59s 2.6x
ansible 1.9.4 | 3m34s 2m26s 1.5x

1 Play, 148 hosts, 22 tasks

It's seen that inventory size strongly affects performance, but much more so on Ansible 2.x.

I've tried deactivating the dynamic logic that seems to be the main "feature" of 2.x, but adding these lines to ansible.cfg had no measurable effect:

task_includes_static = yes
handler_includes_static = yes

Tobias Wolf

unread,

May 26, 2016, 8:24:34 AM5/26/16

to Ansible Development

Congratulations on releasing Ansible 2.1 today. To celebrate I want to share this slightly amended xkcd:

I've looked at cprof stats of the aforementioned test playbook.

The first line in the stats tells everything. In Ansible 1.9.4 there were 600k Python function calls to complete the playbook. In Ansible 2.1 there were 50 million.

Hope this will be addressed at some point.

--Tobias

jhawkesworth

unread,

May 26, 2016, 9:01:31 AM5/26/16

to Ansible Development

Hey, thanks for this.

Seeing this thread again has reminded me that I was curious about the number of groups you have in relation to the number of hosts. If I read the above right you have nearly 9000 groups but 'a couple of thousand' of hosts. So hosts are members of multiple groups.

I'm interested to know what problems organising groups and hosts in this way solves - or perhaps its the only way to deal with whatever you are dealing with. Partly I'm just curious as you are clearly operating at a scale that I'm nowhere near, and partly just to see if the collective brains round here might have another way to organise things that might speed stuff up for you.

Jon

Tobias Wolf

unread,

May 26, 2016, 12:32:00 PM5/26/16

to Ansible Development

On Thursday, May 26, 2016 at 3:01:31 PM UTC+2, jhawkesworth wrote:

Hey, thanks for this.

Seeing this thread again has reminded me that I was curious about the number of groups you have in relation to the number of hosts. If I read the above right you have nearly 9000 groups but 'a couple of thousand' of hosts. So hosts are members of multiple groups.

That's right. Instead of simple membership to a group like "westcoast" we encode a lot of meta-information in the inventory in a sort of key-value notation. I described some of this in my issue #13956.

Any particular host is member to quite a few groups. "host.os=FreeBSD" is a group name, as is "status=productive". Then there are groups related to services, of which several belong to any particular host: "service.type=foo", "service.country.name=bar_baz_quux", and so on and so forth.

Lots of groups only contain one host, some contain thousands.

I'm interested to know what problems organising groups and hosts in this way solves - or perhaps its the only way to deal with whatever you are dealing with. Partly I'm just curious as you are clearly operating at a scale that I'm nowhere near, and partly just to see if the collective brains round here might have another way to organise things that might speed stuff up for you.

The thing is, that even stripping the inventory to the bare minimum Ansible 2.x is CPU bound much of the time here. It can't be that most of the runtime was spent in time.sleep in 1.9.x, waiting for tasks to complete on remote hosts. In 2.x it is calling 85 times as many functions as before to do the same job.

Tobias Wolf

unread,

Jun 3, 2016, 1:02:02 PM6/3/16

to Ansible Development

More crazy profiling results.

Ansible devel, 143 matching hosts, some 15 tasks. Inventory:

$ inventory/inventory.xxx.pl | jq '._meta.hostvars | length '

3845 # hosts

$ inventory/inventory.xxx.pl | jq '._meta.hostvars[] | length ' | numsum

88285 # sum of hostvars

$ inventory/inventory.xxx.pl | jq '. | length'

8994 # groups

$ kernprof -l -o ~/2.2.0.kern.out /home/towolf/src/local/ansible/bin/ansible-playbook o-test.yml --check

$ python -m line_profiler 2.2.0.kern.out

Timer unit: 1e-06 s

Total time: 485.37 s

File: /home/towolf/src/local/ansible/lib/ansible/vars/__init__.py

Function: _get_magic_variables at line 374

Line # Hits Time Per Hit % Time Line Contents

==============================================================

374 @profile

375 def _get_magic_variables(self, loader, play, host, task, include_hostvars, include_delegate_to):

376 '''

377 Returns a dictionary of so-called "magic" variables in Ansible,

378 which are special variables we set internally for use.

379 '''

380

381 3010 10132 3.4 0.0 variables = dict()

382 3010 51137 17.0 0.0 variables['playbook_dir'] = loader.get_basedir()

383

384 3010 5181 1.7 0.0 if host:

385 47808 1080418 22.6 0.2 variables['group_names'] = sorted([group.name for group in host.get_groups() if group.name != 'all'])

386

387 2574 5605 2.2 0.0 if self._inventory is not None:

388 2574 6046 2.3 0.0 variables['groups'] = dict()

389 23155704 46359530 2.0 9.6 for (group_name, group) in iteritems(self._inventory.groups):

390 171155556 435879828 2.5 89.8 variables['groups'][group_name] = [h.name for h in group.get_hosts()]

391 3010 5512 1.8 0.0 if play:

392 6016 77879 12.9 0.0 variables['role_names'] = [r._role_name for r in play.roles]

393

394 3010 5422 1.8 0.0 if task:

395 3006 6596 2.2 0.0 if task._role:

396 2863 11964 4.2 0.0 variables['role_name'] = task._role.get_name()

397 2863 11069 3.9 0.0 variables['role_path'] = task._role._role_path

398 2863 51752 18.1 0.0 variables['role_uuid'] = text_type(task._role._uuid)

399

400 3010 7532 2.5 0.0 if self._inventory is not None:

401 3010 351708 116.8 0.1 variables['inventory_dir'] = self._inventory.basedir()

402 3010 75596 25.1 0.0 variables['inventory_file'] = self._inventory.src()

403 3010 5813 1.9 0.0 if play:

404 # add the list of hosts in the play, as adjusted for limit/filters

405 # DEPRECATED: play_hosts should be deprecated in favor of ansible_play_hosts,

406 # however this would take work in the templating engine, so for now

407 # we'll add both so we can give users something transitional to use

408 451950 1159038 2.6 0.2 host_list = [x.name for x in self._inventory.get_hosts()]

409 3010 5423 1.8 0.0 variables['play_hosts'] = host_list

410 3010 5010 1.7 0.0 variables['ansible_play_hosts'] = host_list

411

412 # the 'omit' value alows params to be left out if the variable they are based on is undefined

413 3010 10783 3.6 0.0 variables['omit'] = self._omit_token

414 3010 130188 43.3 0.0 variables['ansible_version'] = CLI.version_info(gitinfo=False)

415 # Set options vars

416 6020 28844 4.8 0.0 for option, option_value in iteritems(self._options_vars):

417 3010 5882 2.0 0.0 variables[option] = option_value

418

419 3010 6659 2.2 0.0 if self._hostvars is not None and include_hostvars:

420 3004 5735 1.9 0.0 variables['hostvars'] = self._hostvars

421

422 3010 4190 1.4 0.0 return variables

It's crazy how many hits there are on get_hosts(). I'm really wondering if we are the only team who is working with any kind of large inventory. Since Red Hat states that there is no commercial support available for Ansible Core at this time, I wonder where to get answers about this, if not here on this list.

For reference, here's how Ansible 1.9.4 behaves, i.e. mostly waiting for the workers to return:

Matt Martz

unread,

Jun 3, 2016, 1:11:16 PM6/3/16

to Tobias Wolf, Ansible Development

Some work was recently started looking into some performance issues around inventory that may provide improvements. See https://github.com/ansible/ansible/compare/fix_inv_vars_loading

--
You received this message because you are subscribed to the Google Groups "Ansible Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ansible-deve...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Matt Martz
@sivel
sivel.net

Tobias Wolf

unread,

Jun 4, 2016, 7:41:37 AM6/4/16

to Ansible Development, tow...@gmail.com

On Friday, June 3, 2016 at 7:11:16 PM UTC+2, Matt Martz wrote:

Some work was recently started looking into some performance issues around inventory that may provide improvements. See https://github.com/ansible/ansible/compare/fix_inv_vars_loading

That looks interesting and seems like it expands on my earlier work-around for reducing excessive repetitive work (including hammering the file system) at the inventory parsing stage. I'm going to try this on Monday.

However there seem to be many more places where stuff is packed and unpacked and merged over and over again needlessly.

At first when the inventory is read, the groups and their members are evaluated and every member host packed into a Host object. And then later for every host in get_vars() the global list of groups is iterated over again to generate the 'groups' magic variable. So each group is unpacked, the Host objects looked at and their name extracted in that list comprehension. And we do this over and over again, for a hash that should be global and unique!

And on top of that every hostvar in the inventory is templated, when this is merely static key-value information, courtesy of the inventory.

Tobias Wolf

unread,

Jun 9, 2016, 10:12:21 AM6/9/16

to Ansible Development, tow...@gmail.com

On Saturday, June 4, 2016 at 1:41:37 PM UTC+2, Tobias Wolf wrote:

At first when the inventory is read, the groups and their members are evaluated and every member host packed into a Host object. And then later for every host in get_vars() the global list of groups is iterated over again to generate the 'groups' magic variable. So each group is unpacked, the Host objects looked at and their name extracted in that list comprehension. And we do this over and over again, for a hash that should be global and unique!

I've submitted a Pull Request fro this issue. Please review: https://github.com/ansible/ansible/pull/16204

The line_profiler went from 480 seconds to 15 seconds inside that function.

The new profile map looks like the following. VariableManager._get_magic_variables shrunk into a corner.

Also, the cost for posix.fork went from 3ms to 12ms and is executed 3 times more often. This means we spend 33 seconds in posix.fork compared to 3.5 seconds before in 1.9.4.

Next I'm going to submit pull requests for cases where set makes more sense than list data type, i.e., repetitive access with not many adds and unique contents.

Igor Cicimov

unread,

Jul 15, 2016, 10:08:44 PM7/15/16

to Ansible Development

Hi Tobias,

Have you noticed any improvement on this front? I'm still on 1.9.4 because of these issues.

Thanks,
Igor

Tobias Wolf

unread,

Aug 31, 2016, 1:49:19 PM8/31/16

to Ansible Development

On Saturday, July 16, 2016 at 4:08:44 AM UTC+2, Igor Cicimov wrote:

Hi Tobias,
Have you noticed any improvement on this front? I'm still on 1.9.4 because of these issues.

I ran a new, longer, benchmark, which is more relevant to daily work today.

An "apply common stuff" playbook with --check over 164 hosts.

Ansible 1.9.4: 9 Minutes

Ansible 2.1-stable branch: 51 Minutes !

Ansible devel: stuck somewhere in the middle, no progress two hours later, but using CPU nonetheless:

So, we are all still on 1.9.x. There´s still the mentioned pull request open, that improves devel drastically for us. But it´s still more than 2 times slower.

Igor Cicimov

unread,

Sep 1, 2016, 8:06:32 PM9/1/16

to Ansible Development

Bloody RedHell .... I mean RedHat :-)

Tobias Wolf

unread,

Sep 2, 2016, 4:59:41 AM9/2/16

to Ansible Development

Just to give Closure to this chapter. After merging my PR ew are back into more manageable territory, runtime-wise:

An "apply common stuff" playbook with --check over 164 hosts.
Ansible 1.9.4: 9 Minutes
Ansible 2.1-stable branch: 51 Minutes !
Ansible devel: stuck somewhere in the middle, no progress two hours later, but using CPU nonetheless:

Ansible devel as of yesterday: 42 minutes

Ansible devel as of today: 16 minutes

Igor Cicimov

unread,

Sep 2, 2016, 9:44:18 PM9/2/16

to Ansible Development

Tobias, thanks for all your work, I can't believe such a serious issues has been ignored for so long. Being a configuration manager Ansible is aimed to managing hundreds of servers in the same time and such a poor performance is just unacceptable. Taking an hour to apply changes to 100 servers, what a joke.

It's good to keep this thread on the surface though so people are aware of the issues that have been introduced after version 0.9.4 and maybe, just maybe push for faster resolution in the production release.

At the end, if you can leave a note here when you move to version 2.x it would be very much appreciated.

Thanks,
Igor

Reply all

Reply to author

Forward