Bad performance: In Ansible 2.1 playbooks take factor 5.6 longer than 1.9.x

1,096 views
Skip to first unread message

Tobias Wolf

unread,
May 10, 2016, 4:28:15 AM5/10/16
to Ansible Development
Salvete all,

we are still clinging to 1.9.x ever since 2.0 came out but we've been planning to move to 2.1 once it comes out.

We had already figured out that ansible 2.x starting up with parsing our dynamic JSON inventory (7MB of text, almost 9000 groups, excluding '_meta' hostvars) takes over 20 seconds versus 5 seconds before on 1.9. (Please see my PR #13957 on that)

Now we've started running stable-2.1 with --check on some of our playbooks with the incompatible changes cleaned up (backslash escapes, etc).

Ansible branch stable-2.1 takes more than 5 times longer to complete than 1.9.4!

The test playbook rolls stuff out on about 150 hosts out of a couple thousand in the inventory. There are a lot of tasks on each host and already visually it is apparent that something is very different compared to before. Even skips tick by at a snails pace (up to 0.5s per skip). The main ansible-playbook process holds 500MB res memory.

On ansible 1.9.4 this completes in 23 minutes. On 2.1 in 2 hours, 9 minutes.

This is ansible.cfg:
[defaults]
forks = 55
force_color = 1
roles_path = roles
hostfile = inventory
#library = library
retry_files_enabled = False
[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=61s -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=yes -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey,keyboard-interactive
pipelining=True

Since people were suggesting it, I tried fact caching with redis or the free strategy. This did nothing significant to the runtime (as expected since the time is not spent in gathering, it's spent forking the main process or something):

ansible 2.1.0.0 (stable-2.1 be28443943)
15125.40user 6640.76system 2:08:29elapsed 282%CPU (0avgtext+0avgdata 3262800maxresident)k
304inputs+1046376outputs (5major+613591630minor)pagefaults 0swaps

ansible 2.1.0.0 (stable-2.1 be28443943) -- fact_caching = redis
14469.46user 6497.10system 2:04:38elapsed 280%CPU (0avgtext+0avgdata 3052820maxresident)k
728inputs+1028496outputs (18major+606686383minor)pagefaults 0swaps

ansible 2.1.0.0 (stable-2.1 be28443943) -- strategy = free
16394.78user 8026.57system 2:19:29elapsed 291%CPU (0avgtext+0avgdata 1231688maxresident)k
80inputs+1084288outputs (0major+626770671minor)pagefaults 0swaps


We really want to run 2.x but a this point, we are considering maintaining 1.9.x for ourselves with backported modules. I wanted to bring this up on the mailing list first, to see if there is anyone on the dev team who has a grip on what is going on with 2.x at all? What is happening that throttles playbook execution so much? Is this going to improve anytime soon? If I wanted to explore this further, what would be the approriate tools to profile/benchmark/analyze the problem further with our workload? Apparently this has not come up before, otherwise — I assume — 2.0 would not have been released like this. So I wonder what is different in our setup that slows Ansible 2.x to a crawl. I suspect it is again our large inventory, but i'd like some help to figure it out.

cheers,
Tobias

jhawkesworth

unread,
May 10, 2016, 7:25:02 AM5/10/16
to Ansible Development
Do you have a lot of vaulted vars?
I installed cryptography from pip (after some messing around meeting cryptography's dependencies) and that improved speed for me a lot (although I only have a few groups and fairly small inventory) - as mentioned here: http://docs.ansible.com/ansible/playbooks_vault.html#speeding-up-vault-operations

which python version are you using?

Tobias Wolf

unread,
May 10, 2016, 8:11:35 AM5/10/16
to Ansible Development
On Tuesday, May 10, 2016 at 1:25:02 PM UTC+2, jhawkesworth wrote:
Do you have a lot of vaulted vars?

We do not use vault at all. For secret vars we use locally stored persistent facts and a FUSE mounted sshfs.

which python version are you using?

2.7.11


So, I've collected some metrics about our setup. I got starved for memory once and OOM killer when trying to count our hostvars.

$ nice -n 19 ansible all -m debug -a var=hostvars | pv | wc -l
9.84GiB 0:56:07 [2.99MiB/s] [          <=>    ]
291874566

And hostvars relevant to the test playbook only:

$ ansible "foo_type=bar_baz:&host.os=Ubuntu_14.04" -m debug -a var=hostvars | wc -l
11247516

Number of groups:

$ inventory/inventory.foo.pl | jq '. | length'
9089

Size of inventory total (with ~140 hosts matching the play)

$ ansible all --list-hosts | wc -l 
3843

Number of tasks in the test playbook

$ ansible-playbook playbooks/foo/bar/baz_quux.yml --list-tasks | grep "    [a-Z]"  | wc -l
561

And this takes 129 minutes vis-a-vis 23 minutes on 1.9.4.

Will Thames

unread,
May 10, 2016, 8:45:54 PM5/10/16
to Ansible Development

Might be worth using the profile_tasks callbacks to see where the time is being spent. Is every task just taking 6 times longer, or is it specific tasks.

https://github.com/jlafon/ansible-profile (it's built into v2.0+ but that page has details of how to set it up)

In the above example of just running debug, it's not clear whether it's the inventory processing that is slow, or the fact gathering, or both.

Once you have profile_tasks setup, using a playbook as simple as:

- hosts: "{{hosts}}"

with

time ansible-playbook playbook.yml -e hosts="foo_type=bar_baz:&host.os=Ubuntu_14.04"

will help see where time is spent.

Tobias Wolf

unread,
May 11, 2016, 4:06:16 AM5/11/16
to Ansible Development
Might be worth using the profile_tasks callbacks to see where the time is being spent. Is every task just taking 6 times longer, or is it specific tasks.

I don't think that is going in the right direction considering that everything is slower, most prominently skipped tasks as well. Seems to be a core issue concerning the execution logic.

Here's another test result surrounding size of inventory:


               | full inventory   minimal inventory  
               | (9000 groups)    (1 group + _meta)
---------------+------------------------------------  ------
 ansible 2.1   | 12m45s           4m59s               2.6x  
 ansible 1.9.4 |  3m34s           2m26s               1.5x  

             1 Play, 148 hosts, 22 tasks

It's seen that inventory size strongly affects performance, but much more so on Ansible 2.x.

I've tried deactivating the dynamic logic that seems to be the main "feature" of 2.x, but adding these lines to ansible.cfg had no measurable effect:

task_includes_static = yes
handler_includes_static = yes



Tobias Wolf

unread,
May 26, 2016, 8:24:34 AM5/26/16
to Ansible Development
Congratulations on releasing Ansible 2.1 today. To celebrate I want to share this slightly amended xkcd:

 
I've looked at cprof stats of the aforementioned test playbook.

The first line in the stats tells everything. In Ansible 1.9.4 there were 600k Python function calls to complete the playbook. In Ansible 2.1 there were 50 million.

Hope this will be addressed at some point.

--Tobias

jhawkesworth

unread,
May 26, 2016, 9:01:31 AM5/26/16
to Ansible Development
Hey, thanks for this.

Seeing this thread again has reminded me that I was curious about the number of groups you have in relation to the number of hosts.  If I read the above right you have nearly 9000 groups but 'a couple of thousand' of hosts.  So hosts are members of multiple groups.

I'm interested to know what problems organising groups and hosts in this way solves - or perhaps its the only way to deal with whatever you are dealing with.   Partly I'm just curious as you are clearly operating at a scale that I'm nowhere near, and partly just to see if the collective brains round here might have another way to organise things that might speed stuff up for you.

Jon

Tobias Wolf

unread,
May 26, 2016, 12:32:00 PM5/26/16
to Ansible Development
On Thursday, May 26, 2016 at 3:01:31 PM UTC+2, jhawkesworth wrote:
Hey, thanks for this.

Seeing this thread again has reminded me that I was curious about the number of groups you have in relation to the number of hosts.  If I read the above right you have nearly 9000 groups but 'a couple of thousand' of hosts.  So hosts are members of multiple groups.

That's right. Instead of simple membership to a group like "westcoast" we encode a lot of meta-information in the inventory in a sort of key-value notation. I described some of this in my issue #13956.

Any particular host is member to quite a few groups. "host.os=FreeBSD" is a group name, as is "status=productive". Then there are groups related to services, of which several belong to any particular host: "service.type=foo""service.country.name=bar_baz_quux", and so on and so forth.

Lots of groups only contain one host, some contain thousands.
 
I'm interested to know what problems organising groups and hosts in this way solves - or perhaps its the only way to deal with whatever you are dealing with.   Partly I'm just curious as you are clearly operating at a scale that I'm nowhere near, and partly just to see if the collective brains round here might have another way to organise things that might speed stuff up for you.

The thing is, that even stripping the inventory to the bare minimum Ansible 2.x is CPU bound much of the time here. It can't be that most of the runtime was spent in time.sleep in 1.9.x, waiting for tasks to complete on remote hosts. In 2.x it is calling 85 times as many functions as before to do the same job.

Tobias Wolf

unread,
Jun 3, 2016, 1:02:02 PM6/3/16
to Ansible Development
More crazy profiling results.

Ansible devel, 143 matching hosts, some 15 tasks. Inventory:

$ inventory/inventory.xxx.pl | jq '._meta.hostvars | length '
3845 # hosts
$ inventory/inventory.xxx.pl | jq '._meta.hostvars[] | length ' | numsum
88285 # sum of hostvars
$ inventory/inventory.xxx.pljq '. | length'
8994 # groups



$ kernprof -l -o ~/2.2.0.kern.out /home/towolf/src/local/ansible/bin/ansible-playbook o-test.yml --check
$ python -m line_profiler 2.2.0.kern.out
Timer unit: 1e-06 s

Total time: 485.37 s
File: /home/towolf/src/local/ansible/lib/ansible/vars/__init__.py
Function: _get_magic_variables at line 374

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   374                                               @profile
   375                                               def _get_magic_variables(self, loader, play, host, task, include_hostvars, include_delegate_to):
   376                                                   '''
   377                                                   Returns a dictionary of so-called "magic" variables in Ansible,
   378                                                   which are special variables we set internally for use.
   379                                                   '''
   380                                           
   381      3010        10132      3.4      0.0          variables = dict()
   382      3010        51137     17.0      0.0          variables['playbook_dir'] = loader.get_basedir()
   383                                           
   384      3010         5181      1.7      0.0          if host:
   385     47808      1080418     22.6      0.2              variables['group_names'] = sorted([group.name for group in host.get_groups() if group.name != 'all'])
   386                                           
   387      2574         5605      2.2      0.0              if self._inventory is not None:
   388      2574         6046      2.3      0.0                  variables['groups']  = dict()
   389  23155704     46359530      2.0      9.6                  for (group_name, group) in iteritems(self._inventory.groups):
   390 171155556    435879828      2.5     89.8                      variables['groups'][group_name] = [h.name for h in group.get_hosts()]
   391      3010         5512      1.8      0.0          if play:
   392      6016        77879     12.9      0.0              variables['role_names'] = [r._role_name for r in play.roles]
   393                                           
   394      3010         5422      1.8      0.0          if task:
   395      3006         6596      2.2      0.0              if task._role:
   396      2863        11964      4.2      0.0                  variables['role_name'] = task._role.get_name()
   397      2863        11069      3.9      0.0                  variables['role_path'] = task._role._role_path
   398      2863        51752     18.1      0.0                  variables['role_uuid'] = text_type(task._role._uuid)
   399                                           
   400      3010         7532      2.5      0.0          if self._inventory is not None:
   401      3010       351708    116.8      0.1              variables['inventory_dir'] = self._inventory.basedir()
   402      3010        75596     25.1      0.0              variables['inventory_file'] = self._inventory.src()
   403      3010         5813      1.9      0.0              if play:
   404                                                           # add the list of hosts in the play, as adjusted for limit/filters
   405                                                           # DEPRECATED: play_hosts should be deprecated in favor of ansible_play_hosts,
   406                                                           #             however this would take work in the templating engine, so for now
   407                                                           #             we'll add both so we can give users something transitional to use
   408    451950      1159038      2.6      0.2                  host_list = [x.name for x in self._inventory.get_hosts()]
   409      3010         5423      1.8      0.0                  variables['play_hosts'] = host_list
   410      3010         5010      1.7      0.0                  variables['ansible_play_hosts'] = host_list
   411                                           
   412                                                   # the 'omit' value alows params to be left out if the variable they are based on is undefined
   413      3010        10783      3.6      0.0          variables['omit'] = self._omit_token
   414      3010       130188     43.3      0.0          variables['ansible_version'] = CLI.version_info(gitinfo=False)
   415                                                   # Set options vars
   416      6020        28844      4.8      0.0          for option, option_value in iteritems(self._options_vars):
   417      3010         5882      2.0      0.0              variables[option] = option_value
   418                                           
   419      3010         6659      2.2      0.0          if self._hostvars is not None and include_hostvars:
   420      3004         5735      1.9      0.0              variables['hostvars'] = self._hostvars
   421                                           
   422      3010         4190      1.4      0.0          return variables


It's crazy how many hits there are on get_hosts(). I'm really wondering if we are the only team who is working with any kind of large inventory. Since Red Hat states that there is no commercial support available for Ansible Core at this time, I wonder where to get answers about this, if not here on this list.


For reference, here's how Ansible 1.9.4 behaves, i.e. mostly waiting for the workers to return:



Matt Martz

unread,
Jun 3, 2016, 1:11:16 PM6/3/16
to Tobias Wolf, Ansible Development
Some work was recently started looking into some performance issues around inventory that may provide improvements.  See https://github.com/ansible/ansible/compare/fix_inv_vars_loading

--
You received this message because you are subscribed to the Google Groups "Ansible Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ansible-deve...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Matt Martz
@sivel
sivel.net

Tobias Wolf

unread,
Jun 4, 2016, 7:41:37 AM6/4/16
to Ansible Development, tow...@gmail.com
On Friday, June 3, 2016 at 7:11:16 PM UTC+2, Matt Martz wrote:
Some work was recently started looking into some performance issues around inventory that may provide improvements.  See https://github.com/ansible/ansible/compare/fix_inv_vars_loading

That looks interesting and seems like it expands on my earlier work-around for reducing excessive repetitive work (including hammering the file system) at the inventory parsing stage. I'm going to try this on Monday.

However there seem to be many more places where stuff is packed and unpacked and merged over and over again needlessly.

At first when the inventory is read, the groups and their members are evaluated and every member host packed into a Host object. And then later for every host in get_vars()  the global list of groups is iterated over again to generate the 'groups' magic variable. So each group is unpacked, the Host objects looked at and their name extracted in that list comprehension. And we do this over and over again, for a hash that should be global and unique!

And on top of that every hostvar in the inventory is templated, when this is merely static key-value information, courtesy of the inventory.

Tobias Wolf

unread,
Jun 9, 2016, 10:12:21 AM6/9/16
to Ansible Development, tow...@gmail.com
On Saturday, June 4, 2016 at 1:41:37 PM UTC+2, Tobias Wolf wrote:
At first when the inventory is read, the groups and their members are evaluated and every member host packed into a Host object. And then later for every host in get_vars()  the global list of groups is iterated over again to generate the 'groups' magic variable. So each group is unpacked, the Host objects looked at and their name extracted in that list comprehension. And we do this over and over again, for a hash that should be global and unique!


I've submitted a Pull Request fro this issue. Please review: https://github.com/ansible/ansible/pull/16204

The line_profiler went from 480 seconds to 15 seconds inside that function.

The new profile  map looks like the following. VariableManager._get_magic_variables shrunk into a corner.


Also, the cost for posix.fork went from 3ms to 12ms and is executed 3 times more often. This means we spend 33 seconds in  posix.fork compared to 3.5 seconds before in 1.9.4.

Next I'm going to submit pull requests for cases where set makes more sense than list data type, i.e., repetitive access with not many adds and unique contents.

Igor Cicimov

unread,
Jul 15, 2016, 10:08:44 PM7/15/16
to Ansible Development
Hi Tobias,

Have you noticed any improvement on this front? I'm still on 1.9.4 because of these issues.

Thanks,
Igor

Tobias Wolf

unread,
Aug 31, 2016, 1:49:19 PM8/31/16
to Ansible Development

On Saturday, July 16, 2016 at 4:08:44 AM UTC+2, Igor Cicimov wrote:
Hi Tobias,

Have you noticed any improvement on this front? I'm still on 1.9.4 because of these issues.


I ran a new, longer, benchmark, which is more relevant to daily work today.

An "apply common stuff" playbook with --check over 164 hosts.

Ansible 1.9.4: 9 Minutes
Ansible 2.1-stable branch: 51 Minutes !
Ansible devel: stuck somewhere in the middle, no progress two hours later, but using CPU nonetheless:


So, we are all still on 1.9.x. There´s still the mentioned pull request open, that improves devel drastically for us. But it´s still more than 2 times slower.

Igor Cicimov

unread,
Sep 1, 2016, 8:06:32 PM9/1/16
to Ansible Development

Bloody RedHell .... I mean RedHat :-)
 

Tobias Wolf

unread,
Sep 2, 2016, 4:59:41 AM9/2/16
to Ansible Development
Just to give Closure to this chapter. After merging my PR ew are back into more manageable territory, runtime-wise:

An "apply common stuff" playbook with --check over 164 hosts.
Ansible 1.9.4: 9 Minutes
Ansible 2.1-stable branch: 51 Minutes !
Ansible devel: stuck somewhere in the middle, no progress two hours later, but using CPU nonetheless:

Ansible devel as of yesterday: 42 minutes 
Ansible devel as of today: 16 minutes 

Igor Cicimov

unread,
Sep 2, 2016, 9:44:18 PM9/2/16
to Ansible Development

Tobias, thanks for all your work, I can't believe such a serious issues has been ignored for so long. Being a configuration manager Ansible is aimed to managing hundreds of servers in the same time and such a poor performance is just unacceptable. Taking an hour to apply changes to 100 servers, what a joke.

It's good to keep this thread on the surface though so people are aware of the issues that have been introduced after version 0.9.4 and maybe, just maybe push for faster resolution in the production release.

At the end, if you can leave a note here when you move to version 2.x it would be very much appreciated.

Thanks,
Igor

Reply all
Reply to author
Forward
0 new messages