Vagrant ansible provisioner multi VM race

104 views
Skip to first unread message

Darragh Bailey

unread,
Nov 13, 2015, 12:38:58 PM11/13/15
to Vagrant
Hi,


I've spent some time previously digging into the ansible provisioner and some issues we have at work with a development environment where we are looking to spin up multiple VM's in parallel and then provision them to a base configuration using ansible.

There was a pull request accepted from one of my colleagues https://github.com/mitchellh/vagrant/pull/5551, however we've still found that there is some issues remaining.

The crux of the issue appears to be that there is an assumption in the ansible provisioner that when it is executed all active machines will be ready for SSH access other wise they are broken and should be recreated. This doesn't hold true when using the vagrant up --parallel command, as depending on the provider one or more of the VM's may still be booting.

In turn, this means the common inventory file generated may get regenerated subsequently as the other machines reach the ansible provisioner phase, thus truncating said file while it is being read by the ansible process spun up by other threads.

So there are two problems here:

1) inventory file may be truncated
2) inventory may be incomplete or incorrect because the thread didn't allow enough time for SSH to be available


I suggest reading the comments on the pull request for more details.


Item one is trivial to fix, just switch to an inventory file per machine, so that each subprocess gets it's own file. That way none will overwrite or remove an inventory file already being read.

Number two appears to be a little more tricky, looking at the code from https://github.com/mitchellh/vagrant/blob/a3c077cbe0b27339bb14c7bcd404ea64fefd16d4/plugins/provisioners/ansible/provisioner/host.rb#L121-L142, I tried adding something like the following code:

---------------------
        @machine.env.active_machines.each do |am|
          begin
            m = @machine.env.machine(*am)
            retryable(on: Vagrant::Errors::SSHNotReady,
                      tries: config.inventory_retries.to_i, sleep: 2) do
              m_ssh_info = m.ssh_info
              if !m_ssh_info.nil?
                forced_ssh_user = ""
                if config.force_remote_user
                  forced_ssh_user = "ansible_ssh_user='#{m_ssh_info[:username]}' "
                end
                machines += "#{m.name} ansible_ssh_host=#{m_ssh_info[:host]} " <<
                  "ansible_ssh_port=#{m_ssh_info[:port]} " <<
                  "#{
forced_ssh_user}ansible_ssh_private_key_file=" <<
                  "'#{m_ssh_info[:private_key_path][0]}'\n"

                inventory_machines[m.name] = m
              else
                # there may be more valid states to retry on, but not for libvirt
                if [:preparing, :running].include?(m.state.id)
                  @logger.info("Machine '#{am[0]} (#{am[1]})' not yet responding to ssh.")
                  raise Vagrant::Errors::SSHNotReady
                else

                  @logger.info("Machine '#{am[0]} (#{am[1]})' not running, ignoring.")
                 
machines += "# NOT RUNNING: ignoring non-running machine '#{m.name}'.\n"
                end
              end
            end
          rescue Vagrant::Errors::SSHNotReady => e
            @logger.error("Auto-generated inventory: Impossible to get SSH information for " <<
              "machine '#{m.name} (#{m.provider_name})'. This machine should be recreated.")
            # Leave a note about this missing machine in the inventory
           
machines += "# MISSING: '#{m.name}' machine was probably removed without using " <<
              "Vagrant. This machine should be recreated.\n"
          rescue Vagrant::Errors::MachineNotFound => e
            @logger.info("Auto-generated inventory: Skip machine '#{am[0]} (#{am[1]})', " <<
              "which is not configured for this Vagrant environment.")
          end
        end

---------------------

Basically I was looking to retry in the cases where there may be a chance of the machine providing SSH but it's not ready to yet. In doing this I discovered that accessing m.state.id can trigger a machine is locked exception because querying of the machine state involves trigging an update of the machine index cached state which requires the machine state method to explicitly unlock the returned entry.
see https://github.com/mitchellh/vagrant/blob/a3c077cbe0b27339bb14c7bcd404ea64fefd16d4/lib/vagrant/machine.rb#L485-L503

Of course with the ansible provisioner for each machine in a multi machine configuration all calling this method on each each machine from different threads, there is a high likelyhood of two threads trying to modify the same entry state at the same time.

Also this seems a little flaky anyway, since it's dependent on the state being reported by the provider, which in turn is defined by the providers.

It seems like vagrant multi machine environments would benefit from having some standard mechanism to establish whether a machine has been created by a provider and it has finished executing and should now have valid ssh_info.

At the moment this appears that it would only benefit the ansible provisioner, but I'm sure it would be useful elsewhere in knowing that a machine is available and should be able to respond rather than trying to infer based on whether the ssh_info is nil or not.

Btw, using communicate.ready? has it's own issues for the ssh communicator in that how it synchronizes around the insecure ssh private key switch. It still also doesn't


Desired Enhancement
--------------------------------

To call it out explicitly, it would be useful to independently determine certain machine/environment states consistently, not limited to establishing whether the machine is still being brought up by the provider, is ready for anything needing it booted, or has failed.


I'm more than happy to implement the needed changes, and plan to upload a pull request for some of the changes to achieve the above once I've synced with the latest head.

I'm not sure if there is something in there I can use to infer the needed state information consistently or where best to add such information?
perhaps the lib/vagrant/action/builtin/provision.rb by setting some env flags around the app.call(env) call at https://github.com/mitchellh/vagrant/blob/a3c077cbe0b27339bb14c7bcd404ea64fefd16d4/lib/vagrant/action/builtin/provision.rb#L79-L80 ?


--
Regards,
Darragh Bailey

Darragh Bailey

unread,
Nov 17, 2015, 1:08:45 PM11/17/15
to Vagrant

Forgot to complete this:

If you use communicate.ready? on a machine from a different thread, you can result in the second thread beginning to try to communicate with the machine using the existing connection and then getting the connection closed as the first thread reaches https://github.com/mitchellh/vagrant/blob/c1508cd893e4b3bd5282f1dde3399140955b80e4/plugins/communicators/ssh/communicator.rb#L183-L184 while the second is in the middle of a connect block using the connection https://github.com/mitchellh/vagrant/blob/c1508cd893e4b3bd5282f1dde3399140955b80e4/plugins/communicators/ssh/communicator.rb#L312

Basically the communicator isn't thread safe.

 
Desired Enhancement
--------------------------------

To call it out explicitly, it would be useful to independently determine certain machine/environment states consistently, not limited to establishing whether the machine is still being brought up by the provider, is ready for anything needing it booted, or has failed.


I'm more than happy to implement the needed changes, and plan to upload a pull request for some of the changes to achieve the above once I've synced with the latest head.

I'm not sure if there is something in there I can use to infer the needed state information consistently or where best to add such information?
perhaps the lib/vagrant/action/builtin/provision.rb by setting some env flags around the app.call(env) call at https://github.com/mitchellh/vagrant/blob/a3c077cbe0b27339bb14c7bcd404ea64fefd16d4/lib/vagrant/action/builtin/provision.rb#L79-L80 ?



I've opened up https://github.com/mitchellh/vagrant/issues/6526 to cover this with some details added about where I'm looking to fix. Hopefully there will be feedback on the issue combined with discussion here that will direct my efforts to fix in the preferred manner.

--
Darragh Bailey
Reply all
Reply to author
Forward
0 new messages