ansible ec2_facts returns false data (if there is NAT on the system level; This is ok if You use AWS router interface gateway)

127 views
Skip to first unread message

sirkubax

unread,
Jul 14, 2015, 8:16:49 AM7/14/15
to ansible...@googlegroups.com
THE PROBLEM:
I've just realised why sometimes my playbook fills the template with false data

This happens, when the instance is in my VPC subnet (with internet gateway), while in configuration there is NAT route table on the system level, then reguest to the internet goes through NAT instance and the AWS response is covered.
Then the NAT_instance facts are returned, NOT the current_instance facts about.


THE DEBUGGING:

If You look into the code, the ec2_facts fetch a bunch of requests to

in Example:
172.16.0.200

while real data is
eth0: ***
    inet 172.16.0.110/24 brd 172.16.0.255 scope global eth0


THE INSTANCE CONFIGURATION:

$ ip r
default via 172.16.0.200 dev eth0 
172.16.0.0/24 dev eth0  proto kernel  scope link  src 172.16.0.110 
172.16.0.0/16 via 172.16.0.1 dev eth0 

           $ ip a 
eth0: ***
    inet 172.16.0.110/24 brd 172.16.0.255 scope global eth0


If You keep remote files, You can check it Yourself
export ANSIBLE_KEEP_REMOTE_FILES=1
and then 
python /home/ubuntu/.ansible/tmp/ansible-tmp-1436872330.49-72199016469620/ec2_facts
will return as one of the facts:
            "ansible_ec2_local_ipv4": "172.16.0.200",
(or run a curl)

THE CURRENT WORKAROUND:
  1. do NOT use (in roles nor tasks)
    1. - action: ec2_facts
    2. DRAWBACKS:
      1. You will not have some variables available (ansible_ec2_* will be unavailable)
      2. You will have only ec2_* facts from you LOCAL inventory cache (ec2.py if I'm correct now)
      3. If You add in playbook ("gather_facts: True") then You can also use ansible_* facts gathered by setup.py module
        1. so instead of ansible_ec2_local_ipv4 You can use ansible_eth0['ipv4]['address']
      4. BUT this can bring some problems when You have a role, that expects some vatiable (example: ansible_hostname), but in the playbook You have disabled system fact gathering  ("gather_facts: False") - You will have to be carefull
      5. OR You would like to access some AWS variable, independent form Your LOCAL cache
  2. configure you VPC routing tables so it will point to NAT-instance-interface, rather than IP address
      1. 0.0.0.0/0  eni-xxx / i-xxx
    1. instead of:
      1. 0.0.0.0/0  igw-zzzzz  + system routing tables
    2. Then You do not have to override the routing table on the system level
    3. You rely on AWS Router
    4. DRAWBACKS
      1. You will have to change the routing table in the VPC, pointing to other phisical interface, when Your NAT instance will shut down
        1. vs
      2. If kept with system routing table, You will lunch new NAT-instance with "old IP address" attached
QUESTIONS / CONCLUSION:
  1. Be aware about ec2_facts limitation
  2. If possible - rely on Amazon Routing Table
    1. How You prevent SPOF in Your VPC subnets?
    2. What is Your best-practise to configure VPC subnet (private and public), so they have internet outside access (for github, apt), and are still safe without SPOF that is NAT-instance?



Igor Cicimov

unread,
Jul 14, 2015, 8:21:38 PM7/14/15
to ansible...@googlegroups.com
I'm using Ansible with AWS VPC's, where most of them have public and private subnets, and have never had the problem you are seeing. This is definitely a misconfiguration on your side and nothing to do with Ansible. The ec2_facts is doing the right thing, there is no other way of collecting data except querying the meta-data repository which is what the AWS CLI tools do anyway. Meaning you will get wrong data using AWS CLI as well. Don't forget you are in the cloud and your networking is configured in the hypervisor/SDN level and NOT on instance level. Meaning you can create as many network interfaces as you want on instance level and set IP's on those but none of them will work since you have bypassed the SDN and there is no record of those in the meta-data repository. Which finally means that collecting facts on the instance locally really means nothing if those values don't match what is in the meta-data repository.

Now that we have that cleared, lets move to your problem, which looks to me is AWS routing tables. Or more specific the lack of those. For an instance to be in a private subnet it needs separate routing table from the VPC's default one (which has IGW created for you when the VPC was created) that has the NAT instance as IGW (internet gateway). And that is all you need, you don't have to set any routing tables on the system level, the SDN will route the traffic for you.
 
Hope this makes sense. Since you haven't provided any info about your subnets, routing tables, ACL's etc. this is more of a guess what's going on so please correct my assumptions if needed.

Thanks,
Igor

Igor Cicimov

unread,
Jul 14, 2015, 8:52:01 PM7/14/15
to ansible...@googlegroups.com
Have to correct myself, you do provide the subnet information. So in answer to you questions/conclusions they way I do it is:

- Use private routing table for the private subnets pointing to the NAT as IGW
- Use 2 x NAT instances and NAT takeover script that modifies the the private subnets routing table and points the IGW to itself in case the other NAT instance has failed

Jakub Muszynski

unread,
Jul 15, 2015, 7:14:27 AM7/15/15
to ansible...@googlegroups.com
Thanks Igor.

You are right, it is not ansible "bug", but an configuration-feature, tough it is the "bad one" since it silently provides the false data. I had to dig into the source code to track it down.
There could be some warning in ec2_facts detecting default route, but it would be some work :/

---------------------
To sum up mine state - I've worked out the solution that is almost the same You have provided :)
I will describe it in my words:

I did not provide enough data about my subnets
I have public subnet, and a private one. Faulty instances were in the public subnet with their system-local-routeing table containing "default via 172.16.0.200 dev eth0"). I have moved that instances to private subnet, and set its routing table in the way, that the default traffic goes via NAT-instance in public subnet:

Destination
Target
Status
Propagated
local
Active
No
eni-ezzzzb / i-2xxxx
Active
No

So that's exactly what You did stated :)

To fix the issue in the public subnet (with "default via 172.16.0.200 dev eth0"), it would be enough to add 

ip r a 169.254.169.254 via 172.16.0.1

verification:

since the
modules/core/cloud/amazon/ec2_facts.py
defines the querry parameter as:


So I'll have to add 2xNAT and I'll be happy :)




--
You received this message because you are subscribed to a topic in the Google Groups "Ansible Project" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ansible-project/oTO0nk8Q-uc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ansible-proje...@googlegroups.com.
To post to this group, send email to ansible...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ansible-project/3378f92a-d933-4f6b-ad64-066ec04b51a0%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages