DNS resolution issue in deploying CF: multiple nameservers in resolv.conf with wrong order

395 views
Skip to first unread message

Guillaume Berche

unread,
Jan 24, 2014, 11:40:15 AM1/24/14
to bosh-...@cloudfoundry.org
I'm facing an issue when trying to deploy CF on vCloud: it seems the DNS resolutions on the jobs VMs are failing. Looking at /etc/resolv.conf, the nameservers seem to be in the wrong order: the bosh micro IP hosting the powerDNS is the last entry of the list while the entries specified in the manifest being first.

I'm following vsphere sample manifest [2], [3] which indeed specifies dns entries into the network part of the cf manifest. In bosh_cloudfoundry, the vsphere templates specifies one [4] but the AWS and openstack don't [5]. It is just a documentation issue and I should leave an empty dns section in networks, and rather configure the bosh micro deployment to use powerdns recursor ? In the vpshere example should the dns_ip_address resolve to the bosh node holding the powerDNS entry ?

Extract of my manifest follows:

networks:
- name: default
  subnets:
  - range: 10.114.60.0/23
    #expecting bosh to use 10.114.60.151 to 10.114.60.200
    # reversed is the ones that bosh should not be using
    reserved:
    - 10.114.60.1 - 10.114.60.150
    - 10.114.60.201 - 10.114.61.253
    #static would be used by router
    static:
    - 10.114.60.151 - 10.114.60.151
    gateway: 10.114.61.254
    dns:
    - 10.171.x.y
    - 10.171.x.z
    cloud_properties:
      name: mynetwork



[1] precises that:
              The algorithm used is to try a name server, and if the  query  times
              out, try the next, until out of name servers, then repeat trying
              all the name servers until  a  maximum  number  of  retries  are
              made.

Therefore, I see Jobs failing to resolve powerDNS hosted entries:

Preparing configuration
  binding configuration (00:00:01)
Done                    1/1 00:00:01
  Started updating job data: data/0 (canary)
     Done updating job data: data/0 (canary)
  Started updating job core: core/0 (canary)
     Done updating job core: core/0 (canary)
Error 400007: `core/0' is not running after update

/var/vcap/packages/health_manager_next/health_manager_next/vendor/bundle/ruby/1.9.1/gems/eventmachine-1.0.3/lib/eventmachine.rb:664:in `connect_server': unable to resolve server address (EventMachine::ConnectionError)
        from /var/vcap/packages/health_manager_next/health_manager_next/vendor/bundle/ruby/1.9.1/gems/eventmachine-1.0.3/lib/eventmachine.rb:664:in `bind_connect'
        from /var/vcap/packages/health_manager_next/health_manager_next/vendor/bundle/ruby/1.9.1/gems/eventmachine-1.0.3/lib/eventmachine.rb:640:in `connect'
        from /var/vcap/packages/health_manager_next/health_manager_next/vendor/bundle/ruby/1.9.1/gems/nats-0.4.26/lib/nats/client.rb:115:in `connect

When I manually change the order of the entries into /etc/resolv.conf the jobs properly start.

The bosh micro has dns enabled. I properly see the powerDns being populated in the db.

The environment hash includes this list in the wrong order when director asks the agent to update job.

I tried to look into the code what would have caused wrong list of entries here, but without luck. Also I think I saw posts on the mailing list with people having the powerdns entry first.

[1] http://manpages.ubuntu.com/manpages/lucid/man5/resolver.5.html
[2] http://docs.cloudfoundry.com/docs/running/deploying-cf/vsphere/cloud-foundry-example-manifest.html
[3] https://github.com/cloudfoundry/cf-docs/blob/master/source/docs/running/deploying-cf/vcloud/wordpress-vcloud.yml#L32
[4] https://github.com/cloudfoundry-community/bosh-cloudfoundry/blob/vsphere/templates/v134/vsphere/medium/deployment_file.yml.erb#L63
[5] https://github.com/cloudfoundry-community/bosh-cloudfoundry/blob/master/templates/v149/openstack/medium/deployment_file.yml.erb#L50

Thanks in advance,

Guillaume.

James Bayer

unread,
Jan 24, 2014, 11:32:07 PM1/24/14
to bosh-users
i've forwarded this the team working on the vcloud CPI at VMware. 


To unsubscribe from this group and stop receiving emails from it, send an email to bosh-users+...@cloudfoundry.org.



--
Thank you,

James Bayer

Guillaume Berche

unread,
Jan 27, 2014, 9:02:55 AM1/27/14
to bosh-...@cloudfoundry.org
Thanks James. I'm not sure the issue is specific to vcloud cpi.

Removing the dns entries from the Cf manifest fixed the issue (at least symptoms). Submitted PR https://github.com/cloudfoundry/cf-docs/pull/286 to help other cf admins to not fell into the same trap.

Why is'nt bosh_director adding powerDns IP in head of the list of dns instead of adding it at the end ? Even more, since multiple DNS entries in a resolv.conf are supposed to only be contacted on timeouts, I'm wondering if it even makes sense to mix bosh powerDNS IP with user provided DNS entries from the network section of the manifest: they appear to me as having different roles and hence not being able to be failovers of one another.

In face of an inconsistency between bosh config (with DNS enabled) and a release manifest specifying dns entries for a network, and shouldn't bosh director either
a- choose manifest provided dns entries
-b error,
c- choose the powerDns single IP

Guillaume.

James Bayer

unread,
Jan 28, 2014, 2:03:05 AM1/28/14
to bosh-users
thanks for the PR, i'm glad it got merged already.

In face of an inconsistency between bosh config (with DNS enabled) and a release manifest specifying dns entries for a network, and shouldn't bosh director either 
a- choose manifest provided dns entries
-b error, 
c- choose the powerDns single IP 


it appears that you have valid points. tammer/greg are PMs for bosh right now. i'll ask them about this. at pivotal, we don't encourage bosh release authors to use powerdns and therefore it's not too surprising to me that there are rough edges here. just today we were talking about what it would take to have a more robust dns solution like skydns [1].

Tammer Saleh

unread,
Jan 29, 2014, 3:51:28 PM1/29/14
to James Bayer, bosh-users
I agree that bosh should error in this situation.  I've created a story in Greg's icebox for this, here: https://www.pivotaltracker.com/story/show/64774796

Cheers,
Tammer Saleh

Guillaume Berche

unread,
Jan 29, 2014, 4:12:01 PM1/29/14
to bosh-...@cloudfoundry.org
Thanks Tammer!

Guillaume.

David Williams

unread,
Mar 19, 2014, 11:19:16 AM3/19/14
to bosh-...@cloudfoundry.org
Guillaume, did this ever get resolved for you? I'm experiencing the same issue with a vSphere deployment of CF v161. For the jobs that are failing, I can bosh ssh into those VMs and I see in /etc/resolv.conf the DNS entries from the manifest ahead of PowerDNS. When I check the logs for the jobs, I see errors like this:

DNS resolution failed for 192.1681.151.204

Note that in this case the .204 address is the nfs_server, and I confirmed that the record is in PowerDNS.

Guillaume Berche

unread,
Mar 19, 2014, 4:40:09 PM3/19/14
to bosh-...@cloudfoundry.org
I solved it by moving the dns entries into the recursor property.

I had fixed the documentation into
https://github.com/cloudfoundry/cf-docs/commit/5272912bd315c2e3b69f448ba2d648cc031a71e9 but it seems that the documentation bug was reintroduced when migrating to docs-deploying-cf https://github.com/cloudfoundry/docs-deploying-cf/blame/master/vsphere/bosh-example-manifest.md#L26
 :-(

Guillaume.

David Williams

unread,
Mar 19, 2014, 11:28:42 PM3/19/14
to bosh-...@cloudfoundry.org
Thanks for the tip. I actually had the recursor property already defined in my BOSH manifest, and have tested that recursive DNS resolution works. However, I was still including my DNS servers in the Network section of the CF deployment manifest, which is causing them to be added in /etc/resolv.conf of the CF VMs ahead of my BOSH PowerDNS instance. I tried to leave the network.dns property empty, but get the following error immediately when attempting to deploy CF (failing the validate manifest step):

Error 40000: Property `dns' (value nil) did not match the required type `Array'

...implying that I must add values for DNS, causing those values to be placed to the top of the /etc/resolv.conf. The only 2 jobs that appear to be affected by this (and are failing due to DNS resolution issues) are Cloud Controller and Loggregator Traffic Controller.  Everything else appears to have deployed fine.

Anyone have any idea on how to work around this, or what I'm missing?

James Bayer

unread,
Mar 20, 2014, 10:33:08 AM3/20/14
to bosh-...@cloudfoundry.org
Reported this to docs team

Sent from my mobile

Guillaume Berche

unread,
Mar 20, 2014, 1:03:13 PM3/20/14
to bosh-...@cloudfoundry.org
David, I have no dns section in my networks section in cf manifest (running vcloud) (i.e. not an empty one) and I was not getting the Error 40000 that you have with release v149. I'm currently testing with v161. I'll report if I run into the same Error than you did.

Thanks James. I had also submitted issue 6.

Guillaume.

David Williams

unread,
Mar 22, 2014, 1:05:21 AM3/22/14
to bosh-...@cloudfoundry.org
Forgot to report back. Leaving out the DNS section altogether did the trick. CF v161 deployed just fine after that.  Can't believe that little gotcha cost me a couple of days. :)

Thanks for the help!
Reply all
Reply to author
Forward
0 new messages