Invalid Service Address with Consul

897 views
Skip to first unread message

Onomojaku

unread,
Aug 19, 2019, 12:55:15 PM8/19/19
to Nomad
I'm hoping that I'm missing something simple here, and that I'll just feel foolish shortly, but we're having a significant issue that I could use some help with.  We are running an integrated Nomad/Consul cluster.  28 nodes, 25 are clients only and 3 are combined server/client (I know, not recommended, we don't have spare processing hardware).
  • Nomad version: 0.9.4
  • Consul version: 1.5.3
  • Docker version: 18.09.5
Our problem is this: none of the client-only nodes are able to pass their service health checks to consul.  The only health checks that get passed along are the Serf Health Status ones.  We can schedule jobs and allocations are assigned without issue, they start and run without problem.  This WAS working in a previous version - but I can't say for certain how long ago.  Aggressive log rotation is aggressive.

We see errors in the logs along the lines of:
[ERROR] consul.sync: failed to update services in Consul: failures=10 error="Unexpected response code: 400 (Invalid Service Address)" in our Nomad logs.

Additionally, when checking the ports, Nomad's TCP port 4646 shows up on the clients - but the RPC port of 4647 doesn't.  4647 does show up on the servers.  The "Ports Used" documentation suggests that RPC is used for internal communications between client agent and servers - thus I would suspect it should be up.  I honestly can't remember if 4647 was up on my client nodes before.

nomad node status -self -verbose shows me no issues, nor does nomad node status -self <nodeid>

Anyone have any insights?  If you have to see config files, it'll take us a day or two to get them migrated over from our private network.  We'll do it, if it'll help.


Rod

Shantanu Gadgil

unread,
Aug 19, 2019, 3:08:35 PM8/19/19
to Nomad
I have a few points, not necessarily in any order, just thoughts:
* typically I just give all the Consul ports amd Nomad ports to be allowed for BOTH tcp and udp among the machines. I know I am being lazy, but it is better than allowing all-all.

* I don't understand what you mean by "pass along the health check", the HC is on the client, the state info is shared by all nodes, that's all.

* I suspect it is either a cloud specific security group issue or an os based firewall issue which could be preventing a client from reaching the server.

I would run "nc -zv server:port" for ALL Consul and Nomad ports and verify connectivity from any one client-only machine to the server.

(So far this is what comes to mind)

Rod

unread,
Aug 19, 2019, 4:36:44 PM8/19/19
to Shantanu Gadgil, Nomad
Replying in your order. :)

- ports are open on both UDP and TCP; those that I listed were what’s actually listening on which ports/protocols
- the HCs from the client-only hosts other than Nomad’s own node Serf check are not registering with Consul
- We aren’t running these in the Cloud, these are on premises bare-metal; Redhat 7.5 nodes
- I can do ‘nmap’ from the clients to the servers and servers to the clients and confirm the visible ports, they match the returns locally from ‘netstat -nr’.

Locally inside the client nodes a ‘netstat -nr | grep 46’ only shows port 4646 (HTTP API port). Port 4647 never comes online on client-only nodes.

So a real question... should, as the documentation suggests, port 4647 be active on client-only nodes?


Rod

Sent from my iPad
> --
> This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
>
> GitHub Issues: https://github.com/hashicorp/nomad/issues
> IRC: #nomad-tool on Freenode
> ---
> You received this message because you are subscribed to the Google Groups "Nomad" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/nomad-tool/ff98c1a0-bf9c-4286-bf90-30d83ad6a940%40googlegroups.com.

Chris Baker

unread,
Aug 20, 2019, 4:11:59 PM8/20/19
to Nomad
So a real question... should, as the documentation suggests, port 4647 be active on client-only nodes? 

The client nodes do not listen on the RCP port (default: 4647). They bind to a randomly chosen server on its advertised RPC port (default: 4647); this long-running connection is used for client-server communication, including the tunneling of any client-specific RPC requests.
The clients only listen on the HTTP port (default: 4646).

My first concern was a regression between these two versions of Nomad/Consul. However, I ran a small Nomad cluster and Consul with these versions, ran the example job, and the service registration was fine.

So I'd like to understand what Nomad is sending to earn a 400 from Consul. The error in Consul suggests that it doesn't like the address coming in with the service registration. Honestly, I wouldn't mind seeing a tcpdump from the Nomad Client to the Consul Agent during service registration. I'm trying to figure out what we could see from the Nomad API to give a hint as to what is wrong. My working theory is that Nomad is sending a bad address. Can you check out the following for one of the allocations that isn't being registered? 
nomad alloc status -json 9a89234c | jq '.TaskResources | .. | .Networks? // empty | .[].IP'

I'd like to see that the result is an IPv4 or an IPv6 address.

Onomojaku

unread,
Aug 20, 2019, 4:38:57 PM8/20/19
to Chris Baker, Nomad
Thanks for the reply!

The networking on these systems is a little complicated, but the main public ("gateway") address should be the one bound to.  I'll check and verify.  Should all be ipv4 - and we leave the default of '0.0.0.0' without specifying "address" for rpc.  I do specify "advertise" as an external-to-us address, but I want the client http to be available on localhost, 127.0.0.1 and the external IP.

I'm at home right now, so I'll dig into that and try to get you that output tomorrow.  I'll do my own tcp dump and see what I can come up with.


Rod


--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/nomad/issues
IRC: #nomad-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Nomad" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+...@googlegroups.com.

Shantanu Gadgil

unread,
Aug 21, 2019, 1:14:08 AM8/21/19
to Nomad
For me the netstat outputs on the various agents are like, (if that helps)

[root@nomad-server ~]# netstat -pentl | grep nomad
tcp6       0      0 :::4646                 :::*                    LISTEN      0          17963      830/nomad
tcp6       0      0 :::4647                 :::*                    LISTEN      0          17901      830/nomad
tcp6       0      0 :::4648                 :::*                    LISTEN      0          17947      830/nomad

[root@nomad-server ~]# netstat -penul | grep nomad
udp6       0      0 :::4648                 :::*                                0          17948      830/nomad


[root@nomad-client ~]# netstat -pentl | grep nomad
tcp        0      0 10.x.x.x:4646         0.0.0.0:*               LISTEN      0          17823      837/nomad
[root@nomad-client ~]# netstat -penul | grep nomad
--- no output ---

HTH,
Shantanu Gadgil
To unsubscribe from this group and stop receiving emails from it, send an email to nomad...@googlegroups.com.

Rod Dreher

unread,
Aug 21, 2019, 3:38:56 PM8/21/19
to Nomad
Noted on the clients not needing active port 4647 - that's a relief.  Port 4646 is visible on all clients and servers.

So the vast majority of our allocations have 'null' - they use the default bridge network and don't do separate networking of their own.  Ones on our nomad/consul hosts register their healthchecks fine, ones on the client-only hosts do not.  The ones that use the `docker host network` show the host's primary external IP.  Likewise, hosts register, clients do not.

One additional question - do the healthchecks register nomad client to consul clientnomad client to consul servernomad client to nomad server to consul server?  How do they get from the nomad client running the allocation into Consul?  Because based on the error message it seems very much like the registering of the healthcheck isn't happening, but then the results try to run and feed to Consul, and Consul answers with a 'what check do you mean?'

I'll set up a separate nomad/consul client and do a tcpdump as well as a full strace review tomorrow and see if I can figure out what's happening from those.


Rod

To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+unsubscribe@googlegroups.com.

Chris Baker

unread,
Aug 21, 2019, 4:21:27 PM8/21/19
to Rod Dreher, Nomad
The healthchecks register from Nomad client to configured consul agent. That consul agent is typically local.

To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+...@googlegroups.com.

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/nomad/issues
IRC: #nomad-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Nomad" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nomad-tool/6d27e3bc-c479-4139-98bc-83052949c123%40googlegroups.com.

Rod

unread,
Aug 21, 2019, 5:16:43 PM8/21/19
to Chris Baker, Nomad
Yep - the agent is local.  And Nomad clients (agents) use the local Consul to detect/register with the Nomad servers, so at least a part of that communication is working.

I’ll dig deeper tomorrow and try to figure out where my disconnect is happening.


Rod

Sent from my iPad

Rod Dreher

unread,
Aug 22, 2019, 9:22:17 AM8/22/19
to Nomad
I have resolved this - thanks for all your help in figuring out where to look.

TL;DR: I needed to specify the public IP for http in the "advertise" stanza.  0.0.0.0 apparently shares the wrong IP with Consul, which rejects the connection/registration.
advertise {
  http = "{{ GetPublicIP }}"
  rpc  = "{{ GetPublicIP }}"
  serf = "{{ GetPublicIP }}"
}


Slightly longer:
I've been leaving the default bind_addr of 0.0.0.0 and using "advertise" only for serf.  That's the only one that needed it in the past.  I _think_ this behavior may have changed slightly in either 9.2 or 9.4 (we never got around to running 9.3 to confirm that version).

My new config contains:
advertise {
  http = "{{ GetPublicIP }}"
  rpc  = "{{ GetPublicIP }}"
  serf = "{{ GetPublicIP }}"
}
addresses {
  http = "0.0.0.0"
  rpc  = "0.0.0.0"
  serf = "{{ GetPublicIP }}"
}


Thanks!

Rod
Reply all
Reply to author
Forward
0 new messages