I've read past issues/remarks re: NAT/WAN connections and hoped simply using
advertise would take care of my simple case, but it's apparently not enough.
I've got
- a Windows 2008 Client agent, publicly routable IP only: let's call it 22.33.44.55.
- a Windows 2012 Server agent, self-electing (i.e. for testing), single-homed internally but only accessible via NAT. Internal 192.168.87.99 and the external let's call 66.55.44.99.
(These boxes can communicate fine over other protos -- HTTP, etc. -- and firewall acls are correct, allowing TCP 4647 inbound to the Server agent.)
I get the same results with both vanilla and special agent configs:
- an initially successful connection, Client agent shows in the Server UI as ready
- then the first heartbeat after that shows the Client trying to dial 192.168.8.99:4647 and of course failing to connect ("RPC failed to server 192.168.87.99:4647: rpc error: failed to get conn: dial tcp 192.168.87.99:4647: i/o timeout")
- Client agent goes down
- Server agent of course agrees ("nomad.heartbeat: node 'e36c8737-11f3-718c-7d88-65353fae3f8c' TTL expired")
So my first special config was
advertise {
rpc = "66.55.44.99"
}
but there was no change in the behavior.
I then tried a hostname that specifically would resolve to 192.168.87.99 on the internal network, but to 66.55.44.99 from outside.
I've also tried both of the above same with the hard-coded port :4647, but no change.
Always the same: initial connection is fine (I even ran a job once, before it disconnected) then it starts to look for the 192.168.87.99:4647 and of course fails. Something is clearly "advertising" the internal IP to the remote client, but it's beyond the advertise directive. Not knowing Raft (like, at all) I don't know what else it's up to.
Perhaps I should note I'm not using Consul (on either side). Also deleting all local data_dir before retesting.
-- Sandy