(Raft) heartbeat lost via NAT after initial connection

Sanford Whiteman

unread,

Jan 29, 2018, 3:41:50 PM1/29/18

to Nomad

I've read past issues/remarks re: NAT/WAN connections and hoped simply using advertise would take care of my simple case, but it's apparently not enough.

I've got

a Windows 2008 Client agent, publicly routable IP only: let's call it 22.33.44.55.
a Windows 2012 Server agent, self-electing (i.e. for testing), single-homed internally but only accessible via NAT. Internal 192.168.87.99 and the external let's call 66.55.44.99.

(These boxes can communicate fine over other protos -- HTTP, etc. -- and firewall acls are correct, allowing TCP 4647 inbound to the Server agent.)

I get the same results with both vanilla and special agent configs:

an initially successful connection, Client agent shows in the Server UI as ready
then the first heartbeat after that shows the Client trying to dial 192.168.8.99:4647 and of course failing to connect ("RPC failed to server 192.168.87.99:4647: rpc error: failed to get conn: dial tcp 192.168.87.99:4647: i/o timeout")
Client agent goes down
Server agent of course agrees ("nomad.heartbeat: node 'e36c8737-11f3-718c-7d88-65353fae3f8c' TTL expired")

So my first special config was

advertise {
rpc = "66.55.44.99"
}

but there was no change in the behavior.

I then tried a hostname that specifically would resolve to 192.168.87.99 on the internal network, but to 66.55.44.99 from outside.

advertise {
rpc = "dynamichostname.example.com"
}

I've also tried both of the above same with the hard-coded port :4647, but no change.

Always the same: initial connection is fine (I even ran a job once, before it disconnected) then it starts to look for the 192.168.87.99:4647 and of course fails. Something is clearly "advertising" the internal IP to the remote client, but it's beyond the advertise directive. Not knowing Raft (like, at all) I don't know what else it's up to.

Perhaps I should note I'm not using Consul (on either side). Also deleting all local data_dir before retesting.

-- Sandy

Alex Dadgar

unread,

Jan 29, 2018, 9:00:39 PM1/29/18

to Sanford Whiteman, Nomad

Hey Sanford,

Thanks for the great write up. It let me reproduce the issue and come up with the fix. You are not doing anything wrong, it was an actual bug of the servers responding to the client when it joined with the list of all known servers but not preferring the advertised RPC address and using the bind address.

I have the PR with the fix here and have attached some test binaries if you want to use them to confirm/play around till 0.8: https://github.com/hashicorp/nomad/pull/3811

Thanks,

Alex Dadgar

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.

GitHub Issues: https://github.com/hashicorp/nomad/issues
IRC: #nomad-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Nomad" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nomad-tool/e1487074-c7e3-4961-8b1e-f13c8bbaf9dc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Thanks,

Alex

Sanford Whiteman

unread,

Jan 29, 2018, 9:30:06 PM1/29/18

to Nomad

Thanks for working on it so quickly, Alex!

Unfortunately the 0.8.0-dev binaries don't fix it for me.

It seems like the problem actually starts with trying to advertise a non-local RPC address (i.e. the public NAT address). That causes an infinite acquired-lost cycle. I assume that underlying cause is even with the self-electing server it tries to access itself via the advertised IP.

2018/01/29 21:17:02 [INFO] raft: Node at 69.28.242.149:4647 [Leader] entering Leader state
2018/01/29 21:17:02 [INFO] raft: Added peer 192.168.87.99:4647, starting replication
2018/01/29 21:17:02.546550 [INFO] nomad: cluster leadership acquired
2018/01/29 21:17:02 [INFO] raft: Node at 69.28.242.149:4647 [Follower] entering Follower state (Leader: "")
2018/01/29 21:17:02.555550 [ERR] nomad: failed to wait for barrier: leadership lost while committing log
2018/01/29 21:17:02.555550 [DEBUG] nomad: shutting down leader loop
2018/01/29 21:17:02.557551 [INFO] nomad: cluster leadership lost
2018/01/29 21:17:03 [DEBUG] raft-net: 69.28.242.149:4647 accepted connection from: 192.168.87.99:63894
2018/01/29 21:17:04 [WARN] raft: Heartbeat timeout from "" reached, starting election
2018/01/29 21:17:04 [INFO] raft: Node at 69.28.242.149:4647 [Candidate] entering Candidate state in term 22
2018/01/29 21:17:04 [DEBUG] raft: Votes needed: 1
2018/01/29 21:17:04 [DEBUG] raft: Vote granted from 192.168.87.99:4647 in term 22. Tally: 1
2018/01/29 21:17:04 [INFO] raft: Election won. Tally: 1
2018/01/29 21:17:04 [INFO] raft: Node at 69.28.242.149:4647 [Leader] entering Leader state
2018/01/29 21:17:04 [INFO] raft: Added peer 192.168.87.99:4647, starting replication
2018/01/29 21:17:04.260648 [INFO] nomad: cluster leadership acquired
2018/01/29 21:17:04 [INFO] raft: Node at 69.28.242.149:4647 [Follower] entering Follower state (Leader: "")
2018/01/29 21:17:04.262648 [DEBUG] nomad: shutting down leader loop
2018/01/29 21:17:04.263648 [ERR] nomad: failed to wait for barrier: node is not the leader
2018/01/29 21:17:04.264648 [INFO] nomad: cluster leadership lost

I had seen this part of bug/behavior before, but was apparently just masking it by using my dynamic DNS hostname in the advertise. Using the hostname allows the Server agent to start and stay leader. But the server doesn't advertise the hostname itself to the client, it advertises the IP address it resolves the hostname to (locally). So advertising publicandprivatehostname.example.com still ends up advertising the private IP, meaning the Client agent connects at first and then drops.

In sum: advertising the public IP stops the Server agent from starting, even before we get to the Client agent connectivity question.

Alex Dadgar

unread,

Jan 30, 2018, 12:28:58 AM1/30/18

to Sanford Whiteman, Nomad

Hey Sanford,

Hmm interesting. Would you mind filing an issue with the details you provided. Can you also post up the servers configuration files and explain your network topology. It would also be interesting to get the output of `nomad server-members` and `nomad operator raft list-peers`.

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.

GitHub Issues: https://github.com/hashicorp/nomad/issues
IRC: #nomad-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Nomad" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+unsubscribe@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/nomad-tool/c582909d-8191-4cda-b873-121e647610ca%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Thanks,

Alex

Sanford Whiteman

unread,

Jan 30, 2018, 2:36:36 AM1/30/18

to Nomad

Hmm interesting. Would you mind filing an issue with the details you provided. Can you also post up the servers configuration files and explain your network topology. It would also be interesting to get the output of `nomad server-members` and `nomad operator raft list-peers`.

Opened https://github.com/hashicorp/nomad/issues/3813.

Reply all

Reply to author

Forward