Nomad - agent-to-agent (server) setup

2,219 views
Skip to first unread message

Tarpan pathak

unread,
May 12, 2016, 6:06:50 PM5/12/16
to Nomad
Hi,
I am a new Nomad user. After reading the docs provided on https://www.nomadproject.io/docs/index.html and going through the sample configurations, I am trying to initialize a two server (one local agent and one remote agent) setup. After setting up the agents (OS, firewalls, access, etc..) and installing Nomad: 

This is the command I ran to set up the first node: 
- nomad agent -server -bootstrap-expect 1 -data-dir /opt/nomad/data/

This is what the above command produces: 
==> WARNING: Bootstrap mode enabled! Potentially unsafe operation.
    No configuration files loaded
==> Starting Nomad agent...
==> Nomad agent configuration:

                 Atlas: <disabled>
                Client: false
             Log Level: INFO
                Region: global (DC: dc1)
                Server: true

==> Nomad agent started! Log data will stream in below:

    2016/05/12 21:55:10 [INFO] serf: EventMemberJoin: ip-addr.global 127.0.0.1
    2016/05/12 21:55:10 [INFO] nomad: starting 1 scheduling worker(s) for [service batch system _core]
    2016/05/12 21:55:10 [INFO] raft: Node at 127.0.0.1:4647 [Follower] entering Follower state
    2016/05/12 21:55:10 [WARN] serf: Failed to re-join any previously known node
    2016/05/12 21:55:10 [INFO] nomad: adding server ip-addr.global (Addr: 127.0.0.1:4647) (DC: dc1)
    2016/05/12 21:55:11 [WARN] raft: Heartbeat timeout reached, starting election
    2016/05/12 21:55:11 [INFO] raft: Node at 127.0.0.1:4647 [Candidate] entering Candidate state
    2016/05/12 21:55:11 [INFO] raft: Election won. Tally: 1
    2016/05/12 21:55:11 [INFO] raft: Node at 127.0.0.1:4647 [Leader] entering Leader state
    2016/05/12 21:55:11 [INFO] nomad: cluster leadership acquired
    2016/05/12 21:55:11 [INFO] raft: Disabling EnableSingleNode (bootstrap)

To join a second node to the (previously initiated) cluster these are the commands I ran: 
- nomad server-join ip-addr
- nomad server-join ip-addr:port 

This is the error message the second server produces when trying to join a node to the cluster: 
Error joining: failed joining: Put http://127.0.0.1:4646/v1/agent/join?address=ip-addr: dial tcp 127.0.0.1:4646: getsockopt: connection refused

Based on the commands and responses above, I have a couple of questions: 
1. What is causing the second server to fail from joining the cluster? 
2. Must a config file (e.g. server.hcl) be used to initialize and join one/more nodes to the cluster? 
3. Which ports should the initial server be listening on once its initialized? Assuming network connectivity is already established, should I be able to "telnet" to this port from a node over a LAN/WAN? 

Please let me know if more details are required and I will be happy to post them.   


Diptanu Choudhury

unread,
May 12, 2016, 7:02:14 PM5/12/16
to Tarpan pathak, Nomad
Hi,

It looks like the servers are listening on the loopback interface. Can you make the servers listen on a routable interface?

You could tell Nomad which by using the addresses configuration block.

addresses {
   http=<routable_ip>
   rpc=<routable_ip>
   serf=<routable_ip>
}

Or you could bind on all the interfaces and advertise a publicly routable addr
bind_addr=0.0.0.0
advertise {
  rpc = "10.10.11.3:4646"
  // Add the serf and http addresses too
}

Hope this helps.

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/nomad/issues
IRC: #nomad-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Nomad" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nomad-tool/176728e8-8be0-4f42-8a3d-595b740f553f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Thanks,
Diptanu Choudhury

Gmail

unread,
May 12, 2016, 8:07:44 PM5/12/16
to Diptanu Choudhury, Nomad
Hi Diptanu,
Thanks for the response. Upon reviewing your suggestions, I created a “server.hcl” config file with the following contents: 

log_level = "DEBUG"
bind_addr = “10.10.10.1” # or “0.0.0.0"
advertise {
  http = “10.10.10.1:4646"
  rpc = “10.10.10.1:4647"
  serf = “10.10.10.1:4648"
}
data_dir = "/opt/nomad/data"
server {
  enabled = true
  bootstrap_expect = 1
}

Upon running this config file on the “leader”, here is the nomad log: 

==> WARNING: Bootstrap mode enabled! Potentially unsafe operation.
    Loaded configuration from server.hcl
==> Starting Nomad agent...
==> Nomad agent configuration:

                 Atlas: <disabled>
                Client: false
             Log Level: DEBUG
                Region: global (DC: dc1)
                Server: true

==> Nomad agent started! Log data will stream in below:

    2016/05/13 00:02:48 [INFO] serf: EventMemberJoin: ip-10-10-10-1.global 10.10.10.1
    2016/05/13 00:02:48 [INFO] nomad: starting 1 scheduling worker(s) for [batch system service _core]
    2016/05/13 00:02:48 [INFO] raft: Node at 10.10.10.1:4647 [Follower] entering Follower state
    2016/05/13 00:02:48 [WARN] serf: Failed to re-join any previously known node
    2016/05/13 00:02:48 [INFO] nomad: adding server ip-10-10-10-1.global (Addr: 10.10.10.1:4647) (DC: dc1)
    2016/05/13 00:02:50 [WARN] raft: Heartbeat timeout reached, starting election
    2016/05/13 00:02:50 [INFO] raft: Node at 10.10.10.1:4647 [Candidate] entering Candidate state
    2016/05/13 00:02:50 [ERR] raft: Failed to make RequestVote RPC to 127.0.0.1:4647: dial tcp 127.0.0.1:4647: getsockopt: connection refused
    2016/05/13 00:02:50 [DEBUG] raft: Votes needed: 2
    2016/05/13 00:02:50 [DEBUG] raft: Vote granted from 10.10.10.1:4647. Tally: 1
    2016/05/13 00:02:51 [WARN] raft: Election timeout reached, restarting election
    2016/05/13 00:02:51 [INFO] raft: Node at 10.10.10.1:4647 [Candidate] entering Candidate state
    2016/05/13 00:02:51 [ERR] raft: Failed to make RequestVote RPC to 127.0.0.1:4647: dial tcp 127.0.0.1:4647: getsockopt: connection refused

I am now getting a response when running “telnet 10.10.10.1 4646/4647/4648” but when attempting to join a second node to the cluster, the following error is what I get back: 
nomad server-join 10.10.10.1
Error joining: failed joining: Put http://127.0.0.1:4646/v1/agent/join?address=10.10.10.1: dial tcp 127.0.0.1:4646: connectex: No connection could be made because the target machine actively refused it.

Perhaps, I am still missing something? 

Cheers,
Tarpan


Diptanu Choudhury

unread,
May 12, 2016, 8:17:18 PM5/12/16
to Gmail, Nomad
Now it's is failing because the Nomad CLI is trying to talk to the Nomad http endpoint at 127.0.0.1:4646 and since you are binding on 10.10.10.1 you will have to set the environment variable NOMAD_ADDR= 10.10.10.1:4646 on the shell where you are running the Nomad CLI

Gmail

unread,
May 12, 2016, 9:42:03 PM5/12/16
to Diptanu Choudhury, Nomad
Added the environment variable by running: 
export NOMAD_ADDR=10.10.10.1:4646

Confirmed that the variable exists by running: 
echo $NOMAD_ADDR, which gave me the result:

Re-ran the agent (server) using: 
nomad agent server —config server.hcl 

Here is the nomad log: 
==> WARNING: Bootstrap mode enabled! Potentially unsafe operation.
    Loaded configuration from server.hcl
==> Starting Nomad agent...
==> Nomad agent configuration:

                 Atlas: <disabled>
                Client: false
             Log Level: DEBUG
                Region: global (DC: dc1)
                Server: true

==> Nomad agent started! Log data will stream in below:

    2016/05/13 00:26:55 [INFO] serf: EventMemberJoin: ip-10-10-10-1.global 10.10.10.1
    2016/05/13 00:26:55 [INFO] nomad: starting 1 scheduling worker(s) for [service batch system _core]
    2016/05/13 00:26:55 [INFO] raft: Node at 10.10.10.1:4647 [Follower] entering Follower state
    2016/05/13 00:26:55 [WARN] serf: Failed to re-join any previously known node
    2016/05/13 00:26:55 [INFO] nomad: adding server ip-10-10-10-1.global (Addr: 10.10.10.1:4647) (DC: dc1)
    2016/05/13 00:26:57 [WARN] raft: Heartbeat timeout reached, starting election
    2016/05/13 00:26:57 [INFO] raft: Node at 10.10.10.1:4647 [Candidate] entering Candidate state
    2016/05/13 00:26:57 [ERR] raft: Failed to make RequestVote RPC to 127.0.0.1:4647: dial tcp 127.0.0.1:4647: getsockopt: connection refused

Something is still missing yes? Must the environment variable be added for rpc and serf as well?  

Cheers,
Tarpan


Gmail

unread,
May 13, 2016, 12:36:26 PM5/13/16
to Diptanu Choudhury, Nomad
In addition to the below, I’m not certain if a config file for all nodes (primary/secondary/etc….) in the cluster are required. I am able to successfully start the initial node but still receive an error when joining other server-agents to the cluster. Could you please confirm how additional (server) agents should be joined to an existing cluster? 

Cheers,
Tarpan


Gmail

unread,
May 17, 2016, 5:22:48 PM5/17/16
to Diptanu Choudhury, Nomad
Hi again,
I now have a server and client agent that I believe can talk to each other. I ran the node-status command on the server-agent to confirm that the client shows up. Here is an output of the node-status command from the server and client:

From server-agent 
nomad node-status -allocs:
Error querying node status: Get http://127.0.0.1:4646/v1/nodes: dial tcp 127.0.0.1:4646: getsockopt: connection refused

From client-agent 
nomad node-status -self:
Error querying agent info: failed querying self endpoint: Get http://127.0.0.1:4646/v1/agent/self: dial tcp 127.0.0.1:46
46: connectex: No connection could be made because the target machine actively refused it.

Both the commands are still referencing the loopback IP/interface when querying the nodes. Am I running these commands correctly to display the node-status? If not, what can I do for the client-agent to show up in under node-status command? 

Cheers,
Tarpan


Diptanu Choudhury

unread,
May 17, 2016, 10:11:07 PM5/17/16
to Gmail, Nomad
Hi Tarpan,

Can you run the command netstat -tulpn | grep nomad and tell us what you see in the output? This should tell us which interface/ports the nomad server process is listening. After that it's a matter of setting the NOMAD_ADDR in your shell where you are running the nomad commands.

Gmail

unread,
May 18, 2016, 12:44:44 AM5/18/16
to Diptanu Choudhury, Nomad
Hi Diptanu,
I figured out the issue. Running the following command (on server and client) gives me the expected result: 

nomad node-status -address=http://ip-address:port 

Please let me know if you would like me to run “netstat”. I do have an additional question, I’m using a Windows client to process “batch jobs”. Here’s a code snippet from my job definition file: 

task “example" {
            driver = "raw_exec"
            config {
                command = “path-to-executable"
                args = [“binary params/args"]
            }
 
Does the “args” statement make sense? Just want to make sure I am on the right path. 

Cheers,
Tarpan


Alex Dadgar

unread,
May 18, 2016, 12:52:20 PM5/18/16
to Gmail, Diptanu Choudhury, Nomad
Hey Tarpan,

That looks correct to me!

Thanks,
Alex 

Gmail

unread,
May 18, 2016, 5:42:53 PM5/18/16
to Alex Dadgar, Diptanu Choudhury, Nomad
Hey Alex,
Thanks for your response. I am still debugging an issue with the Windows raw-exec driver. Here is the error I receive after submitting a job: 

2016/05/18 09:58:30 [ERR] client: failed to start task ’task-name' for alloc 'alloc-id': command contained more than one input. Use "args" field to pass arguments

I am using the job definition defined below (in the previous thread). According to you, what is throwing the above error? 

Cheers,
Tarpan


Alex Dadgar

unread,
May 19, 2016, 2:51:14 PM5/19/16
to Gmail, Diptanu Choudhury, Nomad
Are there any spaces in your command path? The command must be a single value

Gmail

unread,
May 19, 2016, 3:00:43 PM5/19/16
to Alex Dadgar, Diptanu Choudhury, Nomad
Hi Alex,
Completely understood on using the command as a single value. Once again, I figured out the issue. The solution (specific to Windows) was to add the binary to the system’s environment variable. After doing so and running the corresponding command “*.exe”, Nomad submits the job successfully. Please let me know if I can provide you with more details to help understand this fix.  

Cheers,
Tarpan


Reply all
Reply to author
Forward
0 new messages