Nomad - agent-to-agent (server) setup

Tarpan pathak

unread,

May 12, 2016, 6:06:50 PM5/12/16

to Nomad

Hi,

I am a new Nomad user. After reading the docs provided on https://www.nomadproject.io/docs/index.html and going through the sample configurations, I am trying to initialize a two server (one local agent and one remote agent) setup. After setting up the agents (OS, firewalls, access, etc..) and installing Nomad:

This is the command I ran to set up the first node:

- nomad agent -server -bootstrap-expect 1 -data-dir /opt/nomad/data/

This is what the above command produces:

==> WARNING: Bootstrap mode enabled! Potentially unsafe operation.

No configuration files loaded

==> Starting Nomad agent...

==> Nomad agent configuration:

Atlas: <disabled>

Client: false

Log Level: INFO

Region: global (DC: dc1)

Server: true

==> Nomad agent started! Log data will stream in below:

2016/05/12 21:55:10 [INFO] serf: EventMemberJoin: ip-addr.global 127.0.0.1

2016/05/12 21:55:10 [INFO] nomad: starting 1 scheduling worker(s) for [service batch system _core]

2016/05/12 21:55:10 [INFO] raft: Node at 127.0.0.1:4647 [Follower] entering Follower state

2016/05/12 21:55:10 [WARN] serf: Failed to re-join any previously known node

2016/05/12 21:55:10 [INFO] nomad: adding server ip-addr.global (Addr: 127.0.0.1:4647) (DC: dc1)

2016/05/12 21:55:11 [WARN] raft: Heartbeat timeout reached, starting election

2016/05/12 21:55:11 [INFO] raft: Node at 127.0.0.1:4647 [Candidate] entering Candidate state

2016/05/12 21:55:11 [INFO] raft: Election won. Tally: 1

2016/05/12 21:55:11 [INFO] raft: Node at 127.0.0.1:4647 [Leader] entering Leader state

2016/05/12 21:55:11 [INFO] nomad: cluster leadership acquired

2016/05/12 21:55:11 [INFO] raft: Disabling EnableSingleNode (bootstrap)

To join a second node to the (previously initiated) cluster these are the commands I ran:

- nomad server-join ip-addr

- nomad server-join ip-addr:port

This is the error message the second server produces when trying to join a node to the cluster:

Error joining: failed joining: Put http://127.0.0.1:4646/v1/agent/join?address=ip-addr: dial tcp 127.0.0.1:4646: getsockopt: connection refused

Based on the commands and responses above, I have a couple of questions:

1. What is causing the second server to fail from joining the cluster?

2. Must a config file (e.g. server.hcl) be used to initialize and join one/more nodes to the cluster?

3. Which ports should the initial server be listening on once its initialized? Assuming network connectivity is already established, should I be able to "telnet" to this port from a node over a LAN/WAN?

Please let me know if more details are required and I will be happy to post them.

Diptanu Choudhury

unread,

May 12, 2016, 7:02:14 PM5/12/16

to Tarpan pathak, Nomad

Hi,

It looks like the servers are listening on the loopback interface. Can you make the servers listen on a routable interface?

You could tell Nomad which by using the addresses configuration block.

addresses {

http=<routable_ip>

rpc=<routable_ip>

serf=<routable_ip>

}

Or you could bind on all the interfaces and advertise a publicly routable addr

bind_addr=0.0.0.0

advertise {

rpc = "10.10.11.3:4646"

// Add the serf and http addresses too

}

Hope this helps.

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.

GitHub Issues: https://github.com/hashicorp/nomad/issues
IRC: #nomad-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Nomad" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nomad-tool/176728e8-8be0-4f42-8a3d-595b740f553f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Thanks,
Diptanu Choudhury

Web - www.linkedin.com/in/diptanu

Twitter - @diptanu

Gmail

unread,

May 12, 2016, 8:07:44 PM5/12/16

to Diptanu Choudhury, Nomad

Hi Diptanu,

Thanks for the response. Upon reviewing your suggestions, I created a “server.hcl” config file with the following contents:

log_level = "DEBUG"

bind_addr = “10.10.10.1” # or “0.0.0.0"

advertise {

http = “10.10.10.1:4646"

rpc = “10.10.10.1:4647"

serf = “10.10.10.1:4648"

}

data_dir = "/opt/nomad/data"

server {

enabled = true

bootstrap_expect = 1

}

Upon running this config file on the “leader”, here is the nomad log:

==> WARNING: Bootstrap mode enabled! Potentially unsafe operation.

Loaded configuration from server.hcl

==> Starting Nomad agent...

==> Nomad agent configuration:

Atlas: <disabled>

Client: false

Log Level: DEBUG

Region: global (DC: dc1)

Server: true

==> Nomad agent started! Log data will stream in below:

2016/05/13 00:02:48 [INFO] serf: EventMemberJoin: ip-10-10-10-1.global 10.10.10.1

2016/05/13 00:02:48 [INFO] nomad: starting 1 scheduling worker(s) for [batch system service _core]

2016/05/13 00:02:48 [INFO] raft: Node at 10.10.10.1:4647 [Follower] entering Follower state

2016/05/13 00:02:48 [WARN] serf: Failed to re-join any previously known node

2016/05/13 00:02:48 [INFO] nomad: adding server ip-10-10-10-1.global (Addr: 10.10.10.1:4647) (DC: dc1)

2016/05/13 00:02:50 [WARN] raft: Heartbeat timeout reached, starting election

2016/05/13 00:02:50 [INFO] raft: Node at 10.10.10.1:4647 [Candidate] entering Candidate state

2016/05/13 00:02:50 [ERR] raft: Failed to make RequestVote RPC to 127.0.0.1:4647: dial tcp 127.0.0.1:4647: getsockopt: connection refused

2016/05/13 00:02:50 [DEBUG] raft: Votes needed: 2

2016/05/13 00:02:50 [DEBUG] raft: Vote granted from 10.10.10.1:4647. Tally: 1

2016/05/13 00:02:51 [WARN] raft: Election timeout reached, restarting election

2016/05/13 00:02:51 [INFO] raft: Node at 10.10.10.1:4647 [Candidate] entering Candidate state

2016/05/13 00:02:51 [ERR] raft: Failed to make RequestVote RPC to 127.0.0.1:4647: dial tcp 127.0.0.1:4647: getsockopt: connection refused

I am now getting a response when running “telnet 10.10.10.1 4646/4647/4648” but when attempting to join a second node to the cluster, the following error is what I get back:

nomad server-join 10.10.10.1

Error joining: failed joining: Put http://127.0.0.1:4646/v1/agent/join?address=10.10.10.1: dial tcp 127.0.0.1:4646: connectex: No connection could be made because the target machine actively refused it.

Perhaps, I am still missing something?

Cheers,

Tarpan

Diptanu Choudhury

unread,

May 12, 2016, 8:17:18 PM5/12/16

to Gmail, Nomad

Now it's is failing because the Nomad CLI is trying to talk to the Nomad http endpoint at 127.0.0.1:4646 and since you are binding on 10.10.10.1 you will have to set the environment variable NOMAD_ADDR= 10.10.10.1:4646 on the shell where you are running the Nomad CLI

Gmail

unread,

May 12, 2016, 9:42:03 PM5/12/16

to Diptanu Choudhury, Nomad

Added the environment variable by running:

export NOMAD_ADDR=10.10.10.1:4646

Confirmed that the variable exists by running:

echo $NOMAD_ADDR, which gave me the result:

10.10.10.1:4646

Re-ran the agent (server) using:

nomad agent server —config server.hcl

Here is the nomad log:

==> WARNING: Bootstrap mode enabled! Potentially unsafe operation.

Loaded configuration from server.hcl

==> Starting Nomad agent...

==> Nomad agent configuration:

Atlas: <disabled>

Client: false

Log Level: DEBUG

Region: global (DC: dc1)

Server: true

==> Nomad agent started! Log data will stream in below:

2016/05/13 00:26:55 [INFO] serf: EventMemberJoin: ip-10-10-10-1.global 10.10.10.1

2016/05/13 00:26:55 [INFO] nomad: starting 1 scheduling worker(s) for [service batch system _core]

2016/05/13 00:26:55 [INFO] raft: Node at 10.10.10.1:4647 [Follower] entering Follower state

2016/05/13 00:26:55 [WARN] serf: Failed to re-join any previously known node

2016/05/13 00:26:55 [INFO] nomad: adding server ip-10-10-10-1.global (Addr: 10.10.10.1:4647) (DC: dc1)

2016/05/13 00:26:57 [WARN] raft: Heartbeat timeout reached, starting election

2016/05/13 00:26:57 [INFO] raft: Node at 10.10.10.1:4647 [Candidate] entering Candidate state

2016/05/13 00:26:57 [ERR] raft: Failed to make RequestVote RPC to 127.0.0.1:4647: dial tcp 127.0.0.1:4647: getsockopt: connection refused

Something is still missing yes? Must the environment variable be added for rpc and serf as well?

Cheers,

Tarpan

Gmail

unread,

May 13, 2016, 12:36:26 PM5/13/16

to Diptanu Choudhury, Nomad

In addition to the below, I’m not certain if a config file for all nodes (primary/secondary/etc….) in the cluster are required. I am able to successfully start the initial node but still receive an error when joining other server-agents to the cluster. Could you please confirm how additional (server) agents should be joined to an existing cluster?

Cheers,

Tarpan

Gmail

unread,

May 17, 2016, 5:22:48 PM5/17/16

to Diptanu Choudhury, Nomad

Hi again,

I now have a server and client agent that I believe can talk to each other. I ran the node-status command on the server-agent to confirm that the client shows up. Here is an output of the node-status command from the server and client:

From server-agent

nomad node-status -allocs:

Error querying node status: Get http://127.0.0.1:4646/v1/nodes: dial tcp 127.0.0.1:4646: getsockopt: connection refused

From client-agent

nomad node-status -self:

Error querying agent info: failed querying self endpoint: Get http://127.0.0.1:4646/v1/agent/self: dial tcp 127.0.0.1:46

46: connectex: No connection could be made because the target machine actively refused it.

Both the commands are still referencing the loopback IP/interface when querying the nodes. Am I running these commands correctly to display the node-status? If not, what can I do for the client-agent to show up in under node-status command?

Cheers,

Tarpan

Diptanu Choudhury

unread,

May 17, 2016, 10:11:07 PM5/17/16

to Gmail, Nomad

Hi Tarpan,

Can you run the command netstat -tulpn | grep nomad and tell us what you see in the output? This should tell us which interface/ports the nomad server process is listening. After that it's a matter of setting the NOMAD_ADDR in your shell where you are running the nomad commands.

Gmail

unread,

May 18, 2016, 12:44:44 AM5/18/16

to Diptanu Choudhury, Nomad

Hi Diptanu,

I figured out the issue. Running the following command (on server and client) gives me the expected result:

nomad node-status -address=http://ip-address:port

Please let me know if you would like me to run “netstat”. I do have an additional question, I’m using a Windows client to process “batch jobs”. Here’s a code snippet from my job definition file:

task “example" {

driver = "raw_exec"

config {

command = “path-to-executable"

args = [“binary params/args"]

}

Does the “args” statement make sense? Just want to make sure I am on the right path.

Cheers,

Tarpan

Alex Dadgar

unread,

May 18, 2016, 12:52:20 PM5/18/16

to Gmail, Diptanu Choudhury, Nomad

Hey Tarpan,

That looks correct to me!

Thanks,

Alex

To view this discussion on the web visit https://groups.google.com/d/msgid/nomad-tool/7FA42819-FDB1-4717-9EC8-D7EEA9061EF6%40gmail.com.

Gmail

unread,

May 18, 2016, 5:42:53 PM5/18/16

to Alex Dadgar, Diptanu Choudhury, Nomad

Hey Alex,

Thanks for your response. I am still debugging an issue with the Windows raw-exec driver. Here is the error I receive after submitting a job:

2016/05/18 09:58:30 [ERR] client: failed to start task ’task-name' for alloc 'alloc-id': command contained more than one input. Use "args" field to pass arguments

I am using the job definition defined below (in the previous thread). According to you, what is throwing the above error?

Cheers,

Tarpan

Alex Dadgar

unread,

May 19, 2016, 2:51:14 PM5/19/16

to Gmail, Diptanu Choudhury, Nomad

Are there any spaces in your command path? The command must be a single value

Gmail

unread,

May 19, 2016, 3:00:43 PM5/19/16

to Alex Dadgar, Diptanu Choudhury, Nomad

Hi Alex,

Completely understood on using the command as a single value. Once again, I figured out the issue. The solution (specific to Windows) was to add the binary to the system’s environment variable. After doing so and running the corresponding command “*.exe”, Nomad submits the job successfully. Please let me know if I can provide you with more details to help understand this fix.