Consul issue after a powercut

1,091 views
Skip to first unread message

SurferL

unread,
Jun 14, 2018, 9:40:15 AM6/14/18
to Consul
Hi,

I have 3 consul servers (v1.0.6) running in HA mode (with Vault using Consul as the backend), with two servers being on mac-mini's (OSX 10.13.3) and the 3rd being on a Ubuntu 16.04 LTS machine.

Yesterday, there was a powercut which took out both mac-mini's.

The configuration across all servers are roughly the same:

{
    "datacenter": "center1",
    "data_dir": "/var/consul",
    "node_name": "MachineName",
    "server": true,

    "bootstrap_expect": 3,
    "bind_addr": "<redacted-ip-address>",

    "client_addr": "0.0.0.0",

    "enable_script_checks": true,

    "ui": true,

    "leave_on_terminate": false,
    "skip_leave_on_interrupt": true,
    "rejoin_after_leave": true,
    "retry_join": [
        "<redacted-ip-address>:8301",
        "<redacted-ip-address>:8301"
    ]
}

On OSX, the consul agent is auto-started by LaunchDaemons consul.plist seen below:

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">

<plist version="1.0">

        <dict>

                <key>Label</key>

                <string>consul</string>

                <key>ProgramArguments</key>

                <array>

                        <string>/usr/local/bin/consul</string>

                        <string>agent</string>

                        <string>-config-dir</string>

                        <string>/etc/consul.d/client</string>

                </array>

                <key>RunAtLoad</key><true/>

                <key>KeepAlive</key><true/>


                <key>StandardOutPath</key>

                <string>/var/log/consul.log</string>

                <key>StandardErrorPath</key>

                <string>/var/log/consul_err.log</string>


        </dict>

</plist>


Prior to the powercut, the Consul server was running just fine; however when the powercut happened, the mac-mini's were automatically restarted, and Consul was also restarted. However, Consul is now in a state of flux which I am unable to decipher and any help would be appreciated to get it out of it's current state.

Main parts in consul_err.log:

Mac 1 server:
-------------------

bootstrap_expect > 0: expecting 3 servers

==> Error starting agent: Failed to start Consul server: Failed to start Raft: mmap too large



Mac 2 server:
-------------------

goroutine 1 [running]:

github.com/hashicorp/consul/vendor/github.com/boltdb/bolt.(*node).spill(0xc4201d81c0, 0x10, 0x10)

/private/tmp/consul-20180209-96398-sbh1rs/src/github.com/hashicorp/consul/vendor/github.com/boltdb/bolt/node.go:375 +0x715

github.com/hashicorp/consul/vendor/github.com/boltdb/bolt.(*Bucket).spill(0xc42003a0f8, 0x19c88fe, 0x2b272e0)

/private/tmp/consul-20180209-96398-sbh1rs/src/github.com/hashicorp/consul/vendor/github.com/boltdb/bolt/bucket.go:570 +0x4d3

github.com/hashicorp/consul/vendor/github.com/boltdb/bolt.(*Tx).Commit(0xc42003a0e0, 0x29ba09c, 0x4)

/private/tmp/consul-20180209-96398-sbh1rs/src/github.com/hashicorp/consul/vendor/github.com/boltdb/bolt/tx.go:163 +0x129

github.com/hashicorp/consul/vendor/github.com/hashicorp/raft-boltdb.(*BoltStore).initialize(0xc42050d760, 0x0, 0x0)

/private/tmp/consul-20180209-96398-sbh1rs/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/raft-boltdb/bolt_store.go:75 +0x134

github.com/hashicorp/consul/vendor/github.com/hashicorp/raft-boltdb.NewBoltStore(0xc420504240, 0x18, 0x2, 0xc420504240, 0x18)

/private/tmp/consul-20180209-96398-sbh1rs/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/raft-boltdb/bolt_store.go:51 +0xc4

github.com/hashicorp/consul/agent/consul.(*Server).setupRaft(0xc4203c2f00, 0x0, 0x0)

/private/tmp/consul-20180209-96398-sbh1rs/src/github.com/hashicorp/consul/agent/consul/server.go:488 +0xaa5

github.com/hashicorp/consul/agent/consul.NewServerLogger(0xc42008b180, 0xc420461860, 0xc4203aaf60, 0x0, 0xc420461860, 0xc4203b2c00)

/private/tmp/consul-20180209-96398-sbh1rs/src/github.com/hashicorp/consul/agent/consul/server.go:333 +0xcb9

github.com/hashicorp/consul/agent.(*Agent).Start(0xc420342a80, 0xc420342a80, 0x0)

/private/tmp/consul-20180209-96398-sbh1rs/src/github.com/hashicorp/consul/agent/agent.go:292 +0x2ed

github.com/hashicorp/consul/command/agent.(*cmd).run(0xc4202ab000, 0xc42000e1a0, 0x2, 0x2, 0x0)

/private/tmp/consul-20180209-96398-sbh1rs/src/github.com/hashicorp/consul/command/agent/agent.go:337 +0x414

github.com/hashicorp/consul/command/agent.(*cmd).Run(0xc4202ab000, 0xc42000e1a0, 0x2, 0x2, 0xc4202adee0)

/private/tmp/consul-20180209-96398-sbh1rs/src/github.com/hashicorp/consul/command/agent/agent.go:77 +0x50

github.com/hashicorp/consul/vendor/github.com/mitchellh/cli.(*CLI).Run(0xc4201c8ea0, 0xc4201c8ea0, 0x40, 0xc4202b6340)

/private/tmp/consul-20180209-96398-sbh1rs/src/github.com/hashicorp/consul/vendor/github.com/mitchellh/cli/cli.go:242 +0x1eb

main.realMain(0x1c48687)

/private/tmp/consul-20180209-96398-sbh1rs/src/github.com/hashicorp/consul/main.go:52 +0x416

main.main()

/private/tmp/consul-20180209-96398-sbh1rs/src/github.com/hashicorp/consul/main.go:19 +0x22

bootstrap_expect > 0: expecting 3 servers

panic: pgid (206158430208) above high water mark (8390)



Ubuntu server:
--------------------

consul[1375]:     2018/06/14 14:00:08 [ERR] agent: failed to sync changes: No cluster leader
consul[1375]:     2018/06/14 14:00:09 [ERR] http: Request GET /v1/kv/vault/core/lock, error: No cluster leader from=127.0.0.1:45572
consul[1375]:     2018/06/14 14:00:14 [INFO] serf: attempting reconnect to <MachineName> <IP-ADDRESS>:8302
consul[1375]:     2018/06/14 14:00:14 [WARN] raft: Election timeout reached, restarting election
consul[1375]:     2018/06/14 14:00:14 [INFO] raft: Node at <IP-ADDRESS>:8300 [Candidate] entering Candidate state in term 24259
consul[1375]: 2018/06/14 14:00:14 [WARN] Unable to get address for server id 39677f93-a2e8-8281-2fcb-2323c113037d, using fallback address <IP-ADDRESS>:8300: Could not find address for server id 39677f93-a2e8-8281-2fcb-2323c113037d
consul[1375]: 2018/06/14 14:00:14 [WARN] Unable to get address for server id 74054daa-8408-4479-4082-b5ead4f8f5ed, using fallback address <IP-ADDRESS>:8300: Could not find address for server id 74054daa-8408-4479-4082-b5ead4f8f5ed
consul[1375]:     2018/06/14 14:00:14 [ERR] raft: Failed to make RequestVote RPC to {Voter 39677f93-a2e8-8281-2fcb-2323c113037d <IP-ADDRESS>:8300}: dial tcp <IP-ADDRESS>:0-><IP-ADDRESS>:8300: getsockopt: connection refused
consul[1375]:     2018/06/14 14:00:14 [ERR] raft: Failed to make RequestVote RPC to {Voter 74054daa-8408-4479-4082-b5ead4f8f5ed <IP-ADDRESS>:8300}: dial tcp <IP-ADDRESS>:0-><IP-ADDRESS>:8300: getsockopt: connection refused
consul[1375]:     2018/06/14 14:00:14 [ERR] http: Request GET /v1/kv/vault/core/lock, error: No cluster leader from=127.0.0.1:45586
consul[1375]:     2018/06/14 14:00:15 [ERR] http: Request PUT /v1/session/create, error: No cluster leader from=127.0.0.1:45514
consul[1375]:     2018/06/14 14:00:15 [WARN] agent: Check "vault:<IP-ADDRESS>:8200:vault-sealed-check" missed TTL, is now critical
consul[1375]:     2018/06/14 14:00:17 [WARN] agent: Syncing check "cpuUsage" failed. No cluster leader
consul[1375]:     2018/06/14 14:00:17 [ERR] agent: failed to sync changes: No cluster leader
consul[1375]:     2018/06/14 14:00:18 [ERR] agent: failed to sync remote state: No cluster leader


I believe if I stop the consul.service on all machines, and then wipe /var/consul and restart the servers everything would be okay - but I'll lose information (such as Vault information?).

However, if there is a "correct" way of dealing with this scenario before I lose any data, and also if I was meant to setup Consul in a better way to avoid this scenario, it would be very helpful!

Thank you in advance :) 




Justin DynamicD

unread,
Jun 15, 2018, 5:02:39 PM6/15/18
to Consul
First off yes:  if you wipe the directory you will lose all your passwords in vault.  So probably shouldn't do that.

That said you've got a ton of ip address errors/cannot find address and timeouts.  Start with the basics:  were those minis static IPs and did they retain them?  Try a simple ping/curl to the ports?  netstat -plnt to ensure everything is listening where you expect it to?

it looks like you have some more fundamental network issues which is why machines can't rediscover each other.

SurferL

unread,
Jun 18, 2018, 5:14:32 AM6/18/18
to Consul
Hi,

Yes both mac-mini's static IP is still correct as I can connect to them still via SSH and Screen sharing.

Looking further, it looks like on Mac Server 1, the raft.db file thinks it is 562.95 TB large, but if I get more info, it's only 34.4MB on disk... Which I suppose explains the error "Failed to start Raft: mmap too large" although apart from deleting it, I'm not sure what else to do.


SurferL

unread,
Jun 22, 2018, 10:34:06 AM6/22/18
to Consul
I don't suppose anyone else has any insights? This has happened before, and I'd rather not have my only solution be to wipe /var/consul everytime this happens...

Preetha Appan

unread,
Jun 23, 2018, 11:16:09 AM6/23/18
to Consul
Hi 

From those logs it looks like when the servers came back up after the power reset their IP address changed. This line in the logs indicates that

 raft: Failed to make RequestVote RPC to {Voter 39677f93-a2e8-8281-2fcb-2323c113037d <IP-ADDRESS>:8300}: dial tcp <IP-ADDRESS>:0-><IP-ADDRESS>:8300: getsockopt: connection refused


You could try running consul join <ip-2> and consul join <ip-3>  from the first server. That should make all the servers aware of each others IP addresses, after which they will leader elect. make sure to point your CONSUL_HTTP_ADDR env. var to the ip address/http port of the first server when you do this.

Hope this helps!
Preetha

SurferL

unread,
Jun 26, 2018, 9:08:13 AM6/26/18
to Consul
Hi Preetha,

Thank you for your response! 

I've tried running `consul join <ip-address>` from all the servers however, they've all returned errors...

Mac Server 1 & Mac Server 2:
---------------------------------------

consul join <ip-address-ubuntu>

Error joining address '<ip-address-ubuntu>': Put http://127.0.0.1:8500/v1/agent/join/10.1.10.196: dial tcp 127.0.0.1:8500: getsockopt: connection refused

Failed to join any nodes.


Ubuntu Server
------------------
  consul join <ip-address-mac1>
  Error joining address '<ip-address-mac1>': Unexpected response code: 500 (1 error(s) occurred:

  * Failed to join <ip-address-mac1>: dial tcp 10.1.10.147:8301: getsockopt: connection refused)
  Failed to join any nodes.

It also failed with the same error messages for the other connections too.

Also, running any `consul members` etc commands on the Mac's fail with errors like:

Error retrieving members: Get http://127.0.0.1:8500/v1/agent/members?segment=_all: dial tcp 127.0.0.1:8500: getsockopt: connection refused


Any other ideas?

Thanks in advance :)

Kasper Grubbe

unread,
Jun 26, 2018, 10:20:32 AM6/26/18
to consu...@googlegroups.com
Maybe I am stating the obvious, so I am sorry for that.

Could it be firewall issues? Maybe the old configuration that was active before the reboot isn't automatically loaded after a reboot.

--
Kasper Grubbe

Phone: (+45) 42 42 42 74
Skype: kasper.grubbe
Mail: kasper...@gmail.com

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/consul/issues
Community chat: https://gitter.im/hashicorp-consul/Lobby
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/consul-tool/821013ff-5dfc-4f9e-b8fe-c906baa60482%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

SurferL

unread,
Jun 26, 2018, 10:25:24 AM6/26/18
to Consul
Hey Kasper, no worries about stating the obvious - I'm very new to all this so any help is appreciated! 

I don't believe it is firewall issues, because both mac's have the firewall turned off, and the ubuntu server didn't get affected by the powercut? (unless I'm missing something else?)

Kasper Grubbe

unread,
Jun 26, 2018, 10:33:26 AM6/26/18
to consu...@googlegroups.com
To rule out any consul internals (because it looks like a system-level error) I would try to connect through lower level tools like telnet:

$ telnet cluster.kasper.co 8500
Connected to cluster.kasper.co.
Escape character is '^]'.

If you have any SSL-encryption enabled between your servers, this makes it a bit more difficult.

--
Kasper Grubbe

Phone: (+45) 42 42 42 74
Skype: kasper.grubbe
Mail: kasper...@gmail.com

On Tue, Jun 26, 2018 at 3:25 PM, SurferL <lawrenc...@gmail.com> wrote:
Hey Kasper, no worries about stating the obvious - I'm very new to all this so any help is appreciated! 

I don't believe it is firewall issues, because both mac's have the firewall turned off, and the ubuntu server didn't get affected by the powercut? (unless I'm missing something else?)

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/consul/issues
Community chat: https://gitter.im/hashicorp-consul/Lobby
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool+unsubscribe@googlegroups.com.

SurferL

unread,
Jun 26, 2018, 10:43:21 AM6/26/18
to Consul
So testing on the three servers (if I'm using the command correctly returns):


mac1 $ telnet <ip-address-mac2> 8500

Trying <ip-address-mac2>...

telnet: connect to address <ip-address-mac2>: Connection refused

telnet: Unable to connect to remote host



And this result happens across all servers (with the different IPs).


Kasper Grubbe

unread,
Jun 26, 2018, 11:51:54 AM6/26/18
to consu...@googlegroups.com
So there is something underlyingly wrong with your network setup that block your servers from communicating. 

Are you sure that the servers are listening on 8500?

You can test that locally by pointing the server to itself: telnet localhost 8500.

I am not familiar with the Firewall on OSX but on Ubuntu you should be able to use ufw to open the ports, as root run:

ufw allow from 192.168.0.4 to any port 8300
ufw allow from 192.168.0.4 to any port 8301
ufw allow from 192.168.0.4 to any port 8302
ufw allow from 192.168.0.4 to any port 8400
ufw allow from 192.168.0.4 to any port 8500
ufw allow from 192.168.0.4 to any port 8600
ufw enable

(be aware that it might block other running servers)

There are some more techniques described here on how to debug "connection refused" on Linux and OSX machines: https://serverfault.com/a/725263

--
Kasper Grubbe

Phone: (+45) 42 42 42 74
Skype: kasper.grubbe
Mail: kasper...@gmail.com

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/consul/issues
Community chat: https://gitter.im/hashicorp-consul/Lobby
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool+unsubscribe@googlegroups.com.

SurferL

unread,
Jun 26, 2018, 12:19:26 PM6/26/18
to Consul
Thank you for your continued help! 

On the Ubuntu server it works fine:
telnet localhost 8500
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.

However on both the mac's it fails with (firewalls off):

telnet localhost 8500

Trying ::1...

telnet: connect to address ::1: Connection refused

Trying 127.0.0.1...

telnet: connect to address 127.0.0.1: Connection refused

telnet: Unable to connect to remote host


I'm getting slightly lost, as isn't this the expected outcome? Consul is currently failing to start/connect to the server, and I'm unable to run any consul commands on both mac-mini's, with the errors logged in consul_err.log being:


Mac 1

--------

bootstrap_expect > 0: expecting 3 servers

==> Error starting agent: Failed to start Consul server: Failed to start Raft: mmap too large



--> the raft.db file thinks it is 562.95 TB large, but if I get more info, it's only 34.4MB on disk... Which I suppose explains the error "Failed to start Raft: mmap too large" although apart from deleting it, I'm not sure what else to do with this?


Mac 2

Reply all
Reply to author
Forward
0 new messages