Vault server unresonsive

773 views
Skip to first unread message

Vladimir Volkov

unread,
May 31, 2017, 8:45:20 AM5/31/17
to Vault
Hello.

I have a cluster of two Vault 0.7.0 nodes with Postgres primary backend and Consul HA backend.
Both nodes stayed sealed for a while, and then I tried to unseal them. After that, both nodes became unresponsive:
# vault status
Error checking seal status: Get http://localhost:9200/v1/sys/seal-status: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
I can confirm that the port is open and the client connects, but it seems to get no response.

Last messages in the log are:
[INFO ] core: vault is unsealed
[INFO ] core: entering standby mode
[INFO ] core: acquired lock, enabling active operation
[INFO ] core: post-unseal setup starting
So, I assume, unseal process has not completed.

In Consul "vault-sealed-check" says "TTL expired (last output before timeout follows): Vault Unsealed"
Consul cluster itself seems healthy.


When I restarted one of the nodes, it started and unsealed without any problems.
However, it hasn't become active:
High-Availability Enabled: true
        Mode: standby
        Leader: <another node, still unresponsive>

 
Any ideas on what could be the cause of the problem, and how to prevent it from happening again?


Regards,
Vladimir

Jeff Mitchell

unread,
May 31, 2017, 10:52:03 AM5/31/17
to Vault
Hi Vladimir,

Please provide full logs with trace enabled (-log-level=trace when starting).

Best,
Jeff

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/vault/issues
IRC: #vault-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Vault" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vault-tool+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/vault-tool/5b70d49c-2323-4c5e-a6a6-6cee09d5c520%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Vladimir Volkov

unread,
May 31, 2017, 11:08:05 AM5/31/17
to Vault
Hi Jeff,

I'll do my best to obtain the logs, but at the moment I don't have a way to reliably reproduce the problem - as I mentioned, after a restart it works fine.
To unsubscribe from this group and stop receiving emails from it, send an email to vault-tool+...@googlegroups.com.

Vladimir Volkov

unread,
Jun 9, 2017, 3:22:36 PM6/9/17
to Vault
Hi,

It happened again, this time both nodes were unsealed for a few days, and became unresponsive at some moment.
This is what I found in the logs, which seems interesting.

HOST_2 (was a standby node when I last saw it working):
    2017/06/08 04:32:46 [INFO] agent: Synced check 'vault-dev:<HOST_2>:8200:vault-sealed-check'
    2017/06/08 04:34:23 [INFO] agent: Synced check 'vault-dev:<HOST_2>:8200:vault-sealed-check'
    2017/06/08 04:34:47 [ERR] memberlist: Failed fallback ping: write tcp [<HOST_2_IP>]:56078->[<HOST_1_IP>]:18301: i/o timeout
    2017/06/08 04:34:47 [ERR] memberlist: Push/Pull with <HOST_1> failed: dial tcp [<HOST_1_IP>]:8301: getsockopt: connection timed out
    2017/06/08 04:34:49 [ERR] yamux: keepalive failed: i/o deadline reached
    2017/06/08 04:34:49 [ERR] consul: RPC failed to server [<HOST_1_IP>]:8300: rpc error: EOF
    2017/06/08 04:34:49 [ERR] agent: coordinate update error: rpc error: EOF
    2017/06/08 04:34:49 [ERR] consul: RPC failed to server [<HOST_1_IP>]:8300: rpc error: EOF
    2017/06/08 04:34:49 [ERR] http: Request GET /v1/kv/vault-dev/core/lock?index=36&wait=15000ms, error: rpc error: EOF from=[::1]:60134
2017/06/08 04:34:49.499295 [ERROR] core: failed to acquire lock: error=failed to read lock: Unexpected response code: 500
    2017/06/08 04:35:47 [INFO] agent: Synced check 'vault-dev:<HOST_2>:8200:vault-sealed-check'
    2017/06/08 04:37:17 [INFO] agent: Synced check 'vault-dev:<HOST_2>:8200:vault-sealed-check'
...
    2017/06/08 05:39:29 [INFO] agent: Synced check 'vault-dev:<HOST_2>:8200:vault-sealed-check'
    2017/06/08 05:41:16 [INFO] agent: Synced check 'vault-dev:<HOST_2>:8200:vault-sealed-check'
2017/06/08 05:41:53.516894 [INFO ] core: acquired lock, enabling active operation
    2017/06/08 05:41:57 [WARN] agent: Check 'vault-dev:<HOST_2>:8200:vault-sealed-check' missed TTL, is now critical
    2017/06/08 05:41:57 [INFO] agent: Synced check 'vault-dev:<HOST_2>:8200:vault-sealed-check'
    2017/06/08 05:42:59 [INFO] agent: Synced check 'vault-dev:<HOST_2>:8200:vault-sealed-check'

HOST_1 (was a master node):
    2017/06/08 05:40:23 [INFO] agent: Synced check 'vault-dev:<HOST_1>:8200:vault-sealed-check'
2017/06/08 05:40:58.140489 [TRACE] rollback: attempting rollback: path=auth/userpass/
2017/06/08 05:40:58.140625 [TRACE] rollback: attempting rollback: path=secret/
2017/06/08 05:40:58.140650 [TRACE] rollback: attempting rollback: path=sys/
2017/06/08 05:40:58.140669 [TRACE] rollback: attempting rollback: path=pki/
2017/06/08 05:40:58.140688 [TRACE] rollback: attempting rollback: path=transit/
2017/06/08 05:40:58.140719 [TRACE] rollback: attempting rollback: path=cubbyhole/
2017/06/08 05:40:58.140740 [TRACE] rollback: attempting rollback: path=auth/token/
    2017/06/08 05:41:33 [ERR] yamux: keepalive failed: i/o deadline reached
    2017/06/08 05:41:33 [ERR] consul: RPC failed to server [<HOST_3_IP>]:8300: rpc error: EOF
    2017/06/08 05:41:33 [ERR] agent: coordinate update error: rpc error: EOF    2017/06/08 05:41:33 [ERR] consul: RPC failed to server [<HOST_3_IP>]:8300: rpc error: EOF
    2017/06/08 05:41:33 [ERR] http: Request GET /v1/kv/vault-dev/core/lock?consistent=&index=36, error: rpc error: EOF from=[::1]:53919
    2017/06/08 05:41:33 [ERR] consul: RPC failed to server [<HOST_3_IP>]:8300: rpc error: EOF
    2017/06/08 05:41:33 [ERR] http: Request PUT /v1/session/renew/4013fd9f-b071-8e86-4068-2e776701731c, error: rpc error: EOF from=[::1]:40184

2017/06/08 05:41:38.272299 [WARN ] core: leadership lost, stopping active operation
2017/06/08 05:41:38.433096 [INFO ] core: pre-seal teardown starting
2017/06/08 05:41:38.434781 [INFO ] core: stopping cluster listeners
2017/06/08 05:41:38.435938 [TRACE] core: waiting for success notification while stopping cluster listeners
2017/06/08 05:41:38.436138 [INFO ] core: shutting down forwarding rpc listeners
2017/06/08 05:41:38.436312 [INFO ] core: forwarding rpc listeners stopped
2017/06/08 05:41:38.484187 [INFO ] core: rpc listeners successfully shut down
2017/06/08 05:41:38.484232 [INFO ] core: cluster listeners successfully shut down
2017/06/08 05:41:38.485002 [DEBUG] audit: removing reload function: path=file/
2017/06/08 05:41:38.485350 [INFO ] rollback: stopping rollback manager
    2017/06/08 05:41:42 [INFO] agent: Synced check 'vault-dev:<HOST_1>:8200:vault-sealed-check'
    2017/06/08 05:41:42 [WARN] agent: Check 'vault-dev:<HOST_1>:8200:vault-sealed-check' missed TTL, is now critical
    2017/06/08 05:41:42 [INFO] agent: Synced check 'vault-dev:<HOST_1>:8200:vault-sealed-check'

Full logs are at https://www.dropbox.com/s/2yaj8nvisuggbs0/vault_logs.zip?dl=0 (I've just scrubbed hostnames/IPs for privacy).


So it seems, one node loses the lock - probably due to some networking issues - and another one acquires the lock (it actually stayed held by HOST_2), but fails to start operation for some reason.
Now if I drop the lock manually, none of the nodes acquires it again.

Jeff Mitchell

unread,
Jun 9, 2017, 4:45:07 PM6/9/17
to Vault
Hi Vladimir,

2017/06/08 05:40:58.140688 [TRACE] rollback: attempting rollback: path=transit/
2017/06/08 05:40:58.140719 [TRACE] rollback: attempting rollback: path=cubbyhole/
2017/06/08 05:40:58.140740 [TRACE] rollback: attempting rollback: path=auth/token/
    2017/06/08 05:41:33 [ERR] yamux: keepalive failed: i/o deadline reached
    2017/06/08 05:41:33 [ERR] consul: RPC failed to server [<HOST_3_IP>]:8300: rpc error: EOF
    2017/06/08 05:41:33 [ERR] agent: coordinate update error: rpc error: EOF    2017/06/08 05:41:33 [ERR] consul: RPC failed to server [<HOST_3_IP>]:8300: rpc error: EOF
    2017/06/08 05:41:33 [ERR] http: Request GET /v1/kv/vault-dev/core/lock?consistent=&index=36, error: rpc error: EOF from=[::1]:53919
    2017/06/08 05:41:33 [ERR] consul: RPC failed to server [<HOST_3_IP>]:8300: rpc error: EOF
    2017/06/08 05:41:33 [ERR] http: Request PUT /v1/session/renew/4013fd9f-b071-8e86-4068-2e776701731c, error: rpc error: EOF from=[::1]:40184

2017/06/08 05:41:38.272299 [WARN ] core: leadership lost, stopping active operation
2017/06/08 05:41:38.433096 [INFO ] core: pre-seal teardown starting

This looks strongly like you're having issues with Consul (or Vault connecting to Consul), and so Vault as a result cannot function normally. You may want to examine your Consul logs in more detail.

Best,
Jeff

To unsubscribe from this group and stop receiving emails from it, send an email to vault-tool+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/vault-tool/79557bbb-35af-4048-9f53-efedb184c34f%40googlegroups.com.

Vladimir Volkov

unread,
Jun 9, 2017, 5:03:28 PM6/9/17
to vault...@googlegroups.com

Hi Jeff,

As far as I see, Consul cluster lost connectivity briefly, and then recovered succesfully.

I have a Consul agent running on each Vault node (actually, errors in the logs below seem to be coming from Consul agent, not from the Vault itself - unfortunately, the logs are mixed together here), and two Consul servers running separately.
Vault is configured to talk to its local Consul agent only. I expected this setup to allow the system to handle network outages and other disturbances well, since Consul itself seems to recover without much problems.

-- 
Best regards, Vladimir
You received this message because you are subscribed to a topic in the Google Groups "Vault" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/vault-tool/NwQ3xrrubbA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to vault-tool+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/vault-tool/CAORe8GH5vQr1qLndxiBQ0E%2BQbRwm8sjYZCsCS07mwE4HXSUXYg%40mail.gmail.com.

Jeff Mitchell

unread,
Jun 9, 2017, 5:06:14 PM6/9/17
to Vault
Hi Vladimir,

There's a difference between Consul recovering its own connectivity after an outage, and doing the right thing for all of its API clients. If the last thing in the log is the "pre-seal teardown starting" entry, it suggests that some Consul I/O that Vault attempted never actually returned (error or no), leading Vault to be stuck waiting for some kind of response.

Best,
Jeff

To unsubscribe from this group and all its topics, send an email to vault-tool+unsubscribe@googlegroups.com.

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/vault/issues
IRC: #vault-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Vault" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vault-tool+unsubscribe@googlegroups.com.

Vladimir Volkov

unread,
Jun 9, 2017, 5:26:13 PM6/9/17
to vault...@googlegroups.com

The last thing is "stopping rollback manager", here is the relevant fragment:

2017/06/08 05:41:38.272299 [WARN ] core: leadership lost, stopping active operation
2017/06/08 05:41:38.433096 [INFO ] core: pre-seal teardown starting
2017/06/08 05:41:38.434781 [INFO ] core: stopping cluster listeners
2017/06/08 05:41:38.435938 [TRACE] core: waiting for success notification while stopping cluster listeners
2017/06/08 05:41:38.436138 [INFO ] core: shutting down forwarding rpc listeners
2017/06/08 05:41:38.436312 [INFO ] core: forwarding rpc listeners stopped
2017/06/08 05:41:38.484187 [INFO ] core: rpc listeners successfully shut down
2017/06/08 05:41:38.484232 [INFO ] core: cluster listeners successfully shut down
2017/06/08 05:41:38.485002 [DEBUG] audit: removing reload function: path=file/
2017/06/08 05:41:38.485350 [INFO ] rollback: stopping rollback manager

And if Consul I/O doesn't return, isn't a call supposed to fail on timeout?


The other node's behaviour seems more strange to me.
Its last message is:


2017/06/08 05:41:53.516894 [INFO ] core: acquired lock, enabling active operation

Shouldn't it be doing some more initialization immediately after that?

As a side note, last time it hanged during post-unseal setup - probably got stuck on the same operation this time?


-- 
Best regards, Vladimir
To unsubscribe from this group and all its topics, send an email to vault-tool+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/vault-tool/CAORe8GF_sbbW%2BXzaWUG9Nmc4%2Br9zeWW01er%2BLam5TYXntfbX%3Dw%40mail.gmail.com.

Jeff Mitchell

unread,
Jun 9, 2017, 5:29:34 PM6/9/17
to Vault
On Fri, Jun 9, 2017 at 5:26 PM, Vladimir Volkov <vlad....@gmail.com> wrote:

And if Consul I/O doesn't return, isn't a call supposed to fail on timeout?

I'm not aware of a way to configure this for Consul's API, although perhaps something has been added.

The other node's behaviour seems more strange to me.
Its last message is:
2017/06/08 05:41:53.516894 [INFO ] core: acquired lock, enabling active operation

Shouldn't it be doing some more initialization immediately after that?

Yes.

Best,
Jeff 
Reply all
Reply to author
Forward
0 new messages