cassandra-database-plugin/gocql panic and strange vault behavior

Léo FERLIN SUTTON

unread,

Sep 20, 2019, 12:19:44 PM9/20/19

to Vault

First of all we are using :

Cassandra 3.11.4

Vault 1.2.2

We noticed a vault node that was sealed. When we tried to look up the reason we saw that it had recently crashed and restarted :

```

panic: runtime error: invalid memory address or nil pointer dereference

[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x16f90ed]

goroutine 72362587 [running]:

github.com/hashicorp/vault/vendor/github.com/gocql/gocql.(*controlConn).HandleError(0xc0024bddc0, 0xc006e09a40, 0x374e100, 0xc003d2fdb

0, 0x1)

#011/gopath/src/github.com/hashicorp/vault/vendor/github.com/gocql/gocql/control.go:392 +0x6d

github.com/hashicorp/vault/vendor/github.com/gocql/gocql.(*Conn).closeWithError(0xc006e09a40, 0x374e100, 0xc003d2fdb0)

#011/gopath/src/github.com/hashicorp/vault/vendor/github.com/gocql/gocql/conn.go:491 +0x27a

github.com/hashicorp/vault/vendor/github.com/gocql/gocql.(*Conn).serve(0xc006e09a40)

#011/gopath/src/github.com/hashicorp/vault/vendor/github.com/gocql/gocql/conn.go:515 +0x58

created by github.com/hashicorp/vault/vendor/github.com/gocql/gocql.(*Session).dialWithoutObserver

#011/gopath/src/github.com/hashicorp/vault/vendor/github.com/gocql/gocql/conn.go:285 +0x6ff

```

We are using the database secret engine on /cassandra , with the `cassandra-database-plugin`.

We had two cassandra clusters with configuration :

```

# First cluster

Key Value

--- -----

allowed_roles [00-rw-cassandra 00-ro-cassandra]

connection_details map[hosts:00cassandra.service.consul protocol_version:4 username:vault_superuser0]

plugin_name cassandra-database-plugin

root_credentials_rotate_statements []

# Second cluster (on another DC from the vault servers, us-central)

Key Value

--- -----

allowed_roles [01-rw-cassandra 01-ro-cassandra]

connection_details map[hosts:01cassandra.service.consul protocol_version:4 username:vault_superuser2]

plugin_name cassandra-database-plugin

root_credentials_rotate_statements []

```

Since the panic happened the 01cassandra cluster has been completely overloaded. Each node has more than 80 system load at any time and all of that load came from the vault servers.

If I block the vault's IP addresses with iptables the load goes back below 5 very quickly, open the iptables again and the load goes back up.

With tcpdump we noticed there were thousands of connections per minute, using even more bandwidth than cassandra did for it's inter-node traffic.

Now we cannot do anything on the vault server relating to the 01cassandra cluster. When I try to a configuration change I get these kind of messages :

```

Error writing data to cassandra/config/01cassandra: Error making API request.

URL: PUT https://127.0.0.1:8200/v1/cassandra/config/01cassandra

Code: 400. Errors:

* error creating database object: error verifying connection: error creating session: gocql: unable to create session: control: unable to connect to initial hosts: gocql: no response to connection startup within timeout

```

(In the above example we tried to add a large timeout)

When we tried to `vault secrets disable cassandra/` we got this message :

```

* failed to revoke "cassandra/creds/01-rw-cassandra/11To3PhJNL2YZ9sLdrXXXRuJ5" (1 / 12): failed to revoke entry: resp: (*logical.Response)(nil) err: error verifying connection: error creating session: gocql: unable to create session: control: unable to connect to initial hosts: gocql: no response to connection startup within timeout

```

In the mean time the applications can still connect to the cassandra without issues (except slower because of the load). A Rolling-restart on the cassandra cluster doesn't seem to change anything. Bringing down the cluster entirely and bringing it back up neither.

The other cluster 00cassandra has the exact same configuration, exact same client applications, just in another DC and they have 0 issues (it probably handles at lot more load without issues).

Now the weirdest thing is : We have the same setup in preproduction and the preproduction's 01cassandra also failed in the same way. After a few days the issue went away without any configuration change on either the vault or the cassandra cluster.

The latency between the two DC is about 100ms, I think the default timeout for the cassandra-database-plugin is 5 seconds

Also, this setup has been untouched and working for a few months now, absolutely no issues yet.

Any idea what might be going on ?

Any ideas how to completely erase the database secret engine we created on /cassandra ? We really don't want to have another vault panic causing nodes to become sealed again.

Regards,

Leo

Matthew Tice

unread,

Sep 21, 2019, 5:40:26 PM9/21/19

to vault...@googlegroups.com

What happens if you try to revoke the leases? My guess is that you'll continue with the high load as vault tries to connect to the cluster to drop the roles. If that's the case, what about a force revoke so that they're removed from vault? You may need to do some cleanup on the cassandra cluster (removing the dynamic roles, etc.).

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.

GitHub Issues: https://github.com/hashicorp/vault/issues
IRC: #vault-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Vault" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vault-tool+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/vault-tool/183d2c96-7a99-41ac-880f-13739a00d10b%40googlegroups.com.

Léo FERLIN SUTTON

unread,

Sep 23, 2019, 5:54:42 AM9/23/19

to vault...@googlegroups.com

What happens if you try to revoke the leases?

My guess is that you'll continue with the high load as vault tries to connect to the cluster to drop the roles.

If that's the case, what about a force revoke so that they're removed from vault?

I did the force revoke. It allowed me to disable the secret engine. Thank you

The initial vault panic was surrounded by lease revocations in the log files, so it might all have started with a lease revocation failure.

Anyway, right now I no longer have any configuration in the vault relating to the cassandra cluster (since I disabled the whole database engine). but I am still getting thousands of connections to the cassandra from the vault cluster. Does vault have a queue or a cache where some relic of the configuration might still be present ?

Regards.

Leo

To view this discussion on the web visit https://groups.google.com/d/msgid/vault-tool/CA%2BtaBv-5pms5z%3DdyYHKL9rVBQkzrk9Bstud9eVn0nkQwWHeHfw%40mail.gmail.com.

Reply all

Reply to author

Forward