First of all we are using :
Cassandra 3.11.4
Vault 1.2.2
We noticed a vault node that was sealed. When we tried to look up the reason we saw that it had recently crashed and restarted :
```
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x16f90ed]
goroutine 72362587 [running]:
0, 0x1)
```
We are using the database secret engine on /cassandra , with the `cassandra-database-plugin`.
We had two cassandra clusters with configuration :
```
# First cluster
Key Value
--- -----
allowed_roles [00-rw-cassandra 00-ro-cassandra]
connection_details map[hosts:00cassandra.service.consul protocol_version:4 username:vault_superuser0]
plugin_name cassandra-database-plugin
root_credentials_rotate_statements []
# Second cluster (on another DC from the vault servers, us-central)
Key Value
--- -----
allowed_roles [01-rw-cassandra 01-ro-cassandra]
connection_details map[hosts:01cassandra.service.consul protocol_version:4 username:vault_superuser2]
plugin_name cassandra-database-plugin
root_credentials_rotate_statements []
```
Since the panic happened the 01cassandra cluster has been completely overloaded. Each node has more than 80 system load at any time and all of that load came from the vault servers.
If I block the vault's IP addresses with iptables the load goes back below 5 very quickly, open the iptables again and the load goes back up.
With tcpdump we noticed there were thousands of connections per minute, using even more bandwidth than cassandra did for it's inter-node traffic.
Now we cannot do anything on the vault server relating to the 01cassandra cluster. When I try to a configuration change I get these kind of messages :
```
Error writing data to cassandra/config/01cassandra: Error making API request.
Code: 400. Errors:
* error creating database object: error verifying connection: error creating session: gocql: unable to create session: control: unable to connect to initial hosts: gocql: no response to connection startup within timeout
```
(In the above example we tried to add a large timeout)
When we tried to `vault secrets disable cassandra/` we got this message :
```
* failed to revoke "cassandra/creds/01-rw-cassandra/11To3PhJNL2YZ9sLdrXXXRuJ5" (1 / 12): failed to revoke entry: resp: (*logical.Response)(nil) err: error verifying connection: error creating session: gocql: unable to create session: control: unable to connect to initial hosts: gocql: no response to connection startup within timeout
```
In the mean time the applications can still connect to the cassandra without issues (except slower because of the load). A Rolling-restart on the cassandra cluster doesn't seem to change anything. Bringing down the cluster entirely and bringing it back up neither.
The other cluster 00cassandra has the exact same configuration, exact same client applications, just in another DC and they have 0 issues (it probably handles at lot more load without issues).
Now the weirdest thing is : We have the same setup in preproduction and the preproduction's 01cassandra also failed in the same way. After a few days the issue went away without any configuration change on either the vault or the cassandra cluster.
The latency between the two DC is about 100ms, I think the default timeout for the cassandra-database-plugin is 5 seconds
Also, this setup has been untouched and working for a few months now, absolutely no issues yet.
Any idea what might be going on ?
Any ideas how to completely erase the database secret engine we created on /cassandra ? We really don't want to have another vault panic causing nodes to become sealed again.
Regards,
Leo