I committed a cardinal sin. In trying to re-create an index and the keyspace it was based on, I inadvertently dropped the keyspace without having first deleted the index.
Then when we then tried to delete the index, it failed. When we then went to try to re-create the index, that failed too. We'd see an org.elasticsearch.ResourceAlreadyExistsException in the logs.
We've tried stopping the search nodes, deleting the index directory in each node under /var/lib/cassandra/elasticsearch.data/nodes/0/indices/, and then starting the search nodes again. No luck.
We've tried doing the same in combination with hacking the row in the elastic_admin.metadata_log table. Where the source column was:
'create-index [foo_index], cause [api]'
We changed it to:
'create-index [not_foo_index], cause [api]'
No luck either.
We're at a loss at this point as to:
A. Where the remaining knowledge of the index is.
B. How to re-create this single index without having to take all the nodes in our search datacenter, convert them to non-search nodes, and then try to start from a clean slate from there.
What are we missing?
Thanks for any help.
OK, we're planning out how to "take all the nodes in our search datacenter, convert them to non-search nodes, and then try to start from a clean slate from there."
For us the broad outline appears to be:
The step that troubles me is #2: "Remove every trace of Elassandra data from each of the search nodes." I'm fearful that I'll either miss something or do something I shouldn't. Here are the things I believe I should do, in no particular order:
Will these steps safely return my Elassandra nodes to a clean slate from which Elassandra can be re-enabled?
Thanks.
Resending with correction...
OK, we're planning out how to "take all the nodes in our search datacenter, convert them to non-search nodes, and then try to start from a clean slate from there."
For us the broad outline appears to be:
- Stop all the search nodes.
- One by one, start the search nodes back up as straight Cassandra nodes. For us that means changing CASSANDRA_DAEMON in /etc/cassandra/sysconfig from org.apache.cassandra.service.ElassandraDaemon to org.apache.cassandra.service.CassandraDaemon.
- Remove every trace of Elassandra from each of the search nodes.
- Repair the indexed tables.
- Restart each of the search nodes configured back as Elassandra nodes.
- Re-create our indexes.
The step that troubles me is #2: "Remove every trace of Elassandra data from each of the search nodes." I'm fearful that I'll either miss something or do something I shouldn't. Here are the things I believe I should do, in no particular order:
- Drop the elastic_admin keyspace.
- Use DROP INDEX to drop every one of the secondary indexes.
- Delete the /var/lib/cassandra/elasticsearch.data/ directory on each of the search nodes.
Will these steps safely return our Elassandra nodes to a clean slate from which Elassandra can be re-enabled?
Thanks.
Update. We managed to get Elassandra back in a valid state without having to do this reset I was describing. However, if anyone has any input on that, I'd be happy to hear.
Let me at least share how we fixed this.
While we were trying and failing to drop or create indexes, we were noticing in system.log repeated messages like:
2021-03-26 03:59:29,663 WARN [elasticsearch[10.6.23.98][masterService#updateTask][T#1]] ClusterService.java:1300 commitMetaData PAXOS Failed to update metadata source=delete-index [[an_index/wsVQmJBiT8-8fOPd3rXVfA]] prevMetadata=9f35db4a-b9cf-43f9-9fa5-2ab7c5aa5a74/9 nextMetaData=9f35db4a-b9cf-43f9-9fa5-2ab7c5aa5a74/10
2021-03-26 21:18:05,196 WARN [elasticsearch[10.6.23.183][masterService#updateTask][T#1]] CassandraDiscovery.java:1103 publishAsCoordinator PAXOS concurrent update, source=put-mapping[another_index] metadata=4b6b7513-ebd0-4135-9f42-c8c08869dc55/10, resubmit task on next metadata change
Note the /10. In the elastic_admin.metadata_log table, we were seeing the version column stuck at 10. As one would expect, all 11 rows would have 10 for version and for the v clustering column they would go from 0 to 10. But we couldn't add another row.
The fix was to delete the last row in the metadata_log table:
DELETE FROM elastic_admin.metadata_log WHERE cluster_name = 'Foo Elassandra 6.8.4.7 cluster-QA' and v = 10;
And then to set the version on all the other rows to reflect there now only being 10 rows.
UPDATE elastic_admin.metadata_log SET version = 9 WHERE cluster_name = 'Foo Elassandra 6.8.4.7 cluster-QA';
Then we restarted the search nodes.
That got the PAXOS transactions working again, and we were able to make index changes. At last check, there are 27 rows and the version on all is at 26.
This solution is adapted from the thread for #367 Failed to create index - PAXOS Failed to update metadata source=create-index.