We are planning to migrate from Cassandra to Datastax Astra. Over the last 6 months have been running a staging instance of our NodeJS application. This a containerised NodeJS app running under Azure App Service, and connecting to Datastax Astra Azure (Australia East zone) PAYG service.
As is best practice we have configured a long running connection pool for NodeJS C* driver.
Periodically we encounter this exception:
"{\n labels: [ 'ERROR' ],\n message: '"NoHostAvailableError: All host(s) tried for query failed.\\n at PrepareHandler.prepareWithQueryPlan (/ob3/node_modules/cassandra-driver/lib/prepare-handler.js:133:15)\\n at PrepareHandler.prepare (...modules/cassandra-driver/lib/prepare-handler.js:107:25)\\n at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\\n at async PrepareHandler.getPrepared (.../node_modules/cassandra-driver/lib/prepare-handler.js:62:12)\\n at async Client._execute (.../node_modules/cassandra-driver/lib/client.js:1012:31)"'\n}"
In each case Astra service is available but for some reason has started reporting no hosts. Service is normally restored by restarting our app - which is shutting down and restarting the Cassandra NodeJS driver connection to Astra.
In testing this issue is happening on average at least once a month with our Astra staging environment.
In our production environment we have long had a best practice of doing a graceful restart of our NodeJS daily by using pm2 graceful restart
The FAQ https://docs.datastax.com/en/developer/nodejs-driver/4.6/faq/ says connection should be long running ("only call client.shutdown() once in your application’s lifetime, normally when you shutdown your application.").
I note several other tickets in JIRA related to possible causes of NoHostAvailableError errors. I think it would be good to provide some best practice guidance and an entry in FAQ advising users how to configure for long running apps.
The practical result of the above issue is that if not handled application will become unresponsive / unable to reach database (for hours in many cases) until NodeJS/Driver is restarted.
Should I implement code to restart Cassandra driver if this exception is encountered?
Can anyone here provide examples of best way to do this?
I think the FAQ for the driver could be improved with recommendation on exception handling - for example examples of detecting and closing long running connection pool on exceptions.