To add more details the problem,
We have more than 500+ queries/API built on 200+ vertex based schema. All the query were written optimally with right indexes so that at any normal condition the response time of the queries will be under 50 ms. Most of the query response in less than 20 ms. You can refer the screenshot shared before.
All of the sudden one of these query freeze at the database indefinitely and all the subsequent queries fired from application also start to freeze indefinitely. This leads to an increase in concurrent connections to the database, with none of the query responding back. This leads to the maximum connection limit at the database level and the database stop accepting new connections. Looking at the database, the CPU, Memory remains stable. There is a very slight increase in CPU (due to too high concurrent connection). This indicates the query is not executed in the database and are waiting for resource/lock.
To bring the server back to normal, we have to stop the database (thus kill the connections), bounce back again to access. This happens very frequently and sometime during restart the index crashes. So we have to restore the database from backup.
We log every query being executed. After bouncing the server, we tried to run the frozen queries (same query with same parameter), they executed normally as usual and responded in usual latency (10 - 20 ms). We tried running all the queries (first query, some random query from all frozen query set), all executed as expected.
When the database goes to freeze mode, even simple query that supposes pick single record by primary Id also freezes. We have no clue why the database goes to freeze state all of sudden.
We have been using OrientDb for last 5 years and never faced such a situation.
We tried passing timeout argument along with all the read query (with timeout as 5000 ms), we reduced record.locktimeout, network level various timeout to lower the number, session time out, connection timeout, etc. None of them helped. The queries are not timing out. The connection breaks and application is getting SocketTimeoutException, but connection/query seems to be staying in frozen/lock state in the database side and not allowing the new connection.
We tried to kill the connection using Command "Kill", "interrupt", both have failed, the command just hangs in waiting to get the response from the server for the first connections.
We are currently rebuilding the index for the entire database on one go as last resort.
We are a startup, built the entire product using OrientDB. Due to this, our service is down for the last 5 days and we are losing our customer trust and we are having big crisis.
Help us identify the root cause and overcome the issue.
Regards,
Ram