Happened again today. I received a "111 connection refused" error. So I fired up tcptrack on my database server to look for tcp packets on 8529. (tcptrack -i eth1 port 8529).
There was not a single connection waiting to be closed! Instead connections were popping up and closing constantly with a 3 second timeout and it really wasn't any different than any other day.
I tried updating the file-max to 1000000 and even giving to arangodb user a limit of hard 10024 and a soft limit to 4096, but it didn't help.
I then tried to change the connection's persistence to "close" in order to force the app to use a new connection on page refresh. Still nothing.
All operations are reads and updates, not a single delete (unless it is performed from the webUI)
Memory consumption and cpu usage are not by any means excessive (4 Core CPU, 8GB RAM) and we try to unload any unused collections to save ram at regular intervals, so I can't really understand what resources are depleted.
The only thing that keeps doing the trick is restarting the service, which is very dangerous as arangodb can stop, but after a "111" error
almost always a WAL file will throw out segmentation fault, so my only option is to delete it or ignore it resulting in complete data loss.
Also, from our use case, there is a high possibility that these errors are caused after multiple update operations.
I have updated all new collections to use sync in order to make sure that data are being written to disk and I am still losing data.
Right now it seems like a domino of disasters. More update operations lead to refused connections which lead to data loss which then require more update operations from our side.
My only hope now is to check the "time wait" solution.