How to debug refused connections?

Georgios Kafataridis

unread,

Apr 7, 2017, 3:14:21 AM4/7/17

to ArangoDB

I am getting "Connect Exception 111 cannot connect to endpoint 'tcp://0.0.0.0:8529': Connection refused" errors once or twice per week. For the duration of aranagodb's unavailability I can't even visit the webUI.

What would be the best approach to find what's causing this ? Are there any conditions that trigger it apart from hardware limits? Too many requests maybe ?

Server is running:
Debian 8

Arangodb 3.1.16

Wilfried Gösgens

unread,

Apr 7, 2017, 4:29:11 AM4/7/17

to ArangoDB

Hi,
you can't connect 0.0.0.0 - you can only bind it.
So for some reason your application picks up the wrong connection IP, you need to put in either 127.0.0.1 (if its on the same machine) or the public IP address of the machine.

In general you can use the output of

netstat -alpnt

to see the connections current in use; The number is limited.
Its a common failure if you don't use connection keepalive that closed connections linger around to pick up eventually stray tcp packets.

In the Linux kernel you can either configure to re-use ports that are in this state, but the better way is to use connection keepalive since establishing new connections needs time and consumes resources on client and server side.

Cheers,
Willi

Georgios Kafataridis

unread,

Apr 7, 2017, 7:58:30 AM4/7/17

to ArangoDB

Thank you.

OK, isn't the keepalive settings (net.ipv4.tcp_keepalive_time = 7200, net.ipv4.tcp_keepalive_intvl = 75, net.ipv4.tcp_keepalive_probes = 9) enabled by default ?

Should I increase the keepalive time, and reduce to intvl and probes numbers to prevent refused connections in the future ?

References:
http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html
http://www.ehowstuff.com/configure-linux-tcp-keepalive-setting/

Wilfried Gösgens

unread,

Apr 7, 2017, 9:54:25 AM4/7/17

to ArangoDB

Hi,
I was talking about http connection keepalive - thus use the same connection for subsequent HTTP requests.
This should be an option in your driver, and a question about how you use your connectionobject; You have to look this up in the documentation of your driver.
However, errors happen, so to be sure you should also have a look at netstat and CLOSE_WAIT connections.

Cheers,
Willi

Georgios Kafataridis

unread,

Apr 7, 2017, 12:21:37 PM4/7/17

to ArangoDB

OK, I use the php driver and I was already using it the keep-alive flag.

    'class' => 'app\components\arangodb\Connection',
    'endpoint' => 'tcp://10.69.198.2:8529', //internal database server
    'username' => '',
    'password' => '',
    'database' => '',
    'persistence' => app\components\arangodb\Connection::PERSISTENCE_KEEPALIVE,

If this happens again I should look for hanged connections ?
So it's not a problem of arangodb but problem of the driiver?

Restarting arangodb3 service seems to close all connections, but doing so has left me with corrupted wal journals.

Wilfried Gösgens

unread,

Apr 10, 2017, 4:16:26 AM4/10/17

to ArangoDB

Hi,
Its most probably a situation where your system runs out of resources.

One thing to watch may be to use `db._explain(query)` in arangosh or the webinterface and find out whether all of your queries properly use indices.

Another thing that may is that i.e. you're targeted by a botnet, so traffic increases above the expected limits.

One thing to fine tune could be the maximum number of open files:
https://www.cyberciti.biz/faq/linux-increase-the-maximum-number-of-open-files/

In doubt i'd put in a rater limit to the frontend like described here: https://debian-administration.org/article/187/Using_iptables_to_rate-limit_incoming_connections

But yes, having a look at the number of open connections is definitely one thing to do once this reoccurs.

Regarding your WAL-log issues, you can force arangodb to ignore them by using `--wal.ignore-recovery-errors` but you should dump & re-import your data afterwards.

Cheers,
Willi

Georgios Kafataridis

unread,

Apr 10, 2017, 7:05:49 AM4/10/17

to ArangoDB

Thank you for your suggestions. I will look into that.

I was planning to use lsof and tcptrack the next time this happens to monitor any non closed connections.

I have no doubts that high traffic could be causing this taking into account other factors.

Now about closing any open tcp connections, can that cause any data loss though?

Wilfried Gösgens

unread,

Apr 10, 2017, 7:25:20 AM4/10/17

to ArangoDB

Hi,
The kernel keeps connection in TIME_WAIT for a grace period, so stray packages comming late can be sorted into these connections.
You can enable recent linux kernels to re-use these sockets:

https://vincent.bernat.im/en/blog/2014-tcp-time-wait-state-linux

(though he tells why not to do so ;-)

Cheers,
Willi

Georgios Kafataridis

unread,

Apr 10, 2017, 12:55:07 PM4/10/17

to ArangoDB

Happened again today. I received a "111 connection refused" error. So I fired up tcptrack on my database server to look for tcp packets on 8529. (tcptrack -i eth1 port 8529).

There was not a single connection waiting to be closed! Instead connections were popping up and closing constantly with a 3 second timeout and it really wasn't any different than any other day.

I tried updating the file-max to 1000000 and even giving to arangodb user a limit of hard 10024 and a soft limit to 4096, but it didn't help.

I then tried to change the connection's persistence to "close" in order to force the app to use a new connection on page refresh. Still nothing.

All operations are reads and updates, not a single delete (unless it is performed from the webUI)

Memory consumption and cpu usage are not by any means excessive (4 Core CPU, 8GB RAM) and we try to unload any unused collections to save ram at regular intervals, so I can't really understand what resources are depleted.

The only thing that keeps doing the trick is restarting the service, which is very dangerous as arangodb can stop, but after a "111" error

almost always a WAL file will throw out segmentation fault, so my only option is to delete it or ignore it resulting in complete data loss.

Also, from our use case, there is a high possibility that these errors are caused after multiple update operations.

I have updated all new collections to use sync in order to make sure that data are being written to disk and I am still losing data.

Right now it seems like a domino of disasters. More update operations lead to refused connections which lead to data loss which then require more update operations from our side.

My only hope now is to check the "time wait" solution.

Wilfried Gösgens

unread,

Apr 11, 2017, 4:03:36 AM4/11/17

to ArangoDB

Hi,
I have been able to reproduce your issue.
Your current usage pattern of restoring the same dump into several databases is triggering this odd behaviour.
If you want to be more stable for now, please start several instances on several ports instead of using several databases from the same arangorestore.
We're working on a fix for this.

Cheers,
Willi

Georgios Kafataridis

unread,

Apr 11, 2017, 4:45:11 AM4/11/17

to ArangoDB

Ok, thank you.

On configuration level that means I will have to separate the database directories in order to contain one database in each one?

Ref: http://stackoverflow.com/questions/35231530/multiple-instances-of-arangodb-on-same-server

Wilfried Gösgens

unread,

Apr 11, 2017, 5:32:34 AM4/11/17

to ArangoDB

Hi,
yes,
--database.directory, --javascript.app-path, --server.endpoint

Cheers,
Willi

Wilfried Gösgens

unread,

Apr 19, 2017, 3:56:21 AM4/19/17

to ArangoDB

Just a short note,
We're still working on this issue and won't be able to present a fix with the next (3.1.18) release.

Cheers,
Willi

Georgios Kafataridis

unread,

Apr 19, 2017, 4:18:56 AM4/19/17

to ArangoDB

Thanks for the update, I appreciate it.

Just to clear things up, the same bug somehow affects the issue here https://groups.google.com/forum/#!topic/arangodb/H1WcP1l2B8U ?

Is there a git issue to track both? Googlegroups is ok I guess, but not so transparent.

Wilfried Gösgens

unread,

Apr 19, 2017, 4:32:43 AM4/19/17

to ArangoDB

Yes,

its most probably the same issue - same collection ID used in several databases - so another workaround would be to change the collectionids to another uniq value.

And yes, a github issue would be better to track this and give you a mail once a version with a fix becomes available.

Reply all

Reply to author

Forward