Hitting TCP socket limit

395 views
Skip to first unread message

Dana Powers

unread,
May 30, 2015, 2:43:18 PM5/30/15
to consul-tool
I ran into this recently and thought I would share to see if others have dealt with anything similar or have tips to help avoid:

Developer writes an application that registers services, pings health checks w/ TTL, and queries for available services. All interaction is over localhost:8500 w/ consul agent in client mode.

The dev application gets into a state where it does not close TCP connections to consul http (unclear whether health check pings or service queries). The open connections eventually exhaust the max fds available to consul (we use 1024 soft limit).

With no more fds available, consul cannot accept new incoming TCP connections. It appears that the random Serf TCP syncs start failing, each logging `[ERR] memberlist: Error accepting TCP connection: accept tcp 0.0.0.0:8301: too many open files` to a local log file. But the node still appears healthy and I assume this is because the checks over UDP still work (not needing additional fds for connections). But so the log entries start coming fast and furious and quickly the logfile takes up all remaining space on disk...

So anyway, single buggy application managed to DOS the local consul agent and get it to push the host disk usage to 100%.

Fixing this dev application should be easy, as would be putting the log file on a separate partition.

But so I am also thinking about how to address from the consul side. Seems like we should have host-level health checks on used-vs-max fds and certainly disk available (disk alert notified us originally). Apart from that, is it possible to timeout tcp connections to the http api? I haven't dug too deeply into linux kernel settings, but perhaps some tuning of net.ipv4.tcp_* would help? Or any other ideas on how to prevent another buggy application from abusing consul like this?


-Dana

Michael Fischer

unread,
Jun 1, 2015, 1:44:34 PM6/1/15
to Dana Powers, consul-tool
Is there some reason you are imposing a 1024-fd limit?

No matter how high you set it, a leaky application is eventually going
to exhaust the server's connection limits, so at the end of the day
you have to place the blame squarely where it lies, i.e., on the
client. Monitoring is of course helpful for detection, but that's
something you should write (and maybe share!).

On the Consul side, I'd suggest that a keepalive timeout would be
useful for the HTTP service, assuming one does not already exist
already - Armon et al?

--Michael
> --
> You received this message because you are subscribed to the Google Groups
> "Consul" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to consul-tool...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Armon Dadgar

unread,
Jun 2, 2015, 9:06:52 AM6/2/15
to Michael Fischer, Dana Powers, 'Michael Fischer' via Consul, consul-tool
Hey,

In a case like this, there isn’t too much Consul can realistically do. We can’t time out TCP connections
because we depend on very long-lived connections for the blocking queries. A client may be “waiting”
for an HTTP response for 10 minutes if there is no change to the data they are querying.

We do know that Consul handles file handle exhaustion very badly, it spins in several places and
also logs very rapidly. That is behavior we can improve and have been working on, but ultimately it
just doesn’t handle that situation well.

One of the simplest approaches is to just set the limit very high as Michael suggests.

Best Regards,
Armon Dadgar

Michael Fischer

unread,
Jun 2, 2015, 11:25:32 AM6/2/15
to Armon Dadgar, Dana Powers, 'Michael Fischer' via Consul
On Tue, Jun 2, 2015 at 6:05 AM, Armon Dadgar <armon....@gmail.com> wrote:

> In a case like this, there isn’t too much Consul can realistically do. We
> can’t time out TCP connections
> because we depend on very long-lived connections for the blocking queries. A
> client may be “waiting”
> for an HTTP response for 10 minutes if there is no change to the data they
> are querying.

Indeed, but the server can determine from the query string whether the
client wants to block, and for how long (if not indefinitely).
Statistically speaking, it's more often than not that queries are
supposed to be non-blocking.

With that in mind, I'd suggest that a KA timeout is appropriate for
non-blocking queries, and even for blocking queries that have a
request-specified timeout.

--Michael

Reply all
Reply to author
Forward
0 new messages