HELP! Too many open files errors fill logs and crash server

5,410 views
Skip to first unread message

Mike Sparr

unread,
Sep 14, 2016, 2:19:07 PM9/14/16
to Confluent Platform
I have a 5-node kafka cluster sharing hardware with Cassandra 2.2.x. 64GB RAM, 960GB SSD, 8-core, bare metal on Ubuntu 14.04.x LTS

Running Kafka 0.10.1 (latest with Confluent Platform 3.0.1 update) and only a single node is failing consistently when I start an application on app servers. The app is typical ETL that consumes real estate data from various servers, storing their raw content (converted to JSON object) in topics for each source. Then Python stream processors using confluent-kafka-python (librdkafka latest) consume messages from topics, manipulate data, and publish to other topics along the way. Along the way one processor compares the raw record to potential existing record in Cassandra, limits to the diffs, and publishes diffs for processing.  The end topic is then consumed by Logstash -> Elasticsearch. 

Given issues with consumer missing some messages (reported librdkafka issue to Magnus), I had to limit each topic to 1 partition. Some of them have 1-3 million messages and 1 has 10 million messages+ respectively. I have 2 replica setting on each partition and also key compaction for several topics.

For several days I've been struggling to solve logs filling up with the following messages on only a single node:

[2016-09-14 11:46:26,674] ERROR Error while accepting connection (kafka.network.Acceptor)

java.io.IOException: Too many open files

at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)

at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)

at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)

at kafka.network.Acceptor.accept(SocketServer.scala:323)

at kafka.network.Acceptor.run(SocketServer.scala:268)

at java.lang.Thread.run(Thread.java:745)



I've tried netstat and other tools to check if socket related and so far appears they are far below the 100,000 open file limit set for BOTH root and * in /etc/security/limits.conf


I've dumped my lsof output and even sorted and the "java" process is the biggest offender but hard to tell if Cassandra or Kafka. I even removed that Cassandra node from cluster and left disabled and Kafka still failed on that node so I believe I ruled it out. It's unclear how to diagnose or find the cause and this is delaying our go-live with new Kafka-based stream pipeline architecture (compete rewrite of back end architecture/apps).


Here are dumps of LSOF:


http://hive.goomzee.com/lsof_logs.tar.gz



-------------

These errors occur when starting Python-based processors on app server that uses Confluent-Kafka-Python. Each of approximately 30 scripts instantiate a Consumer that polls for new messages in a topic and also instantiates a Producer for sending the edited message to another topic along the chain.  Consumer/Producer classes were built using kafka-confluent-python library that wraps librdkafka

Configs attached to this ticket.


-------------

I have set /etc/security/limits.conf

* soft nofile 100000
* hard nofile 100000
root soft nofile 100000
root hard nofile 100000

----------------

I start kafka server (and zookeeper, etc.) using Supervisor with the following configuration (template for Ansible playbook with jinja variables):



[program:kafka-server]

command=/usr/bin/kafka-server-start {{ kafka_config_dir }}/server.properties

stdout_logfile=/var/log/supervisor/kafka.log

redirect_stderr=true

autostart=true

;directory=

stopsignal=QUIT

;user=kafka

environment=KAFKA_LOG4J_OPTS="-Dlog4j.configuration=file:{{ kafka_config_dir }}/log4j.properties"



Screen Shot 2016-09-13 at 2.24.40 PM.png
Screen Shot 2016-09-13 at 2.24.27 PM.png
Message has been deleted

Mike Sparr

unread,
Sep 15, 2016, 11:12:09 AM9/15/16
to Confluent Platform
UPDATE2:

The errors moved to another node now so false positive. Oddly I see them and then the server appears to restart (unsure if it exits and then Supervisor is restarting or built in).  Any ideas how to find the culprit causing this other than what I have already done?

--


[2016-09-15 09:07:25,873] ERROR Error while accepting connection (kafka.network.Acceptor)

java.io.IOException: Too many open files

at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)

at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)

at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)

at kafka.network.Acceptor.accept(SocketServer.scala:323)

at kafka.network.Acceptor.run(SocketServer.scala:268)

at java.lang.Thread.run(Thread.java:745)

[2016-09-15 09:07:27,887] INFO KafkaConfig values: 

advertised.host.name = null

metric.reporters = []

quota.producer.default = 9223372036854775807

offsets.topic.num.partitions = 50

log.flush.interval.messages = 9223372036854775807

auto.create.topics.enable = true

controller.socket.timeout.ms = 30000

log.flush.interval.ms = null

principal.builder.class = class org.apache.kafka.common.security.auth.DefaultPrincipalBuilder

replica.socket.receive.buffer.bytes = 65536

min.insync.replicas = 1

replica.fetch.wait.max.ms = 500

num.recovery.threads.per.data.dir = 1

ssl.keystore.type = JKS

sasl.mechanism.inter.broker.protocol = GSSAPI

default.replication.factor = 1

ssl.truststore.password = null

log.preallocate = false

sasl.kerberos.principal.to.local.rules = [DEFAULT]

fetch.purgatory.purge.interval.requests = 1000

ssl.endpoint.identification.algorithm = null

replica.socket.timeout.ms = 30000

message.max.bytes = 1000012

num.io.threads = 8

offsets.commit.required.acks = -1

log.flush.offset.checkpoint.interval.ms = 60000

delete.topic.enable = true

quota.window.size.seconds = 1

ssl.truststore.type = JKS

offsets.commit.timeout.ms = 5000

quota.window.num = 11

zookeeper.connect = queue1:2181,queue2:2181,queue3:2181,queue4:2181

authorizer.class.name

num.replica.fetchers = 1

log.retention.ms = null

log.roll.jitter.hours = 0

log.cleaner.enable = true

offsets.load.buffer.size = 5242880

log.cleaner.delete.retention.ms = 86400000

ssl.client.auth = none

controlled.shutdown.max.retries = 3

queued.max.requests = 500

offsets.topic.replication.factor = 3

log.cleaner.threads = 1

sasl.kerberos.service.name = null

sasl.kerberos.ticket.renew.jitter = 0.05

socket.request.max.bytes = 104857600

ssl.trustmanager.algorithm = PKIX

zookeeper.session.timeout.ms = 6000

log.retention.bytes = -1

log.message.timestamp.type = CreateTime

sasl.kerberos.min.time.before.relogin = 60000

zookeeper.set.acl = false

connections.max.idle.ms = 600000

offsets.retention.minutes = 1440

replica.fetch.backoff.ms = 1000

inter.broker.protocol.version = 0.10.0-IV1

log.retention.hours = 336

num.partitions = 1

broker.id.generation.enable = true

listeners = PLAINTEXT://:9092

ssl.provider = null

ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]

log.roll.ms = null

log.flush.scheduler.interval.ms = 9223372036854775807

ssl.cipher.suites = null

log.index.size.max.bytes = 10485760

ssl.keymanager.algorithm = SunX509

security.inter.broker.protocol = PLAINTEXT

replica.fetch.max.bytes = 1048576

advertised.port = null

log.cleaner.dedupe.buffer.size = 134217728

replica.high.watermark.checkpoint.interval.ms = 5000

log.cleaner.io.buffer.size = 524288

sasl.kerberos.ticket.renew.window.factor = 0.8

zookeeper.connection.timeout.ms = 6000

controlled.shutdown.retry.backoff.ms = 5000

log.roll.hours = 168

log.cleanup.policy = delete

host.name = db4

log.roll.jitter.ms = null

max.connections.per.ip = 2147483647

offsets.topic.segment.bytes = 104857600

background.threads = 10

quota.consumer.default = 9223372036854775807

request.timeout.ms = 30000

log.message.format.version = 0.10.0-IV1

log.index.interval.bytes = 4096

log.dir = /tmp/kafka-logs

log.segment.bytes = 1073741824

log.cleaner.backoff.ms = 15000

offset.metadata.max.bytes = 4096

ssl.truststore.location = null

group.max.session.timeout.ms = 300000

ssl.keystore.password = null

zookeeper.sync.time.ms = 2000

port = 9092

log.retention.minutes = null

log.segment.delete.delay.ms = 60000

log.dirs = /var/lib/kafka

controlled.shutdown.enable = true

compression.type = producer

max.connections.per.ip.overrides = 

log.message.timestamp.difference.max.ms = 9223372036854775807

sasl.kerberos.kinit.cmd = /usr/bin/kinit

log.cleaner.io.max.bytes.per.second = 1.7976931348623157E308

auto.leader.rebalance.enable = true

leader.imbalance.check.interval.seconds = 300

log.cleaner.min.cleanable.ratio = 0.5

replica.lag.time.max.ms = 10000

num.network.threads = 3

ssl.key.password = null

reserved.broker.max.id = 1000

metrics.num.samples = 2

socket.send.buffer.bytes = 102400

ssl.protocol = TLS

socket.receive.buffer.bytes = 102400

ssl.keystore.location = null

replica.fetch.min.bytes = 1

broker.rack = null

unclean.leader.election.enable = true

sasl.enabled.mechanisms = [GSSAPI]

group.min.session.timeout.ms = 6000

log.cleaner.io.buffer.load.factor = 0.9

offsets.retention.check.interval.ms = 600000

producer.purgatory.purge.interval.requests = 1000

metrics.sample.window.ms = 30000

broker.id = 4

offsets.topic.compression.codec = 0

log.retention.check.interval.ms = 300000

advertised.listeners = null

leader.imbalance.per.broker.percentage = 10

 (kafka.server.KafkaConfig)

[2016-09-15 09:07:27,932] WARN The support metrics collection feature ("Metrics") of Proactive Support is disabled. (io.confluent.support.metrics.SupportedServerStartable)

[2016-09-15 09:07:27,932] INFO starting (kafka.server.KafkaServer)

[2016-09-15 09:07:27,938] INFO Connecting to zookeeper on queue1:2181,queue2:2181,queue3:2181,queue4:2181 (kafka.server.KafkaServer)

[2016-09-15 09:07:27,947] INFO Starting ZkClient event thread. (org.I0Itec.zkclient.ZkEventThread)

[2016-09-15 09:07:27,952] INFO Client environment:zookeeper.version=3.4.6-1569965, built on 02/20/2014 09:09 GMT (org.apache.zookeeper.ZooKeeper)

[2016-09-15 09:07:27,952] INFO Client environment:host.name=queue4 (org.apache.zookeeper.ZooKeeper)

[2016-09-15 09:07:27,952] INFO Client environment:java.version=1.8.0_101 (org.apache.zookeeper.ZooKeeper)

[2016-09-15 09:07:27,952] INFO Client environment:java.vendor=Oracle Corporation (org.apache.zookeeper.ZooKeeper)

[2016-09-15 09:07:27,952] INFO Client environment:java.home=/usr/lib/jvm/java-8-oracle/jre (org.apache.zookeeper.ZooKeeper)

[2016-09-15 09:07:27,952] INFO Client environment:java.class.path=:/usr/bin/../share/java/kafka/jackson-databind-2.6.3.jar:/usr/bin/../share/java/kafka/jetty-security-9.2.15.v20160210.jar:/usr/bin/../share/java/kafka/commons-digester-1.8.1.jar:/usr/bin/../share/java/kafka/zkclient-0.8.jar:/usr/bin/../share/java/kafka/jersey-container-servlet-core-2.22.2.jar:/usr/bin/../share/java/kafka/jackson-annotations-2.6.0.jar:/usr/bin/../share/java/kafka/support-metrics-client-3.0.1.jar:/usr/bin/../share/java/kafka/lz4-1.3.0.jar:/usr/bin/../share/java/kafka/jersey-guava-2.22.2.jar:/usr/bin/../share/java/kafka/aopalliance-repackaged-2.4.0-b34.jar:/usr/bin/../share/java/kafka/jetty-servlet-9.2.15.v20160210.jar:/usr/bin/../share/java/kafka/jetty-io-9.2.15.v20160210.jar:/usr/bin/../share/java/kafka/kafka.jar:/usr/bin/../share/java/kafka/commons-collections-3.2.1.jar:/usr/bin/../share/java/kafka/kafka_2.11-0.10.0.1-cp1-test-sources.jar:/usr/bin/../share/java/kafka/jetty-servlets-9.2.15.v20160210.jar:/usr/bin/../share/java/kafka/slf4j-api-1.7.21.jar:/usr/bin/../share/java/kafka/scala-library-2.11.8.jar:/usr/bin/../share/java/kafka/commons-logging-1.2.jar:/usr/bin/../share/java/kafka/jersey-container-servlet-2.22.2.jar:/usr/bin/../share/java/kafka/commons-beanutils-1.8.3.jar:/usr/bin/../share/java/kafka/commons-codec-1.9.jar:/usr/bin/../share/java/kafka/javax.inject-1.jar:/usr/bin/../share/java/kafka/jersey-common-2.22.2.jar:/usr/bin/../share/java/kafka/httpcore-4.4.3.jar:/usr/bin/../share/java/kafka/osgi-resource-locator-1.0.1.jar:/usr/bin/../share/java/kafka/connect-runtime-0.10.0.1-cp1.jar:/usr/bin/../share/java/kafka/jackson-mapper-asl-1.9.13.jar:/usr/bin/../share/java/kafka/kafka-streams-examples-0.10.0.1-cp1.jar:/usr/bin/../share/java/kafka/avro-1.7.7.jar:/usr/bin/../share/java/kafka/jackson-module-jaxb-annotations-2.6.3.jar:/usr/bin/../share/java/kafka/kafka-streams-0.10.0.1-cp1.jar:/usr/bin/../share/java/kafka/jackson-core-2.6.3.jar:/usr/bin/../share/java/kafka/commons-validator-1.4.1.jar:/usr/bin/../share/java/kafka/xz-1.0.jar:/usr/bin/../share/java/kafka/jersey-media-jaxb-2.22.2.jar:/usr/bin/../share/java/kafka/jopt-simple-4.9.jar:/usr/bin/../share/java/kafka/connect-api-0.10.0.1-cp1.jar:/usr/bin/../share/java/kafka/jetty-server-9.2.15.v20160210.jar:/usr/bin/../share/java/kafka/hk2-utils-2.4.0-b34.jar:/usr/bin/../share/java/kafka/paranamer-2.3.jar:/usr/bin/../share/java/kafka/jetty-http-9.2.15.v20160210.jar:/usr/bin/../share/java/kafka/kafka_2.11-0.10.0.1-cp1-scaladoc.jar:/usr/bin/../share/java/kafka/javax.annotation-api-1.2.jar:/usr/bin/../share/java/kafka/validation-api-1.1.0.Final.jar:/usr/bin/../share/java/kafka/kafka_2.11-0.10.0.1-cp1-javadoc.jar:/usr/bin/../share/java/kafka/rocksdbjni-4.8.0.jar:/usr/bin/../share/java/kafka/jersey-server-2.22.2.jar:/usr/bin/../share/java/kafka/slf4j-log4j12-1.7.21.jar:/usr/bin/../share/java/kafka/connect-file-0.10.0.1-cp1.jar:/usr/bin/../share/java/kafka/commons-lang3-3.1.jar:/usr/bin/../share/java/kafka/snappy-java-1.1.2.6.jar:/usr/bin/../share/java/kafka/jetty-util-9.2.15.v20160210.jar:/usr/bin/../share/java/kafka/jackson-core-asl-1.9.13.jar:/usr/bin/../share/java/kafka/scala-parser-combinators_2.11-1.0.4.jar:/usr/bin/../share/java/kafka/kafka_2.11-0.10.0.1-cp1.jar:/usr/bin/../share/java/kafka/kafka-clients-0.10.0.1-cp1.jar:/usr/bin/../share/java/kafka/jackson-jaxrs-json-provider-2.6.3.jar:/usr/bin/../share/java/kafka/httpmime-4.5.1.jar:/usr/bin/../share/java/kafka/guava-18.0.jar:/usr/bin/../share/java/kafka/kafka_2.11-0.10.0.1-cp1-test.jar:/usr/bin/../share/java/kafka/jetty-continuation-9.2.15.v20160210.jar:/usr/bin/../share/java/kafka/argparse4j-0.5.0.jar:/usr/bin/../share/java/kafka/connect-json-0.10.0.1-cp1.jar:/usr/bin/../share/java/kafka/log4j-1.2.17.jar:/usr/bin/../share/java/kafka/javassist-3.18.2-GA.jar:/usr/bin/../share/java/kafka/jackson-jaxrs-base-2.6.3.jar:/usr/bin/../share/java/kafka/javax.inject-2.4.0-b34.jar:/usr/bin/../share/java/kafka/metrics-core-2.2.0.jar:/usr/bin/../share/java/kafka/kafka_2.11-0.10.0.1-cp1-sources.jar:/usr/bin/../share/java/kafka/jersey-client-2.22.2.jar:/usr/bin/../share/java/kafka/reflections-0.9.10.jar:/usr/bin/../share/java/kafka/hk2-locator-2.4.0-b34.jar:/usr/bin/../share/java/kafka/httpclient-4.5.1.jar:/usr/bin/../share/java/kafka/javax.servlet-api-3.1.0.jar:/usr/bin/../share/java/kafka/support-metrics-common-3.0.1.jar:/usr/bin/../share/java/kafka/commons-compress-1.4.1.jar:/usr/bin/../share/java/kafka/kafka-tools-0.10.0.1-cp1.jar:/usr/bin/../share/java/kafka/javax.ws.rs-api-2.0.1.jar:/usr/bin/../share/java/kafka/kafka-log4j-appender-0.10.0.1-cp1.jar:/usr/bin/../share/java/kafka/hk2-api-2.4.0-b34.jar:/usr/bin/../share/java/kafka/zookeeper-3.4.6.jar:/usr/bin/../share/java/confluent-support-metrics/support-metrics-fullcollector-3.0.1.jar:/usr/share/java/confluent-support-metrics/support-metrics-fullcollector-3.0.1.jar (org.apache.zookeeper.ZooKeeper)

[2016-09-15 09:07:27,952] INFO Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib (org.apache.zookeeper.ZooKeeper)

[2016-09-15 09:07:27,952] INFO Client environment:java.io.tmpdir=/tmp (org.apache.zookeeper.ZooKeeper)

[2016-09-15 09:07:27,952] INFO Client environment:java.compiler=<NA> (org.apache.zookeeper.ZooKeeper)

[2016-09-15 09:07:27,952] INFO Client environment:os.name=Linux (org.apache.zookeeper.ZooKeeper)

[2016-09-15 09:07:27,952] INFO Client environment:os.arch=amd64 (org.apache.zookeeper.ZooKeeper)

[2016-09-15 09:07:27,952] INFO Client environment:os.version=3.19.0-68-generic (org.apache.zookeeper.ZooKeeper)

[2016-09-15 09:07:27,952] INFO Client environment:user.name=root (org.apache.zookeeper.ZooKeeper)

[2016-09-15 09:07:27,952] INFO Client environment:user.home=/root (org.apache.zookeeper.ZooKeeper)

[2016-09-15 09:07:27,953] INFO Client environment:user.dir=/ (org.apache.zookeeper.ZooKeeper)

[2016-09-15 09:07:27,953] INFO Initiating client connection, connectString=queue1:2181,queue2:2181,queue3:2181,queue4:2181 sessionTimeout=6000 watcher=org.I0Itec.zkclient.ZkClient@5c30a9b0 (org.apache.zookeeper.ZooKeeper)

[2016-09-15 09:07:27,965] INFO Waiting for keeper state SyncConnected (org.I0Itec.zkclient.ZkClient)

[2016-09-15 09:07:27,968] INFO Opening socket connection to server queue4/10.0.1.24:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)

[2016-09-15 09:07:28,040] INFO Socket connection established to queue4/10.0.1.24:2181, initiating session (org.apache.zookeeper.ClientCnxn)

[2016-09-15 09:07:28,047] INFO Session establishment complete on server queue4/10.0.1.24:2181, sessionid = 0x4572ae031840006, negotiated timeout = 6000 (org.apache.zookeeper.ClientCnxn)

[2016-09-15 09:07:28,049] INFO zookeeper state changed (SyncConnected) (org.I0Itec.zkclient.ZkClient)

[2016-09-15 09:07:28,611] INFO Loading logs. (kafka.log.LogManager)

[2016-09-15 09:07:28,640] WARN Found a corrupted index file, /var/lib/kafka/social.activitystream.UserEvents-2/00000000000000000000.index, deleting and rebuilding index... (kafka.log.Log)

[2016-09-15 09:07:28,646] INFO Recovering unflushed segment 0 in log social.activitystream.UserEvents-2. (kafka.log.Log)

[2016-09-15 09:07:28,648] INFO Completed load of log social.activitystream.UserEvents-2 with log end offset 0 (kafka.log.Log)

[2016-09-15 09:07:28,653] WARN Found a corrupted index file, /var/lib/kafka/social.activitystream.UserEvents-1/00000000000000000000.index, deleting and rebuilding index... (kafka.log.Log)

[2016-09-15 09:07:28,653] INFO Recovering unflushed segment 0 in log social.activitystream.UserEvents-1. (kafka.log.Log)

[2016-09-15 09:07:28,654] INFO Completed load of log social.activitystream.UserEvents-1 with log end offset 0 (kafka.log.Log)

[2016-09-15 09:07:28,661] WARN Found a corrupted index file, /var/lib/kafka/rets.gacar.OpenHouse-0/00000000000000003653.index, deleting and rebuilding index... (kafka.log.Log)




IT WILL RUN FOR A WHILE AND THEN LOG STARTS FILLING UP WITH Too many open files errors and we crash. ???

Noam Berman

unread,
Nov 8, 2016, 4:51:34 PM11/8/16
to Confluent Platform
Hi, 
I have the same issue on a 5-node cluster running kafka 0.10.1.0.
Have you found a solution for this problem yet?

Thanks,
Noam

Michael Sparr

unread,
Nov 8, 2016, 4:58:28 PM11/8/16
to confluent...@googlegroups.com
We reduced to 3-node and problem went away; we had 5-node but one wasn’t online (so running on 4). I haven’t confirmed yet, but my suspicion is that we also have Zookeeper running on same boxes and it had an even number of nodes (4) and may have been culprit so we reduced to 3 nodes and no issues for 45 days+.

At some point I plan on moving Zookeeper to App Server cluster and then adding back 2 more data nodes for Kafka but for now we just left it at 3-node.

I hope that helps. I’d say reduce Zookeeper nodes to 3 max if not already and also make sure odd number of nodes. Then play with configs if necessary, one at a time.





--
You received this message because you are subscribed to a topic in the Google Groups "Confluent Platform" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/confluent-platform/ViX98ylVTSU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to confluent-platf...@googlegroups.com.
To post to this group, send email to confluent...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/confluent-platform/74a9bbb0-3a44-4a12-b767-55152e241128%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Noam Berman

unread,
Nov 8, 2016, 5:29:44 PM11/8/16
to Confluent Platform
Thanks! 
To unsubscribe from this group and all its topics, send an email to confluent-platform+unsub...@googlegroups.com.

宋鑫

unread,
Feb 20, 2017, 10:48:04 PM2/20/17
to Confluent Platform
Reply all
Reply to author
Forward
0 new messages