kafka connect mysteriously shutting down w/o any errors

2,435 views
Skip to first unread message

Andrew Xue

unread,
Dec 2, 2016, 8:06:00 PM12/2/16
to Confluent Platform
trying to run a connect cluster w/ a command like 

/confluent/confluent-3.0.0/bin/connect-distributed connect-json-distributed.properties

where connect-json-distributed.properties is the properties in etc/schema-registry/connect-avro-distributed.properties

in the logs i see this -- which seems like everything is doing ok? and then suddenly everything starts shutting down -- how would we be able to debug this further?

[2016-12-03 00:41:41,108] INFO Kafka version : 0.10.0.0-cp1 (org.apache.kafka.common.utils.AppInfoParser:83)
[2016-12-03 00:41:41,108] INFO Kafka commitId : 7aeb2e89dbc741f6 (org.apache.kafka.common.utils.AppInfoParser:84)
[2016-12-03 00:41:41,119] INFO Discovered coordinator analytics-kafka03.lumoslabs.com:9092 (id: 2147483645 rack: null) for group avro-connect-cluster. (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:505)
[2016-12-03 00:41:41,219] INFO Finished reading KafkaBasedLog for topic avro-connect-configs (org.apache.kafka.connect.util.KafkaBasedLog:148)
[2016-12-03 00:41:41,219] INFO Started KafkaBasedLog for topic avro-connect-configs (org.apache.kafka.connect.util.KafkaBasedLog:150)
[2016-12-03 00:41:41,220] INFO Started KafkaConfigBackingStore (org.apache.kafka.connect.storage.KafkaConfigBackingStore:261)
[2016-12-03 00:41:41,221] INFO Herder started (org.apache.kafka.connect.runtime.distributed.DistributedHerder:173)
[2016-12-03 00:41:41,323] INFO Discovered coordinator analytics-kafka03.lumoslabs.com:9092 (id: 2147483645 rack: null) for group avro-connect-cluster. (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:505)
[2016-12-03 00:41:41,324] INFO (Re-)joining group avro-connect-cluster (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:326)
[2016-12-03 00:41:41,344] INFO Successfully joined group avro-connect-cluster with generation 9 (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:434)
[2016-12-03 00:41:41,344] INFO Joined group and got assignment: Assignment{error=0, leader='connect-1-8f1081c8-a088-4c55-ae74-8decf73af1d5', leaderUrl='http://10.10.41.6:8083/', offset=-1, connectorIds=[], taskIds=[]} (org.apache.kafka.connect.runtime.distributed.DistributedHerder:1023)
[2016-12-03 00:41:41,345] INFO Starting connectors and tasks using config offset -1 (org.apache.kafka.connect.runtime.distributed.DistributedHerder:752)
[2016-12-03 00:41:41,345] INFO Finished starting connectors and tasks (org.apache.kafka.connect.runtime.distributed.DistributedHerder:769)
[2016-12-03 00:42:17,136] INFO Kafka Connect stopping (org.apache.kafka.connect.runtime.Connect:68)
[2016-12-03 00:42:17,137] INFO Stopping REST server (org.apache.kafka.connect.runtime.rest.RestServer:154)
[2016-12-03 00:42:17,139] INFO Stopped ServerConnector@4384b579{HTTP/1.1}{0.0.0.0:8083} (org.eclipse.jetty.server.ServerConnector:306)
[2016-12-03 00:42:17,144] INFO Stopped o.e.j.s.ServletContextHandler@3c8bdd5b{/,null,UNAVAILABLE} (org.eclipse.jetty.server.handler.ContextHandler:865)
[2016-12-03 00:42:17,147] INFO REST server stopped (org.apache.kafka.connect.runtime.rest.RestServer:165)
[2016-12-03 00:42:17,147] INFO Herder stopping (org.apache.kafka.connect.runtime.distributed.DistributedHerder:362)
[2016-12-03 00:42:17,147] INFO Stopping connectors and tasks that are still assigned to this worker. (org.apache.kafka.connect.runtime.distributed.DistributedHerder:334)
[2016-12-03 00:42:17,150] INFO Stopping KafkaBasedLog for topic avro-connect-statuses (org.apache.kafka.connect.util.KafkaBasedLog:154)
[2016-12-03 00:42:17,150] INFO Closing the Kafka producer with timeoutMillis = 9223372036854775807 ms. (org.apache.kafka.clients.producer.KafkaProducer:658)
[2016-12-03 00:42:17,152] INFO Stopped KafkaBasedLog for topic avro-connect-statuses (org.apache.kafka.connect.util.KafkaBasedLog:180)
[2016-12-03 00:42:17,152] INFO Closing KafkaConfigBackingStore (org.apache.kafka.connect.storage.KafkaConfigBackingStore:266)
[2016-12-03 00:42:17,153] INFO Stopping KafkaBasedLog for topic avro-connect-configs (org.apache.kafka.connect.util.KafkaBasedLog:154)
[2016-12-03 00:42:17,153] INFO Closing the Kafka producer with timeoutMillis = 9223372036854775807 ms. (org.apache.kafka.clients.producer.KafkaProducer:658)
[2016-12-03 00:42:17,155] INFO Stopped KafkaBasedLog for topic avro-connect-configs (org.apache.kafka.connect.util.KafkaBasedLog:180)
[2016-12-03 00:42:17,156] INFO Closed KafkaConfigBackingStore (org.apache.kafka.connect.storage.KafkaConfigBackingStore:268)
[2016-12-03 00:42:17,156] INFO Worker stopping (org.apache.kafka.connect.runtime.Worker:128)
[2016-12-03 00:42:17,156] WARN Shutting down tasks [] uncleanly; herder should have shut down tasks before the Worker is stopped. (org.apache.kafka.connect.runtime.Worker:141)
[2016-12-03 00:42:17,156] INFO Stopping KafkaOffsetBackingStore (org.apache.kafka.connect.storage.KafkaOffsetBackingStore:92)
[2016-12-03 00:42:17,156] INFO Stopping KafkaBasedLog for topic avro-connect-offsets (org.apache.kafka.connect.util.KafkaBasedLog:154)
[2016-12-03 00:42:17,157] INFO Closing the Kafka producer with timeoutMillis = 9223372036854775807 ms. (org.apache.kafka.clients.producer.KafkaProducer:658)
[2016-12-03 00:42:17,163] INFO Stopped KafkaBasedLog for topic avro-connect-offsets (org.apache.kafka.connect.util.KafkaBasedLog:180)
[2016-12-03 00:42:17,163] INFO Stopped KafkaOffsetBackingStore (org.apache.kafka.connect.storage.KafkaOffsetBackingStore:94)
[2016-12-03 00:42:17,163] INFO Worker stopped (org.apache.kafka.connect.runtime.Worker:151)

-- log ends --

Andrew Xue

unread,
Dec 2, 2016, 8:09:54 PM12/2/16
to Confluent Platform
more context: we are running this in a docker container as an EC2 service with AWS ALB in front of it. 

Ewen Cheslack-Postava

unread,
Dec 4, 2016, 11:43:01 PM12/4/16
to Confluent Platform
There aren't that many paths that call the method that logs this:

[2016-12-03 00:42:17,136] INFO Kafka Connect stopping (org.apache.kafka.connect.runtime.Connect:68)

You appear to be running distributed mode, so the only two paths are 1) through ConnectDistributed failing to startup properly (which would have logged another message starting with "Failed to start Connect") and 2) through a shutdown hook, i.e. because something is sending a SIGINT interrupt. I see you're on 0.10.0.0-cp1, so I'm pretty sure none of this has changed.

Is it possible you're getting a SIGINT to the process somehow that would trigger this shutdown? (We should probably improve logging here so the cause of the shutdown is crystal clear.)

-Ewen

--
You received this message because you are subscribed to the Google Groups "Confluent Platform" group.
To unsubscribe from this group and stop receiving emails from it, send an email to confluent-platform+unsub...@googlegroups.com.
To post to this group, send email to confluent-platform@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/confluent-platform/6250648f-82db-4d79-ab6c-8f325a7eb1fe%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Thanks,
Ewen

Andrew Xue

unread,
Dec 5, 2016, 7:34:57 PM12/5/16
to Confluent Platform
yea this turned out to be a mistake in the configs. we had the container port set to 8081 and forgot to set the rest.port correctly (you can see in the logs it still has the default 8083). and that caused the ALB to kill the rest service or something? not totally sure on exactly what happened. 

i fixed the port mistake but its still dying silently. i assume there must be some other rest configurations that are wrong on my side. 
To post to this group, send email to confluent...@googlegroups.com.



--
Thanks,
Ewen

Andrew Xue

unread,
Dec 5, 2016, 8:50:31 PM12/5/16
to Confluent Platform
ok -- so the issue was this. we had a health check which relied on a good response from /connectors ... which kept failing. i changed the health check to simply look at /

however, now i guess i have a separate issue, which is while the cluster is up and responsive to /

* Adding handle: conn: 0x7fef44003000
* Adding handle: send: 0
* Adding handle: recv: 0
* Curl_addHandleToPipeline: length: 1
* - Conn 0 (0x7fef44003000) send_pipe: 1, recv_pipe: 0
* About to connect() to kafka-connect-avro.lumoslabs.com port 80 (#0)
*   Trying 52.21.151.172...
* Connected to kafka-connect-avro.lumoslabs.com (52.21.151.172) port 80 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.30.0
> Accept: */*
>
< HTTP/1.1 200 OK
* Server nginx/1.11.6 is not blacklisted
< Server: nginx/1.11.6
< Date: Tue, 06 Dec 2016 01:47:28 GMT
< Content-Type: application/json
< Content-Length: 54
< Connection: keep-alive
<
* Connection #0 to host kafka-connect-avro.lumoslabs.com left intact
{"version":"0.10.0.0-cp1","commit":"7aeb2e89dbc741f6"}

it is unresponsive to any other rest calls:

* Adding handle: conn: 0x7f9f63803000
* Adding handle: send: 0
* Adding handle: recv: 0
* Curl_addHandleToPipeline: length: 1
* - Conn 0 (0x7f9f63803000) send_pipe: 1, recv_pipe: 0
* About to connect() to kafka-connect-avro.lumoslabs.com port 80 (#0)
*   Trying 52.21.151.172...
* Connected to kafka-connect-avro.lumoslabs.com (52.21.151.172) port 80 (#0)
> GET /connectors/ HTTP/1.1
> User-Agent: curl/7.30.0
> Accept: */*
>
< HTTP/1.1 504 Gateway Time-out
* Server nginx/1.11.6 is not blacklisted
< Server: nginx/1.11.6
< Date: Tue, 06 Dec 2016 01:46:24 GMT
< Content-Type: text/html
< Content-Length: 181
< Connection: keep-alive
<
<html>
<head><title>504 Gateway Time-out</title></head>
<body bgcolor="white">
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>awselb/2.0</center>
</body>
</html>
* Connection #0 to host kafka-connect-avro.lumoslabs.com left intact

FWIW, I should have 1 connect in there (from logs):

[2016-12-06 01:36:17,429] INFO Finished reading to end of log and updated config snapshot, new config log offset: 1 (org.apache.kafka.connect.runtime.distributed.DistributedHerder:734)
[2016-12-06 01:36:17,432] INFO Starting connectors and tasks using config offset 1 (org.apache.kafka.connect.runtime.distributed.DistributedHerder:752)
[2016-12-06 01:36:17,433] INFO Starting connector eaf-insights-dynamodb-sink (org.apache.kafka.connect.runtime.distributed.DistributedHerder:783)
[2016-12-06 01:36:17,434] INFO ConnectorConfig values:
connector.class = dynamok.sink.DynamoDbSinkConnector
tasks.max = 1
name = eaf-insights-dynamodb-sink
 (org.apache.kafka.connect.runtime.ConnectorConfig:178)

Andrew Xue

unread,
Dec 6, 2016, 6:00:40 PM12/6/16
to Confluent Platform
ok -- i think i resolved this issue. i was installing a connect plugin incorrectly which was causing this strange behavior. 
Reply all
Reply to author
Forward
0 new messages