Lots of errors with kafka-connect REST API and reconfiguration

Barry Kaplan

unread,

Jun 15, 2016, 10:04:43 PM6/15/16

to Confluent Platform

I am seeing lots of errors when trying to GET/PUT/POST configurations. About 50% of the time I will get connection refused

eg, from ansible:

failed: [10.0.136.98] => {"failed": true}
msg: Socket error: [Errno 111] Connection refused to http://connect-elasticsearch-indexer.service.consul:31099/connectors/elasticsearch-indexer/config

And when I can successfully get and then post a new configuration errors occur in the logs:

ERROR Unexpected error during connector task reconfiguration:
ERROR Failed to reconfigure connector's tasks, retrying after backoff:
ERROR Request to leader to reconfigure connector tasks failed
ERROR Task reconfiguration for elasticsearch-indexer failed unexpectedly, this connector will not be properly reconfigured unless manually triggered.
ERROR IO error forwarding REST request:

In this scenario I am running two process (via marathon/mesos) on two different slaves.

Other strange behaviors:

If after the above I manually GET the configuration via /connectors/elasticsearch-indexer/config the result will show all the topics I PUT above -- this is for both processes. But when I look at the tasks via /connectors/elasticsearch-indexer/tasks it shows each task using the previous set of topics, but not havin any errors.

After restarting the process, the tasks have the full set of topics.

Logging is very flakey. If there is any kind of load, or if any error is ever emitted, logs no longer get written. The tasks will continue to do work, and log statements will continue to be called, but no more output. It's hard to imagine what could cause this, as slf4j/logback is usually rock solid. This seems to only happen with a FileAppender. When I use a console appender I don't see this. The kafka-connect processes are running docker containers writing the logs to a mounted volume. But this is how all our processes work, and we've never seen logging just stop.

Barry Kaplan

unread,

Jun 15, 2016, 10:09:33 PM6/15/16

to Confluent Platform

I also notice that even after restarting all processes (3 now) only one uses any cpu -- the other two are idle. The configuration has about 100 topics.

Barry Kaplan

unread,

Jun 16, 2016, 4:47:26 AM6/16/16

to Confluent Platform

It seems that only one of my processes actually does work, and only the same process will accept REST commands.

I can PUT the configuration to the process that is working and then some other process seems to take over. At that point only that new process that is now doing work will accept any more REST commands.

Ewen Cheslack-Postava

unread,

Jun 16, 2016, 10:04:12 PM6/16/16

to Confluent Platform

This definitely sounds odd. You should be able to hit the REST API to get the status of connectors and tasks to find out where they are running. Are all the workers listed at any given time when you scan through all connectors & tasks?

Perhaps some configs would help? I haven't seen any sort of connection refusal problem before. I'd check with netstat that it is consistently listening on the expected port. It looks like you might have Consul DNS doing load balancing? Is it possible some subset of the entries for the DNS are wrong and causing the intermittent connection failures?

Re: logging, we're not doing anything special here -- just normal log4j. If the FileAppenders don't work well in your environment or you have some easy way to collect a simple console appender, you can always swap it for a different appender by adjusting the connect-log4j.properties file.

-Ewen

On Thu, Jun 16, 2016 at 1:47 AM, Barry Kaplan <mem...@gmail.com> wrote:

It seems that only one of my processes actually does work, and only the same process will accept REST commands.

I can PUT the configuration to the process that is working and then some other process seems to take over. At that point only that new process that is now doing work will accept any more REST commands.

--
You received this message because you are subscribed to the Google Groups "Confluent Platform" group.
To unsubscribe from this group and stop receiving emails from it, send an email to confluent-platf...@googlegroups.com.
To post to this group, send email to confluent...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/confluent-platform/ba0d3db6-e92b-46b4-be13-0957a03fd0f5%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Thanks,
Ewen

Barry Kaplan

unread,

Jun 17, 2016, 1:07:20 AM6/17/16

to Confluent Platform

logging: That was fully user error (as expected). I have configured the logstash encoder (https://github.com/logstash/logstash-logback-encoder) but was passing as a keyValue to the log a JSONObject. The encoder did not emit any errors but surely did not like this. One I removed this and just logged record string value logging has been solid.

I don't see how the above could have any effect on the restapi, but after fixing the logging I never again saw a rejection. Before that though, I did check that consul had the correct ips (I was testing at this point using postman). I also refactored the ansible playbook (which is the normal way we use the restapi) to get the ip/port directly from marathon, but still had rejections. Again, no more now though. I'll post back if learn the real reason I was having problems.

thanks Ewen!!

Barry Kaplan

unread,

Jun 17, 2016, 5:01:27 AM6/17/16

to Confluent Platform

Well, the rest problems are not resolved. Buts a bit more subtle than I first thought. And maybe this is related to another I am having.

First the other problem: I have three processes deployed. A configuration with 100 topics. But only one process ever get messages put to it.

The one process that is actually doing work will process a rest messages. But the other process seem to only support GET. If I try to PUT a new config to any process other than the that is getting messages I receive a 500 response.

This is on our staging mesos cluster. We'll try to deploy the same configuration locally and see if we can't figure out what is going on.

Barry Kaplan

unread,

Jun 17, 2016, 6:09:04 AM6/17/16

to Confluent Platform

The errors in the log (where it seems the processes try to talk to each other, is this new for 0.10?) was that we were not setting the advertised host/port. That is now set correctly to the host/port assigned by mesos.

But still only one of the processes ever gets a non-empty collection of of messages in put().

Reply all

Reply to author

Forward