Monitor hdfs connector

280 views
Skip to first unread message

Guillaume

unread,
Nov 25, 2016, 9:16:26 AM11/25/16
to Confluent Platform
Hello,

I have a (small, 3 nodes) kafka cluster, with about 20 topics and one hdfs connector per topic per server.

I can see that some topics are properly written in hdfs (hive), but some are not. There is no obvious configuration difference between both types of connectors (as expected, the configuration is auto-generated).

One configuration example of a connector not writing anything is:
{
  "name": "sent-connector",
  "config": {
    "connector.class": "io.confluent.connect.hdfs.HdfsSinkConnector",
    "topics.dir": "/kafka-connect/topics",
    "hadoop.conf.dir": "/etc/hadoop/conf",
    "schema.compatibility": "FULL",
    "flush.size": "10000",
    "timezone": "UTC",
    "tasks.max": "1",
    "topics": "sent",
    "hive.home": "/usr/hdp/current/hive-client",
    "hdfs.url": "hdfs://aws_server:8020",
    "hive.database": "events",
    "rotate.interval.ms": "60000",
    "hive.metastore.uris": "thrift://ambari:9083",
    "locale": "C",
    "hadoop.home": "/usr/hdp/current/hadoop-client",
    "logs.dir": "/kafka-connect/wal",
    "hive.integration": "true",
    "partitioner.class": "io.confluent.connect.hdfs.partitioner.HourlyPartitioner",
    "name": "sent-connector",
    "hive.conf.dir": "/etc/hive/conf",
    "path.format": "'year'=YYYY/'month'=MM/'day'=dd/'hour'=HH/"
  },
  "tasks": [
    {
      "connector": "sent-connector",
      "task": 0
    }
  ]
}

One status example is:
{
  "name": "sent-connector",
  "connector": {
    "state": "RUNNING",
    "worker_id": "192.168.1.1:8083"
  },
  "tasks": [
    {
      "state": "RUNNING",
      "id": 0,
      "worker_id": "192.168.1.2:8083"
    }
  ]
}



Of course, nothing in the logs either appart from a lot of "INFO Starting commit and rotation for topic partition...".

My question is thus how can I monitor what a connector is doing, and how can I know why it is not doing anything?

Some side questions are:
- I suppose it would make sense to increase the number of tasks in connectors?
- I did not find any reference anywhere to compression to write hdfs files. Are they compressed, or how can I enable it?

Thanks,


Gwen Shapira

unread,
Nov 26, 2016, 11:49:57 AM11/26/16
to confluent...@googlegroups.com
Few suggestions:

1. You can actually specify multiple topics in one connector, so
perhaps take a connector that does work and add few more topics there
to see what happens?

2. It sounds like you are using Connect in a distributed mode? Perhaps
try a stand-alone worker with one of the "bad" connectors and see if
this gives you any more information (I find stand-alone easier to
troubleshoot. They also allow me to switch logs to DEBUG and TRACE
without drowning in output).

In terms of monitoring - the Sink is a consumer, so you should be able
to attach a jconsole (or jstatd) and look at the consumer metrics,
things like whether any events were consumed, errors, etc.

Gwen
> --
> You received this message because you are subscribed to the Google Groups
> "Confluent Platform" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to confluent-platf...@googlegroups.com.
> To post to this group, send email to confluent...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/confluent-platform/2093c91b-0a3d-4993-a41e-a7a449d50f73%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
Gwen Shapira
Product Manager | Confluent
650.450.2760 | @gwenshap
Follow us: Twitter | blog

Guillaume Roger | webpower

unread,
Dec 22, 2016, 8:29:59 AM12/22/16
to confluent...@googlegroups.com
Thanks for your answer. 

It took me a while to finally follow your suggestion, but there we are

I tried using standalone connectors instead, but it does not make a difference in result. What I managed to see, though, are those 2 things:
- Sometimes the lease is acquired, and I can see the time of the file ${logs.dir}/${topic}/${partition}/log in hdfs updated. No actual data come into HDFS, though. Int the log output, I have a lot lines like this one:

[2016-12-22 13:55:43,292] INFO Successfully acquired lease for hdfs://ip-10-0-0-239.eu-west-1.compute.internal:8020//kafka-connect/wal/sent/0/log (io.confluent.connect.hdfs.wal.FSWAL:75)
[2016-12-22 13:55:43,292] INFO Successfully acquired lease for hdfs://ip-10-0-0-239.eu-west-1.compute.internal:8020//kafka-connect/wal/sent/2/log (io.confluent.connect.hdfs.wal.FSWAL:75)
[2016-12-22 13:55:43,292] INFO Successfully acquired lease for hdfs://ip-10-0-0-239.eu-west-1.compute.internal:8020//kafka-connect/wal/sent/1/log (io.confluent.connect.hdfs.wal.FSWAL:75)
[2016-12-22 13:56:33,080] INFO WorkerSinkTask{id=sent-connector-standalone-0} Committing offsets (org.apache.kafka.connect.runtime.WorkerSinkTask:262)
[2016-12-22 13:56:33,206] INFO WorkerSinkTask{id=sent-connector-standalone-1} Committing offsets (org.apache.kafka.connect.runtime.WorkerSinkTask:262)
[2016-12-22 13:56:33,225] INFO WorkerSinkTask{id=sent-connector-standalone-2} Committing offsets (org.apache.kafka.connect.runtime.WorkerSinkTask:262)
[2016-12-22 13:56:43,225] INFO Starting commit and rotation for topic partition sent-1 with start offsets {} and end offsets {} (io.confluent.connect.hdfs.TopicPartitionWriter:297)
[2016-12-22 13:56:43,237] INFO Starting commit and rotation for topic partition sent-2 with start offsets {} and end offsets {} (io.confluent.connect.hdfs.TopicPartitionWriter:297)
[2016-12-22 13:56:43,316] INFO Starting commit and rotation for topic partition sent-0 with start offsets {} and end offsets {} (io.confluent.connect.hdfs.TopicPartitionWriter:297)

The empty offset look wrong to me. Indeed, with the distributed connector, if the offset is empty I have no data, if it is not empty I do have data. I know data is coming to kafka in the meantime, so it should be found by the connector.    

- After a few start, ^C, I now have the message:

[2016-12-22 14:26:14,077] INFO Cannot acquire lease on WAL hdfs://ip-10-0-0-239.eu-west-1.compute.internal:8020//kafka-connect/wal/sent/2/log (io.confluent.connect.hdfs.wal.FSWAL:80)
[2016-12-22 14:26:21,835] ERROR Recovery failed at state RECOVERY_PARTITION_PAUSED (io.confluent.connect.hdfs.TopicPartitionWriter:229)
org.apache.kafka.connect.errors.ConnectException: Cannot acquire lease after timeout, will retry.
at io.confluent.connect.hdfs.wal.FSWAL.acquireLease(FSWAL.java:95)


What is the reason preventing the lease acquisition? Is there a lock somewhere?

I would like to give more info but that's all I have for the moment.  

Cheers,

For more options, visit https://groups.google.com/d/optout.
--
Guillaume Roger
Senior big data engineer & architect @ Webpower
Koolhovenstraat 1k, 3772 MT, Barneveld, Nederland
guillaume.roger@webpower.nl | +31 (0) 342 423 262

Ewen Cheslack-Postava

unread,
Dec 24, 2016, 11:51:36 AM12/24/16
to Confluent Platform
There are built-in lease timeouts in HDFS, so it may just be that it can't reacquire within the connector's timeout because the namenode still has the lease assigned to the previous instance (i.e. the dead Connect process).

The empty {} does look suspicious since data should have to be added before that can be logged and this specific message should only occur when at least one message has been logged. For partitions that are delivering data are you seeing the offsets properly included?

-Ewen

Cheers,

> email to confluent-platform+unsub...@googlegroups.com.
> To post to this group, send email to confluent-platform@googlegroups.com.

> To view this discussion on the web visit
> https://groups.google.com/d/msgid/confluent-platform/2093c91b-0a3d-4993-a41e-a7a449d50f73%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
Gwen Shapira
Product Manager | Confluent
650.450.2760 | @gwenshap
Follow us: Twitter | blog

--
You received this message because you are subscribed to the Google Groups "Confluent Platform" group.
To unsubscribe from this group and stop receiving emails from it, send an email to confluent-platform+unsub...@googlegroups.com.
To post to this group, send email to confluent-platform@googlegroups.com.
--
Guillaume Roger
Senior big data engineer & architect @ Webpower
Koolhovenstraat 1k, 3772 MT, Barneveld, Nederland
guillaume.roger@webpower.nl | +31 (0) 342 423 262

--
You received this message because you are subscribed to the Google Groups "Confluent Platform" group.
To unsubscribe from this group and stop receiving emails from it, send an email to confluent-platform+unsub...@googlegroups.com.
To post to this group, send email to confluent-platform@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/confluent-platform/CAFPbq5UngRorUcD72uZEdAxNWZ3_f87-9DcXdEiCgLzv9QNYyg%40mail.gmail.com.

Swathi Mocharla

unread,
Mar 1, 2019, 3:46:35 AM3/1/19
to Confluent Platform
I am running into a similar issue with confluent hdfs connect 5.0.0. A restart of the datanode causes an issue with acquiring the lease for the next write. Adding a recover lease doesn't seem to help. Recovering the lease manually from HDFS still points to the old process holding the lease even thoug lease recovery was successful. Is there any way to recover this lease from connect's side instead of waiting for the lease timeout from HDFS? This would be very necessary.
Cheers,

> To post to this group, send email to confluent...@googlegroups.com.

> To view this discussion on the web visit
> https://groups.google.com/d/msgid/confluent-platform/2093c91b-0a3d-4993-a41e-a7a449d50f73%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
Gwen Shapira
Product Manager | Confluent
650.450.2760 | @gwenshap
Follow us: Twitter | blog

--
You received this message because you are subscribed to the Google Groups "Confluent Platform" group.
To unsubscribe from this group and stop receiving emails from it, send an email to confluent-platform+unsub...@googlegroups.com.
To post to this group, send email to confluent...@googlegroups.com.
--
Guillaume Roger
Senior big data engineer & architect @ Webpower
Koolhovenstraat 1k, 3772 MT, Barneveld, Nederland
guillau...@webpower.nl | +31 (0) 342 423 262

--
You received this message because you are subscribed to the Google Groups "Confluent Platform" group.
To unsubscribe from this group and stop receiving emails from it, send an email to confluent-platform+unsub...@googlegroups.com.
To post to this group, send email to confluent...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages