Hdfs connect error

247 views
Skip to first unread message

Guillaume

unread,
Jan 27, 2017, 10:03:42 AM1/27/17
to Confluent Platform
Hello,

I have some log lines I do not understand in hdfs connect:

[2017-01-27 15:42:37,256] ERROR Recovery failed at state RECOVERY_PARTITION_PAUSED (io.confluent.connect.hdfs.TopicPartitionWriter:229)
org.apache.kafka.connect.errors.ConnectException: java.io.EOFException
at io.confluent.connect.hdfs.wal.FSWAL.apply(FSWAL.java:131)
at io.confluent.connect.hdfs.TopicPartitionWriter.applyWAL(TopicPartitionWriter.java:484)
at io.confluent.connect.hdfs.TopicPartitionWriter.recover(TopicPartitionWriter.java:212)
at io.confluent.connect.hdfs.TopicPartitionWriter.write(TopicPartitionWriter.java:256)
at io.confluent.connect.hdfs.DataWriter.write(DataWriter.java:234)
at io.confluent.connect.hdfs.HdfsSinkTask.put(HdfsSinkTask.java:103)
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:384)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:240)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:172)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:143)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at io.confluent.connect.hdfs.wal.WALFile$Reader.init(WALFile.java:584)
at io.confluent.connect.hdfs.wal.WALFile$Reader.initialize(WALFile.java:552)
at io.confluent.connect.hdfs.wal.WALFile$Reader.<init>(WALFile.java:529)
at io.confluent.connect.hdfs.wal.FSWAL.apply(FSWAL.java:107)
... 16 more


The exception I got many times. The line before varies, eg.

INFO Committed hdfs:/xxxx:8020//kafka-connect/topics/tstperf/year=2017/month=01/day=27/hour=14//tstperf+1+0011992516+0011992520.avro for tstperf-1 (io.confluent.connect.hdfs.TopicPartitionWriter:625)
 
or

INFO Starting commit and rotation for topic partition open-0 with start offsets {year=2017/month=01/day=27/hour=14/=58337131} and end offsets {year=2017/month=01/day=27/hour=14/=58338674, year=2017/month=01/day=27/hour=13/=58274188} (io.confluent.connect.hdfs.TopicPartitionWriter:297)

When the line before is the first example, the file pointed out does not exist in hdfs.
There is no other common pattern I could find.
I am using the distributed connector, on 3 servers, and only one of them shows this issue. 
Data still arrives in hdfs, probably from the 2 other connectors.  
Issues started to happen after a restart of all connectors (stopped during hdfs maintenance).

There are no weird empty files, nothing about 'not an avro data file' in the logs...

I have 2 questions related to that:
- What happens and how can it be fixed?
- Can I be confident that all data is still being written to HDFS thanks to the other instances of the connector, or am I losing 1/3?

Thanks,

Ewen Cheslack-Postava

unread,
Jan 31, 2017, 12:27:56 AM1/31/17
to Confluent Platform
It looks like that one task is getting stuck because it cannot finish applying the WAL. The WAL just contains a list of files that need to be committed: each entry maps from tempfile -> final destination. The WAL contains the record of what commits are planned so that if one process dies while committing files, the process that picks up the work later can finish committing the files before starting to process data again.

In this case it looks like some data was not successfully written to the WAL file, so it is hitting EOF unexpectedly when reading the WAL. This is an unusual edge case (it requires failing at a very specific moment) and it looks like we may not be handling it gracefully. The way the WAL file works is that we write all the entries we need to "commit" a block of data from Kafka -> HDFS. Then we make sure that data is synced into HDFS, and then apply it. From where the error is happening it looks like only m out of the total N entries were successfully written (and possibly a fractional record). We wait until we see the full set before applying anything, but it looks like you ended up with a partial set.

The good news is that due to the way we handle committing data, you should be able to resolve this just by deleting the WAL file. Since we didn't have the full set of files to be committed written to the WAL file, none will have been committed and deleting it will just result in reprocessing the data. On the HDFS connector side, we should handle this specific EOF error more gracefully when applying the WAL. In that case it should be safe to truncate the WAL (equivalent to deleting it) and restart processing from the last known safe point.

-Ewen

--
You received this message because you are subscribed to the Google Groups "Confluent Platform" group.
To unsubscribe from this group and stop receiving emails from it, send an email to confluent-platform+unsub...@googlegroups.com.
To post to this group, send email to confluent-platform@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/confluent-platform/d50c2122-ac8e-49f9-a375-9ad56a7d34e7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages