Significance of partition.duration.ms property

1,832 views
Skip to first unread message

Mohammad Tariq

unread,
Mar 16, 2016, 8:47:44 AM3/16/16
to confluent...@googlegroups.com
Hi group,

Could someone please explain me the usage of partition.duration.ms property which is used with TimeBasedPartitioner in kafka-hdfs-connect? I looked into the docs but didn't quite get it. So an example would be really helpful.

Ewen Cheslack-Postava

unread,
Mar 22, 2016, 1:47:24 AM3/22/16
to Confluent Platform
Hi Mohammad,

The partition.duration.ms property indicates the duration of time that each file should cover -- it's the interval of time each file should cover when using the TimeBasedPartitioner. For example, if you wanted each file to cover 1hr of time, you should set partition.duration.ms=3600000.

-Ewen

--
You received this message because you are subscribed to the Google Groups "Confluent Platform" group.
To unsubscribe from this group and stop receiving emails from it, send an email to confluent-platf...@googlegroups.com.
To post to this group, send email to confluent...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/confluent-platform/CAMVC6RMuHnk2%2BbjPj-CrWWcG2k1Wy_qppFrupgz7VDOToiO6ig%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.



--
Thanks,
Ewen

Tariq Mohammad

unread,
Mar 28, 2016, 6:56:44 AM3/28/16
to Confluent Platform
Hi Ewen,

Apologies for the late response. I am somehow not getting notifications from the group.

Thank you so much for your response. I was looking at the source code and was able to figure that out. And your response confirms that. However, I am still not getting the result I want. No matter what value I use all the data end up in a single partition. This is how my properties file looks like :

name=hdfs-sink
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
tasks.max=1
topics=test
hdfs.url=hdfs://localhost:9000
flush.size=3
#partitioner.class=io.confluent.connect.hdfs.partitioner.HourlyPartitioner
partitioner.class=io.confluent.connect.hdfs.partitioner.TimeBasedPartitioner
#path.format='year'=YYYY/'month'=MM/'day'=dd/'hour'=HH/'minute'=MM/'second'=ss/
path.format='year'=YYYY/'month'=MM/'day'=dd/'hour'=HH/'minute'=MM/
locale=en
timezone=GMT
logs.dir=/kafka-connect/logs
topics.dir=/kafka-connect/topics
hive.integration=true
hive.metastore.uris=thrift://localhost:9083
schema.compatibility=BACKWARD

With 60000 as values for both partition.duration.ms and rotate.interval.ms I was expecting that the connector will create a new HDFS directory(and the corresponding Hive partition) every minute, and copy the entire one minute worth of data into this directory. After one minute a new directory(partition) will be created where the data for this minute will be copied. 

The connector copies data from Kafka into HDFS successfully, but it creates a new file under the same directory everytime it writes the data, instead of creating a new directory. I have attached a snapshot below in case that helps :



I feel I am still missing some piece, or I have got it completely wrong.

Thanks again!

On Tuesday, 22 March 2016 11:17:24 UTC+5:30, Ewen Cheslack-Postava wrote:
Hi Mohammad,

The partition.duration.ms property indicates the duration of time that each file should cover -- it's the interval of time each file should cover when using the TimeBasedPartitioner. For example, if you wanted each file to cover 1hr of time, you should set partition.duration.ms=3600000.

-Ewen
On Wed, Mar 16, 2016 at 5:47 AM, Mohammad Tariq <dont...@gmail.com> wrote:
Hi group,

Could someone please explain me the usage of partition.duration.ms property which is used with TimeBasedPartitioner in kafka-hdfs-connect? I looked into the docs but didn't quite get it. So an example would be really helpful.

--
You received this message because you are subscribed to the Google Groups "Confluent Platform" group.
To unsubscribe from this group and stop receiving emails from it, send an email to confluent-platform+unsub...@googlegroups.com.



--
Thanks,
Ewen

Александр Сосновских

unread,
Jan 24, 2018, 3:58:05 AM1/24/18
to Confluent Platform
Hi!
I noticed that you have 'flush.size=3' and look at the endings of the name of the files written to HDFS: *003.avro, *006.avro, *009.avro - it's offsets of messages taken from kafka.
So as the result you're writing every 3 messages to a new file due to
'flush.size=3' . Try to change it to bigger value.

понедельник, 28 марта 2016 г., 13:56:44 UTC+3 пользователь Tariq Mohammad написал:
Reply all
Reply to author
Forward
0 new messages