Significance of partition.duration.ms property

Mohammad Tariq

unread,

Mar 16, 2016, 8:47:44 AM3/16/16

to confluent...@googlegroups.com

Hi group,

Could someone please explain me the usage of partition.duration.ms property which is used with TimeBasedPartitioner in kafka-hdfs-connect? I looked into the docs but didn't quite get it. So an example would be really helpful.

Many thanks!

Tariq, Mohammad

about.me/mti

Ewen Cheslack-Postava

unread,

Mar 22, 2016, 1:47:24 AM3/22/16

to Confluent Platform

Hi Mohammad,

The partition.duration.ms property indicates the duration of time that each file should cover -- it's the interval of time each file should cover when using the TimeBasedPartitioner. For example, if you wanted each file to cover 1hr of time, you should set partition.duration.ms=3600000.

-Ewen

--
You received this message because you are subscribed to the Google Groups "Confluent Platform" group.
To unsubscribe from this group and stop receiving emails from it, send an email to confluent-platf...@googlegroups.com.
To post to this group, send email to confluent...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/confluent-platform/CAMVC6RMuHnk2%2BbjPj-CrWWcG2k1Wy_qppFrupgz7VDOToiO6ig%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--

Thanks,
Ewen

Tariq Mohammad

unread,

Mar 28, 2016, 6:56:44 AM3/28/16

to Confluent Platform

Hi Ewen,

Apologies for the late response. I am somehow not getting notifications from the group.

Thank you so much for your response. I was looking at the source code and was able to figure that out. And your response confirms that. However, I am still not getting the result I want. No matter what value I use all the data end up in a single partition. This is how my properties file looks like :

name=hdfs-sink

connector.class=io.confluent.connect.hdfs.HdfsSinkConnector

tasks.max=1

topics=test

hdfs.url=hdfs://localhost:9000

flush.size=3

#partitioner.class=io.confluent.connect.hdfs.partitioner.HourlyPartitioner

partitioner.class=io.confluent.connect.hdfs.partitioner.TimeBasedPartitioner

partition.duration.ms=60000

rotate.interval.ms=60000

#path.format='year'=YYYY/'month'=MM/'day'=dd/'hour'=HH/'minute'=MM/'second'=ss/

path.format='year'=YYYY/'month'=MM/'day'=dd/'hour'=HH/'minute'=MM/

locale=en

timezone=GMT

logs.dir=/kafka-connect/logs

topics.dir=/kafka-connect/topics

hive.integration=true

hive.metastore.uris=thrift://localhost:9083

schema.compatibility=BACKWARD

With 60000 as values for both partition.duration.ms and rotate.interval.ms I was expecting that the connector will create a new HDFS directory(and the corresponding Hive partition) every minute, and copy the entire one minute worth of data into this directory. After one minute a new directory(partition) will be created where the data for this minute will be copied.

The connector copies data from Kafka into HDFS successfully, but it creates a new file under the same directory everytime it writes the data, instead of creating a new directory. I have attached a snapshot below in case that helps :

I feel I am still missing some piece, or I have got it completely wrong.

Thanks again!

On Tuesday, 22 March 2016 11:17:24 UTC+5:30, Ewen Cheslack-Postava wrote:

Hi Mohammad,

The partition.duration.ms property indicates the duration of time that each file should cover -- it's the interval of time each file should cover when using the TimeBasedPartitioner. For example, if you wanted each file to cover 1hr of time, you should set partition.duration.ms=3600000.

-Ewen

On Wed, Mar 16, 2016 at 5:47 AM, Mohammad Tariq <dont...@gmail.com> wrote:

Hi group,

Could someone please explain me the usage of partition.duration.ms property which is used with TimeBasedPartitioner in kafka-hdfs-connect? I looked into the docs but didn't quite get it. So an example would be really helpful.

Many thanks!

Tariq, Mohammad
about.me/mti

--
You received this message because you are subscribed to the Google Groups "Confluent Platform" group.

To unsubscribe from this group and stop receiving emails from it, send an email to confluent-platform+unsub...@googlegroups.com.

To post to this group, send email to confluent...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/confluent-platform/CAMVC6RMuHnk2%2BbjPj-CrWWcG2k1Wy_qppFrupgz7VDOToiO6ig%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
Thanks,
Ewen

Александр Сосновских

unread,

Jan 24, 2018, 3:58:05 AM1/24/18

to Confluent Platform

Hi!
I noticed that you have 'flush.size=3' and look at the endings of the name of the files written to HDFS: *003.avro, *006.avro, *009.avro - it's offsets of messages taken from kafka.
So as the result you're writing every 3 messages to a new file due to 'flush.size=3' . Try to change it to bigger value.

понедельник, 28 марта 2016 г., 13:56:44 UTC+3 пользователь Tariq Mohammad написал:

Reply all

Reply to author

Forward