Pluggable FieldPartitioners

9 views
Skip to first unread message

Helmut Zechmann

unread,
Apr 27, 2015, 12:03:49 PM4/27/15
to cdk...@cloudera.org
Dear list members,

the (absolutely great) Kite SDK supports a list of built in field partitioners that should be suitable for many setups. In our project we would like to have other (project specific) field-partitioners. Is it possible to implement pluggable FieldPartitioners support? I think of something similar to support for hive based datasets which gets loaded when the required jars are present. This might be a useful extension point for many projects. Do you think it is possible to implement this or are there any Kite SDK internals that completely prevent this?

Kind Regards,

Helmut

Joey Echeverria

unread,
Apr 27, 2015, 1:24:41 PM4/27/15
to Helmut Zechmann, cdk...@cloudera.org
Hi Hulmut,

We don't have APIs to make this extensible today. Are you thinking of
FieldPartitioners that would only be useful to your team or do you
think some of what you might need would be generally useful? I ask
because we'd be happy to help you with contributing additional
FieldPartitioners directly to the project.

For you Hive example, have you looked at the ProvidedPartitioner? That
was designed for this use case where you have a partitioned dataset
where the partition fields are provided and linked to fields in your
entity schema. This was built for users that may have been using Hive
where this is the default partition model.

-Joey
> --
> You received this message because you are subscribed to the Google Groups
> "CDK Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to cdk-dev+u...@cloudera.org.
> For more options, visit https://groups.google.com/a/cloudera.org/d/optout.



--
Joey Echeverria
Senior Infrastructure Engineer

Ryan Blue

unread,
Apr 27, 2015, 3:49:02 PM4/27/15
to Joey Echeverria, Helmut Zechmann, cdk...@cloudera.org
I agree with Joey: in general we'd prefer to have partition functions
added to the core project itself so everyone benefits and we have
consistency in how data is stored. But I'm happy to add a way to
register partitioners if that's what you need for your use case. If so,
could you tell us a little more?

Thanks for using Kite, I'm glad you like it.

rb


--
Ryan Blue
Software Engineer
Cloudera, Inc.

Helmut Zechmann

unread,
Apr 28, 2015, 4:09:35 AM4/28/15
to cdk...@cloudera.org, helmut....@gmail.com, jo...@scalingdata.com, stefan.c...@gfk.com
Hi Ryan and Joey,

we have the following two use cases we currently cannot cover natively using the kite framework:

1) We want to partition our data based on user local time. Therefore we store timestamps in user local time as strings in ISO8601 format including a timezone offset. Example: "2014-07-21T16:43:43.049+02:00". We would like to use those timestamps as input for the Kite FieldPartitioners. For kite compatibility we currently add an additional column to our data that holds the local date as unix timestamp. This works, but:
    a) unix timestamps do not have support for timezones, they represent the (milli)seconds since 1.1.1970 in GMT/UTC, so I have the feeling that this representation is somewhat dirty.
    b) we already have a lot of date fields in our data and adding another one (in a different representation) might be confusing for data consumers.
2) For some data sets we would like to partition our data based on quarter and half year. Currently we solve this using provided fields. Using a field based partitioner for this would give us more flexibility here.

To tell you a little bit more about our use of the kite sdk:

We have hive based datasets, partitioned by user local time. Based on domain specific requirements the datasets have varying partition durations such as daily, monthly, quarterly, .... 
For automated processing all datasets have 
1) a property that tells us the duration of the partitions within a dataset in iso format (such as P1D, P1M). 
2) the partition column slice_start. slice_start contains the first day of a partition. That is the slice date for daily partitions, the first of the month for monthly partitions, ....

In addition to that we add partition fields based on the duration of the partitions for a given dataset, such as "year", "month", "day". Those feels added to facilitate data access and data management.

Just in case you are really interested :-) The following link points you to a talk about our project. Starting from minute 13:00 data management is covered: https://vimeo.com/125738693.


I hope that helps you to understand our use of the kite sdk.

Kind regards,

Helmut

Helmut Zechmann

unread,
May 8, 2015, 11:29:15 AM5/8/15
to cdk...@cloudera.org, helmut....@gmail.com, jo...@scalingdata.com
Hi Ryan,

do you have an opinion on my last post about our usage of the Kite SDK?

Kind regards,

Helmut

Ryan Blue

unread,
May 11, 2015, 1:20:56 PM5/11/15
to Helmut Zechmann, cdk...@cloudera.org, jo...@scalingdata.com
Hi Helmut,

Sorry for the late reply, I was at a conference all last week.

For your date format, I think the current method is correct. I've worked
in both the Parquet and Avro communities to define the specs for storing
timestamps (and dates/times) in those formats, and the consensus is that
a binary field that stores a value in UTC is the best practice. If you
want information about where that value came from, like the original
zone, then the best practice is to keep that in a different column. I
recommend replacing your string timestamp field with the long timestamp
and another field that is just the original zone.

For the other partitioners, I'm all for adding those granularities to
Kite. It sounds like those would be useful for other users as well. The
only concern I have is that it might make using Hive or Impala slightly
harder, if there isn't a function to generate those values (e.g.,
quarter) built into those SQL engines.

rb
> --
> You received this message because you are subscribed to the Google
> Groups "CDK Development" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to cdk-dev+u...@cloudera.org
> <mailto:cdk-dev+u...@cloudera.org>.

Helmut Zechmann

unread,
May 19, 2015, 8:14:01 AM5/19/15
to cdk...@cloudera.org, jo...@scalingdata.com, helmut....@gmail.com
Hi Ryan,

thank you very much for your advice!
I would really love to have the additional granularities. What do you mean by "The 

only concern I have is that it might make using Hive or Impala slightly 
harder"?
Are you talking about using hive as an implementation for the data partitioning process?

Kind Regards,

Helmut

Ryan Blue

unread,
May 19, 2015, 12:23:27 PM5/19/15
to Helmut Zechmann, cdk...@cloudera.org, jo...@scalingdata.com
On 05/19/2015 05:14 AM, Helmut Zechmann wrote:
> Hi Ryan,
>
> thank you very much for your advice!
> I would really love to have the additional granularities. What do you
> mean by "The
> only concern I have is that it might make using Hive or Impala slightly
> harder"?
> Are you talking about using hive as an implementation for the data
> partitioning process?
>
> Kind Regards,
>
> Helmut

The current partition methods, with the exception of hash, are
compatible with Hive and Impala. I'm just not sure whether Hive has a
quarter(Date) function. If it does, then we need to make sure this is
compatible so it can be used to select partitions in Hive.
Reply all
Reply to author
Forward
0 new messages