Removing Timebased partitions

53 views
Skip to first unread message

Micah Whitacre

unread,
Mar 11, 2015, 4:50:33 PM3/11/15
to cdk...@cloudera.org
If I have a partition strategy based on time field in my data, is there a way to delete all partitions older than a given time?

As an example I have a PartitionStrategy that looks like the following:

PartitionStrategy.Builder builder = new PartitionStrategy.Builder();
builder.dateFormat("extractionTime", "extractionTime_partition","yyyyMMddHH");

So I will be writing data into essentially hourly partitions.  Obviously hourly partitions might lose their value over time so I'll want to roll those partitions up into daily/monthly by copying the data into a different dataset with a different partitioning strategy.  However I would want to delete all the hourly partitions older than 2 weeks.  I thought I should be able to do something like this:

dataset.toBefore("extractionTime", <millis for 2 weeks ago>).deleteAll();

I however am not able do this because of the following exception:

java.lang.UnsupportedOperationException: Cannot cleanly delete view: FileSystemView...
at org.kitesdk.data.spi.filesystem.FileSystemView.deleteAll(FileSystemView.java:108)

In this case the underlying issue is the DateFormatPartitioner's projectStrict[1] doesn't support Range predicates like I am using so the constraints never align with the boundaries.

I tried switching around the PartitionStrategy to be something like this:

        builder.year("extractionTime", "extractionTimeYear_partition");
        builder.day("extractionTime", "extractionTimeDay_partition");
        builder.hour("extractionTime", "extractionHourDay_partition");

Which should be equivalent functionally.  I still can't delete data because of the same exception.  In this case it seems that when it is trying to evaluate strict.equals(permissive)[2] the TimePredicateImpl[3] is failing equivalency because despite both strict/permissive having the same upper[] = {2015} because they don't differ they are not equivalent.  I'm still trying to wrap my head around that logic a bit but seems odd at first glance.

Is there another way I might go about accomplishing deleting older time based partitions other than the deleteAll() since that doesn't seem to work?  Also in my real situation I have a few other partitioners in my strategy but the leaf strategies will be time based.  Will that cause a problem for me?

Thanks,
Micah




Ryan Blue

unread,
Mar 11, 2015, 5:50:28 PM3/11/15
to Micah Whitacre, cdk...@cloudera.org
On 03/11/2015 01:50 PM, Micah Whitacre wrote:
> If I have a partition strategy based on time field in my data, is there
> a way to delete all partitions older than a given time?
>
> As an example I have a PartitionStrategy that looks like the following:
>
> PartitionStrategy.Builder builder = new PartitionStrategy.Builder();
> builder.dateFormat("extractionTime",
> "extractionTime_partition","yyyyMMddHH");
>
> So I will be writing data into essentially hourly partitions. Obviously
> hourly partitions might lose their value over time so I'll want to roll
> those partitions up into daily/monthly by copying the data into a
> different dataset with a different partitioning strategy. However I
> would want to delete all the hourly partitions older than 2 weeks. I
> thought I should be able to do something like this:
>
> dataset.toBefore("extractionTime", <millis for 2 weeks ago>).deleteAll();
>
> I however am not able do this because of the following exception:
>
> java.lang.UnsupportedOperationException: Cannot cleanly delete view:
> FileSystemView...
> at
> org.kitesdk.data.spi.filesystem.FileSystemView.deleteAll(FileSystemView.java:108)
>
> In this case the underlying issue is the DateFormatPartitioner's
> projectStrict[1] doesn't support Range predicates like I am using so the
> constraints never align with the boundaries.

You're right about the cause here. The date format partitioner doesn't
implement the partition logic that is handled for the other time-based
partitions yet. It wouldn't be too difficult to do.

We originally didn't include it because we would have to detect the
partitioning order (e.g., year before month) in the format. But, we
currently assume that the partition order is correct for the separate
time-based partitioners, so it wouldn't be too bad to make the same
assumption.

> I tried switching around the PartitionStrategy to be something like this:
>
> builder.year("extractionTime", "extractionTimeYear_partition");
> builder.day("extractionTime", "extractionTimeDay_partition");
> builder.hour("extractionTime", "extractionHourDay_partition");
>
> Which should be equivalent functionally. I still can't delete data
> because of the same exception. In this case it seems that when it is
> trying to evaluate strict.equals(permissive)[2] the TimePredicateImpl[3]
> is failing equivalency because despite both strict/permissive having the
> same upper[] = {2015} because they don't differ they are not equivalent.
> I'm still trying to wrap my head around that logic a bit but seems odd
> at first glance.

This is something we should fix immediately. Could you open a bug report
for it? I doubt it it a really difficult fix.

> Is there another way I might go about accomplishing deleting older time
> based partitions other than the deleteAll() since that doesn't seem to
> work? Also in my real situation I have a few other partitioners in my
> strategy but the leaf strategies will be time based. Will that cause a
> problem for me?
>
> Thanks,
> Micah
>
>
> [1]
> - https://github.com/kite-sdk/kite/blob/a0867cc685dd31d9e09cc76b76472ed165a884ae/kite-data/kite-data-core/src/main/java/org/kitesdk/data/spi/partition/DateFormatPartitioner.java#L112
> [2]
> - https://github.com/kite-sdk/kite/blob/a0867cc685dd31d9e09cc76b76472ed165a884ae/kite-data/kite-data-core/src/main/java/org/kitesdk/data/spi/Constraints.java#L317
> [3]
> - https://github.com/kite-sdk/kite/blob/a0867cc685dd31d9e09cc76b76472ed165a884ae/kite-data/kite-data-core/src/main/java/org/kitesdk/data/spi/TimeDomain.java#L284

Thanks for letting us know about these! Opening issues for both of these
would be great if you have the time. Thanks!

rb

--
Ryan Blue
Software Engineer
Cloudera, Inc.

Micah Whitacre

unread,
Mar 11, 2015, 6:33:17 PM3/11/15
to cdk...@cloudera.org, mkw...@gmail.com

Micah Whitacre

unread,
Mar 11, 2015, 6:34:30 PM3/11/15
to cdk...@cloudera.org, mkw...@gmail.com
While waiting for the 1.1.0 version of Kite, is there another work around I could use to accomplish the same task of deleting older partitions after a point? (with the exception of dropping down to the raw HDFS folders myself)

Ryan Blue

unread,
Mar 11, 2015, 6:59:03 PM3/11/15
to Micah Whitacre, cdk...@cloudera.org
On 03/11/2015 03:34 PM, Micah Whitacre wrote:
> While waiting for the 1.1.0 version of Kite, is there another work
> around I could use to accomplish the same task of deleting older
> partitions after a point? (with the exception of dropping down to the
> raw HDFS folders myself)

Yes, you can use views created with partition data instead of record
data. To delete all of 2014, use:

kite-dataset delete view:hive:table?year=2014

If you're using the date format partitioner, then this wouldn't work
because there is only one level of partitioning.

Micah Whitacre

unread,
Mar 12, 2015, 10:08:24 PM3/12/15
to cdk...@cloudera.org, mkw...@gmail.com
Bringing the conversation back from CDK-961 since this is more of a discussion than about that issue...

So if I'm understanding this correctly for me to be able to delete partitions I can't specify a timestamp with any granularity greater than the leaf most time based partitioner right?  So if I'm hourly partitioned I need to clear the minute, seconds, milliseconds.  Assuming I detect a dataset and know it is time based partitioned is there a way for to easily detect the leaf based field partitioner?  There used to be a way to get the FieldPartitioners from the PartitionStrategy but that was removed.  The use case I'm thinking of is I might have a namespace with the following datasets:

storage_hourly
storage_daily
storage_monthly
storage_yearly

If I had an Oozie coordinator for each dataset that ran regularly clearing out any data older than X interval for each dataset could I do something to examine the dataset to determine that partitioning granularity?  If not my current approach is going to be to parse the dataset name and translate the dataset name suffix with custom code.

Ryan Blue

unread,
Mar 13, 2015, 1:02:36 AM3/13/15
to Micah Whitacre, cdk...@cloudera.org
On 03/12/2015 07:08 PM, Micah Whitacre wrote:
> Bringing the conversation back from CDK-961 since this is more of a
> discussion than about that issue...
>
> So if I'm understanding this correctly for me to be able to delete
> partitions I can't specify a timestamp with any granularity greater than
> the leaf most time based partitioner right? So if I'm hourly
> partitioned I need to clear the minute, seconds, milliseconds. Assuming
> I detect a dataset and know it is time based partitioned is there a way
> for to easily detect the leaf based field partitioner? There used to be
> a way to get the FieldPartitioners from the PartitionStrategy but that
> was removed. The use case I'm thinking of is I might have a namespace
> with the following datasets:
>
> storage_hourly
> storage_daily
> storage_monthly
> storage_yearly
>
> If I had an Oozie coordinator for each dataset that ran regularly
> clearing out any data older than X interval for each dataset could I do
> something to examine the dataset to determine that partitioning
> granularity? If not my current approach is going to be to parse the
> dataset name and translate the dataset name suffix with custom code.

What I recommend here is to use the granularity at which you want to run
the job. If you're going to clear it out daily, use a day or the hour
when you plan to run it. With the hierarchy of partitioning directories
(rather than a flat structure) you know that a boundary at a low level
is also a boundary at a higher level: it doesn't matter whether a
dataset is partitioned by hour to delete yesterday and all data older
than that. Same thing for hours and minutes. A boundary on the hour is
also a minute boundary.

Of course, I also think there is value to being able to inspect the
partition strategy and we should probably add those partitioners back.
If you (or anyone on this list) have thoughts on this, please let us
know what you're thinking and what these could be used for.

Thanks!

Micah Whitacre

unread,
Mar 13, 2015, 1:56:20 PM3/13/15
to cdk...@cloudera.org, mkw...@gmail.com
Do you know the context/motivation for removing access to the FieldPartitioners to originally?  Simplification of the API? Encapsulation leak?

>> What I recommend here is to use the granularity at which you want to run the job.

Yeah my thoughts were that when rolling up the daily dataset into a monthly it would run at the start of each month (some point on that day but not necessarily at midnight).  During that run it would also delete data that is now older than a month that just got rolled up.  Current plan is to use Oozie nominal time with a date format only as precise as the dataset being cleared.  Was just thinking it'd be nice to be able to programmatically ensure inside the command that the granularity was appropriate.

Ryan Blue

unread,
Mar 13, 2015, 2:01:07 PM3/13/15
to Micah Whitacre, cdk...@cloudera.org
On 03/13/2015 10:56 AM, Micah Whitacre wrote:
> Do you know the context/motivation for removing access to the
> FieldPartitioners to originally? Simplification of the API?
> Encapsulation leak?

There were methods we didn't think we needed to provide on the
FieldPartitioner. I think it makes sense to expose the type and possibly
the partition function itself, but we were exposing references to the
internal Predicate classes through the projection methods. I think this
was too far, but we'll have to come up with a way to expose some, but
not all of the instance methods. Probably just need an interface that
exposes function type, apply(Object), source, and name.

> >> What I recommend here is to use the granularity at which you want to
> run the job.
>
> Yeah my thoughts were that when rolling up the daily dataset into a
> monthly it would run at the start of each month (some point on that day
> but not necessarily at midnight). During that run it would also delete
> data that is now older than a month that just got rolled up. Current
> plan is to use Oozie nominal time with a date format only as precise as
> the dataset being cleared. Was just thinking it'd be nice to be able to
> programmatically ensure inside the command that the granularity was
> appropriate.

Let's fix that, but I think as long as you choose a granularity larger
than the minimum you should be okay.

Micah Whitacre

unread,
Jul 8, 2015, 9:54:33 AM7/8/15
to cdk...@cloudera.org, mkw...@gmail.com
Reviving this old discussion but logged issue to expose the partition strategy values again.
Reply all
Reply to author
Forward
0 new messages