+1
This is exactly how I think it should work. Partitions are just views
that are related to physical storage.
On 03/26/2015 12:14 PM, Ryan Brush wrote:
> I'm spawning off a separate thread from the "latest version" Chris
> started
> <
https://groups.google.com/a/cloudera.org/forum/#!topic/cdk-dev/DPh7Pusp-aU>.
> I'd like to enumerate the things we're doing with partitions today and
> talk through how we'd achieve them in Kite. I think the key is that a
> /partition/ is really just a View with some special properties. For
> instance, these properties might be: deleteAll works, it can be uniquely
> identified with a set of keys (e.g. {year = 2015, month = 3, day = 26}),
> and can be efficiently moved or archived. So if we can enumerate the set
> of views that meet these properties (reified views? physical views?),
> then we could probably perform management functions without needing to
> expose the concept of partitions.
>
> Anyway, here are some things we do with partition-based data that we'd
> need to shift into Kite if we wanted to adopt it broadly:
>
> * The latest version of a dataset (as Chris described, which appears
> solvable by setting a property on the dataset itself)
For now, let's go with that. I want to get a VersionedDataset put
together though, which could potentially keep a Hive view or table
up-to-date with the latest version.
> * Oozie integration -- which may be addressed by the
> isReady()/signalReady() proposal as the latest idea floated over on
> CDK-409.
+1 to the signalReady solution.
> * Efficient archival/removal of old data. For instance, I want to move
> old datasets into an archive location and more aggressively compress
> and combine them. We iterate through partitions to do this, but I
> think if we could enumerate views that are movable that would work
> as well.
+1
> * Simple processing validation, e.g. ensuring we have updates for
> every hour -- even if that load contained no data
> o Note that checking view.isEmpty() isn't sufficient here...it's
> possibly that data for the 10:00 updates were loaded but it just
> didn't contain anything.
Does the isReady/signalReady work help here? It seems cleaner to
validate processing directly rather than by observing side-effects,
which is really what we're doing when we "validate" a run based on
whether a directory exists.
The signalReady solution can be used as a confirmation that you know
happened, rather than creating the output directory only to fail during
a merge. Merge the output and then signal that it's done.
I really like this approach. The dataset implementation should determine
what makes sense for partitions. I'm for a Partitoned interface that can
make these guarantees and provide the getPartitionViews() (or similar)
method, too. (Since we always call these partitions, I like
PartitionView the best.)
Another one I've been kicking around is the idea of the same thing for
Locality. Basically, this would allow us to expose block and locality
information without delegating to InputFormats and InputSplit. Removing
the MR dependency from core is a win.
> Additional tooling could be built around physical views as well, such as
> the ability to move it from one dataset to another (compatible) dataset,
> but those could be separate and added later.
>
> Whether we go down this route or not, I think we need some way to
> enumerate physical views that can be archived and tracked for tooling
> purposes. This should be expressive enough to obviate our use of partitions.
I think what you propose is great and shouldn't be too much work. We
need a View subclass that takes a path, then we need to add the
interface to the API or SPI to provide access to the
`getPartitionedView` method. PartitionView should also guarantee that
calls to deleteAll() return without throwing UnsupportedOperationException.
rb
--
Ryan Blue
Software Engineer
Cloudera, Inc.