Replacing Partitions with Views

25 views
Skip to first unread message

Ryan Brush

unread,
Mar 26, 2015, 3:14:58 PM3/26/15
to cdk...@cloudera.org
I'm spawning off a separate thread from the "latest version" Chris started. I'd like to enumerate the things we're doing with partitions today and talk through how we'd achieve them in Kite. I think the key is that a partition is really just a View with some special properties. For instance, these properties might be: deleteAll works, it can be uniquely identified with a set of keys (e.g. {year = 2015, month = 3, day = 26}), and can be efficiently moved or archived. So if we can enumerate the set of views that meet these properties (reified views? physical views?), then we could probably perform management functions without needing to expose the concept of partitions.

Anyway, here are some things we do with partition-based data that we'd need to shift into Kite if we wanted to adopt it broadly:
  • The latest version of a dataset (as Chris described, which appears solvable by setting a property on the dataset itself)
  • Oozie integration -- which may be addressed by the isReady()/signalReady() proposal as the latest idea floated over on CDK-409.
  • Efficient archival/removal of old data. For instance, I want to move old datasets into an archive location and more aggressively compress and combine them. We iterate through partitions to do this, but I think if we could enumerate views that are movable that would work as well.
  • Simple processing validation, e.g. ensuring we have updates for every hour -- even if that load contained no data
    • Note that checking view.isEmpty() isn't sufficient here...it's possibly that data for the 10:00 updates were loaded but it just didn't contain anything.
I think all of these could be solved by tooling built on the ability to enumerate views with these properties and access the keys that define the view (e.g., see the {year = 215, month = 3, day = 26} so I know that the load took place, even if it contains no data). We could further add some other operations that only make sense on such a view, such as move() to another dataset. 

Don't read too much into this next part -- it's more to stir conversation rather than a proposal -- but perhaps something along these lines:

public interface (Physical/Reified/Partition/)View extends View {
 
  // Returns the key uniquely identifying the physical view in the dataset. This way we can
  // ensure all of the physical views expected to be loaded indeed were, or choose which content
  // to move to an archive.
  // Or we could just offer a separate function to get the arguments from the view's URI, and drop the need for a PhysicalView interface.
  Map<String,Object> getKey();

  // Perhaps we should move this here from the View interface, since it would always be supported
  // on a physical view?
  boolean deleteAll();
}

Then we might add something like this to View:

List<PhysicalView> view.getCoveringViews(); // (Ryan Blue had kicked around something similar to this.)

Note that a view may have no PhysicalViews if the Dataset implementation doesn't support it, in which case the above method returns an empty list.

Additional tooling could be built around physical views as well, such as the ability to move it from one dataset to another (compatible) dataset, but those could be separate and added later.

Whether we go down this route or not, I think we need some way to enumerate physical views that can be archived and tracked for tooling purposes. This should be expressive enough to obviate our use of partitions.

Ryan Blue

unread,
Mar 26, 2015, 4:12:29 PM3/26/15
to Ryan Brush, cdk...@cloudera.org
+1

This is exactly how I think it should work. Partitions are just views
that are related to physical storage.

On 03/26/2015 12:14 PM, Ryan Brush wrote:
> I'm spawning off a separate thread from the "latest version" Chris
> started
> <https://groups.google.com/a/cloudera.org/forum/#!topic/cdk-dev/DPh7Pusp-aU>.
> I'd like to enumerate the things we're doing with partitions today and
> talk through how we'd achieve them in Kite. I think the key is that a
> /partition/ is really just a View with some special properties. For
> instance, these properties might be: deleteAll works, it can be uniquely
> identified with a set of keys (e.g. {year = 2015, month = 3, day = 26}),
> and can be efficiently moved or archived. So if we can enumerate the set
> of views that meet these properties (reified views? physical views?),
> then we could probably perform management functions without needing to
> expose the concept of partitions.
>
> Anyway, here are some things we do with partition-based data that we'd
> need to shift into Kite if we wanted to adopt it broadly:
>
> * The latest version of a dataset (as Chris described, which appears
> solvable by setting a property on the dataset itself)

For now, let's go with that. I want to get a VersionedDataset put
together though, which could potentially keep a Hive view or table
up-to-date with the latest version.

> * Oozie integration -- which may be addressed by the
> isReady()/signalReady() proposal as the latest idea floated over on
> CDK-409.

+1 to the signalReady solution.

> * Efficient archival/removal of old data. For instance, I want to move
> old datasets into an archive location and more aggressively compress
> and combine them. We iterate through partitions to do this, but I
> think if we could enumerate views that are movable that would work
> as well.

+1

> * Simple processing validation, e.g. ensuring we have updates for
> every hour -- even if that load contained no data
> o Note that checking view.isEmpty() isn't sufficient here...it's
> possibly that data for the 10:00 updates were loaded but it just
> didn't contain anything.

Does the isReady/signalReady work help here? It seems cleaner to
validate processing directly rather than by observing side-effects,
which is really what we're doing when we "validate" a run based on
whether a directory exists.

The signalReady solution can be used as a confirmation that you know
happened, rather than creating the output directory only to fail during
a merge. Merge the output and then signal that it's done.
I really like this approach. The dataset implementation should determine
what makes sense for partitions. I'm for a Partitoned interface that can
make these guarantees and provide the getPartitionViews() (or similar)
method, too. (Since we always call these partitions, I like
PartitionView the best.)

Another one I've been kicking around is the idea of the same thing for
Locality. Basically, this would allow us to expose block and locality
information without delegating to InputFormats and InputSplit. Removing
the MR dependency from core is a win.

> Additional tooling could be built around physical views as well, such as
> the ability to move it from one dataset to another (compatible) dataset,
> but those could be separate and added later.
>
> Whether we go down this route or not, I think we need some way to
> enumerate physical views that can be archived and tracked for tooling
> purposes. This should be expressive enough to obviate our use of partitions.

I think what you propose is great and shouldn't be too much work. We
need a View subclass that takes a path, then we need to add the
interface to the API or SPI to provide access to the
`getPartitionedView` method. PartitionView should also guarantee that
calls to deleteAll() return without throwing UnsupportedOperationException.

rb


--
Ryan Blue
Software Engineer
Cloudera, Inc.

Ryan Brush

unread,
Mar 27, 2015, 6:15:18 PM3/27/15
to cdk...@cloudera.org, rbr...@gmail.com
Great! I logged https://issues.cloudera.org/browse/CDK-972 to track this. We'll need to nail down the exact semantics of this, but I feel pretty good about the approach. 

Interesting idea on pushing data locality into this...I hadn't thought of that.
Reply all
Reply to author
Forward
0 new messages