Regarding Kite Views

2 views
Skip to first unread message

Buntu Dev

unread,
Mar 12, 2015, 3:32:49 PM3/12/15
to cdk...@cloudera.org
I got pageviews dataset partitioned by year/month/day but would like to add another level of partition at the top say pagename/year/month/day but I'm worried this will easily create 10k partitions over couple of years.

I'm looking at the Kite Views API and wanted to know if defining a view:hive:<table>?pagename=xyz (where <table> is partitioned by year/month/day) is equivalent of adding the 'pagename' partition to <table> or just a convenience to play with the data of interest? Basically, does the view help with query performance?


Thanks!

Ryan Blue

unread,
Mar 12, 2015, 3:57:03 PM3/12/15
to Buntu Dev, cdk...@cloudera.org
Good question!

Partitioning in Kite builds your main "index" into the data. For file
system datasets, that's the directory structure that you can use to
select files that contain the right data. This is the on-disk structure
of your dataset.

Views are how you express constraints to select data that you want. This
is a logical view of the data, which doesn't need to correspond to the
on-disk structure. All constraints, whether or not they can be used to
ignore unneeded partitions, are applied to the data before records are
passed to the reader.

Using views to pass constraints to Kite allows Kite to handle selecting
which partitions should be used, bridging the gap between logical (what
data you want) and on-disk structure (the details). It also allows Kite
to pass those constraints on further for better performance. For
example, Kite can pass your constraints to the Parquet format, saving
I/O bandwidth and processor cycles (in the roadmap).

While there may be ways to speed up your query by passing the
constraints to Kite, the main one that is implemented today is to ignore
unnecessary partitions. Other constraints aren't used to speed up the
read, though there are ways they could be.

This is in contrast to Hive, which allows you to express logical
constraints (time between X and Y) but requires that you also provide
constraints for partition columns (date between '2015-02-01' and
'2015-02-28'). The two don't necessarily align, and you have to pay
attention to ensure the partitions that get selected actually contain
all of the data for your logical constraints. (Kite would take your time
constraint and figure out the partition details automatically, so
there's less chance for error.)

Thanks for asking!

rb

--
Ryan Blue
Software Engineer
Cloudera, Inc.

Buntu Dev

unread,
Mar 12, 2015, 6:46:40 PM3/12/15
to Ryan Blue, cdk...@cloudera.org
Thanks Ryan for all the info.
Reply all
Reply to author
Forward
0 new messages