Issues/questions regarding creating partition using identity

14 views
Skip to first unread message

Mats Naslund

unread,
Apr 25, 2016, 2:28:47 PM4/25/16
to CDK Development

I'm trying to create a partition using identity, and I'm wondering if I'm doing things correctly or I might be missing something. When I create the following and view the data in hive I now have 2 columns A.B.C.D & D containing the same data. Am I missing something?

String partitionOn = "[{\"type\": \"identity\", \"source\": \"A.B.C.D\", \"name\": \"D\" }]";
       
DatasetDescriptor descriptor = new DatasetDescriptor.Builder()
                .schema(schema)
                .partitionStrategy(PartitionStrategyParser.parse(partitionOn))
                .build();

Thank you

Joey Echeverria

unread,
Apr 25, 2016, 4:21:59 PM4/25/16
to Mats Naslund, CDK Development
You're not missing anything, this is a by-product of how Hive
implements partitioning[1]. The D column isn't stored in the
underlying Avro or Parquet files but is instead stored in the Hive
Metastore database. When querying Hive or Impala, you need to include
a filter on the D column in order for partition pruning to work
correctly. This isn't required for Kite because we maintain a
connection between the name of the partitioned field (D) and the
source of the partitioned field (A.B.C.D). With Kite APIs, you can
filter on either A.B.C.D or D and it will have the same effect.

I hope that helps!

-Joey

[1] https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-PartitionedTables
> --
> You received this message because you are subscribed to the Google Groups
> "CDK Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to cdk-dev+u...@cloudera.org.
> For more options, visit https://groups.google.com/a/cloudera.org/d/optout.



--
-Joey
Reply all
Reply to author
Forward
0 new messages