Ryan Blue
unread,Feb 23, 2015, 1:25:03 PM2/23/15Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Sign in to report message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Nithin Asokan, cdk...@cloudera.org
On 02/23/2015 09:36 AM, Nithin Asokan wrote:
> Can data managed by kite utilize more space in disk? I have a basic map
> reduce job, that reads an input from HDFS and writes it as Crunch target
> to a HDFS dataset. I find it interesting how the output stored in disk
> takes nearly 2 times the storage space as my input. I would like to
> understand if kite adds any metadata to output that can contribute to
> additional space. Is compression enabled by default? I have seen this
> behavior in Avro and Parquet files.
>
> Here are some stats
>
> Avro
> hadoop dfs -du -h -s /tmp/avro/input
> 2.1 G /tmp/parquet/input
>
> hadoop dfs -dus /tmp/avro/output
> 4.4 G /tmp/parquet/output
>
> Parquet:
> hadoop dfs -du -h -s /tmp/parquet/input
> 2.1 G /tmp/parquet/input
>
> hadoop dfs -du -h -s /tmp/parquet/output
> 4.9 G /tmp/parquet/output
Hi Nithin,
Kite uses snappy compression by default, but you can configure it to use
other compression. I don't think compression is the source of this problem.
What is the partition strategy for this data? Because this is happening
for both Parquet and Avro, I think this is most likely caused by
splitting the files into more files based on the partitioning.
rb
--
Ryan Blue
Software Engineer
Cloudera, Inc.