Re: Disk usage for avro files

15 views
Skip to first unread message

Ryan Blue

unread,
Feb 23, 2015, 1:25:03 PM2/23/15
to Nithin Asokan, cdk...@cloudera.org
On 02/23/2015 09:36 AM, Nithin Asokan wrote:
> Can data managed by kite utilize more space in disk? I have a basic map
> reduce job, that reads an input from HDFS and writes it as Crunch target
> to a HDFS dataset. I find it interesting how the output stored in disk
> takes nearly 2 times the storage space as my input. I would like to
> understand if kite adds any metadata to output that can contribute to
> additional space. Is compression enabled by default? I have seen this
> behavior in Avro and Parquet files.
>
> Here are some stats
>
> Avro
> hadoop dfs -du -h -s /tmp/avro/input
> 2.1 G /tmp/parquet/input
>
> hadoop dfs -dus /tmp/avro/output
> 4.4 G /tmp/parquet/output
>
> Parquet:
> hadoop dfs -du -h -s /tmp/parquet/input
> 2.1 G /tmp/parquet/input
>
> hadoop dfs -du -h -s /tmp/parquet/output
> 4.9 G /tmp/parquet/output

Hi Nithin,

Kite uses snappy compression by default, but you can configure it to use
other compression. I don't think compression is the source of this problem.

What is the partition strategy for this data? Because this is happening
for both Parquet and Avro, I think this is most likely caused by
splitting the files into more files based on the partitioning.

rb


--
Ryan Blue
Software Engineer
Cloudera, Inc.

Nithin Asokan

unread,
Feb 24, 2015, 3:47:56 PM2/24/15
to cdk...@cloudera.org, anit...@gmail.com
Hi Ryan,
I used a non-partitioned dataset. Also on further testing, I found that I was using a different compression scheme on my input and output. When I changed my dataset schema to use 'deflate' compression instead of snappy, I can see the disk usage being same for input and output folders. Thanks for helping. 
Reply all
Reply to author
Forward
0 new messages