Disk usage for avro files

Nithin Asokan

unread,

Feb 23, 2015, 12:41:09 PM2/23/15

to cdk...@cloudera.org

Can data managed by kite utilize more space in disk? I have a basic map reduce job, that reads an input from HDFS and writes it as Crunch target to a HDFS dataset. I find it interesting how the output stored in disk takes nearly 2 times the storage space as my input. I would like to understand if kite adds any metadata to output that can contribute to additional space. Is compression enabled by default?

Here are some stats

Avro

hadoop dfs -du -h -s /tmp/avro/input

2.1 G /tmp/avro/input

hadoop dfs -dus /tmp/avro/output

4.4 G /tmp/avro/output

Ryan Brush

unread,

Feb 23, 2015, 12:55:35 PM2/23/15

to cdk...@cloudera.org

Are there many more files in our output than in your input? If you're doing a basic map job it will create a separate output Avro file for each map task, and each output file includes the Avro schema. You can get around this using something like org.kitesdk.data.crunch.CrunchDatasets.partition, which will create one output file for each partition the dataset is targeting. (It does introduce a reduce phase into your job, but this can actually improve performance since it reduces the number of writers and files being created in HDFS.)

The other possibility is the output isn't being compressed as efficiently as your input (although I believe Kite uses Snappy by default), so it's worth poking around your compression settings.

Nithin Asokan

unread,

Feb 24, 2015, 3:49:01 PM2/24/15

to cdk...@cloudera.org

I found that I was using a different compression scheme on my input and output. When I changed my dataset schema to use 'deflate' compression instead of snappy, I can see the disk usage being same for input and output folders. Thanks for helping.

Reply all

Reply to author

Forward