Disabling crc file generation

4,922 views
Skip to first unread message

Micah Whitacre

unread,
Jun 11, 2014, 2:49:01 PM6/11/14
to cdk...@cloudera.org
So in playing around with Kite and writing out to the filesystem using either Avro/Parquet I noticed that for every file written it generated an accompanying crc file.  

-rw-r--r--  1 user  staff    24 Jun 11 13:29 .67c4dc18-cf64-486c-b755-8b01da10e288.avro.crc
-rw-r--r--  1 user  staff    24 Jun 11 13:30 .b9b0eb83-b75f-425b-9c2a-eef4c8d9d526.avro.crc
-rwxr-xr-x  1 user  staff  1782 Jun 11 13:29 67c4dc18-cf64-486c-b755-8b01da10e288.avro
-rwxr-xr-x  1 user  staff  1782 Jun 11 13:30 b9b0eb83-b75f-425b-9c2a-eef4c8d9d526.avro

What is the long term plan/purpose for those files?  I'm curious because we'd want to obviously avoid flooding HDFS with lots of tiny files.  So is there a way to turn off the generation of those files?

Thanks,
Micah


Ryan Blue

unread,
Jun 11, 2014, 3:25:52 PM6/11/14
to Micah Whitacre, cdk...@cloudera.org
Tom has a good write-up about the crc files here:

https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-4/data-integrity

Basically, they are used to make sure data hasn't been corrupted and in
some cases replace corrupt copies with good ones. The overhead is fairly
minimal for the utility you get, so I don't think it's a good idea to
add an option to turn it off. The main concern over a bunch of tiny
files is MR performance, but these are not used when calculating splits.

Are you running into a specific problem?

rb

--
Ryan Blue
Software Engineer
Cloudera, Inc.

Micah Whitacre

unread,
Jun 11, 2014, 5:50:03 PM6/11/14
to cdk...@cloudera.org, mkw...@gmail.com, bl...@cloudera.com
Thanks for the link I'll give it a read.  I guess I was more thinking of the first problem Tom documents[1] where the overhead of managing small files outweighs the content.  Obviously a tradeoff for various benefits.  I haven't hit a problem yet as my Kite usage is still in prototype but wanted to make sure I didn't ignore a side effect that might bite me later.

Patrick Angeles

unread,
Jun 11, 2014, 9:25:28 PM6/11/14
to Micah Whitacre, cdk...@cloudera.org, bl...@cloudera.com
Micah,

Those crc files don't take up any overhead in the NN namespace. They're not HDFS data files, they are meta files in the data directories. You will see them in your local filesystem if you use the "file:///" URI.




--
You received this message because you are subscribed to the Google Groups "CDK Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cdk-dev+u...@cloudera.org.
For more options, visit https://groups.google.com/a/cloudera.org/d/optout.

Vikash Pareek

unread,
Oct 9, 2017, 3:40:31 AM10/9/17
to CDK Development, mkw...@gmail.com, bl...@cloudera.com, pat...@cloudera.com
Hi Patrick,

I have completely agreed with you that .crc file is good for data integrity and it is not adding any overhead on NN.

Still, there are few cases where we need to avoid .crc file, for e.g. in my case I have mounted S3 on S3FS and saving data from rdd to mounting point. 
It is creating lots of .crc file on S3 which we don't require, to overcome this we need to write an extra utility to filter out all the .crc file which degrade our performance.

The interesting observation is that there is a .crc file for _SUCCESS file too. and that .crc files is 8 bytes of size while the _SUCCESS file is 0 byte.
If we are having 1000 million part files than we are using extra 1000M*12 bytes.

Best Regards,
Vikash Pareek
Reply all
Reply to author
Forward
0 new messages