Data truncated at ~5MB in the arc files???

Lee Graber

unread,

Feb 20, 2013, 5:15:01 PM2/20/13

to common...@googlegroups.com

Hi,

I have only recently started using the common crawl data set and seem to be hitting some issues. I have been playing around with a modified version of the examples project from github and I am finding that the documents stored in the arc files are not always complete. And the ones that seem to be failing all have content lengths ~524278. To give an example:

segment: s3n://aws-publicdatasets/common-crawl/parse-output/segment/1341690163490/1341782443295_1551.arc.gz

file: big-lie1.pdf

url: http://www.ritholtz.com/blog/wp-content/uploads/2011/11/big-lie1.pdf

content-length: 524276 (I checked and the payload matched this value)

When I went did a wget of the actual url, I found that the file is actually 680215. That makes the content in the arc file pretty much useless. I hit this a number of times for other pdfs in the same segment. I spot checked a couple of them and they all had content-lengths of about the same size. Is there some type of crawl limit that I am not aware of or some special handling that has to happen when the file exceeds ~5MB? I can see in the example that the code skips these files but it indicates it doesn't want to run out of memory (which doesn't make much sense). Any insight would be appreciated.

Thanks
Lee

Lee Graber

unread,

Feb 20, 2013, 5:21:49 PM2/20/13

to common...@googlegroups.com

Sorry about my bad math. That is not 5MB but ~500K. Was looking at a different number. Question is still the same, though, ... why is it being truncated.

Ken Krugler

unread,

Feb 20, 2013, 8:05:17 PM2/20/13

to common...@googlegroups.com

On Feb 20, 2013, at 2:21pm, Lee Graber wrote:

Sorry about my bad math. That is not 5MB but ~500K. Was looking at a different number. Question is still the same, though, ... why is it being truncated.

Most web crawls will truncate fetched content at some pre-defined limit, as otherwise a bad server can cause the fetch process to fail with an out-of-memory exception.

Looks like maybe 512K was the limit for PDFs, but Ahad would need to confirm.

If so then that's generally too short…we use 2MB as a reasonable upper bounds for PDFs when crawling.

And when truncation happens with binary (non-HTML) files, you definitely want to have it flagged as such, since parsing will (almost) always fail. This isn't the case for HTML, where TagSoup/NekoHTML can clean up the broken HTML that results from truncation.

-- Ken

On Wednesday, February 20, 2013 2:15:01 PM UTC-8, Lee Graber wrote:
Hi,
I have only recently started using the common crawl data set and seem to be hitting some issues. I have been playing around with a modified version of the examples project from github and I am finding that the documents stored in the arc files are not always complete. And the ones that seem to be failing all have content lengths ~524278. To give an example:

segment: s3n://aws-publicdatasets/common-crawl/parse-output/segment/1341690163490/1341782443295_1551.arc.gz
file: big-lie1.pdf
url: http://www.ritholtz.com/blog/wp-content/uploads/2011/11/big-lie1.pdf
content-length: 524276 (I checked and the payload matched this value)

When I went did a wget of the actual url, I found that the file is actually 680215. That makes the content in the arc file pretty much useless. I hit this a number of times for other pdfs in the same segment. I spot checked a couple of them and they all had content-lengths of about the same size. Is there some type of crawl limit that I am not aware of or some special handling that has to happen when the file exceeds ~5MB? I can see in the example that the code skips these files but it indicates it doesn't want to run out of memory (which doesn't make much sense). Any insight would be appreciated.

Thanks
Lee

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

--------------------------

Ken Krugler

+1 530-210-6378

http://www.scaleunlimited.com

custom big data solutions & training

Hadoop, Cascading, Cassandra & Solr

Ahad Rana

unread,

Feb 20, 2013, 8:24:52 PM2/20/13

to Common Crawl

Hi Lee/Ken,

The limit should be 2MB for all content, but I believe the original limit at some point was 512K. You can confirm that the CC Crawler truncated the content by looking for the x-commoncrawl-ContentTruncated HTTP header. In the future we will probably bump this to 5MB for non html content.

BTW, 1341690163490 is an invalid segment. Please refer to the s3://aws-publicdatasets/common-crawl/parse-output/valid_segments.txt file for the list of valid segment ids. There were some issues with ArcFile generation that necessitated the regeneration of the ARC Segments a while back but I don't think the truncation issue will be addressed between runs :-(

Ahad.

Reply all

Reply to author

Forward