Sorry about my bad math. That is not 5MB but ~500K. Was looking at a different number. Question is still the same, though, ... why is it being truncated.
On Wednesday, February 20, 2013 2:15:01 PM UTC-8, Lee Graber wrote:Hi,I have only recently started using the common crawl data set and seem to be hitting some issues. I have been playing around with a modified version of the examples project from github and I am finding that the documents stored in the arc files are not always complete. And the ones that seem to be failing all have content lengths ~524278. To give an example:segment: s3n://aws-publicdatasets/common-crawl/parse-output/segment/1341690163490/1341782443295_1551.arc.gzfile: big-lie1.pdfcontent-length: 524276 (I checked and the payload matched this value)When I went did a wget of the actual url, I found that the file is actually 680215. That makes the content in the arc file pretty much useless. I hit this a number of times for other pdfs in the same segment. I spot checked a couple of them and they all had content-lengths of about the same size. Is there some type of crawl limit that I am not aware of or some special handling that has to happen when the file exceeds ~5MB? I can see in the example that the code skips these files but it indicates it doesn't want to run out of memory (which doesn't make much sense). Any insight would be appreciated.Thanks
Lee
--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.