Re: Early crawls (2008/2009)

45 views

Skip to first unread message

Sebastian Nagel

unread,

Dec 27, 2016, 1:54:52 PM12/27/16

to common...@googlegroups.com

Hi Mo,

some information about the format of the pre-2013 crawls is available on the old Common Crawl wiki
[1]. The crawl was based on a different crawler software, that's why the output is organized
differently and has a different format than today.

The date indicated by the folder hierarchy relates to the date the contained ARC files are written.
It should be close to the date the content was crawled (few days or weeks before).

Afaics, the content of every ARC file is random or determined by a hash function. Just taking a
sample set of ARC files should be sufficient to get a good subsample to study the crawl.

> In particular, are all the files in crawl-001 supposed to be taken together to constitute
> 1 crawl? Or do they overlap and should I only consider some of the files?

Good question. Ahad Rana mentions that the content is deduped [2], I would assume that this also
applies to re-crawled documents. But this would not exclude that the same URL is contained in
multiple with changed content. I'll try to get the answer for this detail.

Thanks,
Sebastian

[1] https://commoncrawl.atlassian.net/wiki/display/CRWL/About+the+Data+Set
[2] http://www.slideshare.net/hadoopusergroup/building-a-scalable-web-crawler-with-hadoop

On 12/23/2016 10:54 PM, Mohit Tiwari wrote:
> Hi,
>
> My name is Mo Tiwari and I'm currently a researcher at Stanford. I'm trying to analyze the early
> crawls, in particular Crawl #1 that covers 2008-2009. I see in the S3 bucket, however, that there
> are many sub-folders in the bucket.
>
> These buckets seem to be labeled by the year, month, day, and hour of the crawl: for
> example, s3://commoncrawl/crawl-001/2008/06/27/10. Do I need to get all the ARC files from each of
> these subdirectories in order to understand the whole crawl? In particular, are all the files in
> crawl-001 supposed to be taken together to constitute 1 crawl? Or do they overlap and should I only
> consider some of the files? I'm a bit confused as to why this crawl has lots of subdirectories,
> whereas the later crawls are all stored in a flat directory with a bunch of WET files.
>
> Thank you and happy holidays!
> Mo

Reply all

Reply to author

Forward

0 new messages