Does a single file in a single segment provide a reasonably random cross section of crawl data?

David Parks

unread,

Apr 13, 2014, 12:51:10 PM4/13/14

to common...@googlegroups.com

I'd like to download a small subset of data to work on training a classifier. I wondered if I take a single file of a single segment whether that would provide me a sufficiently random cross section of crawl data, or whether that one file might potentially be skewed by data from a single large site, or from just a few sites.

Take this file for example:

s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-10/segments/1394678705117/wet/CC-MAIN-20140313024505-00099-ip-10-183-142-35.ec2.internal.warc.wet.gz

How random can I expect the data in that one file to be?

Guy Dumais

unread,

Apr 22, 2014, 2:02:46 PM4/22/14

to common...@googlegroups.com

I was also hoping for an answer to this question. My little experiment with Common Crawl showed that some larger websites can take up to one third of a segment. Clearly, such segments can not be used as a representative sampling of the whole Common Crawl documents.

jor...@commoncrawl.org

unread,

Apr 22, 2014, 11:05:38 PM4/22/14

to common...@googlegroups.com

It won't be random. Each file is, iirc, the output of a single crawler which groups by domain. I can work on generating a truly random sample if people would like.

Jordan

shlomi...@gmail.com

unread,

Apr 26, 2014, 6:40:49 PM4/26/14

to common...@googlegroups.com

... I can work on generating a truly random sample if people would like.

That would be awesome!

Shlomi

Reply all

Reply to author

Forward