Does a single file in a single segment provide a reasonably random cross section of crawl data?

64 views
Skip to first unread message

David Parks

unread,
Apr 13, 2014, 12:51:10 PM4/13/14
to common...@googlegroups.com
I'd like to download a small subset of data to work on training a classifier. I wondered if I take a single file of a single segment whether that would provide me a sufficiently random cross section of crawl data, or whether that one file might potentially be skewed by data from a single large site, or from just a few sites.

Take this file for example:

s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-10/segments/1394678705117/wet/CC-MAIN-20140313024505-00099-ip-10-183-142-35.ec2.internal.warc.wet.gz

How random can I expect the data in that one file to be?

Guy Dumais

unread,
Apr 22, 2014, 2:02:46 PM4/22/14
to common...@googlegroups.com
I was also hoping for an answer to this question.  My little experiment with Common Crawl showed that some larger websites can take up to one third of a segment.  Clearly, such segments can not be used as a representative sampling of the whole Common Crawl documents.

jor...@commoncrawl.org

unread,
Apr 22, 2014, 11:05:38 PM4/22/14
to common...@googlegroups.com
It won't be random. Each file is, iirc, the output of a single crawler which groups by domain. I can work on generating a truly random sample if people would like.


Jordan

shlomi...@gmail.com

unread,
Apr 26, 2014, 6:40:49 PM4/26/14
to common...@googlegroups.com
... I can work on generating a truly random sample if people would like.

That would be awesome! 

Shlomi
Reply all
Reply to author
Forward
0 new messages