I'd like to download a small subset of data to work on training a classifier. I wondered if I take a single file of a single segment whether that would provide me a sufficiently random cross section of crawl data, or whether that one file might potentially be skewed by data from a single large site, or from just a few sites.
Take this file for example:
s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-10/segments/1394678705117/wet/CC-MAIN-20140313024505-00099-ip-10-183-142-35.ec2.internal.warc.wet.gz
How random can I expect the data in that one file to be?