Re: Need a sampling of 10,000 urls

81 views

Skip to first unread message

Tom Morris

unread,

May 13, 2015, 3:05:27 PM5/13/15

to common...@googlegroups.com

On Wed, May 13, 2015 at 12:16 PM, Andriy Drozdyuk <dro...@gmail.com> wrote:

How can I get a sample of 10,000 urls from the dataset? It doesn't matter how random it is, but I'd like to avoid cases where I get 1000's of links from the same TLD.

URLs crawled or the contents of those pages? If you just want the URLs, it's be pretty simple to tweak the Python program that I posted a few weeks ago to do the sampling.

https://gist.github.com/tfmorris/ab89ed13e2e52830aa6c

If you need the pages contents, you could either sample the index and then fetch the contents from the segments, or you could just use one or more segments and rely on whatever inherent randomness there is in the crawl process.

For example, this segment, chosen at random:

https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-14/segments/1427131292567.7/warc/CC-MAIN-20150323172132-00108-ip-10-168-14-71.ec2.internal.warc.gz

contains over 50K unique pages from over 20K hosts with a reasonable distribution. Here are the counts for the top 10 hosts:

62 www.thefind.com

45 m.mlb.com

42 weather.weatherbug.com

39 www.opensecrets.org

39 www.beeradvocate.com

37 www.tripadvisor.se

36 stackoverflow.com

35 www.popsugar.com

35 www.agoda.com

34 www.urbandictionary.com

You'd only need the first 200MB or so to get your 10K pages.

Tom

Reply all

Reply to author

Forward

0 new messages