Re: Need a sampling of 10,000 urls

81 views
Skip to first unread message

Tom Morris

unread,
May 13, 2015, 3:05:27 PM5/13/15
to common...@googlegroups.com
On Wed, May 13, 2015 at 12:16 PM, Andriy Drozdyuk <dro...@gmail.com> wrote:

How can I get a sample of 10,000 urls from the dataset? It doesn't matter how random it is, but I'd like to avoid cases where I get 1000's of links from the same TLD.

URLs crawled or the contents of those pages?  If you just want the URLs, it's be pretty simple to tweak the Python program that I posted a few weeks ago to do the sampling.


If you need the pages contents, you could either sample the index and then fetch the contents from the segments, or you could just use one or more segments and rely on whatever inherent randomness there is in the crawl process.

For example, this segment, chosen at random:

  https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-14/segments/1427131292567.7/warc/CC-MAIN-20150323172132-00108-ip-10-168-14-71.ec2.internal.warc.gz 

contains over 50K unique pages from over 20K hosts with a reasonable distribution.  Here are the counts for the top 10 hosts:


You'd only need the first 200MB or so to get your 10K pages.

Tom
Reply all
Reply to author
Forward
0 new messages