You do not have permission to delete messages in this group
Copy link
Report message
Sign in to report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to common...@googlegroups.com
On Wed, May 13, 2015 at 12:16 PM, Andriy Drozdyuk <dro...@gmail.com> wrote:
How can I get a sample of 10,000 urls from the dataset? It doesn't matter how random it is, but I'd like to avoid cases where I get 1000's of links from the same TLD.
URLs crawled or the contents of those pages? If you just want the URLs, it's be pretty simple to tweak the Python program that I posted a few weeks ago to do the sampling.
If you need the pages contents, you could either sample the index and then fetch the contents from the segments, or you could just use one or more segments and rely on whatever inherent randomness there is in the crawl process.