Information needed about the crawling parameters of CC-MAIN-2013-48 novermber dataset..!!

36 views
Skip to first unread message

Zahid Adeel

unread,
Mar 12, 2014, 3:01:20 AM3/12/14
to common...@googlegroups.com
hi everyone,
i am working on CC-MAIN-2013-48 dataset. So, i need some info about its crawling parameters i.e. crawl depth and seed.
Kindly share this info anyone knows about it.

Thanks in advance..!

Jordan Mendelson

unread,
Mar 12, 2014, 5:25:29 PM3/12/14
to common...@googlegroups.com
Our 2013 crawls use URL lists donated by Blekko that they use to populate their search engine so it is a fairly comprehensive set of pages that one could make a decent search engine for. We don't go quite as deep as Blekko because our politeness policy means it might take a month to fetch all the pages though for our later 2014 crawls, we'll be recycling old unfetched URLs back into the crawl so that eventually we pick them all up.


Jordan

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages