difference between crawls

71 views
Skip to first unread message

Hadar Rottenberg

unread,
May 18, 2016, 4:15:39 AM5/18/16
to Common Crawl
Do the crawls have different urls? some intersection?
Is there some kind of logic regarding the url set chosen for crawling in each crawl?

Thanks
Hadar

Sebastian Nagel

unread,
May 18, 2016, 6:46:20 AM5/18/16
to Common Crawl
Hi Hadar,

that's an interesting question.  By now, the rough answer is: yes, there is large overlap
because the seed list for the last crawls hasn't been changed. Of course, there are
also smaller differences:
- documents (temporarily) disappear
- fetch lists are shuffled when crawling and cut off after some time
  (differences are expected to be bigger for hosts with long fetch lists
   or slowly responding hosts)

I'm not aware of any comprehensive analysis (maybe I've missed it!?)
I hope to have exact numbers available next week - overlaps (URLs and domains)
and also other metrics, at least, for the last 5 crawls. The analysis will be done
on the Common Crawl index. If you want to know the overlap only for few domains,
it's possible to get the URL lists from http://index.commoncrawl.org/ and calculate
the overlaps.

Regards,
Sebastian

Tom Morris

unread,
May 18, 2016, 11:19:17 AM5/18/16
to common...@googlegroups.com
On Wed, May 18, 2016 at 6:46 AM, 'Sebastian Nagel' via Common Crawl <common...@googlegroups.com> wrote:
Hi Hadar,

that's an interesting question.  By now, the rough answer is: yes, there is large overlap
because the seed list for the last crawls hasn't been changed. Of course, there are
also smaller differences:
- documents (temporarily) disappear
- fetch lists are shuffled when crawling and cut off after some time
  (differences are expected to be bigger for hosts with long fetch lists
   or slowly responding hosts)

I'm not aware of any comprehensive analysis (maybe I've missed it!?)

Christian Buck posted a cursory analysis last month. His spreadsheet is here:


Recent crawls have only been adding about 50M new uniq URLs which have never been crawled before.
 
I hope to have exact numbers available next week - overlaps (URLs and domains)
and also other metrics, at least, for the last 5 crawls. The analysis will be done
on the Common Crawl index. If you want to know the overlap only for few domains,
it's possible to get the URL lists from http://index.commoncrawl.org/ and calculate
the overlaps.

I'll look forward to your analysis.

Tom 
Reply all
Reply to author
Forward
0 new messages