.co.uk Domain Coverage

Alexander Mitchell

unread,

Sep 30, 2015, 7:23:09 AM9/30/15

to Common Crawl

Hi all

Does anyone have a baseline for the coverage of .co.uk domains the common crawl? The index server (great resource BTW) suggests there are only about 446K, but other sources suggest there about 10M .co.uk domains in total.

The indexes linked to on http://index.commoncrawl.org/ aren't deltas are they?

Cheers

Alex

Tom Morris

unread,

Sep 30, 2015, 1:59:18 PM9/30/15

to common...@googlegroups.com

On Wed, Sep 30, 2015 at 7:23 AM, Alexander Mitchell <alexander...@gmail.com> wrote:

Does anyone have a baseline for the coverage of .co.uk domains the common crawl? The index server (great resource BTW) suggests there are only about 446K, but other sources suggest there about 10M .co.uk domains in total.

The indexes linked to on http://index.commoncrawl.org/ aren't deltas are they?

No, they are complete crawls, but I haven't seen any published reports of the degree of overlap among the various crawls. I count 508K .co.uk domains in the 2015-14 crawl, so your number sounds like the right ballpark.

As we've seen in other cases, there are some anomalies like the #2 site by page count:

366171 uk,co,schooluniformshop

and these very small, simple sites in the top 20 (crawler stuck chasing calendar entries perhaps?):

192701 uk,co,southnorfolkguesthouse

190032 uk,co,hotelanacapri

172984 uk,co,foulsykefarmhouse

Tom

Greg Lindahl

unread,

Sep 30, 2015, 4:34:05 PM9/30/15

to common...@googlegroups.com

On Wed, Sep 30, 2015 at 01:59:16PM -0400, Tom Morris wrote:

> As we've seen in other cases, there are some anomalies like the #2 site by
> page count:
>
> 366171 uk,co,schooluniformshop
>
> and these very small, simple sites in the top 20 (crawler stuck chasing
> calendar entries perhaps?):
>
> 192701 uk,co,southnorfolkguesthouse
> 190032 uk,co,hotelanacapri
> 172984 uk,co,foulsykefarmhouse

I would not be surprised if these are "crawler traps" in the Blekko
metadata -- that's not a large enough pagecount for us to have noticed
the trap. Millions, yes, we would have noticed.

The latter 3 examples are in our curated list of hotel websites, so we
were willing to crawl them deeper than usual.

Here's the list of curated sites:
https://raw.githubusercontent.com/wumpus/slashtag-data/master/slastag.json

-- greg

Tom Morris

unread,

Sep 30, 2015, 6:11:58 PM9/30/15

to common...@googlegroups.com

On Wed, Sep 30, 2015 at 4:34 PM, Greg Lindahl <lin...@pbm.com> wrote:

On Wed, Sep 30, 2015 at 01:59:16PM -0400, Tom Morris wrote:

> As we've seen in other cases, there are some anomalies like the #2 site by
> page count:
>
> 366171 uk,co,schooluniformshop
>
> and these very small, simple sites in the top 20 (crawler stuck chasing
> calendar entries perhaps?):
>
> 192701 uk,co,southnorfolkguesthouse
> 190032 uk,co,hotelanacapri
> 172984 uk,co,foulsykefarmhouse

I would not be surprised if these are "crawler traps" in the Blekko
metadata -- that's not a large enough pagecount for us to have noticed
the trap. Millions, yes, we would have noticed.

The latter 3 examples are in our curated list of hotel websites, so we
were willing to crawl them deeper than usual.

Even with a curated list, I think the crawler should probably be a little more defensive. The site only has 10 pages!

I took a look at the first 5 pages of results for southnorfolkguesthouse:

http://index.commoncrawl.org/CC-MAIN-2015-32-index?url=http%3A%2F%2Fwww.southnorfolkguesthouse.co.uk%2F*&output=json&page=0

and of the 73,386 pages fetched, 72,771 of them appear to be empty pages fetched from the southnorfolkguesthouse.co.uk domain. The 615 pages which actually returned content were all fetched from www.southnorfolkguesthouse.co.uk (note the leading www.)

Pages like: http://www.southnorfolkguesthouse.co.uk/evening-meals/Default.aspx?shortname=OakbrookNR172HE&industrytype=1&startdate=2013-01-04&nights=1&windowsearch=0&page=3&location&adults1=1

were fetched over a hundred times each in this small sample despite the fact that they query string is ignored and the page content is always the same.

The other two hotels look like they're running the same software (https://eviivo.com/), so they're probably afflicted by the same problems.

Tom

Anderson Smith

unread,

Oct 1, 2015, 6:07:27 AM10/1/15

to Common Crawl

I think these are crawler working properly and i want ask a question regarding crawler that it would be work only .co.uk extension sites not at .com. this website http://www.antiviruscontactsupport.com/

would be crawled.

smith

Alexander Mitchell

unread,

Oct 1, 2015, 9:22:21 AM10/1/15

to Common Crawl

Thanks Tom - much appreciated

Reply all

Reply to author

Forward