On Wed, Sep 30, 2015 at 01:59:16PM -0400, Tom Morris wrote:
> As we've seen in other cases, there are some anomalies like the #2 site by
> page count:
>
> 366171 uk,co,schooluniformshop
>
> and these very small, simple sites in the top 20 (crawler stuck chasing
> calendar entries perhaps?):
>
> 192701 uk,co,southnorfolkguesthouse
> 190032 uk,co,hotelanacapri
> 172984 uk,co,foulsykefarmhouse
I would not be surprised if these are "crawler traps" in the Blekko
metadata -- that's not a large enough pagecount for us to have noticed
the trap. Millions, yes, we would have noticed.
The latter 3 examples are in our curated list of hotel websites, so we
were willing to crawl them deeper than usual.
Here's the list of curated sites:
https://raw.githubusercontent.com/wumpus/slashtag-data/master/slastag.json
-- greg