Hi Greg,
On Fri, May 22, 2015 at 1:42 PM, Greg Lindahl <
lin...@pbm.com> wrote:
> On Tue, May 19, 2015 at 03:43:18PM -0400, Tom Morris wrote:
>
>> The URL list comes from Blekko, not Alexa, and I don't think they've
>> disclosed how it's generated, so it's not too surprising that it doesn't
>> match up.
>
> There's not much to disclose -- Blekko, as a search engine, has quite
> different opinions about websites and pages than Alexa's
> toolbar-generated stats. Alexa users visit lots of websites that
> blekko thinks are "bad". SEO that fools Google but not blekko results
> in a lot of sites being in Alexa's top million, but not Blekko's crawl
> frontier. On the flip side, there are probably plenty of sites whose
> SEO fooled Blekko and not Google.
Thanks. That makes sense. I actually think the Alexa (and thus
HTTPArchive) list has more problems than just the collection
methodology, but I'll post the results of my investigation in a
separate thread.
Do you mind expanding a little bit on the interaction between the
blekko processes and the Common Crawl? Some questions which come to
mind:
- is the URL list updated for each crawl?
- does it represent a seed list for the crawlers to establish a new
frontier or is used as a fixed list with no new discovery done (this
might be a question for CCers)?
- will blekko continue to provide a URL list now that they've been
acquired by IBM (congratulations BTW!)
- if the list is updated, how are its contents balanced/biased between
crawling fresh URLs vs re-crawling high ranking URLs?
In general, how do the two halves of the operation fit together since
they're done by different organizations with different goals?
Tom