Blekko & CommonCrawl

70 views

Skip to first unread message

Tom Morris

unread,

May 22, 2015, 2:39:39 PM5/22/15

to common...@googlegroups.com

Hi Greg,

On Fri, May 22, 2015 at 1:42 PM, Greg Lindahl <lin...@pbm.com> wrote:
> On Tue, May 19, 2015 at 03:43:18PM -0400, Tom Morris wrote:
>
>> The URL list comes from Blekko, not Alexa, and I don't think they've
>> disclosed how it's generated, so it's not too surprising that it doesn't
>> match up.
>
> There's not much to disclose -- Blekko, as a search engine, has quite
> different opinions about websites and pages than Alexa's
> toolbar-generated stats. Alexa users visit lots of websites that
> blekko thinks are "bad". SEO that fools Google but not blekko results
> in a lot of sites being in Alexa's top million, but not Blekko's crawl
> frontier. On the flip side, there are probably plenty of sites whose
> SEO fooled Blekko and not Google.

Thanks. That makes sense. I actually think the Alexa (and thus
HTTPArchive) list has more problems than just the collection
methodology, but I'll post the results of my investigation in a
separate thread.

Do you mind expanding a little bit on the interaction between the
blekko processes and the Common Crawl? Some questions which come to
mind:

- is the URL list updated for each crawl?
- does it represent a seed list for the crawlers to establish a new
frontier or is used as a fixed list with no new discovery done (this
might be a question for CCers)?
- will blekko continue to provide a URL list now that they've been
acquired by IBM (congratulations BTW!)
- if the list is updated, how are its contents balanced/biased between
crawling fresh URLs vs re-crawling high ranking URLs?

In general, how do the two halves of the operation fit together since
they're done by different organizations with different goals?

Tom

Reply all

Reply to author

Forward

0 new messages