Now that Blekko is no more...

109 views
Skip to first unread message

Jeremy Wilson

unread,
Apr 21, 2015, 5:07:48 PM4/21/15
to common...@googlegroups.com
Since Blekko is no more and has been taken over by IBM, what will the impact be to Common Crawl since they seemed to have provided the bulk of the index for each crawl?

Stephen Merity

unread,
Apr 22, 2015, 1:39:31 AM4/22/15
to common...@googlegroups.com
Hi Jeremy,

Good question! We are immensely thankful to the team at blekko for the data and expertise they've provided us in the past, particularly to blekko's founder and CTO Greg Lindahl who has provided his insight on many decisions. We wish them all the best as they begin their work with IBM Watson which is certainly one of the most interesting projects in AI and NLP in recent memory!

Common Crawl are indeed prepared for a transition away from the blekko URL dataset. In the coming weeks we'll be announcing our work regarding URL discovery and ranking for producing future Common Crawl URL datasets.

As a sneak peak, rather than performing link discovery as a traditional web crawler might (by simply adding links from the pages the crawler visits), we'll be using knowledge from previous crawls to better identify the most relevant URLs to include in the subsequent crawls. The initial implementation of this process involves performing PageRank at both a page and domain level, which is one of the reasons we've had so many fascinating guest posts discussing and evaluating graph computation systems for big graph datasets. This ranking process will continue to evolve over time, allowing the community to contribute and additionally providing a new dataset  to the ones Common Crawl already produces.

On Wed, Apr 22, 2015 at 7:07 AM, Jeremy Wilson <jkwil...@gmail.com> wrote:
Since Blekko is no more and has been taken over by IBM, what will the impact be to Common Crawl since they seemed to have provided the bulk of the index for each crawl?

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.



--
Regards,
Stephen Merity
Data Scientist @ Common Crawl

Jeremy Wilson

unread,
Apr 22, 2015, 9:33:09 AM4/22/15
to common...@googlegroups.com
Thanks for the detailed response.   Looking forward to seeing what you guys have in the works!

Ken Krugler

unread,
Apr 22, 2015, 10:22:21 AM4/22/15
to common...@googlegroups.com
Hi Stephen,

One quick comment on using PageRank at the domain level…

We'd done this in the past, using a 500M page crawl with about 30B links.

In general the results were reasonable, except that we wound up with about 30K domains that had a PR10 :)

There were also a large number of domains with PR9…after that it settled down.

Comparing domain PR to domain traffic stats from various sources (Quantcast, Alexa, Compete) was useful - we flagged domains that had a PR which was much higher than their relative "traffic rank".

Manual inspection of the results showed these were all sites that were link farms, who were trying to game Google's ratings by artificially inflating their computed PR scores.

-- Ken


From: Stephen Merity

Sent: April 21, 2015 10:39:09pm PDT

To: common...@googlegroups.com

Subject: Re: Now that Blekko is no more...



--------------------------
Ken Krugler
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





--------------------------
Ken Krugler
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Tom Morris

unread,
Apr 22, 2015, 10:54:07 AM4/22/15
to common...@googlegroups.com
On Wed, Apr 22, 2015 at 10:22 AM, Ken Krugler <k...@bixolabs.com> wrote:

Manual inspection of the results showed these were all sites that were link farms, who were trying to game Google's ratings by artificially inflating their computed PR scores.

Content farms were going to be my first question, since I know Blekko to specific measures to combat them.

Tom 
Reply all
Reply to author
Forward
0 new messages