Host-level WebGraph & PageRank datasets from the June 2016 crawl

65 views
Skip to first unread message

Sylvain Zimmer

unread,
Aug 2, 2016, 6:13:58 PM8/2/16
to Common Crawl
Hello,

Common Search just published 2 datasets extracted from the June 2016 crawl:

I'd love to have your feedbacks on them!

Our goal is to provide a new URL seed list to Common Crawl based on the same code before September 15, so it's your chance to influence what will be included in the next crawls :-)

There are definitely lots of interesting things to explore in there, including spam, coverage of the web, potential biases, correlation with other metrics, ... I'll be happy to help anyone who wants to dig in!

Best,
Reply all
Reply to author
Forward
0 new messages