New 2014-04 Ranking Files

san

未读，

2017年1月27日 06:08:492017/1/27

收件人 Web Data Commons

Hi,

Where can i find 2014-04 ranking files to download, like the one we have for 2012-08.

Thanks.

Robert Meusel

未读，

2017年1月27日 07:54:412017/1/27

收件人 web-data...@googlegroups.com

Hi San,

We have not created the ranking files for 2014-04, because the underlying graph is only sparsely connected. This is due to the crawling strategy of CC which was used for this crawl. The resulting ranking files would not reflect the reality. 

Best,

Robert

--
You received this message because you are subscribed to the Google Groups "Web Data Commons" group.
To unsubscribe from this group and stop receiving emails from it, send an email to web-data-commons+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

san

未读，

2017年1月30日 22:31:572017/1/30

收件人 Web Data Commons

Thanks Robert.

On Friday, January 27, 2017 at 6:24:41 PM UTC+5:30, Robert Meusel wrote:

Hi San,

We have not created the ranking files for 2014-04, because the underlying graph is only sparsely connected. This is due to the crawling strategy of CC which was used for this crawl. The resulting ranking files would not reflect the reality.

Best,
Robert

2017-01-27 12:08 GMT+01:00 san <san9...@gmail.com>:

Hi,

Where can i find 2014-04 ranking files to download, like the one we have for 2012-08.

Thanks.

--
You received this message because you are subscribed to the Google Groups "Web Data Commons" group.

To unsubscribe from this group and stop receiving emails from it, send an email to web-data-commo...@googlegroups.com.

seba...@commoncrawl.org

未读，

2017年2月1日 04:40:152017/2/1

收件人 Web Data Commons

Hi,

commonsearch has recently published page ranks from the June 2016 crawl:
https://about.commonsearch.org/2016/07/our-first-public-datasets-host-level-webgraph-and-pagerank/
Just in case only page ranks based on recent data are sufficient.

> because the underlying graph is only sparsely connected

Would be interesting to know whether this is still the case for recent crawls.
It could be also worth to build the web graph incrementally from multiple
monthly crawl archives, some of the gaps will disappear.

> The resulting ranking files would not reflect the reality.

The discussion whether the bow-tie structure of the web graph is partially
an artifact of the crawling strategy is as old as the discovery of this
structure itself.

But of course, any crawling strategy which
- either does not follow certain links to avoid spam and duplicates
- or adds URLs from "external" sources (seed donations, sitemaps)
produces web graph which is notoriously different from that of
a breadth-first crawl.

As the one who operates the Common Crawl crawler, I would think
positive and be optimistic: any rankings from recent data are
better than what we currently have - mixed rankings from previous
seed donations, or even no scores at all for a large amounts of URLs.
We rely on rankings to "steer" the crawler, i.e., to select a representative
sample of URLs for the next crawl. That's why we would be really interested
in updates of the webdatacommons web graph and are also willing to invest
resources.

Best and thanks,
Sebastian

回复全部

回复作者