> because the underlying graph is only sparsely connected
Would be interesting to know whether this is still the case for recent crawls.
It could be also worth to build the web graph incrementally from multiple
monthly crawl archives, some of the gaps will disappear.
> The resulting ranking files would not reflect the reality.
The discussion whether the bow-tie structure of the web graph is partially
an artifact of the crawling strategy is as old as the discovery of this
structure itself.
But of course, any crawling strategy which
- either does not follow certain links to avoid spam and duplicates
- or adds URLs from "external" sources (seed donations, sitemaps)
produces web graph which is notoriously different from that of
a breadth-first crawl.
As the one who operates the Common Crawl crawler, I would think
positive and be optimistic: any rankings from recent data are
better than what we currently have - mixed rankings from previous
seed donations, or even no scores at all for a large amounts of URLs.
We rely on rankings to "steer" the crawler, i.e., to select a representative
sample of URLs for the next crawl. That's why we would be really interested
in updates of the webdatacommons web graph and are also willing to invest
resources.
Best and thanks,
Sebastian