Web graph question

48 views
Skip to first unread message

Phil Creston

unread,
Dec 14, 2022, 7:12:49 PM12/14/22
to Common Crawl
Hi Sebastian, 
Quick question on the web graph: being that the graph is produced just from a set of a few months of crawl data and each crawl has a different set of source pages, would it be correct to assume an optimally comprehensive graph would be produced from the union of multiple host/domain-level graph dumps?

Are you aware of any research to see what the difference is in ranking/present links/coverage across each?

Thanks!


Sebastian Nagel

unread,
Dec 15, 2022, 7:32:24 AM12/15/22
to common...@googlegroups.com
Hi Phil,

yes, if you combine multiple graphs (or build a graph from more
"monthly" crawls) the graph is expected to include more nodes and
also more edges between nodes.

Caused by a bug once a single-month graph was released:

https://commoncrawl.org/2018/02/webgraphs-nov-dec-2017-jan-2018/#webgraph-2018-jan
You could compare the 1-month with the 3-month graph to see the effects
- just look at the properties and statistics files.

> Are you aware of any research to see what the difference is in
> ranking/present links/coverage across each?

There's some work done by the authors of the webgraph framework.
See esp. the time-aware graphs:
https://law.di.unimi.it/webdata/uk-union-2006-06-2007-05/
This is a wish from our users:
https://groups.google.com/g/common-crawl/c/6960gQ5c-cE/m/pKlI937zAAAJ
Great idea but still no time to bring this forward.

Maybe one more pointer, potentially of interest:
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0249993

Best,
Sebastian
Reply all
Reply to author
Forward
0 new messages