Time aware web-graphs

33 views
Skip to first unread message

Animesh Baranawal

unread,
Jun 1, 2021, 4:28:47 AMJun 1
to Common Crawl

WebUK has a time-aware union graph created by merging the individual 12 monthly snapshots (http://law.di.unimi.it/webdata/uk-union-2006-06-2007-05/).

Is there any similar time-aware union graph available from the temporal aggregation of individual host-level (or domain-level) web graphs?

Thanks and regards,
Animesh

Sebastian Nagel

unread,
Jun 1, 2021, 5:55:57 AMJun 1
to common...@googlegroups.com
Hi Animesh,

interesting idea. Thanks!

There is no aggregated graph available, except for the aggregations covering 3 monthly crawls.
But these do not mark in which of the 3 crawls a node/edge was seen.

If you (or anybody else) are interested to create an aggregated graph - we kept the intermediate data,
used to create the 3-monthly aggregations, the input of hostlinks_to_graph.py [1,2]:
- about 30 GiB per monthly crawl
- in multiple split (so there are duplicated edges if splits are combined)
- Parquet files with two columns
<source_host,target_host>
- host names in reverse domain name notation

The data is kept on a private bucket but we are happy to share it, in case you are interested
to build the aggregated host-level graph over 12 month (or even a longer interval).

The task would be to add the bit vector marking to in crawl every node/edge was seen.

I'd recommend to build the domain-level graph from the host-level graph - we use the public
section of the public suffix list [3] to get the registered domain for a host name. The list
changes over time, so it's better to pick a version from [4] which fits the time span of the
aggregation.

Recently, the webgraph libraries have been integrated into JGraphT [5], but unfortunately not
yet into the JGraphT Python bindings [6]. This might simplify the process of the webgraph
constructions in the future.


Best,
Sebastian

[1] https://github.com/commoncrawl/cc-pyspark/blob/master/hostlinks_to_graph.py
[2] https://github.com/commoncrawl/cc-webgraph/blob/777dc5ca9406f488e801a95f5300417d0d6f8a74/src/script/hostgraph/build_hostgraph.sh#L237
[3] https://publicsuffix.org/
[4] https://github.com/publicsuffix/list/
[5] https://jgrapht.org/
[6] https://pypi.org/project/jgrapht/
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/f7ffbc7e-a68c-42f6-a347-771eaf2ae7cen%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/f7ffbc7e-a68c-42f6-a347-771eaf2ae7cen%40googlegroups.com?utm_medium=email&utm_source=footer>.

Reply all
Reply to author
Forward
0 new messages