Hi Animesh,
interesting idea. Thanks!
There is no aggregated graph available, except for the aggregations covering 3 monthly crawls.
But these do not mark in which of the 3 crawls a node/edge was seen.
If you (or anybody else) are interested to create an aggregated graph - we kept the intermediate data,
used to create the 3-monthly aggregations, the input of hostlinks_to_graph.py [1,2]:
- about 30 GiB per monthly crawl
- in multiple split (so there are duplicated edges if splits are combined)
- Parquet files with two columns
<source_host,target_host>
- host names in reverse domain name notation
The data is kept on a private bucket but we are happy to share it, in case you are interested
to build the aggregated host-level graph over 12 month (or even a longer interval).
The task would be to add the bit vector marking to in crawl every node/edge was seen.
I'd recommend to build the domain-level graph from the host-level graph - we use the public
section of the public suffix list [3] to get the registered domain for a host name. The list
changes over time, so it's better to pick a version from [4] which fits the time span of the
aggregation.
Recently, the webgraph libraries have been integrated into JGraphT [5], but unfortunately not
yet into the JGraphT Python bindings [6]. This might simplify the process of the webgraph
constructions in the future.
Best,
Sebastian
[1]
https://github.com/commoncrawl/cc-pyspark/blob/master/hostlinks_to_graph.py
[2]
https://github.com/commoncrawl/cc-webgraph/blob/777dc5ca9406f488e801a95f5300417d0d6f8a74/src/script/hostgraph/build_hostgraph.sh#L237
[3]
https://publicsuffix.org/
[4]
https://github.com/publicsuffix/list/
[5]
https://jgrapht.org/
[6]
https://pypi.org/project/jgrapht/
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
common-crawl...@googlegroups.com
> <mailto:
common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
>
https://groups.google.com/d/msgid/common-crawl/f7ffbc7e-a68c-42f6-a347-771eaf2ae7cen%40googlegroups.com
> <
https://groups.google.com/d/msgid/common-crawl/f7ffbc7e-a68c-42f6-a347-771eaf2ae7cen%40googlegroups.com?utm_medium=email&utm_source=footer>.