interesting idea. Thanks!
There is no aggregated graph available, except for the aggregations covering 3 monthly crawls.
But these do not mark in which of the 3 crawls a node/edge was seen.
If you (or anybody else) are interested to create an aggregated graph - we kept the intermediate data,
used to create the 3-monthly aggregations, the input of hostlinks_to_graph.py [1,2]:
- about 30 GiB per monthly crawl
- in multiple split (so there are duplicated edges if splits are combined)
- Parquet files with two columns
- host names in reverse domain name notation
The data is kept on a private bucket but we are happy to share it, in case you are interested
to build the aggregated host-level graph over 12 month (or even a longer interval).
The task would be to add the bit vector marking to in crawl every node/edge was seen.
I'd recommend to build the domain-level graph from the host-level graph - we use the public
section of the public suffix list  to get the registered domain for a host name. The list
changes over time, so it's better to pick a version from  which fits the time span of the
Recently, the webgraph libraries have been integrated into JGraphT , but unfortunately not
yet into the JGraphT Python bindings . This might simplify the process of the webgraph
constructions in the future.
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> To view this discussion on the web visit