Hi all,
the Web Data Commons team is happy to announce the publication of a new large hyperlink graph.
The graph has been extracted from the Common Crawl 2012 web corpus and covers 3.5 billion web pages and 128 billion hyperlinks between these pages. To the best of our knowledge, the graph is the largest hyperlink graph that is available to the public.
The graph can be downloaded in various formats from http://webdatacommons.org/hyperlinkgraph
We provide initial statistics about the topology of the graph at http://webdatacommons.org/hyperlinkgraph/topology.html
We hope that the graph will be useful for researchers who develop
We want to thanks the Common Crawl project for providing their great web crawl and thus enabling the creation of the WDC Hyperlink Graph.
The creation of the WDC Hyperlink Graph was supported by the EU research project PlanetData and by Amazon Web Services. We thank your sponsors a lot.
Best Regards,
Chris, Oliver & Robert
But for WebGraph Files, the total size is 56GB, including 52GB(network.graph) + 4GB(network.offsets) + 1.5MB(network.properties) The size is significantly different, is it normal? 2. Hyperlink Graph 2014: The size of data for Index/Arc files is 20GB. The size of data for WebGrpah files is 22.1GB, including 20GB(webgraph.graph) and 2.1GB(webgrah offsets). There causes two questions: (1)The size for two formats are almost same. Why in 2012 data, the size for two formats are significantly different? (2)Why the size of data for 2014 is significantly smaller then 2012 data while the hyperlinks in 2014 is about half of that in 2012? |