ANN: Large Hyperlink Graph from April 2014 Web Crawl available for download

22 views
Skip to first unread message

Robert Meusel

unread,
Aug 13, 2014, 9:37:33 AM8/13/14
to web-data...@googlegroups.com
Hi all,

the Web Data Commons team is happy to announce the publication of the second large hyperlink graph.

The graph has been extracted from the April 2014 Common Crawl web corpus and covers 1.7 billion web pages and 64 billion hyperlinks between these pages. 

The graph can be downloaded in various formats from: http://webdatacommons.org/hyperlinkgraph/2014-04/download.html

We provide initial statistics about the topology of the graph at: http://webdatacommons.org/hyperlinkgraph/2014-04/topology.html

We hope that the graph will be useful for researchers who develop
  • Search algorithms that rank results based on the hyperlinks between pages.
  • SPAM detection methods which identity networks of web pages that are published in order to trick search engines.
  • Graph analysis algorithms and can use the hyperlink graph for testing the scalability and performance of their tools.
  • Web Science researchers who want to analyze the linking patterns within specific topical domains in order to identify the social mechanisms that govern these domains.

We want to thanks the Common Crawl project for providing their great web crawl and thus enabling the creation of the WDC Hyperlink Graph. 

The creation of the WDC Hyperlink Graph was supported by the EU research project PlanetData and by Amazon Web Services.  We thank your sponsors a lot.

Best Regards,

Chris, Oliver, Sebastiano & Robert
Reply all
Reply to author
Forward
0 new messages