ANN: Web Table Corpus containing 233 million tables released

60 views
Skip to first unread message

Robert Meusel

unread,
Nov 19, 2015, 4:54:13 AM11/19/15
to Common Crawl
The DWS group is happy to announce the release of the WebDataCommons Web Table Corpus 2015.

The corpus has been extracted from the July 2015 version of the Common Crawl which contains 1.78 billion HTML pages originating from 15 million pay-level domains.

The WDC Web Tables Corpus 2015 consists of 233 million HTML tables which are classified into the categories: relational, entity, and matrix. In addition to the actual tables, the corpus also contains table metadata such as table orientation, header rows, and key columns, as well as table context information such as the text on the HTML page before and after the table, the page title, and timestamp information from the page.

Detailed statistics about the corpus, information about its application domains, as well as instructions on how to download the corpus are found at


We want to thanks the Common Crawl Foundation for gathering their great web corpora and thus enabling the creation of the WDC Web Tables Corpus. We also want to thank Amazon Web Services for supporting the Web Data Commons project by allowing us to use their cloud infrastructure. Great thanks also to the Dresden Web Table Corpus team for extending the WDC framework which we further extended and used for this extraction.

Enjoy the new corpus!

Dominique Ritze, Oliver Lehmberg, Robert Meusel, Sanikumar Zope, and Christian Bizer
Reply all
Reply to author
Forward
0 new messages