The corpus has been extracted from the
July 2015 version of the Common Crawl which contains 1.78 billion HTML pages originating from 15 million pay-level domains.
The WDC Web Tables Corpus 2015 consists of 233 million HTML tables which are classified into the categories: relational, entity, and matrix. In addition to the actual tables, the corpus also contains table metadata such as table orientation, header rows, and key columns, as well as table context information such as the text on the HTML page before and after the table, the page title, and timestamp information from the page.
Detailed statistics about the corpus, information about its application domains, as well as instructions on how to download the corpus are found at
We want to thanks the
Common Crawl Foundation for gathering their great web corpora and thus enabling the creation of the WDC Web Tables Corpus. We also want to thank
Amazon Web Services for supporting the Web Data Commons project by allowing us to use their cloud infrastructure. Great thanks also to the
Dresden Web Table Corpus team for extending the
WDC framework which we further extended and used for this extraction.
Enjoy the new corpus!
Dominique Ritze, Oliver Lehmberg, Robert Meusel, Sanikumar Zope, and Christian Bizer