ANN: WebDataCommons releases 17.2 billion quads RDFa, Microdata and Microformat data originating from 1.7 million pay-level-domains

28 views
Skip to first unread message

Robert Meusel

unread,
Apr 5, 2014, 7:23:01 AM4/5/14
to web-data...@googlegroups.com
Hi all,

we are happy to announce a new release of the WebDataCommons RDFa,
Microdata, and Microformat data sets.

The data sets have been extracted from the November 2013 version of the
Common Crawl covering 2.24 billion HTML pages which originate from 12.8
million websites (pay-level-domains).

Altogether we discovered structured data within 585 million HTML pages out
of the 2.24 billion pages contained in the crawl (26%). These pages
originate from 1.7 million different pay-level-domains out of the 12.8
million pay-level-domains covered by the crawl (13%).

Approximately 471 thousand of these websites use RDFa, while 463 thousand
websites use Microdata. Microformats are used on 1 million websites within
the crawl.

Data Set Statistics:

Basic statistics about the November 2013 RDFa, Microdata, and Microformat
data sets as well as the vocabularies that are used together with each
markup format are found at:

http://webdatacommons.org/structureddata/2013-11/stats/stats.html

Comparing the statistics to the statistics about the August 2012 release of
the data sets

http://webdatacommons.org/structureddata/2012-08/stats/stats.html

we see that the adoption of the Microdata markup syntax has strongly
increased (463 thousand websites in 2013 compared to 140 thousand in 2012,
even given that the 2013 version of the Common Crawl covers significantly
less websites than the 2012 version).

Looking at the adoption of different vocabularies, we see that webmasters
mostly follow the recommendation by Google, Microsoft, Yahoo, and Yandex to
use the schema.org vocabularies as well as their predecessors in the context
of Microdata. In the context of RDFa, the most widely used vocabulary is the
Open Graph Protocol recommended by Facebook.

Looking at the most frequently used classes, we see that beside of
navigational, blog and CMS related meta-information many websites markup
e-commerce related data (products, offers, and reviews) as well as contact
information (LocalBusiness, Organization, PostalAddress).


Download:

The overall size of the November 2013 RDFa, Microdata, and Microformat data
sets is 17.2 billion RDF quads. For download, we split the data into 3,398
files with a total size of 332 GB.

http://webdatacommons.org/structureddata/2013-11/stats/how_to_get_the_data.h
tml


Lots of thanks to:

+ the Common Crawl project for providing their great web crawl and thus
enabling the Web Data Commons project.
+ the Any23 project for providing their great library of structured data
parsers.
+ the LOD2 and PlanetData research projects as well as Amazon Web Services
for supporting WebDataCommons.


Have fun with the new dataset.

Cheers, Christian Bizer, Petar Petrovski, and Robert Meusel

Aaron Bradley

unread,
Apr 9, 2014, 12:14:36 PM4/9/14
to web-data...@googlegroups.com
Great release - thanks!  A quick note about the "Detailed Statistics for the November 2013 corpus" page:

Namely that the title tag is misleading - doubtlessly a copy/paste oversight:
title>Web Data Commons Extraction Report - August 2012 Corpus</title>
Reply all
Reply to author
Forward
0 new messages