ANN: WebDataCommons releases 24.4 billion quads RDFa, Microdata, Embedded JSON-LD and Microformat data originating from 2.7 million pay-level-domains

72 views
Skip to first unread message

Robert Meusel

unread,
Apr 25, 2016, 9:19:33 AM4/25/16
to Common Crawl

Hi All,


we are happy to announce a new release of the Web Data Commons RDFa, 
Microdata, Embedded JSON-LD and Microformat data corpus. 

The data corpus have been extracted from the November 2015 version of the 
Common Crawl covering 1.77 billion HTML pages which originate from 14.4 
million websites (pay-level domains). 

Altogether we discovered structured data within 541 million HTML pages out 
of the 1.77 billion pages contained in the crawl (30%). These pages 
originate from 2.7 million different pay-level-domains out of the 14.4 
million pay-level domains covered by the crawl (19%). 

Approximately 521 thousand of these websites use RDFa, while 1.1 million 
websites use Microdata. Microformats are used also by over 1 million 
websites within the crawl. For the first time, we have also extracted 
embedded json-ld which we can report to be used by more 
than 596 thousand websites.



Background: 

More and more websites embed structured data describing for instance 
products, people, organizations, places, events, reviews, and cooking 
recipes into their HTML pages using markup formats such as RDFa, Microdata 
and Microformats. 

The WebDataCommons project extracts all Microformat, Microdata and RDFa 
data, and since 2015 also embedded JSON-LD data from the Common Crawl 
web corpus, the largest and most up-to-data web corpus that is 
available to the public, and provides the extracted data for download. 
In addition, we publish statistics about the adoption of the different 
markup formats as well as the vocabularies that are used together 
with each format. 

Besides the data extracted from the named markup syntaxes the 
WebDataCommons project also provides one of the largest public
accessible corpora of WebTables extracted from web crawls as well
as a collection of hypernyms extract from billions of web pages for download. 


General information about the WebDataCommons project is found at 

http://webdatacommons.org/ 


Data Set Statistics: 

Basic statistics about the November 2015 RDFa, Microdata, Embedded JSON-LD 
and Microformat data sets as well as the vocabularies that are used together with each 

markup format are found at: 

http://webdatacommons.org/structureddata/2015-11/stats/stats.html


Comparing the statistics to the statistics about the December 2014 
release of the data sets


http://webdatacommons.org/structureddata/2014-12/stats/stats.html


we see that the adoption of the Microdata markup syntax has again 
increased (1.1 million websites in 2015 compared to 819 thousand in 
2014, where both crawls cover a comparable number of websites). 
Where the deployment of RDFa and Microformats is more or less stable.

As already observed in the former year the vocabulary schema.org
recommended by Google, Microsoft, Yahoo!, and Yandex is most 
frequently used by the webmasters in the context of Microdata. 
We observe a decreasing deployment of its predecessor, the data vocabulary.  
In the context of RDFa, we still find the Open Graph Protocol 
recommended by Facebook to be the most widely used vocabulary.

Topic-wise the trends identified in the former extractions continue. 
We see that beside of navigational, blog and CMS related 
meta-information, that many websites annotate e-commerce related data 

(Products, Offers, and Reviews) as well as contact information 
(LocalBusiness, Organization, PostalAddress).

For the first time, we have also extracted information marked up 
using embedded JSON-LD. Over 99% of all webmasters using 
this syntax use it to mark-up search boxes on their 
webpages (http://schema.org/SearchAction). Only a small part of the 
websites also use embedded JSON-LD to annotate other 
information, e.g. about organizations (92 thousand websites) 
or persons (18 thousand websites).

 


Download: 

The overall size of the November 2015 RDFa, Microdata, Embedded 
JSON-LD and Microformat data sets is 24.4 billion RDF quads. 
For download, we split the data into 3,961 files with a total size of 404 GB. 


http://webdatacommons.org/structureddata/2015-11/stats/how_to_get_the_data.html

In addition, we have created for over 50 different schema.org classes 
separate files, including all quads from pages, deploying at least once the specific class. 


http://webdatacommons.org/structureddata/2015-11/stats/schema_org_subsets.html



Lots of thanks to: 

+ the Common Crawl project for providing their great web crawl and thus 
enabling the 
WebDataCommons project. 
+ the Any23 project for providing their great library of structured data 
parsers. 
+ Amazon Web Services in Education Grant for supporting WebDataCommons. 


Have fun with the new data set. 

Cheers, 
Robert and Chris

Reply all
Reply to author
Forward
0 new messages