ANN: WebDataCommons releases 24.4 billion quads RDFa, Microdata, Embedded JSON-LD and Microformat data originating from 2.7 million pay-level-domains

78 views
Skip to first unread message

Robert Meusel

unread,
Apr 25, 2016, 9:19:03 AM4/25/16
to Web Data Commons

Hi All,


we are happy to announce a new release of the Web Data Commons RDFa, 
Microdata, Embedded JSON-LD and Microformat data corpus. 

The data corpus have been extracted from the November 2015 version of the 
Common Crawl covering 1.77 billion HTML pages which originate from 14.4 
million websites (pay-level domains). 

Altogether we discovered structured data within 541 million HTML pages out 
of the 1.77 billion pages contained in the crawl (30%). These pages 
originate from 2.7 million different pay-level-domains out of the 14.4 
million pay-level domains covered by the crawl (19%). 

Approximately 521 thousand of these websites use RDFa, while 1.1 million 
websites use Microdata. Microformats are used also by over 1 million
websites within the crawl. For the first time, we have also extracted
embedded json-ld which we can report to be used by more
than 596 thousand websites.



Background: 

More and more websites embed structured data describing for instance 
products, people, organizations, places, events, reviews, and cooking 
recipes into their HTML pages using markup formats such as RDFa, Microdata 
and Microformats. 

The WebDataCommons project extracts all Microformat, Microdata and RDFa 
data, and since 2015 also embedded JSON-LD data from the Common Crawl
web corpus, the largest and most up-to-data web corpus that is
available to the public, and provides the extracted data for download.
In addition, we publish statistics about the adoption of the different
markup formats as well as the vocabularies that are used together
with each format. 

Besides the data extracted from the named markup syntaxes the
WebDataCommons project also provides one of the largest public
accessible corpora of WebTables extracted from web crawls as well
as a collection of hypernyms extract from billions of web pages for download.


General information about the WebDataCommons project is found at 

http://webdatacommons.org/ 


Data Set Statistics: 

Basic statistics about the November 2015 RDFa, Microdata, Embedded JSON-LD
and Microformat data sets as well as the vocabularies that are used together with each 

markup format are found at: 

http://webdatacommons.org/structureddata/2015-11/stats/stats.html


Comparing the statistics to the statistics about the December 2014
release of the data sets


http://webdatacommons.org/structureddata/2014-12/stats/stats.html


we see that the adoption of the Microdata markup syntax has again
increased (1.1 million websites in 2015 compared to 819 thousand in
2014, where both crawls cover a comparable number of websites).
Where the deployment of RDFa and Microformats is more or less stable.

As already observed in the former year the vocabulary schema.org,
recommended by Google, Microsoft, Yahoo!, and Yandex is most
frequently used by the webmasters in the context of Microdata.
We observe a decreasing deployment of its predecessor, the data vocabulary.  
In the context of RDFa, we still find the Open Graph Protocol
recommended by Facebook to be the most widely used vocabulary.

Topic-wise the trends identified in the former extractions continue.
We see that beside of navigational, blog and CMS related 
meta-information, that many websites annotate e-commerce related data 

(Products, Offers, and Reviews) as well as contact information 
(LocalBusiness, Organization, PostalAddress).

For the first time, we have also extracted information marked up
using embedded JSON-LD. Over 99% of all webmasters using
this syntax use it to mark-up search boxes on their
webpages (http://schema.org/SearchAction). Only a small part of the
websites also use embedded JSON-LD to annotate other
information, e.g. about organizations (92 thousand websites)
or persons (18 thousand websites).

 


Download: 

The overall size of the November 2015 RDFa, Microdata, Embedded
JSON-LD and Microformat data sets is 24.4 billion RDF quads.
For download, we split the data into 3,961 files with a total size of 404 GB. 


http://webdatacommons.org/structureddata/2015-11/stats/how_to_get_the_data.html

In addition, we have created for over 50 different schema.org classes
separate files, including all quads from pages, deploying at least once the specific class. 


http://webdatacommons.org/structureddata/2015-11/stats/schema_org_subsets.html



Lots of thanks to: 

+ the Common Crawl project for providing their great web crawl and thus 
enabling the 
WebDataCommons project. 
+ the Any23 project for providing their great library of structured data 
parsers. 
+ Amazon Web Services in Education Grant for supporting WebDataCommons. 


Have fun with the new data set. 

Cheers, 
Robert and Chris

Boran Taylan BALCI

unread,
Jul 4, 2017, 9:28:47 AM7/4/17
to Web Data Commons
Hi,
First of all, I appreciate the work you have done. I would like to make a research based on this data, but I am encountering some problems regarding the decompression of 2015 files. I finally kind of sorted it out by using 7zip, but the files seem corrupted. If you try it also, you will see an error related to end of file. Other decompression tools besides 7zip  are aborting when they encounter this error. When its done by 7zip, it extracts as it can and the last line seems corrupted somehow.  For example:

dataset : RiverBodyOfWater
lastline : _:node3b37fb64dbddd4b7868b29d5c42

dataset : LakeBodyOfWater
lastline : _:node73fbf0af534c3e4461d3754a55e56f88 <http://schema.org/LakeBodyOfWater/description> "\n            \n        \n          \n            Lake Naivasha is a beautiful freshwater lake, fringed by thick papyrus. The lake is almost 13kms across, but its waters are shallow with an average depth of five metres. \u00A0\n\nLake area varies greatly according to rainfall, with an average range between 114 and 991 sq kms. At the beginning of the 20th Century, Naivasha completely dried up and effectively disappeared.\u00A0\n\u00A0\nThe resulting open land was farmed, until heavy rains a few years later caused the lake to r

BR,
Boran Taylan BALCI
 

Anna Primpeli

unread,
Jul 4, 2017, 10:35:36 AM7/4/17
to web-data...@googlegroups.com
Hello Boran,

thank you for your message.
Indeed I was able to reproduce the problem, however I think it is the case only for the class specific files of 2015.

So I would try to fix the problem as soon as possible and let you know. Till then maybe you could experiment with the new dataset (October 2016).

Best regards,
Anna

--
You received this message because you are subscribed to the Google Groups "Web Data Commons" group.
To unsubscribe from this group and stop receiving emails from it, send an email to web-data-commons+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

Anna Primpeli

unread,
Jul 11, 2017, 9:13:19 AM7/11/17
to Web Data Commons

Hello,

the updated schemaOrg subset files are now online.
Thank you once again for your feedback! Please let us know in case you face any further problems.

Best,
Anna
Reply all
Reply to author
Forward
0 new messages