ANN: WebDataCommons releases 20.4 billion quads RDFa, Microdata and Microformat data originating from 2.7 million pay-level-domains

73 views
Skip to first unread message

Robert Meusel

unread,
Apr 13, 2015, 4:31:25 AM4/13/15
to web-data...@googlegroups.com
Hi all,

we are happy to announce a new release of the WebDataCommons RDFa,
Microdata, and Microformat data sets.

The data sets have been extracted from the December 2014 version of the
Common Crawl covering 2.01 billion HTML pages which originate from 15.7
million websites (pay-level domains).

Altogether we discovered structured data within 620 million HTML pages out
of the 2.04 billion pages contained in the crawl (30%). These pages
originate from 2.7 million different pay-level-domains out of the 15.7
million pay-level domains covered by the crawl (17%).

Approximately 571 thousand of these websites use RDFa, while 819 thousand
websites use Microdata. Microformats are used by over 1 million websites within
the crawl.


Background:

More and more websites embed structured data describing for instance
products, people, organizations, places, events, reviews, and cooking
recipes into their HTML pages using markup formats such as RDFa, Microdata
and Microformats.

The WebDataCommons project extracts all Microformat, Microdata and RDFa
data from the Common Crawl web corpus, the largest and most up-to-data web
corpus that is available to the public, and
provides the extracted data for download. In addition, we publish statistics
about the adoption of the different markup formats as well as the
vocabularies that are used together with each format.

General information about the WebDataCommons project is found at

http://webdatacommons.org/


Data Set Statistics:

Basic statistics about the December 2014 RDFa, Microdata, and Microformat
data sets as well as the vocabularies that are used together with each
markup format are found at:

http://webdatacommons.org/structureddata/2014-12/stats/stats.html

Comparing the statistics to the statistics about the November 2013 release of the data sets

http://webdatacommons.org/structureddata/2013-11/stats/stats.html

we see that the adoption of the Microdata markup syntax has again
increased (819 thousand websites in 2014 compared to 463 thousand in 2013,
where both crawls cover a comparable number of websites). Where the
deployment of RDFa and Microformats is more or less stable.

Looking at the adoption of different vocabularies, we see that webmasters
mostly follow the recommendation by Google, Microsoft, Yahoo!, and Yandex to
use the schema.org vocabularies as well as their predecessors in the context
of Microdata. In the context of RDFa, the most widely used vocabulary is the
Open Graph Protocol recommended by Facebook.

Topic-wise the trend, which was already identified from 2012 to 2013
continues. We see that beside of navigational, blog and CMS related
meta-information many websites markup e-commerce related data
(Products, Offers, and Reviews) as well as contact information
(LocalBusiness, Organization, PostalAddress).


Download:

The overall size of the December 2014 RDFa, Microdata, and Microformat data
sets is 20.4 billion RDF quads. For download, we split the data into 3,533
files with a total size of 357 GB.

http://webdatacommons.org/structureddata/2014-12/stats/how_to_get_the_data.html

In addition, we have created for over 50 different schema.org classes separate files, including all quads from pages, deploying at least once the specific class.

http://webdatacommons.org/structureddata/2014-12/stats/schema_org_subsets.html


Lots of thanks to:

+ the Common Crawl project for providing their great web crawl and thus
enabling the Web Data Commons project.
+ the Any23 project for providing their great library of structured data
parsers.
+ Amazon Web Services in Education Grant for supporting WebDataCommons.


Have fun with the new data set.

Cheers,
Robert Meusel, Anna Primpeli, and Christian Bizer 

Jon Clement

unread,
May 17, 2015, 3:16:07 PM5/17/15
to web-data...@googlegroups.com
Amazing resource.  I've been using the free FlashGet program to download the gzs on my windows machine.  Threaded, resume-able and imports the file lists no problem.
As for other tools, will likely start a new thread, but would be interesting to know what rdf dbs people are using.

Jon.

Mauro Dragoni

unread,
Jun 19, 2015, 3:38:57 AM6/19/15
to web-data...@googlegroups.com
Dear Robert and all the Group,
thanks for this amazing resource.

I just want to inform you that it seems there is an error in the html-mf-hlisting.list file.
A backslash is missing in all the rows.
Below, you my find the fixed content:


Cheers,
Mauro.

Robert Meusel

unread,
Jun 19, 2015, 3:46:36 AM6/19/15
to web-data...@googlegroups.com
Thanks a lot Mauro. I have applied the fixes to the page.

Mauro Dragoni

unread,
Jun 19, 2015, 8:23:15 AM6/19/15
to web-data...@googlegroups.com
Thanks Robert!
Maybe, I found another problem in the T2D Gold Standard page: it seems that it is not possible to download the dbpedia_subset.tar.gz file from the download section.

Cheers,
Mauro.

Robert Meusel

unread,
Jul 10, 2015, 3:16:32 AM7/10/15
to web-data...@googlegroups.com
Hi Mauro,

Sorry for the delay. The link should work now.

Cheers,
Robert
Reply all
Reply to author
Forward
0 new messages