Hi All,
we are happy to announce a
new release of the Web Data Commons RDFa,
Microdata, Embedded JSON-LD and Microformat data
corpus.
The data corpus have been extracted from the
November 2015 version of the
Common Crawl covering 1.77 billion HTML pages
which originate from 14.4
million websites (pay-level domains).
Altogether we discovered structured data within
541 million HTML pages out
of the 1.77 billion pages contained in the crawl
(30%). These pages
originate from 2.7 million different
pay-level-domains out of the 14.4
million pay-level domains covered by the crawl
(19%).
Approximately 521 thousand of these websites use
RDFa, while 1.1 million
websites use Microdata. Microformats are used
also by over 1 million
websites within the crawl. For the first time, we have also extracted
embedded json-ld which we can report to be used by more
than 596 thousand websites.
Background:
More and more websites embed structured data
describing for instance
products, people, organizations, places, events,
reviews, and cooking
recipes into their HTML pages using markup
formats such as RDFa, Microdata
and Microformats.
The WebDataCommons project extracts all
Microformat, Microdata and RDFa
data, and since 2015 also embedded JSON-LD data
from the Common Crawl
web corpus, the largest and most up-to-data web corpus that is
available to the public, and provides
the extracted data for download.
In addition, we publish statistics about
the adoption of the different
markup formats as well as the vocabularies
that are used together
with each format.
Besides the data extracted from the named markup syntaxes the
WebDataCommons project also provides one of the largest public
accessible corpora of WebTables extracted from web crawls as well
as a collection of hypernyms extract from billions of web pages for download.
General information about the WebDataCommons
project is found at
http://webdatacommons.org/
Data Set Statistics:
Basic statistics about the November 2015 RDFa,
Microdata, Embedded JSON-LD
and Microformat data sets as well
as the vocabularies that are used together with each
markup format are found at:
http://webdatacommons.org/structureddata/2015-11/stats/stats.html
Comparing the statistics to
the statistics about the December 2014
release of the data sets
http://webdatacommons.org/structureddata/2014-12/stats/stats.html
we see that the adoption of
the Microdata markup syntax has again
increased (1.1 million websites in 2015 compared to 819 thousand in
2014, where both crawls cover a comparable number of websites).
Where the deployment of RDFa and
Microformats is more or less stable.
As already observed in the
former year the vocabulary schema.org,
recommended by Google, Microsoft, Yahoo!, and Yandex is most
frequently used by the webmasters in the context of Microdata.
We observe a decreasing deployment of its predecessor, the data vocabulary.
In the context of RDFa, we still find the Open Graph Protocol
recommended by Facebook to be the most widely used vocabulary.
Topic-wise the trends
identified in the former extractions continue.
We see that beside of navigational, blog and CMS related
meta-information, that many websites annotate e-commerce related data
(Products, Offers, and Reviews) as well as
contact information
(LocalBusiness, Organization, PostalAddress).
For the first time, we have also extracted
information marked up
using embedded JSON-LD. Over 99% of all webmasters using
this syntax use it to mark-up search boxes on their
webpages (http://schema.org/SearchAction).
Only a small part of the
websites also use embedded JSON-LD to annotate other
information, e.g. about organizations (92 thousand websites)
or persons (18 thousand websites).
Download:
The overall size of the November 2015 RDFa,
Microdata, Embedded
JSON-LD and Microformat data sets
is 24.4 billion RDF quads.
For download, we split the data into 3,961 files
with a total size of 404 GB.
http://webdatacommons.org/structureddata/2015-11/stats/how_to_get_the_data.html
In addition, we have created
for over 50 different schema.org classes
separate files, including all quads from pages, deploying at least once the
specific class.
http://webdatacommons.org/structureddata/2015-11/stats/schema_org_subsets.html
Hi,
--
You received this message because you are subscribed to the Google Groups "Web Data Commons" group.
To unsubscribe from this group and stop receiving emails from it, send an email to web-data-commons+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.