[ANN] WebDataCommons releases 44.2 billion quads Microdata, Embedded JSON-LD, RDFa, and Microformat data originating from 11.9 million websites

42 views

Skip to first unread message

Anna Primpeli

unread,

Jan 13, 2020, 4:34:07 AM1/13/20

to Web Data Commons

Hi all,

we are happy to announce the new release of the WebDataCommons Microdata, JSON-LD, RDFa and Microformat data corpus.

The data has been extracted from the November 2019 version of the Common Crawl covering 2.4 billion HTML pages which originate from 32 million websites (pay-level domains).

In summary, we found structured data within 934 million HTML pages out of the 2.4 billion pages contained in the crawl (37.9%). These pages originate from 11.9 million different pay-level domains out of the 32 million pay-level-domains covered by the crawl (37.2%).

Approximately 6.3 million of these websites use Microdata, 5.1 million websites use JSON-LD, and 1 million websites make use of RDFa. Microformats are used by more than 4 million websites within the crawl.

Background:

More and more websites annotate data describing for instance products, people, organizations, places, events, reviews, and cooking recipes within their HTML pages using markup formats such as Microdata, embedded JSON-LD, RDFa and Microformat.

The WebDataCommons project extracts all Microdata, JSON-LD, RDFa, and Microformat data from the Common Crawl web corpus, the largest web corpus that is available to the public, and provides the extracted data for download. In addition, we publish statistics about the adoption of the different markup formats as well as the vocabularies that are used together with each format. We run yearly extractions since 2012 and we provide the dataset series as well as the related statistics at:

http://webdatacommons.org/structureddata/

Statistics about the November 2019 Release:

Basic statistics about the November 2019 Microdata, JSON-LD, RDFa, and Microformat data sets as well as the vocabularies that are used together with each markup format are found at:

http://webdatacommons.org/structureddata/2019-12/stats/stats.html

Markup Format Adoption

The page below provides an overview of the increase in the adoption of the different markup formats as well as widely used schema.org classes from 2012 to 2019:

http://webdatacommons.org/structureddata/#toc3

Comparing the statistics from the new 2019 release to the statistics about the November 2018 release of the data sets

http://webdatacommons.org/structureddata/2018-12/stats/stats.html

we can observe that although the size of the November 2018 crawl is similar to the one of November 2019, the relative number of PLDs using structured data increased significantly from 29.3% to 37.2%. However, differences in the crawling strategies that were used for the two crawls make it difficult to directly compare absolute numbers. Even though there is clear dominance of the Microdata and embedded JSON-LD formats in terms of number of PLDs, we see a different distribution over the amount of extracted entities with the Microformat hCard dominating. This comes as a result of deeper crawling of blogging domains, such as blogspot and wordpress, which extensively use the Microformat hCard to annotate post-related data.

Vocabulary Adoption

Concerning the vocabulary adoption, schema.org, the vocabulary recommended by Google, Microsoft, Yahoo!, and Yandex continues to be the most dominant in the context of Microdata with 73% of the webmasters using it in comparison to its predecessor, the data-vocabulary, which is only used by 11% of the websites containing Microdata. In the context of RDFa, the Open Graph Protocol recommended by Facebook remains the most widely used vocabulary.

Download

The overall size of the November 2019 RDFa, Microdata, Embedded JSON-LD and Microformat data sets is 44.2 billion RDF quads. For download, we split the data into 9,925 files with a total size of 1.01 TB.

http://webdatacommons.org/structureddata/2019-12/stats/how_to_get_the_data.html

In addition, we have created for over 43 different schema.org classes separate files, including all quads extracted from pages, using a specific schema.org class.

http://webdatacommons.org/structureddata/2019-12/stats/schema_org_subsets.html

Lots of thanks to:

+ the Common Crawl project for providing their great web crawl and thus enabling the WebDataCommons project.
+ the Any23 project for providing their great library of structured data parsers.
+ Amazon Web Services in Education Grant for supporting WebDataCommons.

General Information about the WebDataCommons Project

The WebDataCommons project extracts structured data from the Common Crawl, the largest web corpus available to the public, and provides the extracted data for public download in order to support researchers and companies in exploiting the wealth of information that is available on the Web. Beside of the yearly extractions of semantic annotations from webpages, the WebDataCommons project also provides large hyperlink graphs, the largest public corpus of web tables, two corpora of product data, as well as a collection of hypernyms extracted from billions of web pages for public download. General information about the WebDataCommons project is found at

http://webdatacommons.org/

Have fun with the new data set.

Cheers,

Anna Primpeli and Chris Bizer

Sebastian Nagel

unread,

Jan 14, 2020, 9:26:21 AM1/14/20

to Web Data Commons

Hi Anna, hi Chris,

thanks for the extraction and for the detailed statistics what has been extracted!

we can observe that although the size of the November 2018 crawl is similar to the one of November 2019, the relative number of PLDs using structured data increased significantly from 29.3% to 37.2%. However, differences in the crawling strategies that were used for the two crawls make it difficult to directly compare absolute numbers. Even though there is clear dominance of the Microdata and embedded JSON-LD formats in terms of number of PLDs, we see a different distribution over the amount of extracted entities with the Microformat hCard dominating. This comes as a result of deeper crawling of blogging domains, such as blogspot and wordpress, which extensively use the Microformat hCard to annotate post-related data.

Well, I'm not sure whether the differences - mainly: 44 billion extracted quads for 2019 vs. 32 billion in 2018 - can be traced down to blogspot and wordpress alone, resp. to other blogging domains. It's correct, since Feb 2019 [1] subdomains of blogspot.com or wordpress.com are crawled somewhat deeper because there is a limit (500k) applied to the max. number of subdomains per domain which in turn forces the crawler to fetch more pages per subdomain. However, this can hardly explain the difference:

Nov 2019 (CC-MAIN-2019-47)

#pages #subdomains #urls_with_triples #extr_triples domain

29,702,041 527,193 19,409,126 411,891,887 blogspot.com

26,327,223 427,289 18,255,371 961,381,004 wordpress.com

Nov 2018 (CC-MAIN-2018-47)

47,231,818 2,029,105 37,543,780 441,766,940 blogspot.com

14,618,652 1,025,367 8,410,578 319,760,823 wordpress.com

Especially, for the html-mf-hcard which alone contribute 6 billion more triples/quads in 2019, it seems that there quite many domains involved into the increase of triples, in 2018 the domain on position 1000 [2] served 300k triples, in 2019 [3] it's 1.3 million triples. Interestingly, it's only few properties linked in the vast amount of new hcard triples: n, fn, given-name, family-name, url and photo.

Anyway, I also do not have any good explanation why hcard microformats and JSON-LD (an increase by 4 billion triples) have jumped up.

Best,

Sebastian

[1] https://commoncrawl.org/2019/03/february-2019-crawl-archive-now-available/

[2] http://webdatacommons.org/structureddata/2018-12/stats/top_domains_by_extracted_triples_for_extractor_html-mf-hcard.html

[3] http://webdatacommons.org/structureddata/2019-12/stats/top_domains_by_extracted_triples_for_extractor_html-mf-hcard.html

Reply all

Reply to author

Forward

0 new messages