[ANN] WebDataCommons releases 97.7 billion quads Microdata, Embedded JSON-LD, RDFa, and Microformat data originating from 14.6 million websites

45 views
Skip to first unread message

Alexander Brinkmann

unread,
Feb 6, 2024, 11:07:01 AMFeb 6
to web-data...@googlegroups.com

Hi all,

we are happy to announce the new release of the WebDataCommons Microdata, JSON-LD, RDFa and Microformat data corpus.

The data has been extracted from the September/October 2023 version of the Common Crawl covering 3.35 billion HTML pages which originate from 34.1 million websites (pay-level domains).

In summary, we found structured data within 1.7 billion HTML pages out of the 3.4 billion pages in the crawl (50.60%). These pages originate from 15 million different pay-level domains out of the 34 million pay-level domains covered by the crawl (42.89%). Altogether, the extracted data sets consist of 86 billion RDF quads.

Approximately 9.5 million websites provide structured data using the JSON-LD syntax, 7.4 million websites use the Microdata markup format to annotate structured data within their pages, and half a million websites were found to use the RDFa markup format.

 

Statistics about the October 2023 Release:

Basic statistics about the October 2023 Microdata, JSON-LD, RDFa, and Microformat data sets as well as the vocabularies that are used along with each markup format are found at:

https://webdatacommons.org/structureddata/2023-12/stats/stats.html

 

Adoption of the Different Markup Formats

The WebDataCommons project has been extracting structured data from the CommonCrawl yearly since 2010. The October 2023 release signifies 12 years of monitoring the adoption of structured data on the Web. This allows us to spot trends concerning the adoption of different markup formats as well as the usage of specific classes and properties, a short overview of which is provided on the page below:

https://webdatacommons.org/structureddata/#results

The first WDC release in 2010 revealed that only 5.7% of the examined web pages contained structured data. In 2023, we found structured data within 50.6% of the examined webpages indicating a huge growth in adoption over the last decade. The two markup formats that saw the largest increase in adoption are JSON-LD and Microdata. 

By 2023, JSON-LD and Microdata dominate over RDFa and other Microformats. JSON-LD is the most widely adopted markup format for structured data annotation, used by 66% of websites that annotate structured data. In comparison, Microdata is used by 48% of websites, while RDFa and Microformats (hCard) are used by only 5% and 17% of websites, respectively.

The analysis of the richness of Microdata and JSON-LD annotations, measured by the average number of triples per webpage, shows an upward trend over the years. In 2010, an average of 21 Microdata triples were extracted from each webpage. By 2023, this number had increased to 36. JSON-LD annotations provide even more detailed information than Microdata annotations, with the average number of triples per webpage continuously increasing from 10 in 2015 to 55 in 2023.

Adoption of the Schema.org Vocabulary

The schema.org vocabulary remains the most popular in the context of Microdata and JSON-LD. It is used for annotating navigation elements within webpages, using classes such as BreadcrumbList, WebPage and SiteNavigationElement, as well as the main content of a page, using classes like Product, LocalBusiness, and JobPosting. We observe a rapidly increasing adoption of several content classes: Over the past four years the number of websites providing Product annotations increased from 594K to 2.82M (475% growth), the number of websites annotating LocalBusiness entities increased from 386K to 1.3M (337% growth) while the adoption of the JobPosting class increased from 7K websites to 59K (843% growth).

Finally, we observe that an increasing number of websites explicitly annotate entity identifiers, such as product identifiers, as well as other identifying attributes such as telephone numbers or geo coordinates for local businesses. Schema.org provides different terms for annotating different types of product identifiers, with schema:Product/sku being the most popular among them. Over the past five years, the relative adoption of the schema:Product/sku property has increased from 21% to 60%. The property schema:LocalBusiness/telephone has also seen comparable increased growth in the last five years from 64% to 77%. This verifies our previous observation on the increasing richness of the annotations.

Download all Data (N-QUADS)

The overall size of the October 2023 RDFa, Microdata, Embedded JSON-LD and Microformat data sets is 97.7 billion RDF quads. For download, we split the data into 17.416 files with a total size of 1.8 TB.

http://webdatacommons.org/structureddata/2023-12/stats/how_to_get_the_data.html

Download Schema.org Subset (N-QUADS)

We have also created class-specific subsets for 48 popular schema.org classes such as product, local business, event, and job posting in order to support the focused download of specific types of data.

http://webdatacommons.org/structureddata/2023-12/stats/schema_org_subsets.html

Download Schema.org Table Corpus (JSON)

We have also converted the data from the schema.org subsets into relational tables by grouping the data by class and website (host) and removing duplicates as well as sparse entities that were extracted from list pages. The resulting table corpus contains 5 million relational tables which are provided in a JSON format that can be directly read by the pandas Python library. The overall download size of the table corpus is 71 GB.

https://webdatacommons.org/structureddata/schemaorgtables/2023/


Lots of thanks to:

+ The Common Crawl project for providing their great web crawl and thus enabling the WebDataCommons project.

+ The Any23 project for providing and maintaining their great library of structured data parsers.

 

Have fun with the new data.

 

Cheers,

Alexander Brinkmann, Ralph Peeters and Chris Bizer


Reply all
Reply to author
Forward
0 new messages