Hi all,
We are happy to announce the release of the Web Data Commons Product Data Corpus V.2020. The corpus is extracted from the
December 2020 WDC schema.org Product Microdata and JSON-LD subsets.
In comparison to our previously published
Product Data Corpus which contains data from 2017, the current corpus is 4 times larger and covers up-to-date products from 2020.
The current version of the WDC Product Data Corpus consists of more than 98 million product offers originating from 603 thousand websites. Grouping the offers based on the co-occurrence of their annotated product identifier values, such as GTINs and MPNs, results in more than 7.1 million clusters of size two or larger. 1.5 million of these clusters have a size larger than three while 670 thousand have a size larger than five.
Motivation:
Many e-shops have started to mark-up offers within HTML pages using
schema.org annotations. In recent years, many of these e-shops have also started to annotate product identifiers within their pages such as
schema.org/Product/sku, gtin8, gtin13, gtin14, and mpn. These identifiers allow offers for the same product from different e-shops to be grouped into clusters and can thus be considered as supervision for training matching methods [1, 2].
In our previous work, we exploited this source of supervision and published the largest publicly available corpus for entity matching which was extracted from the WDC 2017
schema.org Microdata Product Corpus [3]. Given the considerable increase of Product related annotation adoption, we improve our cleansing workflow and publish a new version of the WDC Product Corpus which is extracted from the WDC 2020
schema.org Microdata and JSON-LD Product corpora. Additionally, we map a subset of the offers of the WDC Product Data Corpus - V.2020 to the
Schema.org table corpus for grouping table rows that refer to the same real-world product entities.
More Information:
More information about the curation and statistics of the WDC Product Data Corpus V.2020 is found on the WDC website which also offers the corpus and the mapping to the
schema.org table corpus for public download:
Acknowledgements:
Special thanks to our student Marc Becker for his contribution to the improvement of the extraction pipeline and the curation of the Product Data Corpus V2020.
Have fun with the new Product Data Corpus!
Cheers,
Anna Primpeli & Christian Bizer
[1] Peeters, Ralph, et al. "Using schema. org annotations for training and maintaining product matchers." Proceedings of the 10th International Conference on Web Intelligence, Mining and Semantics. (2020).
[2] Peeters, Ralph et al. “Intermediate Training of BERT for Product Matching.” DI2KG@VLDB (2020).
[3] Primpeli, A., Peeters, R., & Bizer, C.: The WDC training dataset and gold standard for large-scale product matching. Companion Proceedings of the 2019 World Wide Web Conference. (2019).