[ANN] WDC Product Data Corpus V.2020 released

Anna Primpeli

unread,

Sep 10, 2021, 10:26:48 AM9/10/21

to web-data...@googlegroups.com

Hi all,

We are happy to announce the release of the Web Data Commons Product Data Corpus V.2020. The corpus is extracted from the December 2020 WDC schema.org Product Microdata and JSON-LD subsets.
In comparison to our previously published Product Data Corpus which contains data from 2017, the current corpus is 4 times larger and covers up-to-date products from 2020.

The current version of the WDC Product Data Corpus consists of more than 98 million product offers originating from 603 thousand websites. Grouping the offers based on the co-occurrence of their annotated product identifier values, such as GTINs and MPNs, results in more than 7.1 million clusters of size two or larger. 1.5 million of these clusters have a size larger than three while 670 thousand have a size larger than five.

Motivation:

Many e-shops have started to mark-up offers within HTML pages using schema.org annotations. In recent years, many of these e-shops have also started to annotate product identifiers within their pages such as schema.org/Product/sku, gtin8, gtin13, gtin14, and mpn. These identifiers allow offers for the same product from different e-shops to be grouped into clusters and can thus be considered as supervision for training matching methods [1, 2].

In our previous work, we exploited this source of supervision and published the largest publicly available corpus for entity matching which was extracted from the WDC 2017 schema.org Microdata Product Corpus [3]. Given the considerable increase of Product related annotation adoption, we improve our cleansing workflow and publish a new version of the WDC Product Corpus which is extracted from the WDC 2020 schema.org
Microdata and JSON-LD Product corpora. Additionally, we map a subset of the offers of the WDC Product Data Corpus - V.2020 to the Schema.org table corpus for grouping table rows that refer to the same real-world product entities.

More Information:

More information about the curation and statistics of the WDC Product Data Corpus V.2020 is found on the WDC website which also offers the corpus and the mapping to the schema.org table corpus for public download:

http://webdatacommons.org/largescaleproductcorpus/v2020/index.html

Acknowledgements:

Special thanks to our student Marc Becker for his contribution to the improvement of the extraction pipeline and the curation of the Product Data Corpus V2020.

Have fun with the new Product Data Corpus!

Cheers,

Anna Primpeli & Christian Bizer

[1] Peeters, Ralph, et al. "Using schema. org annotations for training and maintaining product matchers." Proceedings of the 10th International Conference on Web Intelligence, Mining and Semantics. (2020).

[2] Peeters, Ralph et al. “Intermediate Training of BERT for Product Matching.” DI2KG@VLDB (2020).

[3] Primpeli, A., Peeters, R., & Bizer, C.: The WDC training dataset and gold standard for large-scale product matching. Companion Proceedings of the 2019 World Wide Web Conference. (2019).

Kelly Murphy

unread,

Jan 27, 2022, 3:44:23 PM1/27/22

to Web Data Commons

Hi Anna,

I am trying to work on the product corpus (V.2020 referenced in this post) with the matched products in clusters.

For the cluster_ID or the the product instance mapped to the cluster - is there a category for these. I only want to work on a subset of categories and I can see now way to identify what category a cluster belongs.

Do you have any schemas that you could point me to or docs that show this.

I have looked at your other data sets (Gold Standard for Product Categorization) which has categories - is there something that links these somewhere.

I appreciate your help.

Thanks

Kelly

Anna Primpeli

unread,

Jan 31, 2022, 3:20:48 AM1/31/22

to web-data...@googlegroups.com

Hello Kelly,

Thank you for your e-mail and interest in the corpus!

Unfortunately, we haven't performed any categorization for the V2020 corpus. The only category information you can find, after you have mapped to the table corpus (as described in Section 6. Download), is the one assigned with schema.org/Product/category. Of course in that case the values of the categories are the ones assigned by different e-shops and are therefore not normalized to a consistent schema.

However, we have categorized the offers of the English subset of the 2017 Product Corpus. You can find details on how we performed the categorization, evaluation results and the result files for download here:

http://webdatacommons.org/categorization/index.html

I hope that helps! In case you have any further questions, please let me know!

Best,

Anna

--
You received this message because you are subscribed to the Google Groups "Web Data Commons" group.
To unsubscribe from this group and stop receiving emails from it, send an email to web-data-commo...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/web-data-commons/7485c00b-26e1-4952-bd42-e8d124adcd4cn%40googlegroups.com.

--

Anna Primpeli

apri...@gmail.com

an...@informatik.uni-mannheim.de

Reply all

Reply to author

Forward