How to enrich the WDC Product Data Corpus with product images data?

Егор Еремеев

unread,

Jul 6, 2020, 1:03:26 PM7/6/20

to Web Data Commons

Hello,

I'm interesting to compose the large scale product image dataset based on the WDC Product Data Corpus.

I see that version 1 (http://webdatacommons.org/largescaleproductcorpus/index.html) at least include "/image" property in the "schema.org_properties" section of provided JSON files of corpus.

In the version 2 (http://webdatacommons.org/largescaleproductcorpus/v2/index.html) significantly changes the format and there is no section "schema.org_properties" in the corpus files.

At this point I'm very confused because for version 2 you wrote in the section 2.2. Schema of the Corpus:

The offers in version 2.0 of the product data corpus is described by the attributes listed below. Please note that compared to version 1.0 of the corpus, the identification schema was simplified while the actual content stayed exactly the same.

So I do not see that the corpus files for v1 like Training Corpus (English) and v2 like Product Data Corpus (English) has the same structure content.

Maybe I miss something, could you advise, please?

However, if I will work with v1 of Product Data Corpus and try to explore the "/image" property in the "schema.org_properties" section, how can I match the particular product data item with the OpenCrawl source file?

Many Thanks

Egor

Ralph Peeters

unread,

Jul 9, 2020, 8:19:36 AM7/9/20

to Web Data Commons

Hello Egor,

You are correct, the image links have indeed been removed from v2 because we wanted to focus on textual entity resolution and do not plan to host an image corpus due to copyright reasons. The sentence you cited is actually misleading as it refers to the token level of the schema.org properties used to build the new attributes. I will upload a file to the page of version 2 which allows you to link offers from v1 to v2.

For your other question:

I do not think the Common Crawl actually stores image files. Just for matching the offer to the corresponding crawled page of the CC, the URL is enough. If you have some resource which does store image files, you could do the following, taking an example from v1:

{"url":"https://www.glamour.com.br/polo-aleatory-combo-branco-289076/p","nodeID":"_:node1d4f6dc87bdd49d84dd1955a5551f1d","cluster_id":"15678134","identifiers":[{"/gtin8":"[78930642]"}],"schema.org_properties":[{"/image":"[\"https://glamour.vteximg.com.br/arquivos/ids/778630-398-398/Polo-Aleatory-Combo-Branco.jpg?v=636087734090100000\"@pt-br]"},{"/name":"[ polo aleatory combo branco ]"},{"/description":"[ null ]"}],"parent_NodeID":"https://www.glamour.com.br/polo-aleatory-combo-branco-289076/p","relationToParent":"http://www.w3.org/1999/xhtml/microdata#item","parent_schema.org_properties":[{"/title":"[ polo aleatory combo branco glamour ]"}]}

1. You would have to take the url of the offer "https://www.glamour.com.br/polo-aleatory-combo-branco-289076/p"

2. Search in the WARC files of the CC November 2017 for the timestamp of the crawling of this page

3. Use your data source to find the exact content page at the retrieved timestamp and get the image with url https://glamour.vteximg.com.br/arquivos/ids/778630-398-398/Polo-Aleatory-Combo-Branco.jpg?v=636087734090100000\

You can also just try retrieving the image from the web as it is now, but a lot of the links will very likely not work anymore.

I hope this helps!

Cheers,

Ralph

Егор Еремеев

unread,

Jul 11, 2020, 9:52:04 AM7/11/20

to Web Data Commons

Hello, Ralph,

Thank you for explanation and suggestion of possible approach. They make things clearly for me and your answer helps to go further.

Regards,

Egor

четверг, 9 июля 2020 г., 15:19:36 UTC+3 пользователь Ralph Peeters написал:

Reply all

Reply to author

Forward