What is VOC?

20 views

Skip to first unread message

Vladimir Alexiev

unread,

Feb 5, 2017, 5:23:34 AM2/5/17

to Web Data Commons

The stats spreadsheets
http://webdatacommons.org/structureddata/2016-10/stats/html-md.xlsx
http://webdatacommons.org/structureddata/2016-10/stats/html-embedded-jsonld.xlsx
http://webdatacommons.org/structureddata/2016-10/stats/html-rdfa.xlsx
include cols Class, Prop and Voc.

I understand the first two, but what is Voc? I know a @vocab term in JSONLD.
Inspection of some values shows a lot of mistakes:

http://www.w3.org/1999/xhtml/microdata# : "You do not have sufficient privileges to access the page that you requested."
http://purl.org/dc/terms/ : valid
http://schema.org/ : valid
https://schema.org/ : invalid, that's not the canonic URL
http://data-vocabulary.org/ : appears to be obsolete version of
https://schema.org/Product/ : invalid. This is a class URL but shouldn't end in slash
https://schema.org/Offer/ : invalid. This is a class URL but shouldn't end in slash
http://data-vocabulary.org/Organization/ : invalid. This is a class URL but shouldn't end in slash
http://data-vocabulary.org/Rating/ : invalid. This is a class URL but shouldn't end in slash

Primpeli Anna

unread,

Feb 6, 2017, 11:11:56 AM2/6/17

to Web Data Commons

Hello Vladimir,

As VOC we define the vocabulary of the properties used in the HTML pages of CC, which we use in order to extract structured data. You are right that some of them are invalid. Taking into consideration the size of the corpus, one can realize that there will also be faulty annotated entities. In our extraction, we want to keep track of these errors as it could be further investigated as a topic [1]. Another scenario for extracting an invalid Vocabulary would be that something went wrong while parsing.

I hope this answers your question.

Best Regards,

Anna

[1]: Meusel, R., & Paulheim, H. (2015, May). Heuristics for fixing common errors in deployed schema. org microdata. In European Semantic Web Conference (pp. 152-168). Springer International Publishing.

Reply all

Reply to author

Forward

0 new messages