microformats2?

Ryan B

unread,

Mar 21, 2017, 10:15:44 AM3/21/17

to Web Data Commons

hi all! first, thank you so much for web data commons. it's a really powerful resource for developers and researchers working on web structured data. we owe you all and common crawl a debt of gratitude!

i work with microformats2 a lot - http://microformats.org/wiki/microformats2 , notably h-entry and related formats - and i'd love to see an mf2 extract from the common crawl. have you all considered adding that to your current microformats 1 extract? i'd happily help any way i can.

thanks in advance!

Primpeli Anna

unread,

Mar 22, 2017, 6:52:01 AM3/22/17

to Web Data Commons

Hello Ryan,

thank you for your feedback!

Indeed, we could add in our next extraction the mf2 formats. The relevant extractors are already offered by the Apache Any23 library so it should be trivial to do so.

Best,

Anna

Lewis John Mcgibbney

unread,

Mar 28, 2017, 5:09:06 PM3/28/17

to Web Data Commons

ACK, if you upgrade WDC to Any23 2.0 then this will come out of the box folks.
Thanks

Ben Roberts

unread,

Jan 17, 2018, 11:13:48 AM1/17/18

to Web Data Commons

Any23 seems to break on quite a few sites

./bin/any23 rover https://ben.thatmustbe.me/note/2017/12/28/1
...

[Fatal Error] :170:3: The element type "input" must be terminated by the matching end-tag "</input>".

./bin/any23 rover https://aaronparecki.com/2018/01/11/5/
...
[Fatal Error] :1:3: The markup in the document preceding the root element must be well-formed.

this is on version 2.1 of any23

Lewis John Mcgibbney

unread,

Jan 17, 2018, 11:42:15 AM1/17/18

to web-data...@googlegroups.com

Hi Ben,

This is partially fixed in Any23 master branch which you can find at https://github.com/apache/any23.git

The underlying issue is that the underlying Semargl parser we use to parse and extract RDFa1.1 is VERY strict. It therefore does not parse messy HTML well at all.

and the second URL

# OUTPUT FORMAT: Turtle (mimeTypes=text/turtle, application/x-turtle; ext=ttl)
# BEGIN: ExtractionContext(urn:x-any23:html-head-meta:root-extraction-result-id:https://aaronparecki.com/2018/01/11/5/)
@prefix sindice: <http://vocab.sindice.net/> .

<https://aaronparecki.com/2018/01/11/5/> <http://vocab.sindice.net/any23#viewport> "width=device-width, initial-scale=1" .
# BEGIN: ExtractionContext(urn:x-any23:html-head-title:root-extraction-result-id:https://aaronparecki.com/2018/01/11/5/)
@prefix dcterms: <http://purl.org/dc/terms/> .

<https://aaronparecki.com/2018/01/11/5/> dcterms:title "Test post for Superfeedr to see if it will find my #indieweb ... • Aaron Parecki" .
# BEGIN: ExtractionContext(urn:x-any23:html-mf-license:root-extraction-result-id:https://aaronparecki.com/2018/01/11/5/)
@prefix xhtml: <http://www.w3.org/1999/xhtml/vocab#> .

<https://aaronparecki.com/2018/01/11/5/> xhtml:license <http://creativecommons.org/licenses/by/3.0/> .
# END: ExtractionContext(urn:x-any23:html-head-meta:root-extraction-result-id:https://aaronparecki.com/2018/01/11/5/)
# END: ExtractionContext(urn:x-any23:html-mf-license:root-extraction-result-id:https://aaronparecki.com/2018/01/11/5/)
# END: ExtractionContext(urn:x-any23:html-head-title:root-extraction-result-id:https://aaronparecki.com/2018/01/11/5/)

There are issues with the extractions and you can see these if you activate the extraction report.

Hope this helps you, we will be pushing a further fix for the RDFa1.1 issue and then an Any23 release pretty shortly so I will announce it her when we do.

Thanks

Lewis

--
You received this message because you are subscribed to a topic in the Google Groups "Web Data Commons" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/web-data-commons/WOeSOODtj3A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to web-data-commons+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Lewis

Dr. Lewis J. McGibbney Ph.D, B.Sc

Skype: lewis.john.mcgibbney

Reply all

Reply to author

Forward

Message has been deleted