microformats2?

44 views
Skip to first unread message

Ryan B

unread,
Mar 21, 2017, 10:15:44 AM3/21/17
to Web Data Commons
hi all! first, thank you so much for web data commons. it's a really powerful resource for developers and researchers working on web structured data. we owe you all and common crawl a debt of gratitude!

i work with microformats2 a lot - http://microformats.org/wiki/microformats2 , notably h-entry and related formats - and i'd love to see an mf2 extract from the common crawl. have you all considered adding that to your current microformats 1 extract? i'd happily help any way i can.

thanks in advance!

Primpeli Anna

unread,
Mar 22, 2017, 6:52:01 AM3/22/17
to Web Data Commons
 
Hello Ryan,

thank you for your feedback!
Indeed, we could add in our next extraction the mf2  formats. The relevant extractors are already offered by the Apache Any23 library so it should be trivial to do so.

Best,
Anna

Lewis John Mcgibbney

unread,
Mar 28, 2017, 5:09:06 PM3/28/17
to Web Data Commons
ACK, if you upgrade WDC to Any23 2.0 then this will come out of the box folks.
Thanks

Ben Roberts

unread,
Jan 17, 2018, 11:13:48 AM1/17/18
to Web Data Commons
Any23 seems to break on quite a few sites

./bin/any23 rover https://ben.thatmustbe.me/note/2017/12/28/1
...

[Fatal Error] :170:3: The element type "input" must be terminated by the matching end-tag "</input>".



 ./bin/any23 rover https://aaronparecki.com/2018/01/11/5/
...
[Fatal Error] :1:3: The markup in the document preceding the root element must be well-formed.

this is on version 2.1 of any23

Lewis John Mcgibbney

unread,
Jan 17, 2018, 11:42:15 AM1/17/18
to web-data...@googlegroups.com
Hi Ben,
This is partially fixed in Any23 master branch which you can find at https://github.com/apache/any23.git
The underlying issue is that the underlying Semargl parser we use to parse and extract RDFa1.1 is VERY strict. It therefore does not parse messy HTML well at all.
As I said we have however partially fixed this, you can now run extractions using the master branch and you can also test it on the public service at any23.org. From the first URL you've provided, Any23 extracts the following annotated structured data.

# OUTPUT FORMAT: Turtle (mimeTypes=text/turtle, application/x-turtle; ext=ttl)
# BEGIN: ExtractionContext(urn:x-any23:html-mf-xfn:root-extraction-result-id:https://ben.thatmustbe.me/note/2017/12/28/1)
@prefix xfn: <http://microformats.org/wiki/xfn/> .
@prefix rdf: <https://www.w3.org/TR/REC-rdf-syntax#> .
@prefix foaf: <http://xmlns.com/foaf/spec/> .

_:node1c3bnm00nx4596 <http://vocab.sindice.com/xfn#mePage> <http://www.facebook.com/dissolve333> .

<https://ben.thatmustbe.me/note/2017/12/28/1> <http://vocab.sindice.com/xfn#me-hyperlink> <http://www.facebook.com/dissolve333> .

_:node1c3bnm00nx4596 <http://vocab.sindice.com/xfn#mePage> <http://twitter.com/dissolve333> .

<https://ben.thatmustbe.me/note/2017/12/28/1> <http://vocab.sindice.com/xfn#me-hyperlink> <http://twitter.com/dissolve333> .

_:node1c3bnm00nx4596 <http://vocab.sindice.com/xfn#mePage> <https://github.com/dissolve/> .

<https://ben.thatmustbe.me/note/2017/12/28/1> <http://vocab.sindice.com/xfn#me-hyperlink> <https://github.com/dissolve/> .

_:node1c3bnm00nx4596 a <http://xmlns.com/foaf/0.1/Person> ;
    <http://vocab.sindice.com/xfn#mePage> <https://ben.thatmustbe.me/note/2017/12/28/1> .
# BEGIN: ExtractionContext(urn:x-any23:html-rdfa11:root-extraction-result-id:https://ben.thatmustbe.me/note/2017/12/28/1)

<https://ben.thatmustbe.me/note/2017/12/28/1#headBanner> <http://www.w3.org/1999/xhtml/vocab#role> <http://www.w3.org/1999/xhtml/vocab#banner> .

<https://ben.thatmustbe.me/note/2017/12/28/1#content> <http://www.w3.org/1999/xhtml/vocab#role> <http://www.w3.org/1999/xhtml/vocab#main> .

<https://ben.thatmustbe.me/note/2017/12/28/1#secondary> <http://www.w3.org/1999/xhtml/vocab#role> <http://www.w3.org/1999/xhtml/vocab#complementary> .
# END: ExtractionContext(urn:x-any23:html-rdfa11:root-extraction-result-id:https://ben.thatmustbe.me/note/2017/12/28/1)
# END: ExtractionContext(urn:x-any23:html-mf-xfn:root-extraction-result-id:https://ben.thatmustbe.me/note/2017/12/28/1)


and the second URL

# OUTPUT FORMAT: Turtle (mimeTypes=text/turtle, application/x-turtle; ext=ttl)
# BEGIN: ExtractionContext(urn:x-any23:html-head-meta:root-extraction-result-id:https://aaronparecki.com/2018/01/11/5/)
@prefix sindice: <http://vocab.sindice.net/> .

<https://aaronparecki.com/2018/01/11/5/> <http://vocab.sindice.net/any23#viewport> "width=device-width, initial-scale=1" .
# BEGIN: ExtractionContext(urn:x-any23:html-head-title:root-extraction-result-id:https://aaronparecki.com/2018/01/11/5/)
@prefix dcterms: <http://purl.org/dc/terms/> .

<https://aaronparecki.com/2018/01/11/5/> dcterms:title "Test post for Superfeedr to see if it will find my #indieweb ... • Aaron Parecki" .
# BEGIN: ExtractionContext(urn:x-any23:html-mf-license:root-extraction-result-id:https://aaronparecki.com/2018/01/11/5/)
@prefix xhtml: <http://www.w3.org/1999/xhtml/vocab#> .

<https://aaronparecki.com/2018/01/11/5/> xhtml:license <http://creativecommons.org/licenses/by/3.0/> .
# END: ExtractionContext(urn:x-any23:html-head-meta:root-extraction-result-id:https://aaronparecki.com/2018/01/11/5/)
# END: ExtractionContext(urn:x-any23:html-mf-license:root-extraction-result-id:https://aaronparecki.com/2018/01/11/5/)
# END: ExtractionContext(urn:x-any23:html-head-title:root-extraction-result-id:https://aaronparecki.com/2018/01/11/5/)

There are issues with the extractions and you can see these if you activate the extraction report.
Hope this helps you, we will be pushing a further fix for the RDFa1.1 issue and then an Any23 release pretty shortly so I will announce it her when we do.
Thanks
Lewis


--
You received this message because you are subscribed to a topic in the Google Groups "Web Data Commons" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/web-data-commons/WOeSOODtj3A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to web-data-commons+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Lewis
Dr. Lewis J. McGibbney Ph.D, B.Sc
Skype: lewis.john.mcgibbney



Reply all
Reply to author
Forward
Message has been deleted
0 new messages