extraction broken on page with both RDFa and microdata

Ruben

unread,

Aug 2, 2013, 3:47:16 PM8/2/13

to sindi...@googlegroups.com

Dear Sindice developers,

I recently added RDFa data in addition to the existing HTML5 microdata on my pages.

For instance: http://ruben.verborgh.org/publications/vandersande_ldow_2013/

It is parsed by the pyRdfa parser:

http://www.w3.org/2012/pyRdfa/extract?uri=http%3A%2F%2Fruben.verborgh.org%2Fpublications%2Fvandersande_ldow_2013%2F

However, on Sindice it fails with

Could not extract metadata from url [http://ruben.verborgh.org/publications/verborgh_umap_2013/] Cause:com.sindice.inspector.MetadataError@6039d1ee in 95 ms

It did work before the RDFA was added.

The Google Structured Data Tool is able to parse both markups:

http://www.google.com/webmasters/tools/richsnippets?q=http%3A%2F%2Fruben.verborgh.org%2Fpublications%2Fvandersande_ldow_2013%2F

It would be great if you could look into this.

Thanks,

Ruben

Giovanni Tummarello

unread,

Aug 6, 2013, 6:50:34 AM8/6/13

to sindice-dev, Szymon Danielczyk

Ruben hi,

sorry for the delay it was a bank holiday here. We will replce the any23 library with the latest version and this should fix the issues.

will let you know when done, cheers

cheers

Gio

--
--
You received this message because you are subscribed to the Google
Groups "Sindice Developers" group.
To post to this group, send email to sindi...@googlegroups.com
To unsubscribe from this group, send email to
sindice-dev...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/sindice-dev?hl=en

http://sindice.com http://sig.ma http://www.deri.ie

---
You received this message because you are subscribed to the Google Groups "Sindice Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sindice-dev...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

szydan

unread,

Aug 9, 2013, 11:27:11 AM8/9/13

to sindi...@googlegroups.com

Hi Ruben

Thanks for reporting this and sorry for the delay

So I've checked your website

http://ruben.verborgh.org/publications/verborgh_umap_2013/

against the version of any23 v0.6.1 - which is currently used by inspector to parse documents

and also newer one which is

v.0.7.0-incubating which is deployed at http://any23.org/

And in both cases it failed to extract any triples from your page with the same error

java.lang.IllegalArgumentException: Invalid content ''

at org.apache.any23.extractor.microdata.ItemPropValue.<init>(ItemPropValue.java:89)

at org.apache.any23.extractor.microdata.MicrodataParser.getPropertyValue(MicrodataParser.java:341)

at org.apache.any23.extractor.microdata.MicrodataParser.getItemProps(MicrodataParser.java:394)

at org.apache.any23.extractor.microdata.MicrodataParser.getItemScope(MicrodataParser.java:471)

at org.apache.any23.extractor.microdata.MicrodataParser.getMicrodata(MicrodataParser.java:186)

at org.apache.any23.extractor.microdata.MicrodataParser.getMicrodata(MicrodataParser.java:203)

at org.apache.any23.extractor.microdata.MicrodataExtractor.run(MicrodataExtractor.java:100)

at org.apache.any23.extractor.microdata.MicrodataExtractor.run(MicrodataExtractor.java:62)

at org.apache.any23.extractor.SingleDocumentExtraction.runExtractor(SingleDocumentExtraction.java:477)

at org.apache.any23.extractor.SingleDocumentExtraction.run(SingleDocumentExtraction.java:260)

at org.apache.any23.Any23.extract(Any23.java:294)

at org.apache.any23.Any23.extract(Any23.java:446)

at org.apache.any23.servlet.WebResponder.runExtraction(WebResponder.java:113)

at org.apache.any23.servlet.Servlet.doPost(Servlet.java:108)

at javax.servlet.http.HttpServlet.service(HttpServlet.java:637)

at javax.servlet.http.HttpServlet.service(HttpServlet.java:717)

at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)

at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)

at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)

at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)

at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)

at com.googlecode.psiprobe.Tomcat60AgentValve.invoke(Tomcat60AgentValve.java:30)

at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)

at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)

at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)

at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)

at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:602)

at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)

at java.lang.Thread.run(Thread.java:662)

newest version 0.8.0

2013-08-09 16:06:52,597 [qtp0-2] INFO c.s.i.s.RDFExtractorServlet - contentType:text/html

2013-08-09 16:06:52,597 [qtp0-2] INFO c.s.i.s.RDFExtractorServlet - contentEncoding:UTF-8

2013-08-09 16:06:52,597 [qtp0-2] INFO c.s.i.RDFExtractor - extracting data from url [http://ruben.verborgh.org/publications/verborgh_umap_2013/]

2013-08-09 16:06:52,625 [qtp0-2] INFO c.s.i.s.RdfextractorHTTPClient - Content-Type [text/html; charset=utf-8] header found in the response

2013-08-09 16:06:52,656 [qtp0-2] INFO o.a.a.e.SingleDocumentExtraction - Processing http://ruben.verborgh.org/publications/verborgh_umap_2013/

2013-08-09 16:06:52,693 [qtp0-2] ERROR c.s.i.RDFExtractor - An error occurred extracting metadata using any23

java.lang.IllegalArgumentException: Invalid content ''

at org.apache.any23.extractor.microdata.ItemPropValue.<init>(ItemPropValue.java:89) [apache-any23-core-0.8.0.jar:0.8.0]

at org.apache.any23.extractor.microdata.MicrodataParser.getPropertyValue(MicrodataParser.java:341) [apache-any23-core-0.8.0.jar:0.8.0]

at org.apache.any23.extractor.microdata.MicrodataParser.getItemProps(MicrodataParser.java:394) [apache-any23-core-0.8.0.jar:0.8.0]

at org.apache.any23.extractor.microdata.MicrodataParser.getItemScope(MicrodataParser.java:471) [apache-any23-core-0.8.0.jar:0.8.0]

at org.apache.any23.extractor.microdata.MicrodataParser.getMicrodata(MicrodataParser.java:186) [apache-any23-core-0.8.0.jar:0.8.0]

at org.apache.any23.extractor.microdata.MicrodataParser.getMicrodata(MicrodataParser.java:203) [apache-any23-core-0.8.0.jar:0.8.0]

at org.apache.any23.extractor.microdata.MicrodataExtractor.run(MicrodataExtractor.java:89) [apache-any23-core-0.8.0.jar:0.8.0]

at org.apache.any23.extractor.microdata.MicrodataExtractor.run(MicrodataExtractor.java:58) [apache-any23-core-0.8.0.jar:0.8.0]

at org.apache.any23.extractor.SingleDocumentExtraction.runExtractor(SingleDocumentExtraction.java:471) [apache-any23-core-0.8.0.jar:0.8.0]

at org.apache.any23.extractor.SingleDocumentExtraction.run(SingleDocumentExtraction.java:254) [apache-any23-core-0.8.0.jar:0.8.0]

at org.apache.any23.Any23.extract(Any23.java:295) [apache-any23-core-0.8.0.jar:0.8.0]

at org.apache.any23.Any23.extract(Any23.java:447) [apache-any23-core-0.8.0.jar:0.8.0]

at org.apache.any23.Any23.extract(Any23.java:377) [apache-any23-core-0.8.0.jar:0.8.0]

at com.sindice.inspector.RDFExtractor.extract(RDFExtractor.java:187) [classes/:na]

at com.sindice.inspector.RDFExtractor.extract(RDFExtractor.java:152) [classes/:na]

at com.sindice.inspector.servlet.RDFExtractorServlet.doRDFExtract(RDFExtractorServlet.java:910) [classes/:na]

at com.sindice.inspector.servlet.RDFExtractorServlet.extractTriplesFromDocumentPart(RDFExtractorServlet.java:519) [classes/:na]

at com.sindice.inspector.servlet.RDFExtractorServlet.doRDFExtract(RDFExtractorServlet.java:374) [classes/:na]

at com.sindice.inspector.servlet.RDFExtractorServlet.doPost(RDFExtractorServlet.java:166) [classes/:na]

at javax.servlet.http.HttpServlet.service(HttpServlet.java:713) [servlet-api-3.0.pre4.jar:na]

at javax.servlet.http.HttpServlet.service(HttpServlet.java:806) [servlet-api-3.0.pre4.jar:na]

at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502) [jetty-7.0.0.pre5.jar:7.0.0.pre5]

at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1121) [jetty-7.0.0.pre5.jar:7.0.0.pre5]

at com.sindice.inspector.servlet.filter.SameIpBlocker.doFilter(SameIpBlocker.java:88) [classes/:na]

at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1112) [jetty-7.0.0.pre5.jar:7.0.0.pre5]

at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363) [jetty-7.0.0.pre5.jar:7.0.0.pre5]

at org.mortbay.jetty.security.ConstraintsSecurityHandler.handle(ConstraintsSecurityHandler.java:220) [jetty-security-7.0.0.pre5.jar:7.0.0.pre5]

at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) [jetty-7.0.0.pre5.jar:7.0.0.pre5]

There might be a good reasons for these if you look at the validation results:

http://inspector.sindice.com/inspect?url=http%3A%2F%2Fruben.verborgh.org%2Fpublications%2Fverborgh_umap_2013%2F&content=&contentType=auto&doTriplesValidation=1&doSyntaxValidation=1#COMMON-MISTAKES-VALIDATION

there are couple of errors you might want to fix and try again

At the moment we are working to update the any23 version to the newest one

The new version of the inspector is expected next week

If you think that your document is 100% valid you might want to

notify any23 team directly

https://issues.apache.org/jira/browse/ANY23

All the Best

Szymon

Giovanni Tummarello

unread,

Aug 11, 2013, 4:05:12 PM8/11/13

to sindice-dev

Szymon it makes sense to post it to any23 as a bug.. i mean i dont
think any23 is supposed to explode when parsing some content :)
makes sense we update anyway to the latest.
cheers
Gio

Ruben

unread,

Aug 13, 2013, 2:11:08 PM8/13/13

to sindi...@googlegroups.com, giovanni....@deri.org

Hi Gio, Hi Szymon,

Thanks very much for your help; seeing the validation errors allows me to actually do something about it.

I have to agree with Gio that just blowing up when something like this happens, they were not sever errors after all.

The main problem is caused by Twitter markup such as:

However, this is the Twitter recommend way or marking things up (even though property would be better here):

https://dev.twitter.com/docs/cards/types/summary-card

So I have the choice between following the Twitter guidelines for markup or being parsed by Sindice…

though one ;-)

Thanks,

Ruben

unread,

Aug 13, 2013, 2:53:14 PM8/13/13

to sindi...@googlegroups.com, giovanni....@deri.org

Update: seems Twitter is fine with me using property; although the official documentation doesn't mention it.

The other issue turned out to be "missing-opengraph-namespace-rule".

As you can see, the MissingOpenGraphNamespaceRule checks whether I have the xmlns:og value set on my document.

If not, it errors. Even though my document is HTML5 and I use <html prefix="og: http://ogp.me/ns#">.

I have added the xmlns:og thing, but now the parser complains "Attribute xmlns:og not allowed here".

Seems like I have to choose between

"Attribute xmlns:og not allowed here" and "Missing OpenGraph namespace declaration.".

Please help me out ;-)

Cheers,

Ruben

On Sunday, August 11, 2013 10:05:12 PM UTC+2, Giovanni wrote:

Reply all

Reply to author

Forward