extraction broken on page with both RDFa and microdata

69 views
Skip to first unread message

Ruben

unread,
Aug 2, 2013, 3:47:16 PM8/2/13
to sindi...@googlegroups.com
Dear Sindice developers,

I recently added RDFa data in addition to the existing HTML5 microdata on my pages.

It is parsed by the pyRdfa parser:

However, on Sindice it fails with
 Could not extract metadata from url [http://ruben.verborgh.org/publications/verborgh_umap_2013/] Cause:com.sindice.inspector.MetadataError@6039d1ee in 95 ms 
It did work before the RDFA was added.

The Google Structured Data Tool is able to parse both markups:

It would be great if you could look into this.

Thanks,

Ruben

Giovanni Tummarello

unread,
Aug 6, 2013, 6:50:34 AM8/6/13
to sindice-dev, Szymon Danielczyk
Ruben hi,

sorry for the delay it was a bank holiday here. We will replce the any23 library with the latest version and this should fix the issues. 
will let you know when done, cheers
cheers
Gio


--
--
You received this message because you are subscribed to the Google
Groups "Sindice Developers" group.
To post to this group, send email to sindi...@googlegroups.com
To unsubscribe from this group, send email to
sindice-dev...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/sindice-dev?hl=en
 
http://sindice.com http://sig.ma http://www.deri.ie
 
---
You received this message because you are subscribed to the Google Groups "Sindice Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sindice-dev...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

szydan

unread,
Aug 9, 2013, 11:27:11 AM8/9/13
to sindi...@googlegroups.com
Hi Ruben 

Thanks for reporting this and sorry for the delay 
So I've checked your website 

against the version of any23 v0.6.1 - which is currently used by inspector to parse documents 
and also newer one which is 
v.0.7.0-incubating which is deployed at http://any23.org/

And in both cases it failed to extract any triples from your page with the same error 

java.lang.IllegalArgumentException: Invalid content ''
at org.apache.any23.extractor.microdata.ItemPropValue.<init>(ItemPropValue.java:89)
at org.apache.any23.extractor.microdata.MicrodataParser.getPropertyValue(MicrodataParser.java:341)
at org.apache.any23.extractor.microdata.MicrodataParser.getItemProps(MicrodataParser.java:394)
at org.apache.any23.extractor.microdata.MicrodataParser.getItemScope(MicrodataParser.java:471)
at org.apache.any23.extractor.microdata.MicrodataParser.getMicrodata(MicrodataParser.java:186)
at org.apache.any23.extractor.microdata.MicrodataParser.getMicrodata(MicrodataParser.java:203)
at org.apache.any23.extractor.microdata.MicrodataExtractor.run(MicrodataExtractor.java:100)
at org.apache.any23.extractor.microdata.MicrodataExtractor.run(MicrodataExtractor.java:62)
at org.apache.any23.extractor.SingleDocumentExtraction.runExtractor(SingleDocumentExtraction.java:477)
at org.apache.any23.extractor.SingleDocumentExtraction.run(SingleDocumentExtraction.java:260)
at org.apache.any23.Any23.extract(Any23.java:294)
at org.apache.any23.Any23.extract(Any23.java:446)
at org.apache.any23.servlet.WebResponder.runExtraction(WebResponder.java:113)
at org.apache.any23.servlet.Servlet.doPost(Servlet.java:108)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:637)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at com.googlecode.psiprobe.Tomcat60AgentValve.invoke(Tomcat60AgentValve.java:30)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:602)
at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
at java.lang.Thread.run(Thread.java:662)

newest version 0.8.0  

2013-08-09 16:06:52,597 [qtp0-2] INFO  c.s.i.s.RDFExtractorServlet -    contentType:text/html
2013-08-09 16:06:52,597 [qtp0-2] INFO  c.s.i.s.RDFExtractorServlet -    contentEncoding:UTF-8
2013-08-09 16:06:52,597 [qtp0-2] INFO  c.s.i.RDFExtractor - extracting data from url [http://ruben.verborgh.org/publications/verborgh_umap_2013/]
2013-08-09 16:06:52,625 [qtp0-2] INFO  c.s.i.s.RdfextractorHTTPClient - Content-Type [text/html; charset=utf-8] header found in the response
2013-08-09 16:06:52,656 [qtp0-2] INFO  o.a.a.e.SingleDocumentExtraction - Processing http://ruben.verborgh.org/publications/verborgh_umap_2013/
2013-08-09 16:06:52,693 [qtp0-2] ERROR c.s.i.RDFExtractor - An error occurred extracting metadata using any23 
java.lang.IllegalArgumentException: Invalid content ''
        at org.apache.any23.extractor.microdata.ItemPropValue.<init>(ItemPropValue.java:89) [apache-any23-core-0.8.0.jar:0.8.0]
        at org.apache.any23.extractor.microdata.MicrodataParser.getPropertyValue(MicrodataParser.java:341) [apache-any23-core-0.8.0.jar:0.8.0]
        at org.apache.any23.extractor.microdata.MicrodataParser.getItemProps(MicrodataParser.java:394) [apache-any23-core-0.8.0.jar:0.8.0]
        at org.apache.any23.extractor.microdata.MicrodataParser.getItemScope(MicrodataParser.java:471) [apache-any23-core-0.8.0.jar:0.8.0]
        at org.apache.any23.extractor.microdata.MicrodataParser.getMicrodata(MicrodataParser.java:186) [apache-any23-core-0.8.0.jar:0.8.0]
        at org.apache.any23.extractor.microdata.MicrodataParser.getMicrodata(MicrodataParser.java:203) [apache-any23-core-0.8.0.jar:0.8.0]
        at org.apache.any23.extractor.microdata.MicrodataExtractor.run(MicrodataExtractor.java:89) [apache-any23-core-0.8.0.jar:0.8.0]
        at org.apache.any23.extractor.microdata.MicrodataExtractor.run(MicrodataExtractor.java:58) [apache-any23-core-0.8.0.jar:0.8.0]
        at org.apache.any23.extractor.SingleDocumentExtraction.runExtractor(SingleDocumentExtraction.java:471) [apache-any23-core-0.8.0.jar:0.8.0]
        at org.apache.any23.extractor.SingleDocumentExtraction.run(SingleDocumentExtraction.java:254) [apache-any23-core-0.8.0.jar:0.8.0]
        at org.apache.any23.Any23.extract(Any23.java:295) [apache-any23-core-0.8.0.jar:0.8.0]
        at org.apache.any23.Any23.extract(Any23.java:447) [apache-any23-core-0.8.0.jar:0.8.0]
        at org.apache.any23.Any23.extract(Any23.java:377) [apache-any23-core-0.8.0.jar:0.8.0]
        at com.sindice.inspector.RDFExtractor.extract(RDFExtractor.java:187) [classes/:na]
        at com.sindice.inspector.RDFExtractor.extract(RDFExtractor.java:152) [classes/:na]
        at com.sindice.inspector.servlet.RDFExtractorServlet.doRDFExtract(RDFExtractorServlet.java:910) [classes/:na]
        at com.sindice.inspector.servlet.RDFExtractorServlet.extractTriplesFromDocumentPart(RDFExtractorServlet.java:519) [classes/:na]
        at com.sindice.inspector.servlet.RDFExtractorServlet.doRDFExtract(RDFExtractorServlet.java:374) [classes/:na]
        at com.sindice.inspector.servlet.RDFExtractorServlet.doPost(RDFExtractorServlet.java:166) [classes/:na]
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:713) [servlet-api-3.0.pre4.jar:na]
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:806) [servlet-api-3.0.pre4.jar:na]
        at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502) [jetty-7.0.0.pre5.jar:7.0.0.pre5]
        at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1121) [jetty-7.0.0.pre5.jar:7.0.0.pre5]
        at com.sindice.inspector.servlet.filter.SameIpBlocker.doFilter(SameIpBlocker.java:88) [classes/:na]
        at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1112) [jetty-7.0.0.pre5.jar:7.0.0.pre5]
        at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363) [jetty-7.0.0.pre5.jar:7.0.0.pre5]
        at org.mortbay.jetty.security.ConstraintsSecurityHandler.handle(ConstraintsSecurityHandler.java:220) [jetty-security-7.0.0.pre5.jar:7.0.0.pre5]
        at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) [jetty-7.0.0.pre5.jar:7.0.0.pre5]


There might be a good reasons for these if you look at the validation results:

there are couple of errors you might want to fix and try again 

At the moment we are working to update the any23 version to the newest one 
The new version of the inspector is expected next week 

If you think that your document is 100% valid you might want to 
notify any23 team directly 


All the Best
Szymon  

Giovanni Tummarello

unread,
Aug 11, 2013, 4:05:12 PM8/11/13
to sindice-dev
Szymon it makes sense to post it to any23 as a bug.. i mean i dont
think any23 is supposed to explode when parsing some content :)
makes sense we update anyway to the latest.
cheers
Gio

Ruben

unread,
Aug 13, 2013, 2:11:08 PM8/13/13
to sindi...@googlegroups.com, giovanni....@deri.org
Hi Gio, Hi Szymon,

Thanks very much for your help; seeing the validation errors allows me to actually do something about it.
I have to agree with Gio that just blowing up when something like this happens, they were not sever errors after all.

The main problem is caused by Twitter markup such as:
<meta name="twitter:card" content="summary">
<meta name="twitter:site" content="@nytimes">

However, this is the Twitter recommend way or marking things up (even though property would be better here):

So I have the choice between following the Twitter guidelines for markup or being parsed by Sindice…
though one ;-)

Thanks,

Ruben

Ruben

unread,
Aug 13, 2013, 2:53:14 PM8/13/13
to sindi...@googlegroups.com, giovanni....@deri.org
Update: seems Twitter is fine with me using property; although the official documentation doesn't mention it.

The other issue turned out to be "missing-opengraph-namespace-rule".
As you can see, the MissingOpenGraphNamespaceRule checks whether I have the xmlns:og value set on my document.
If not, it errors. Even though my document is HTML5 and I use <html prefix="og: http://ogp.me/ns#">.
I have added the xmlns:og thing, but now the parser complains "Attribute xmlns:og not allowed here".

Seems like I have to choose between
"Attribute xmlns:og not allowed here" and "Missing OpenGraph namespace declaration.".
Please help me out ;-)

Cheers,

Ruben

On Sunday, August 11, 2013 10:05:12 PM UTC+2, Giovanni wrote:
Reply all
Reply to author
Forward
0 new messages