LinkedDataSail is too persnickety (bad triples fails page load)

Dan Brickley

unread,

Nov 5, 2012, 3:39:49 PM11/5/12

to gremli...@googlegroups.com, Joshua Shinavier

Hi folks

I'm revisiting Gremlin, with the idea of trying it out on some SPARQL
stores like OWLIM.

I figured I'd first get back to where I was last - i.e.
http://danbri.org/words/2011/05/10/675 - so tried to re-run a script
that was fine a few months ago, using a freshly-built-today new build:

g = new LinkedDataSailGraph(new MemoryStoreSailGraph())
fry = g.v('http://dbpedia.org/resource/Stephen_Fry')
g.addNamespace('wp', 'http://dbpedia.org/ontology/')
m = [:]
fry.in('wp:starring').out('wp:starring').groupCount(m).loop(3){it.loops <2}
m2 = m.sort{ a,b -> b.value <=> a.value }

It fails to find anything. In my ripple.log, I have the following 3
entries, which suggest that The WebClosure page fetch is failing hard
due to a suspect triple. Since the Web is always going to have poor
data in it, perhaps Gremlin/Ripple could be more tolerant of errors in
this mode, and maybe just issue a warning and try to continue? I don't
believe dbpedia have changed their design/architecture so I don't
think it so likely the mimetype is the issue, though it is also a
possibility. W3C's RDF validator seemed happy with the URL.

2012-11-05 19:56:41,865 [main] INFO WebClosure - Dereferencing URI
<http://dbpedia.org/resource/Stephen_Fry>
2012-11-05 19:56:50,797 [main] ERROR RippleException -
'1981-01-01T00:00:00+02:00' is not a valid value for datatype
http://www.w3.org/2001/XMLSchema#gYear [line 251, column 153]
2012-11-05 19:56:50,798 [main] INFO WebClosure - Failed to dereference
URI <http://dbpedia.org/resource/Stephen_Fry>: ParseError (perhaps
application/rdf+xml is not the correct media type for this data)

Thanks for any suggestions,

Dan

Joshua Shinavier

unread,

Nov 5, 2012, 8:48:34 PM11/5/12

to gremli...@googlegroups.com

Hi Dan,

This is indeed annoying. The good news is that it's a known issue:

https://github.com/tinkerpop/blueprints/issues/244

The bad news is that it's not going to be fixed for a little while (i.e. not in the pre-release snapshots).

The good news is that I have just committed a change to LinkedDataSail which resolves the issue as far as we're concerned. It allows you to specify the desired behavior of the parser w.r.t. invalid data-typed literals, and makes a behavior of "ignore" the default. No warn-and-continue functionality currently, although Sesame does offer some API bits which might be useful there. See the blurb on datatypeHandlingPolicy near the bottom of this page:

https://github.com/joshsh/ripple/wiki/Ripple-configuration-properties

This will be available in the next TinkerPop release after the next Ripple release.

Thanks for the nudge.

Josh

--

Dan Brickley

unread,

Nov 6, 2012, 4:26:30 AM11/6/12

to Gremlin-users

On Nov 6, 1:48 am, Joshua Shinavier <shi...@rpi.edu> wrote:
> Hi Dan,
>
> This is indeed annoying. The good news is that it's a known issue:
>
> https://github.com/tinkerpop/blueprints/issues/244
>
> The bad news is that it's not going to be fixed for a little while (i.e.
> not in the pre-release snapshots).
>
> The good news is that I have just committed a change to LinkedDataSail
> which resolves the issue as far as we're concerned. It allows you to
> specify the desired behavior of the parser w.r.t. invalid data-typed
> literals, and makes a behavior of "ignore" the default. No
> warn-and-continue functionality currently, although Sesame does offer some
> API bits which might be useful there. See the blurb on
> datatypeHandlingPolicy near the bottom of this page:
>
> https://github.com/joshsh/ripple/wiki/Ripple-configuration-properties
>
> This will be available in the next TinkerPop release after the next Ripple
> release.
>
> Thanks for the nudge.

Thanks for the quick update! "Ignore" mode sounds fine. I look forward
to the next release. In the meantime I'll bug dbpedia, assuming
''1981-01-01T00:00:00+02:00' is indeed wrong; I didn't find a datetime
validator yet that can tell me exactly what's up with it.

cheers,

Dan

p.s. this Sail would be really cool for Microdata and RDFa Lite (I
work on schema.org...); is that on the agenda for someday?

> Josh
>
>
>
>
>
>
>
> On Mon, Nov 5, 2012 at 3:39 PM, Dan Brickley <dan...@danbri.org> wrote:
> > Hi folks
>
> > I'm revisiting Gremlin, with the idea of trying it out on some SPARQL
> > stores like OWLIM.
>
> > I figured I'd first get back to where I was last - i.e.

> >http://danbri.org/words/2011/05/10/675- so tried to re-run a script

Joshua Shinavier

unread,

Nov 6, 2012, 6:01:07 PM11/6/12

to gremli...@googlegroups.com

On Tue, Nov 6, 2012 at 4:26 AM, Dan Brickley <dan...@danbri.org> wrote:

[...]

Thanks for the quick update! "Ignore" mode sounds fine. I look forward
to the next release. In the meantime I'll bug dbpedia, assuming
''1981-01-01T00:00:00+02:00' is indeed wrong; I didn't find a datetime
validator yet that can tell me exactly what's up with it.

As an xsd:dateTime value, it's just fine. The problem is that DBpedia is using xsd:gYear instead. Evidently something like this is intended:

dbr:Stephen_Fry dbpediaowl:activeYearsStartYear "1981"^^xsd:gYear .

dbr:Stephen_Fry dbpediaowl:birthYear "1957"^^xsd:gYear .

But this is what is actually being published:

dbr:Stephen_Fry dbpediaowl:activeYearsStartYear "1981-01-01T00:00:00+02:00"^^xsd:gYear .

dbr:Stephen_Fry dbpediaowl:birthYear "1957-01-01T00:00:00+02:00"^^xsd:gYear .

cheers,

Dan

p.s. this Sail would be really cool for Microdata and RDFa Lite (I
work on schema.org...);

I'm aware :-)

is that on the agenda for someday?

As soon as I know of a Sesame -compatible parser for those formats, I can add support for them to LinkedDataSail. It seems to me I looked into RDFa support at one point, but came to a dead end. In any case, RDFa Lite and Microdata would have to be handled a little differently than the other formats (for which a format-specific parser is chosen based on the MIME type of the retrieved document) in that rich snippets are embedded in web pages (so you would need to parse the web page first, in order to extract the snippets). But... where there is a parser, there is a way.

Best,

Josh

--

Reply all

Reply to author

Forward