rdflib N3/Turtle parser problems

1,073 views
Skip to first unread message

Osma Suominen

unread,
Nov 15, 2012, 4:51:59 AM11/15/12
to rdfli...@googlegroups.com
Hi all,

I'm trying to use rdflib to parse some Turtle files but I'm getting an
exception like this:

rdflib.plugins.parsers.notation3.BadSyntax: at line 76894 of <>:
Bad syntax (expected '.' or '}' or ']' at end of statement) at ^ in:

This is happening with Turtle versions of the NY Times Locations dataset,
which I downloaded from http://data.nytimes.com. I think there's some
literal (or URI) in the original data that triggers an exception in the
rdflib N3/Turtle parser. The original published data file is in RDF/XML
which can be parsed just fine by rdflib, but when I convert the data into
Turtle using any of three different tools (rdflib, Jena or rapper) the
resulting Turtle files cannot be parsed by rdflib.

I've put up the original data as well as the different Turtle versions
here: http://www.seco.tkk.fi/u/oisuomin/rdflib-syntax/

The full tracebacks I get parsing the different Turtle versions are at the
end of this message, as well as in the rdflib-script.txt file in the above
directory.

Unfortunately the data is pretty big (170k triples, about 10MB as Turtle)
and the exceptions didn't help me locate the problematic part of the data.
The data file contains Unicode literals in various non-Western scripts
which may or may not be related to the problem.

Any ideas how to fix this?

Best regards,
Osma Suominen



$ python
Python 2.7.3 (default, Aug 1 2012, 05:14:39)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from rdflib import *
>>> g = Graph()
>>> g.parse('locations-rdflib.ttl', format='n3')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/graph.py",
line 918, in parse
parser.parse(source, self, **args)
File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugins/parsers/notation3.py",
line 2393, in parse
TurtleParser.parse(self,source,conj_graph,encoding)
File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugins/parsers/notation3.py",
line 2373, in parse
p.loadStream(source.getByteStream())
File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugins/parsers/notation3.py",
line 937, in loadStream
return self.loadBuf(stream.read()) # Not ideal
File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugins/parsers/notation3.py",
line 943, in loadBuf
self.feed(buf)
File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugins/parsers/notation3.py",
line 969, in feed
i = self.directiveOrStatement(s, j)
File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugins/parsers/notation3.py",
line 987, in directiveOrStatement
return self.checkDot(argstr, j)
File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugins/parsers/notation3.py",
line 1558, in checkDot
argstr, j, "expected '.' or '}' or ']' at end of statement")
rdflib.plugins.parsers.notation3.BadSyntax: at line 76894 of <>:
Bad syntax (expected '.' or '}' or ']' at end of statement) at ^ in:
"...lat>,
<http://hu.wikipedia.org/wiki/Eilat>,
^<http://id.wikipedia.org/wiki/Eilat>,
<http://it.wik..."
>>> g = Graph()
>>> g.parse('locations-jena.ttl', format='n3')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/graph.py",
line 918, in parse
parser.parse(source, self, **args)
File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugins/parsers/notation3.py",
line 2393, in parse
TurtleParser.parse(self,source,conj_graph,encoding)
File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugins/parsers/notation3.py",
line 2373, in parse
p.loadStream(source.getByteStream())
File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugins/parsers/notation3.py",
line 937, in loadStream
return self.loadBuf(stream.read()) # Not ideal
File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugins/parsers/notation3.py",
line 943, in loadBuf
self.feed(buf)
File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugins/parsers/notation3.py",
line 969, in feed
i = self.directiveOrStatement(s, j)
File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugins/parsers/notation3.py",
line 987, in directiveOrStatement
return self.checkDot(argstr, j)
File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugins/parsers/notation3.py",
line 1558, in checkDot
argstr, j, "expected '.' or '}' or ']' at end of statement")
rdflib.plugins.parsers.notation3.BadSyntax: at line 3211 of <>:
Bad syntax (expected '.' or '}' or ']' at end of statement) at ^ in:
"...622290083051> ;
cc:license <http://creativecommons.org^/licenses/by/3.0/us/> ;
nyt:mapping_strategy
..."
>>> g = Graph()
>>> g.parse('locations-rapper.ttl', format='n3')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/graph.py",
line 918, in parse
parser.parse(source, self, **args)
File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugins/parsers/notation3.py",
line 2393, in parse
TurtleParser.parse(self,source,conj_graph,encoding)
File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugins/parsers/notation3.py",
line 2373, in parse
p.loadStream(source.getByteStream())
File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugins/parsers/notation3.py",
line 937, in loadStream
return self.loadBuf(stream.read()) # Not ideal
File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugins/parsers/notation3.py",
line 943, in loadBuf
self.feed(buf)
File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugins/parsers/notation3.py",
line 969, in feed
i = self.directiveOrStatement(s, j)
File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugins/parsers/notation3.py",
line 987, in directiveOrStatement
return self.checkDot(argstr, j)
File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugins/parsers/notation3.py",
line 1558, in checkDot
argstr, j, "expected '.' or '}' or ']' at end of statement")
rdflib.plugins.parsers.notation3.BadSyntax: at line 52803 of <>:
Bad syntax (expected '.' or '}' or ']' at end of statement) at ^ in:
"...ms.wikipedia.org/wiki/Berlin>, <http://pt.wikipedia.org/wiki^/Berlim>,
<http://qu.wikipedia.org/wiki/Berlin>, <http://ro...."
>>>


--
Osma Suominen | Osma.S...@aalto.fi | +358 40 5255 882
Aalto University, Department of Media Technology, Semantic Computing Research Group
Room 2541, Otaniementie 17, Espoo, Finland; P.O. Box 15500, FI-00076 Aalto, Finland

Gunnar Aastrand Grimnes

unread,
Nov 15, 2012, 5:20:07 AM11/15/12
to rdfli...@googlegroups.com
The problem is the triple:

<http://sws.geonames.org/2969679/> geonames:alternateName
"Berceau-de-la-Liberté"@fr_1793 .

The language tag here is illegal - lang-tags can only have - in them,
i.e. fr-1793 would be ok - fr_1793 is not.

see http://www.w3.org/TR/turtle/#grammar-production-LANGTAG
and http://tools.ietf.org/html/bcp47#section-2.2.9

This is an error in the original data - clearly the rdf/xml parser is
less strict. I don't really want to fix this in RDFLib :)

You can pipe the original input through sed or somethign and replace
xml:lang="fr_1793" with xml:lang="fr-1793"?

Cheers,
- Gunnar
> --
> You received this message because you are subscribed to the Google Groups
> "rdflib-dev" group.
> To post to this group, send email to rdfli...@googlegroups.com.
> To unsubscribe from this group, send email to
> rdflib-dev+...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>



--
http://gromgull.net

Osma Suominen

unread,
Nov 15, 2012, 5:37:04 AM11/15/12
to rdfli...@googlegroups.com
Hi Gunnar,

many thanks for the very quick reply and for spotting the problem! I can
fix this with sed or something, just as you suggested.

-Osma

On Thu, 15 Nov 2012, Gunnar Aastrand Grimnes wrote:

> The problem is the triple:
>
> <http://sws.geonames.org/2969679/> geonames:alternateName
> "Berceau-de-la-Liberté"@fr_1793 .
>
> The language tag here is illegal - lang-tags can only have - in them,
> i.e. fr-1793 would be ok - fr_1793 is not.
>
> see http://www.w3.org/TR/turtle/#grammar-production-LANGTAG
> and http://tools.ietf.org/html/bcp47#section-2.2.9
>
> This is an error in the original data - clearly the rdf/xml parser is
> less strict. I don't really want to fix this in RDFLib :)
>
> You can pipe the original input through sed or somethign and replace
> xml:lang="fr_1793" with xml:lang="fr-1793"?
>
> Cheers,
> - Gunnar
>
> On 15 November 2012 10:51, Osma Suominen <osma.s...@aalto.fi> wrote:
>> --
>> You received this message because you are subscribed to the Google Groups
>> "rdflib-dev" group.
>> To post to this group, send email to rdfli...@googlegroups.com.
>> To unsubscribe from this group, send email to
>> rdflib-dev+...@googlegroups.com.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>>
>
>
>
> --
> http://gromgull.net
>
> --
> You received this message because you are subscribed to the Google Groups "rdflib-dev" group.
> To post to this group, send email to rdfli...@googlegroups.com.
> To unsubscribe from this group, send email to rdflib-dev+...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>

Ivan herman

unread,
Nov 15, 2012, 6:39:22 AM11/15/12
to rdfli...@googlegroups.com, rdfli...@googlegroups.com
We should also notify the NYT about this error.

Thx

Ivan

---
Ivan Herman
Tel:+31 641044153
http://www.ivan-herman.net

(Written on mobile, sorry for brevity and misspellings...)

Gunnar Aastrand Grimnes

unread,
Nov 15, 2012, 7:17:26 AM11/15/12
to rdfli...@googlegroups.com
There is a perhaps also an issue about input validation here, recently
we changed the Graph interface to make sure the terms of added triples
were of type rdflib.term.Node

But not language tag validation - so if evil you can do:

In [70]: import rdflib
In [71]: g=rdflib.Graph()

In [73]: g.add((rdflib.URIRef("urn:a"), rdflib.RDFS.label,
rdflib.Literal('cake', lang='en ; rdfs:comment "hello!"' )))

In [74]: list(g)
Out[74]:
[(rdflib.term.URIRef(u'urn:a'),
rdflib.term.URIRef(u'http://www.w3.org/2000/01/rdf-schema#label'),
rdflib.term.Literal(u'cake', lang='en ; rdfs:comment "hello!"'))]

# one triple

In [76]: print g.serialize(format='n3')
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

<urn:a> rdfs:label "cake"@en ; rdfs:comment "hello!" .

And if you roundtrip this you get TWO triples :)

I can't think of an immediate EVIL application of this - but it
doesn't look very pretty :)

Also, the RDF concepts doc also says that language-tags MUST be
normalized to lower-case:

http://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal

This caused some of my SPARQL tests to fail as well. We could validate
and normalize language tags on literal construction time. Some people
may rely on the casing of the langtag to remain when roundtripping
though. (like people expect the lexical representation of their
datatypes to remain)

I'll make a ticket :)

- Gunnar
Reply all
Reply to author
Forward
0 new messages