Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
rdflib N3/Turtle parser problems
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  5 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Osma Suominen  
View profile  
 More options Nov 15 2012, 4:52 am
From: Osma Suominen <osma.suomi...@aalto.fi>
Date: Thu, 15 Nov 2012 11:51:59 +0200 (EET)
Local: Thurs, Nov 15 2012 4:51 am
Subject: rdflib N3/Turtle parser problems
Hi all,

I'm trying to use rdflib to parse some Turtle files but I'm getting an
exception like this:

rdflib.plugins.parsers.notation3.BadSyntax: at line 76894 of <>:
Bad syntax (expected '.' or '}' or ']' at end of statement) at ^ in:

This is happening with Turtle versions of the NY Times Locations dataset,
which I downloaded from http://data.nytimes.com. I think there's some
literal (or URI) in the original data that triggers an exception in the
rdflib N3/Turtle parser. The original published data file is in RDF/XML
which can be parsed just fine by rdflib, but when I convert the data into
Turtle using any of three different tools (rdflib, Jena or rapper) the
resulting Turtle files cannot be parsed by rdflib.

I've put up the original data as well as the different Turtle versions
here: http://www.seco.tkk.fi/u/oisuomin/rdflib-syntax/

The full tracebacks I get parsing the different Turtle versions are at the
end of this message, as well as in the rdflib-script.txt file in the above
directory.

Unfortunately the data is pretty big (170k triples, about 10MB as Turtle)
and the exceptions didn't help me locate the problematic part of the data.
The data file contains Unicode literals in various non-Western scripts
which may or may not be related to the problem.

Any ideas how to fix this?

Best regards,
Osma Suominen

$ python
Python 2.7.3 (default, Aug  1 2012, 05:14:39)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.

>>> from rdflib import *
>>> g = Graph()
>>> g.parse('locations-rdflib.ttl', format='n3')

Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/graph .py",
line 918, in parse
     parser.parse(source, self, **args)
   File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugi ns/parsers/notation3.py",
line 2393, in parse
     TurtleParser.parse(self,source,conj_graph,encoding)
   File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugi ns/parsers/notation3.py",
line 2373, in parse
     p.loadStream(source.getByteStream())
   File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugi ns/parsers/notation3.py",
line 937, in loadStream
     return self.loadBuf(stream.read())    # Not ideal
   File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugi ns/parsers/notation3.py",
line 943, in loadBuf
     self.feed(buf)
   File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugi ns/parsers/notation3.py",
line 969, in feed
     i = self.directiveOrStatement(s, j)
   File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugi ns/parsers/notation3.py",
line 987, in directiveOrStatement
     return self.checkDot(argstr, j)
   File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugi ns/parsers/notation3.py",
line 1558, in checkDot
     argstr, j, "expected '.' or '}' or ']' at end of statement")
rdflib.plugins.parsers.notation3.BadSyntax: at line 76894 of <>:
Bad syntax (expected '.' or '}' or ']' at end of statement) at ^ in:
"...lat>,
         <http://hu.wikipedia.org/wiki/Eilat>,
         ^<http://id.wikipedia.org/wiki/Eilat>,
         <http://it.wik..."
>>> g = Graph()
>>> g.parse('locations-jena.ttl', format='n3')

Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/graph .py",
line 918, in parse
     parser.parse(source, self, **args)
   File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugi ns/parsers/notation3.py",
line 2393, in parse
     TurtleParser.parse(self,source,conj_graph,encoding)
   File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugi ns/parsers/notation3.py",
line 2373, in parse
     p.loadStream(source.getByteStream())
   File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugi ns/parsers/notation3.py",
line 937, in loadStream
     return self.loadBuf(stream.read())    # Not ideal
   File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugi ns/parsers/notation3.py",
line 943, in loadBuf
     self.feed(buf)
   File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugi ns/parsers/notation3.py",
line 969, in feed
     i = self.directiveOrStatement(s, j)
   File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugi ns/parsers/notation3.py",
line 987, in directiveOrStatement
     return self.checkDot(argstr, j)
   File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugi ns/parsers/notation3.py",
line 1558, in checkDot
     argstr, j, "expected '.' or '}' or ']' at end of statement")
rdflib.plugins.parsers.notation3.BadSyntax: at line 3211 of <>:
Bad syntax (expected '.' or '}' or ']' at end of statement) at ^ in:
"...622290083051> ;
       cc:license <http://creativecommons.org^/licenses/by/3.0/us/> ;
       nyt:mapping_strategy
          ..."
>>> g = Graph()
>>> g.parse('locations-rapper.ttl', format='n3')

Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/graph .py",
line 918, in parse
     parser.parse(source, self, **args)
   File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugi ns/parsers/notation3.py",
line 2393, in parse
     TurtleParser.parse(self,source,conj_graph,encoding)
   File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugi ns/parsers/notation3.py",
line 2373, in parse
     p.loadStream(source.getByteStream())
   File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugi ns/parsers/notation3.py",
line 937, in loadStream
     return self.loadBuf(stream.read())    # Not ideal
   File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugi ns/parsers/notation3.py",
line 943, in loadBuf
     self.feed(buf)
   File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugi ns/parsers/notation3.py",
line 969, in feed
     i = self.directiveOrStatement(s, j)
   File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugi ns/parsers/notation3.py",
line 987, in directiveOrStatement
     return self.checkDot(argstr, j)
   File
"/usr/local/lib/python2.7/dist-packages/rdflib-3.2.3-py2.7.egg/rdflib/plugi ns/parsers/notation3.py",
line 1558, in checkDot
     argstr, j, "expected '.' or '}' or ']' at end of statement")
rdflib.plugins.parsers.notation3.BadSyntax: at line 52803 of <>:
Bad syntax (expected '.' or '}' or ']' at end of statement) at ^ in:
"...ms.wikipedia.org/wiki/Berlin>, <http://pt.wikipedia.org/wiki^/Berlim>,
<http://qu.wikipedia.org/wiki/Berlin>, <http://ro...."


--
Osma Suominen | Osma.Suomi...@aalto.fi | +358 40 5255 882
Aalto University, Department of Media Technology, Semantic Computing Research Group
Room 2541, Otaniementie 17, Espoo, Finland; P.O. Box 15500, FI-00076 Aalto, Finland

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Gunnar Aastrand Grimnes  
View profile  
 More options Nov 15 2012, 5:20 am
From: Gunnar Aastrand Grimnes <gromg...@gmail.com>
Date: Thu, 15 Nov 2012 11:20:07 +0100
Local: Thurs, Nov 15 2012 5:20 am
Subject: Re: rdflib N3/Turtle parser problems
The problem is the triple:

<http://sws.geonames.org/2969679/> geonames:alternateName
"Berceau-de-la-Liberté"@fr_1793 .

The language tag here is illegal - lang-tags can only have - in them,
i.e. fr-1793 would be ok - fr_1793 is not.

see http://www.w3.org/TR/turtle/#grammar-production-LANGTAG
and http://tools.ietf.org/html/bcp47#section-2.2.9

This is an error in the original data - clearly the rdf/xml parser is
less strict. I don't really want to fix this in RDFLib :)

You can pipe the original input through sed or somethign and replace
xml:lang="fr_1793" with xml:lang="fr-1793"?

Cheers,
- Gunnar

On 15 November 2012 10:51, Osma Suominen <osma.suomi...@aalto.fi> wrote:

--
http://gromgull.net

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Osma Suominen  
View profile  
 More options Nov 15 2012, 5:37 am
From: Osma Suominen <osma.suomi...@aalto.fi>
Date: Thu, 15 Nov 2012 12:37:04 +0200 (EET)
Local: Thurs, Nov 15 2012 5:37 am
Subject: Re: rdflib N3/Turtle parser problems

Hi Gunnar,

many thanks for the very quick reply and for spotting the problem! I can
fix this with sed or something, just as you suggested.

-Osma

--
Osma Suominen | Osma.Suomi...@aalto.fi | +358 40 5255 882
Aalto University, Department of Media Technology, Semantic Computing Research Group
Room 2541, Otaniementie 17, Espoo, Finland; P.O. Box 15500, FI-00076 Aalto, Finland

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ivan herman  
View profile  
 More options Nov 15 2012, 6:39 am
From: Ivan herman <ivan.her...@gmail.com>
Date: Thu, 15 Nov 2012 06:39:22 -0500
Local: Thurs, Nov 15 2012 6:39 am
Subject: Re: rdflib N3/Turtle parser problems
We should also notify the NYT about this error.

Thx

Ivan

---
Ivan Herman
Tel:+31 641044153
http://www.ivan-herman.net

(Written on mobile, sorry for brevity and misspellings...)

On 15 Nov 2012, at 05:37, Osma Suominen <osma.suomi...@aalto.fi> wrote:

...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Gunnar Aastrand Grimnes  
View profile  
 More options Nov 15 2012, 7:17 am
From: Gunnar Aastrand Grimnes <gromg...@gmail.com>
Date: Thu, 15 Nov 2012 13:17:26 +0100
Local: Thurs, Nov 15 2012 7:17 am
Subject: Re: rdflib N3/Turtle parser problems
There is a perhaps also an issue about input validation here, recently
we changed the Graph interface to make sure the terms of added triples
were of type rdflib.term.Node

But not language tag validation - so if evil you can do:

In [70]: import rdflib
In [71]: g=rdflib.Graph()

In [73]: g.add((rdflib.URIRef("urn:a"), rdflib.RDFS.label,
rdflib.Literal('cake', lang='en ; rdfs:comment "hello!"' )))

In [74]: list(g)
Out[74]:
[(rdflib.term.URIRef(u'urn:a'),
  rdflib.term.URIRef(u'http://www.w3.org/2000/01/rdf-schema#label'),
  rdflib.term.Literal(u'cake', lang='en ; rdfs:comment "hello!"'))]

# one triple

In [76]: print g.serialize(format='n3')
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

<urn:a> rdfs:label "cake"@en ; rdfs:comment "hello!" .

And if you roundtrip this you get TWO triples :)

I can't think of an immediate EVIL application of this - but it
doesn't look very pretty :)

Also, the RDF concepts doc  also says that language-tags MUST be
normalized to lower-case:

http://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal

This caused some of my SPARQL tests to fail as well. We could validate
and normalize language tags on literal construction time. Some people
may rely on the casing of the langtag to remain when roundtripping
though. (like people expect the lexical representation of their
datatypes to remain)

I'll make a ticket :)

- Gunnar

On 15 November 2012 12:39, Ivan herman <ivan.her...@gmail.com> wrote:

...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »