Issue 211 in rdflib: N3 parser should preseve the lexical form of literals

rdf...@googlecode.com

unread,

Feb 15, 2012, 3:10:42 AM2/15/12

to rdfli...@googlegroups.com

Status: New
Owner: ----

New issue 211 by pierre.a...@gmail.com: N3 parser should preseve the
lexical form of literals
http://code.google.com/p/rdflib/issues/detail?id=211

* What steps will reproduce the problem?

import rdflib
g = rdflib.Graph()
g.parse(data="<> <#prop> 2.50 .", format="n3")
lit = g.objects().next()
print str(lit)

* What is the expected output? What do you see instead?

I get '2.5'
while it should be '2.50'
according to the serialization that has been parsed.

* What version of the product are you using? On what operating system?

rdflib 3.1.0

* Please provide any additional information below.

Semantically, the two literals ("2.5"^^xsd:decimal and "2.50"^^xsd:decimal)
are equivalent (and they are considered equal by the == operator, which I
can accept).
But syntactically, they are not, and I expect rdflib to preserve the
*abstract syntax* of a graph.

Furthermore, 'rdflib.compare.graph_diff' considers the two literals as
different (and I think it should, as once again, I think it should focus on
the abstract syntax and not on the semantics). I use it a lot to check that
my RDF/XML and my N3 are in sync, but it does not work because the RDF/XML
parser preserves the lexical form of literals while the N3 parser does not.

Again, I consider it as a bug of the N3 parser rather than on the
graph_diff algorithm, which appropriately compares graphs at the abstract
syntax level.

rdf...@googlecode.com

unread,

Feb 15, 2012, 3:46:57 AM2/15/12

to rdfli...@googlegroups.com

Updates:
Labels: Component-rdf Type-Defect OpSys-All

Comment #1 on issue 211 by gromgull: N3 parser should preseve the lexical

form of literals
http://code.google.com/p/rdflib/issues/detail?id=211

This is a tricky issue - somehow related to Issue 6 as well.

A quick look at the compare.graph_diff code suggests that the your desired
behaviour from graph_diff is accidental rather than planned :)

The "accident" happens because the graph compare is based on the set
theoretic operations on graphs, which in turn calls __contains__, which
gets pushed to the store. In the case of the memory store this looks it up
in some dicts which use the hash code. Now although literal.__eq__ does the
data-type specific comparison, we just use the strings hash code, this is a
bug, as __eq__ and __hash__ do not correspond:

In [6]: l1=rdflib.Literal(2.5)

In [7]:
l2=rdflib.Literal("2.50",datatype=rdflib.term.URIRef(u'http://www.w3.org/2001/XMLSchema#float'))

In [8]: l1
Out[8]: rdflib.term.Literal(u'2.5',
datatype=rdflib.term.URIRef(u'http://www.w3.org/2001/XMLSchema#float'))

In [9]: l2
Out[9]: rdflib.term.Literal(u'2.50',
datatype=rdflib.term.URIRef(u'http://www.w3.org/2001/XMLSchema#float'))

In [10]: l1==l2
Out[10]: True

In [11]: hash(l1)
Out[11]: 2018417764

In [12]: hash(l2)
Out[12]: 1880954841

Now - I would be in favour of canonicalise all literals at construction
time.
This is the opposite of what the issue submitter wants :)

This is an almost philosophical issue - in my book the abstract syntax does
not really exist - or only briefly at import/export time, and we should
only care about the true graph...

The alternative solution is to make == actually compare the lexical form,
but this breaks many things (SPARQL etc). I believe Jena keeps the lexical
form separate, but stores the canonical form in stores?

A perhaps related issue is the 2.5 vs. 2.500000001 for comparing floats. I
will worry about this another day.

rdf...@googlecode.com

unread,

Feb 15, 2012, 3:50:59 AM2/15/12

to rdfli...@googlegroups.com

Comment #2 on issue 211 by gromgull: N3 parser should preseve the lexical

form of literals
http://code.google.com/p/rdflib/issues/detail?id=211

(It would be interesting to try to some others stores, and see which may
have a different compare.graph_diff behaviour)

rdf...@googlecode.com

unread,

Feb 15, 2012, 10:09:53 AM2/15/12

to rdfli...@googlegroups.com

Comment #3 on issue 211 by pierre.a...@gmail.com: N3 parser should preseve

the lexical form of literals
http://code.google.com/p/rdflib/issues/detail?id=211

> A quick look at the compare.graph_diff code suggests that the your desired
> behaviour from graph_diff is accidental rather than planned :)

I was slightly fearing that (probably why I didn't lool at the code :)

> This is an almost philosophical issue

Well, at least it should be an informed design decision, and not the
product of implementation accidents :)
So it is worth discussing.

> in my book the abstract syntax does not really exist

Disagree!! Abstract syntax is clearly defined by
http://www.w3.org/TR/rdf-concepts/ ...

> and we should only care about the true graph...

... while this notion of "true graph" is *not* defined in rdf-concepts :-P

Tampering with lexical form is about *interpreting* the graph, as per
http://www.w3.org/TR/2004/REC-rdf-mt-20040210/#dtype_interp .

Again, I could understand (even if it is not my preference) that
rdflib.Graph is representing an interpretation model rather than an
abstract-syntax-graph, but then this should be clearly documented (and
consistently implemented :).

> The alternative solution is to make == actually compare the lexical
> form, but
> this breaks many things (SPARQL etc).

Note that SPARQL has the = operator, which embeds some form of datatype
interpretation (as in 2 = 2.0), but it also has the sameTerm funcyion [1]
wich compares nodes at a strictly syntactical level (where 2, 2.0 and 2.00
are all different).

So the best way to go while preserving backward compatibility would be to
add a sameTerm method on Node that would mimic SPARQL's sameTerm, while ==
would still mimic SPARQL's = .

My 2¢

[1] http://www.w3.org/TR/rdf-sparql-query/#func-sameTerm

rdf...@googlecode.com

unread,

Feb 15, 2012, 10:29:14 AM2/15/12

to rdfli...@googlegroups.com

Comment #4 on issue 211 by gromgull: N3 parser should preseve the lexical

form of literals
http://code.google.com/p/rdflib/issues/detail?id=211

> So the best way to go while preserving backward compatibility would be to

> add a
> sameTerm method on Node that would mimic SPARQL's sameTerm, while ==
> would still
> mimic SPARQL's = .

This sounds good to me. We have to rework the Literal class a bit to keep
track of the original lexical representation, as well as the "value". Since
we sub-class unicode, currently the "value" is always stored as the string
- I guess using this to store the lexical form and having a new
literal.value to store the value as a python object is nicer.

Then we just need to fix the __hash__ method of Literal to use the value
and we're sorted. (and implement sameTermAs)

(and maybe document it all :)

rdf...@googlecode.com

unread,

Feb 15, 2012, 10:38:19 AM2/15/12

to rdfli...@googlegroups.com

Comment #5 on issue 211 by gromgull: N3 parser should preseve the lexical

form of literals
http://code.google.com/p/rdflib/issues/detail?id=211

For reference, Jena does exactly the opposite - the Literal.equals does a
lexical comparison only, and the have a method Literal.sameValueAs which
does interpretation.

rdf...@googlecode.com

unread,

Feb 16, 2012, 5:38:51 AM2/16/12

to rdfli...@googlegroups.com

Comment #6 on issue 211 by gromgull: N3 parser should preseve the lexical

form of literals
http://code.google.com/p/rdflib/issues/detail?id=211

I had a go at changing the Literal implementation both ways, i.e. trying to
do the comparisons in value space, and in lexical space.

http://code.google.com/p/rdflib/source/list?name=interpreted_literals
http://code.google.com/p/rdflib/source/list?name=lexical_literals

See commit messages in re82b4ac46ea3 and ra1ea98d84bd2

I believe the only sane way is to do it the way Jena does it, stick with
http://www.w3.org/TR/rdf-concepts/#section-Literal-Equality
and implement hash, equality and > < cmp functions in lexical space.
This means losing some nice rdflib<->python things, but at least it's
consistent.

rdf...@googlecode.com

unread,

Feb 16, 2012, 6:34:10 AM2/16/12

to rdfli...@googlegroups.com

Comment #7 on issue 211 by tak...@gmail.com: N3 parser should preseve the

lexical form of literals
http://code.google.com/p/rdflib/issues/detail?id=211

Does this mean that 2.5 will compare not equal to 2.50? I think that could
lead to horrible confusion.

Bearing in mind that in Python 1 (integer) == 1.0 (float), I'm strongly in
favour of == doing value comparison, not string comparison. Semantically, a
decimal value 2.50 is the same thing as 2.5 - if you want to consider them
as different, compare them as strings.

rdf...@googlecode.com

unread,

Feb 16, 2012, 7:39:36 AM2/16/12

to rdfli...@googlegroups.com

Comment #8 on issue 211 by pierre.a...@gmail.com: N3 parser should preseve

the lexical form of literals
http://code.google.com/p/rdflib/issues/detail?id=211

Well, confusing a literal (basically, a string with an optional datatype)
with its value *is* a confusion in the first place.

That being said, I can live with the == operator applying to values, as
long as there is a way to compare the literals themselves (like the
sameTermAs() method suggested above).

Gromgul's proposition above is more radical, and probably cleaner. However,
it would break backward compatibility quite badly... Maybe better to wait
for rdflib 4.0 for such a change?

rdf...@googlecode.com

unread,

Feb 16, 2012, 7:49:38 AM2/16/12

to rdfli...@googlegroups.com

Comment #9 on issue 211 by tak...@gmail.com: N3 parser should preseve the

lexical form of literals
http://code.google.com/p/rdflib/issues/detail?id=211

Perhaps the way to look at this is in terms of accessing objects in the
graph: at present, we return a Literal object, which may correspond to a
Python object (e.g. a number or a datetime). Maybe there should be an
option to 'unbox' literals when you retrieve them from the graph, so the
caller can just handle Python objects.

Of course, that would also be a major change, and it doesn't solve the
question of what the default should be. And it could simply be a terrible
idea, it's just what came into my head.

rdf...@googlecode.com

unread,

Feb 16, 2012, 8:28:48 AM2/16/12

to rdfli...@googlegroups.com

Comment #10 on issue 211 by gromgull: N3 parser should preseve the lexical

form of literals
http://code.google.com/p/rdflib/issues/detail?id=211

I guess there is a sliding scale of possibilities here - the important
thing to me is that equals, hashcodes and ordering comparators remain
consistent, i.e.
* if a == b, then hash(a)==hash(b) -- this is REQUIRED by python
* if a > b, then b < a and (not a<b) -- this is actually not required, but
it would be madness to no do this
* if (not a>b) and (not a<b), then a==b -- again, not required, but weird

* At the far "interpret" side: interpret everything. Ignore data-types.
Keep the sensible comparisons of literals and python objects. This is
probably tricky to do "right" - XSD defines a bunch of types, like integer,
int, byte, short, unsignedShort, etc. - i.e. Literal(1) == Literal("001")
== Literal(1.0) == 1.0

* A more moderate approach is to at least respect datatypes, i.e.
Literal(1.0) != Literal(1), but do value comparisons if the datatypes
match, i.e. Literal("1", datatype=XSD.int) == Literal("01",
datatype=XSD.int). This is slightly awkward to program, if you want to keep
the lexical representation intact, as Pierre requested in the original
issue. I also see no easy way to keep the comparisons with python objects,
i.e. Literal(1) > 2.0 ? It would be weird if (not Literal(1) >
Literal(0.0)), but (Literal(1) > 0.0)

* An easier solution to implement is to normalise the lexical
representation of literals with datatypes on creation time. I.e.
Literal("01", datatype=XSD.int) => Literal(u'1', datatype=XSD.int) - this
would mean round-tripping may break, as the literals are normalised on the
way. We could introduce a config flag (rdflib.NORMALIZE_LITERALS) that
toggle this behaviour, with a warning that your comparisons may be
non-sensical if you do.

* And far on the other side - keep lexical representations only - do not
interpretation.

Both extremes could offer a sameTermAs-like method that does either lexical
or interpreted comparison. We could even offer a method to as "key" with
sort and related methods.

rdf...@googlecode.com

unread,

Feb 16, 2012, 5:26:09 PM2/16/12

to rdfli...@googlegroups.com

Comment #11 on issue 211 by gerhar...@gmail.com: N3 parser should

preseve the lexical form of literals
http://code.google.com/p/rdflib/issues/detail?id=211

I believe the only safe and consistent way is strict lexical comparison for
the python operators (respecting datatype of course). Normalising can
become inconsistent again. E.g. how would you normalise / interpret
myns:CustomType (which may be derived from xsd:int but makes
trailing/leading 0s significant?

I can see that normalising might be useful, but it probably shouldn't be
enabled by default and the normalising and to Python conversion functions
should ideally be pluggable / extandable (in the future).

rdf...@googlecode.com

unread,

Feb 17, 2012, 6:30:24 AM2/17/12

to rdfli...@googlegroups.com

Comment #12 on issue 211 by joernhees: N3 parser should preseve the lexical

form of literals
http://code.google.com/p/rdflib/issues/detail?id=211

Isn't this just a problem of two worlds being mixed up here?
We can do == in a syntactic and in a semantic way, but IMHO < only makes
sense when we do it semantically.

If we did < syntactically (something like string comparison on
concat(str(datatype), str(lang), str(value)) ) then unexpected things would
happen: Literal(2.0) < Literal(10.0) < Literal(1) (the last one
because "float" < "int").

At the same time i think == should be syntactic. It is defined
http://www.w3.org/TR/rdf-concepts/#section-Literal-Equality and there is no
such thing as the javascript === in python :( Also the a semantic == leads
to violations of if a == b, then hash(a) == hash(b) (one more here):
>>> Literal(1.0) == Literal(1)
True
>>> hash(Literal(1.0)) == hash(Literal(1))
False

Having < semantic and == syntactic leads to the already mentioned weirdness
where not (a<b or a==b or a>b).

To solve these problems one good (not necessarily cool) solution would be:
make it a bit more explicit whether you're using syntactic or semantic
comparisons:
1. make == syntactic,
2. provide a Literal.semanticComp(a,b) which returns (all ops semantic
here) -1 if a>b, 0 if a==b, 1 if a<b, raising a TypeError if you compare
incompatible / unknown types
2.1 provide utility methods like semanticEquals, semanticLess... which
return booleans
3. leave < semantic, make use of Literal.semanticComp, but log a warning
on first use which points out the problems.

2 could be extensible so you can provide further mappings from Literals to
python datatypes.
3 could be configurable so you can switch if off if you know what you're
doing. An alternative to 3 is to completely remove the overloading of <
operators.

rdf...@googlecode.com

unread,

Feb 27, 2012, 4:34:58 PM2/27/12

to rdfli...@googlegroups.com

Comment #13 on issue 211 by pierre.a...@gmail.com: N3 parser should preseve

the lexical form of literals
http://code.google.com/p/rdflib/issues/detail?id=211

If we are considering breaking backward compatibility, I would rather

* make == and != syntactic
* deprecate all other
* rely on Literal.toPython() for semantic comparison

That one is already extensible, IIRC, and if I want to compare two literals
semantically, I can do