New issue 211 by pierre.a...@gmail.com: N3 parser should preseve the
lexical form of literals
http://code.google.com/p/rdflib/issues/detail?id=211
* What steps will reproduce the problem?
import rdflib
g = rdflib.Graph()
g.parse(data="<> <#prop> 2.50 .", format="n3")
lit = g.objects().next()
print str(lit)
* What is the expected output? What do you see instead?
I get '2.5'
while it should be '2.50'
according to the serialization that has been parsed.
* What version of the product are you using? On what operating system?
rdflib 3.1.0
* Please provide any additional information below.
Semantically, the two literals ("2.5"^^xsd:decimal and "2.50"^^xsd:decimal)
are equivalent (and they are considered equal by the == operator, which I
can accept).
But syntactically, they are not, and I expect rdflib to preserve the
*abstract syntax* of a graph.
Furthermore, 'rdflib.compare.graph_diff' considers the two literals as
different (and I think it should, as once again, I think it should focus on
the abstract syntax and not on the semantics). I use it a lot to check that
my RDF/XML and my N3 are in sync, but it does not work because the RDF/XML
parser preserves the lexical form of literals while the N3 parser does not.
Again, I consider it as a bug of the N3 parser rather than on the
graph_diff algorithm, which appropriately compares graphs at the abstract
syntax level.
Comment #1 on issue 211 by gromgull: N3 parser should preseve the lexical
form of literals
http://code.google.com/p/rdflib/issues/detail?id=211
This is a tricky issue - somehow related to Issue 6 as well.
A quick look at the compare.graph_diff code suggests that the your desired
behaviour from graph_diff is accidental rather than planned :)
The "accident" happens because the graph compare is based on the set
theoretic operations on graphs, which in turn calls __contains__, which
gets pushed to the store. In the case of the memory store this looks it up
in some dicts which use the hash code. Now although literal.__eq__ does the
data-type specific comparison, we just use the strings hash code, this is a
bug, as __eq__ and __hash__ do not correspond:
In [6]: l1=rdflib.Literal(2.5)
In [7]:
l2=rdflib.Literal("2.50",datatype=rdflib.term.URIRef(u'http://www.w3.org/2001/XMLSchema#float'))
In [8]: l1
Out[8]: rdflib.term.Literal(u'2.5',
datatype=rdflib.term.URIRef(u'http://www.w3.org/2001/XMLSchema#float'))
In [9]: l2
Out[9]: rdflib.term.Literal(u'2.50',
datatype=rdflib.term.URIRef(u'http://www.w3.org/2001/XMLSchema#float'))
In [10]: l1==l2
Out[10]: True
In [11]: hash(l1)
Out[11]: 2018417764
In [12]: hash(l2)
Out[12]: 1880954841
Now - I would be in favour of canonicalise all literals at construction
time.
This is the opposite of what the issue submitter wants :)
This is an almost philosophical issue - in my book the abstract syntax does
not really exist - or only briefly at import/export time, and we should
only care about the true graph...
The alternative solution is to make == actually compare the lexical form,
but this breaks many things (SPARQL etc). I believe Jena keeps the lexical
form separate, but stores the canonical form in stores?
A perhaps related issue is the 2.5 vs. 2.500000001 for comparing floats. I
will worry about this another day.
(It would be interesting to try to some others stores, and see which may
have a different compare.graph_diff behaviour)
> So the best way to go while preserving backward compatibility would be to
> add a
> sameTerm method on Node that would mimic SPARQL's sameTerm, while ==
> would still
> mimic SPARQL's = .
This sounds good to me. We have to rework the Literal class a bit to keep
track of the original lexical representation, as well as the "value". Since
we sub-class unicode, currently the "value" is always stored as the string
- I guess using this to store the lexical form and having a new
literal.value to store the value as a python object is nicer.
Then we just need to fix the __hash__ method of Literal to use the value
and we're sorted. (and implement sameTermAs)
(and maybe document it all :)
For reference, Jena does exactly the opposite - the Literal.equals does a
lexical comparison only, and the have a method Literal.sameValueAs which
does interpretation.
I had a go at changing the Literal implementation both ways, i.e. trying to
do the comparisons in value space, and in lexical space.
http://code.google.com/p/rdflib/source/list?name=interpreted_literals
http://code.google.com/p/rdflib/source/list?name=lexical_literals
See commit messages in re82b4ac46ea3 and ra1ea98d84bd2
I believe the only sane way is to do it the way Jena does it, stick with
http://www.w3.org/TR/rdf-concepts/#section-Literal-Equality
and implement hash, equality and > < cmp functions in lexical space.
This means losing some nice rdflib<->python things, but at least it's
consistent.
Does this mean that 2.5 will compare not equal to 2.50? I think that could
lead to horrible confusion.
Bearing in mind that in Python 1 (integer) == 1.0 (float), I'm strongly in
favour of == doing value comparison, not string comparison. Semantically, a
decimal value 2.50 is the same thing as 2.5 - if you want to consider them
as different, compare them as strings.
Well, confusing a literal (basically, a string with an optional datatype)
with its value *is* a confusion in the first place.
That being said, I can live with the == operator applying to values, as
long as there is a way to compare the literals themselves (like the
sameTermAs() method suggested above).
Gromgul's proposition above is more radical, and probably cleaner. However,
it would break backward compatibility quite badly... Maybe better to wait
for rdflib 4.0 for such a change?
Perhaps the way to look at this is in terms of accessing objects in the
graph: at present, we return a Literal object, which may correspond to a
Python object (e.g. a number or a datetime). Maybe there should be an
option to 'unbox' literals when you retrieve them from the graph, so the
caller can just handle Python objects.
Of course, that would also be a major change, and it doesn't solve the
question of what the default should be. And it could simply be a terrible
idea, it's just what came into my head.
I guess there is a sliding scale of possibilities here - the important
thing to me is that equals, hashcodes and ordering comparators remain
consistent, i.e.
* if a == b, then hash(a)==hash(b) -- this is REQUIRED by python
* if a > b, then b < a and (not a<b) -- this is actually not required, but
it would be madness to no do this
* if (not a>b) and (not a<b), then a==b -- again, not required, but weird
* At the far "interpret" side: interpret everything. Ignore data-types.
Keep the sensible comparisons of literals and python objects. This is
probably tricky to do "right" - XSD defines a bunch of types, like integer,
int, byte, short, unsignedShort, etc. - i.e. Literal(1) == Literal("001")
== Literal(1.0) == 1.0
* A more moderate approach is to at least respect datatypes, i.e.
Literal(1.0) != Literal(1), but do value comparisons if the datatypes
match, i.e. Literal("1", datatype=XSD.int) == Literal("01",
datatype=XSD.int). This is slightly awkward to program, if you want to keep
the lexical representation intact, as Pierre requested in the original
issue. I also see no easy way to keep the comparisons with python objects,
i.e. Literal(1) > 2.0 ? It would be weird if (not Literal(1) >
Literal(0.0)), but (Literal(1) > 0.0)
* An easier solution to implement is to normalise the lexical
representation of literals with datatypes on creation time. I.e.
Literal("01", datatype=XSD.int) => Literal(u'1', datatype=XSD.int) - this
would mean round-tripping may break, as the literals are normalised on the
way. We could introduce a config flag (rdflib.NORMALIZE_LITERALS) that
toggle this behaviour, with a warning that your comparisons may be
non-sensical if you do.
* And far on the other side - keep lexical representations only - do not
interpretation.
Both extremes could offer a sameTermAs-like method that does either lexical
or interpreted comparison. We could even offer a method to as "key" with
sort and related methods.
I believe the only safe and consistent way is strict lexical comparison for
the python operators (respecting datatype of course). Normalising can
become inconsistent again. E.g. how would you normalise / interpret
myns:CustomType (which may be derived from xsd:int but makes
trailing/leading 0s significant?
I can see that normalising might be useful, but it probably shouldn't be
enabled by default and the normalising and to Python conversion functions
should ideally be pluggable / extandable (in the future).
Isn't this just a problem of two worlds being mixed up here?
We can do == in a syntactic and in a semantic way, but IMHO < only makes
sense when we do it semantically.
If we did < syntactically (something like string comparison on
concat(str(datatype), str(lang), str(value)) ) then unexpected things would
happen: Literal(2.0) < Literal(10.0) < Literal(1) (the last one
because "float" < "int").
At the same time i think == should be syntactic. It is defined
http://www.w3.org/TR/rdf-concepts/#section-Literal-Equality and there is no
such thing as the javascript === in python :( Also the a semantic == leads
to violations of if a == b, then hash(a) == hash(b) (one more here):
>>> Literal(1.0) == Literal(1)
True
>>> hash(Literal(1.0)) == hash(Literal(1))
False
Having < semantic and == syntactic leads to the already mentioned weirdness
where not (a<b or a==b or a>b).
To solve these problems one good (not necessarily cool) solution would be:
make it a bit more explicit whether you're using syntactic or semantic
comparisons:
1. make == syntactic,
2. provide a Literal.semanticComp(a,b) which returns (all ops semantic
here) -1 if a>b, 0 if a==b, 1 if a<b, raising a TypeError if you compare
incompatible / unknown types
2.1 provide utility methods like semanticEquals, semanticLess... which
return booleans
3. leave < semantic, make use of Literal.semanticComp, but log a warning
on first use which points out the problems.
2 could be extensible so you can provide further mappings from Literals to
python datatypes.
3 could be configurable so you can switch if off if you know what you're
doing. An alternative to 3 is to completely remove the overloading of <
operators.
If we are considering breaking backward compatibility, I would rather
* make == and != syntactic
* deprecate all other
* rely on Literal.toPython() for semantic comparison
That one is already extensible, IIRC, and if I want to compare two literals
semantically, I can do
l1.toPython() < l2.toPython()
which seems clear enough to me.