Serialization of ntriples does not seem to do unicode escaping?

38 views
Skip to first unread message

Etienne Posthumus

unread,
Aug 1, 2022, 8:23:01 AM8/1/22
to rdflib-dev
The current ntriple serialization of rdflib does not seem to do proper unicode escaping as it should with https://www.w3.org/TR/rdf-testcases/#ntrip_strings.

As a test, doing: 

g = Graph()
g.add( (URIRef("urn:aap"), URIRef("urn:noot"), Literal("miës")) )
print(g.serialize(format="nt"))

Expected:
<urn:aap> <urn:noot> "mi\u00EBs" .
But it produces:
<urn:aap> <urn:noot> "miës" .

There seems to be an ancient issue fixing encodings:
But I could not track down how to read what patch it referred to.

This seems like such a basic fundamental thing, hopefully I am just missing something obvious?

Nicholas Car

unread,
Aug 1, 2022, 8:42:33 AM8/1/22
to rdfli...@googlegroups.com
Wow, that is an old issue you point to! Closed in 2021 due to a fix in 2009! Graham Higgins closed it and he's active in RDFLib dev now so perhaps he knows?

------- Original Message -------
--
http://github.com/RDFLib
---
You received this message because you are subscribed to the Google Groups "rdflib-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rdflib-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rdflib-dev/15d99d10-0fab-4be3-ad17-a2c77e22bb6en%40googlegroups.com.

Graham Higgins

unread,
Aug 1, 2022, 10:14:41 AM8/1/22
to rdflib-dev
I just copied the issues over, eikeon actually implemented the fix.

However, the nt serializer code has been through many subsequent changes, I doubt any of the 2009-vintage code remains.

fwiw, it roundtrips just fine and RDFLib's serialization seems consonant with Jena's:

$ cat ttest.nt
<urn:aap> <urn:noot> "miës" .
$ riot ttest.nt

<urn:aap> <urn:noot> "miës" .
$ cat ttest2.nt
<urn:aap> <urn:noot> "mi\u00EBs" .
$ riot ttest2.nt

<urn:aap> <urn:noot> "miës" .

Graham Higgins

unread,
Aug 1, 2022, 12:57:39 PM8/1/22
to rdflib-dev
On Monday, August 1, 2022 at 2:14:41 PM UTC Graham Higgins wrote:
However, the nt serializer code has been through many subsequent changes, I doubt any of the 2009-vintage code remains.

After some forensic work in the commit history, I can confirm that it is intentional that RDFLib's N-Triples serialization doesn't conform to the W3C N-Triples serialization standard ¹.

It *used to* conform, up until Dec 2021 when the encoding was changed from ASCII to UTF-8 ², obviating the need for \-escaping.

The way it used to work was (with ascii encoding) the presence of a non-ascii character in the input would trigger an XML processing exception that deferred to an error-correcting handler which replaced the non-ascii character with the \-escaped correlate. ³
 
¹ “N-Triples strings are sequences of US-ASCII character productions encoding [UNICODE] character strings. The characters outside the US-ASCII range and some other specific characters are made available by \-escape sequences”

Cheers,
Graham

Etienne Posthumus

unread,
Aug 1, 2022, 1:43:54 PM8/1/22
to rdfli...@googlegroups.com
Thanks for the excellent spelunking Graham.

Is it common practice nowadays for most serializers to just do UTF-8 and not do \-escape sequences anymore? I guess if this has been the behaviour in rdflib for years now and no-one complains too much, we can just assume it is OK and keep on doing it.
Maybe it is a good idea for us to add a line in the docs that the rdflib serializer intentionally deviates from the spec.

Graham Higgins

unread,
Aug 1, 2022, 3:13:34 PM8/1/22
to rdflib-dev
On Monday, August 1, 2022 at 5:43:54 PM UTC Etienne Posthumus wrote:
Thanks for the excellent spelunking Graham.

Happy to help, thanks for the kind words.
 
Is it common practice nowadays for most serializers to just do UTF-8 and not do \-escape sequences anymore? I guess if this has been the behaviour in rdflib for years now and no-one complains too much, we can just assume it is OK and keep on doing it.

I don't know about "common practice" but I treat Jena's behaviour as a useful ad hoc yardstick, if it passes muster with Andy Seaborn then it's probably the right way to go.
 
Maybe it is a good idea for us to add a line in the docs that the rdflib serializer intentionally deviates from the spec.

Yes, either document the difference or, given that known-working code still exists, perhaps just enabling strictness by setting an *args flag might be a viable solution ... something along the lines of:

diff --git a/rdflib/plugins/serializers/nt.py b/rdflib/plugins/serializers/nt.py
index 913dbedf..b73f223f 100644
--- a/rdflib/plugins/serializers/nt.py
+++ b/rdflib/plugins/serializers/nt.py
@@ -38,7 +38,11 @@ class NTSerializer(Serializer):
             )
 
         for triple in self.store:
-            stream.write(_nt_row(triple).encode())
+            stream.write(
+                _nt_row(triple).encode("ascii", "_rdflib_nt_escape")
+                if "w3c" in args
+                else _nt_row(triple).encode()
+            )
 
 
 class NT11Serializer(NTSerializer):

Which, on casual testing,  behaves as desired, producing “<urn:aap> <urn:noot> "mi\u00EBs" .” with the flag set and “<urn:aap> <urn:noot> "miës" .” when not set.

What does the team think?

Cheers,
Graham

Nicholas Car

unread,
Aug 1, 2022, 6:48:37 PM8/1/22
to rdfli...@googlegroups.com
Of course, this is a Python 2 -> 3 change thing. Yes, Python can just handle the fancy chars "better" now, i.e. not needing to encode everything. The W3C spec was simply written for less capable encoding systems.

So, unless there's a strong practical reason to revert, I suggest we keep this behaviour.

Etienne, would you be interested in creating a small PR for either then flag or the documentation?

------- Original Message -------
--
http://github.com/RDFLib
---
You received this message because you are subscribed to the Google Groups "rdflib-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rdflib-dev+...@googlegroups.com.

Etienne Posthumus

unread,
Aug 3, 2022, 10:18:36 AM8/3/22
to rdfli...@googlegroups.com
NIcholas, I don't mind taking a shot at adding a pull request.

As a first step, I made a fork and ran the tests, without changing anything, and then I see:
1 failed, 6742 passed, 72 skipped, 376 xfailed, 471 warnings, 2 errors

What is the general policy with regards to fails/skips/warnings etc. on the main branch? A project like this with a rich and long history I would expect this, but before I make assumptions, should these be fixed first? Before even starting on a pull request for this encoding thing, is it worth trying to understand what the current fails and warnings are about and try to remove them, or is that an impossible task in your opinion?




Iwan Aucamp

unread,
Aug 15, 2022, 1:42:33 PM8/15/22
to rdflib-dev
On Wednesday, 3 August 2022 at 16:18:36 UTC+2 epost...@gmail.com wrote:
NIcholas, I don't mind taking a shot at adding a pull request.

As a first step, I made a fork and ran the tests, without changing anything, and then I see:
1 failed, 6742 passed, 72 skipped, 376 xfailed, 471 warnings, 2 errors

What is the general policy with regards to fails/skips/warnings etc. on the main branch? A project like this with a rich and long history I would expect this, but before I make assumptions, should these be fixed first? Before even starting on a pull request for this encoding thing, is it worth trying to understand what the current fails and warnings are about and try to remove them, or is that an impossible task in your opinion?

Our CI must complete for all PRs, and it currently does run on the master branch as can be seen here. This runs all tests on windows, macos and linux for python 3.7 to 3.11, it also runs mypy, flake8 (with some baselining), isort and black.

The skips you are seeing is likely because you don't have Jena running, and because you don't have some extras installed. In the past skips were used where xfails would be more appropriate, but most skips just indicate that the tests requires something to run that is not available in your current environment.

The xfails are used to designate known issues which is more or less what they are intended for [ref]:
> An xfail means that you expect a test to fail for some reason. A common example is a test for a feature not yet implemented, or a bug not yet fixed. 

They indicate that we know a test fails, but that we also know it should pass, and once the underlying issue that causes a test to fail fixed they are reported as xpass and we remove the xfail marker usually in the same PR that fixes the issue that causes the xfail as can be seen in this PR. They are basically a more useful form of a bug report, as it makes it clear how to reproduce an issue and makes it very easy to see that an issue was fixed.

The warnings indicate something suboptimal but not fatal, they should be eliminated but in most cases they are hard to eliminate and quite costly, and there are plenty of other issues we will likely address first.

The errors indicate something serious, and would prevent a PR from being merged if it occurred in a PR, but as our test suite does pass currently on master I think this is either an issue on your system or possibly a problem with a dependency.

We are open to all PRs that address any problems in RDFLib, so if you want to fix the warnings or xfails we would be very happy, but it would be best to address the problems one at a time, not to try and do it all in one PR. Many problems are not that simple to fix though. A lot of the xfails (maybe 1/2) are actually because RDFLib is too lax in parsing, and the W3C test suites which run as part of our test suite requires more strict behaviour. This is not a simple problem to address, as we should provide parsers that can be both strict and lax depending on user preference, and it is also not a high priority problem, if we just made our parsers more strict most users will be unhappy.

Some other xfails are because of legitimate problems with our parsers, which again is not that simple to fix, as likely we should be moving to something like LARK as the current hand crafter parsers are hard to maintain. 

We will be happy with any attempts to fix any of these problems though. 

Regards
Iwan Aucamp 

jerven Bolleman

unread,
Aug 15, 2022, 2:12:47 PM8/15/22
to rdfli...@googlegroups.com
Hi Graham, All,

I suspect that the spec has changed in regards to N-Triples over the
years. Specifically when Turtle became a W3C standard.

For example the spec for 1.1 N-Triples says [1]


Encoding considerations:
The syntax of N-Triples is expressed over code points in Unicode
[UNICODE]. The encoding is always UTF-8 [UTF-8].
Unicode code points may also be expressed using an \uXXXX (U+0 to
U+FFFF) or \UXXXXXXXX syntax (for U+10000 onwards) where X is a
hexadecimal digit [0-9A-F]

And also the note here

6.1 Other Media Types

N-Triples has been historically provided with other media types.
N-Triples may also be provided as text/plain. When used in this way
N-Triples MUST use the escaped form of any character outside US-ASCII.

Hope that helps pointing out if it is a bug or not.

Regards,
Jerven

[1] https://www.w3.org/TR/2014/REC-n-triples-20140225/
> --
> http://github.com/RDFLib <http://github.com/RDFLib>
> ---
> You received this message because you are subscribed to the Google
> Groups "rdflib-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to rdflib-dev+...@googlegroups.com
> <mailto:rdflib-dev+...@googlegroups.com>.
> <https://groups.google.com/d/msgid/rdflib-dev/1b1503b0-dc7a-40a5-963e-0875a6f4b843n%40googlegroups.com?utm_medium=email&utm_source=footer>.

--

*Jerven Tjalling Bolleman*
Principal Software Developer
*SIB | Swiss Institute of Bioinformatics*
1, rue Michel Servet - CH 1211 Geneva 4 - Switzerland
t +41 22 379 58 85
Jerven....@sib.swiss - www.sib.swiss

Etienne Posthumus

unread,
Aug 16, 2022, 2:55:11 AM8/16/22
to rdfli...@googlegroups.com
Thanks for the detailed answer Iwan.
I certainly would not try to fix the warnings or xfails yet, not being familiar with the codebase. It was more caution of not wanting to add anything that has unintended consequences somewhere else which was not testable.

So the test suite is very comprehensive, but needs a more elaborate environment than a simple test run, understood. Will have a look at the Github actions to try and figure out how to duplicate that locally.

Iwan Aucamp

unread,
Aug 20, 2022, 8:01:25 AM8/20/22
to rdflib-dev
On Tuesday, 16 August 2022 at 08:55:11 UTC+2 epost...@gmail.com wrote:
So the test suite is very comprehensive, but needs a more elaborate environment than a simple test run, understood. Will have a look at the Github actions to try and figure out how to duplicate that locally.

We run the test suite with tox in CI, and this should work fine locally, the instructions for using it is in our developers guide.

There are also various other options for running validation, including using a venv via go-tasks from the Taskfile.yml we provide as described here, or using the devcontainer as described here.

I personally use go-task with a venv most of the time when developing locally.
Reply all
Reply to author
Forward
0 new messages