Encoding issues when dereferencing "formats:" URIs

31 views
Skip to first unread message

Aidan Hogan

unread,
Apr 25, 2012, 7:17:36 AM4/25/12
to Ivan Herman, pedant...@googlegroups.com
Hi Ivan,

There was some discussion on the Pedantic Web list about annotating
documents with content types used.

A few alternatives were discussed including:

... dct:format "application/rdf+xml" .
... dct:format [
a dct:IMT;
rdf:value "application/rdf+xml"
].

However, using URIs instead of blank-nodes or literals would seem to be
much more beneficial here since these resources are likely to reappear
very often across different datasets.

Keith Alexander pointed out this page:

http://www.w3.org/ns/formats/

Which looks like (to me) the perfect solution. One could do something like:

... dct:format <http://www.w3.org/ns/formats/RDF_XML> .

Unfortunately, some of the dereferenced documents for the URIs contained
within have syntax errors in RDF/XML.

For example, when checking the URI:

http://www.w3.org/ns/formats/data/RDF_XML

The RDF/XML validator gives:

http://www.w3.org/RDF/Validator/ARPServlet?URI=http%3A%2F%2Fwww.w3.org%2Fns%2Fformats%2Fdata%2FRDF_XML&PARSE=Parse+URI%3A+&TRIPLES_AND_GRAPH=PRINT_TRIPLES&FORMAT=PNG_EMBED

Would it be possible to fix these documents, or maybe forward as
appropriate?

Cheers,
Aidan

Ivan Herman

unread,
Apr 25, 2012, 7:54:04 AM4/25/12
to Aidan Hogan, pedant...@googlegroups.com
Aidan,

I have no idea what is going on. If you take any of those RDF files, and copy the text into the text box for the same validator, it checks all right. When using tabulator in Firefox, it reads it. When I use Firefox to directly display the XML file, it does not experience any problem (though Firefox has a built-in XML parser).

I will have to ask the maintainers of the service for some help here.

Ivan


----
Ivan Herman, W3C Semantic Web Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
FOAF: http://www.ivan-herman.net/foaf.rdf

Richard Cyganiak

unread,
Apr 25, 2012, 7:57:47 AM4/25/12
to pedant...@googlegroups.com, Ivan Herman
Aidan,

Ivan's file is (pedantically speaking) fine. The error reported by the RDF Validator is a symptom of a Jena bug. The issue is triggered by the presence of a Byte Order Mark at the beginning of the file:

http://en.wikipedia.org/wiki/Byte_order_mark

See here for a nice explanation from Rob Vesse, and his related bug report:

http://www.dotnetrdf.org/blogitem.asp?blogID=37
https://issues.apache.org/jira/browse/JENA-12

In fairness, the simplest fix would be for Ivan to edit the RDF files and remove the initial byte order mark. Googling for “remove byte order mark” shows various ways of doing that.

Best,
Richard

Ivan Herman

unread,
Apr 25, 2012, 8:05:26 AM4/25/12
to Richard Cyganiak, pedant...@googlegroups.com
Ah. I was wondering about something like that. Thanks.

The problem is that the RDF/XML files are generated and not edited by hand; the 'real' meat is in the HTML files which are RDFa, and I generate the turtle and RDF/XML files through a Makefile before uploading them. Ie, it is probably done by the RDF/XML serializer of RDFLib, and that is pretty difficult to dig into...

Ivan

Richard Cyganiak

unread,
Apr 25, 2012, 8:07:28 AM4/25/12
to Ivan Herman, pedant...@googlegroups.com
On 25 Apr 2012, at 13:05, Ivan Herman wrote:
> Ah. I was wondering about something like that. Thanks.
>
> The problem is that the RDF/XML files are generated and not edited by hand; the 'real' meat is in the HTML files which are RDFa, and I generate the turtle and RDF/XML files through a Makefile before uploading them. Ie, it is probably done by the RDF/XML serializer of RDFLib, and that is pretty difficult to dig into...

An option might be to add a step to the makefile that strips the BOM. This can probably be done with a line of perl or awk or whatever.

But yeah the Right Thing to do would be to get the validator fixed.

Richard

Aidan Hogan

unread,
Apr 25, 2012, 8:26:03 AM4/25/12
to pedant...@googlegroups.com, Ivan Herman
Ah yep. I have vague memories of this BOM issue, but hadn't connected it
with the (what I always thought to be infallible) RDF/XML validator.

Reminds me of the Douglas Adams quote:

"""The major difference between a thing that might go wrong and a thing
that cannot possibly go wrong is that when a thing that cannot possibly
go wrong goes wrong it usually turns out to be impossible to get at or
repair."""

Cheers,
Aidan

On 25/04/2012 13:07, Richard Cyganiak wrote:
> On 25 Apr 2012, at 13:05, Ivan Herman wrote:
>> Ah. I was wondering about something like that. Thanks.
>>
>> The problem is that the RDF/XML files are generated and not edited by hand; the 'real' meat is in the HTML files which are RDFa, and I generate the turtle and RDF/XML files through a Makefile before uploading them. Ie, it is probably done by the RDF/XML serializer of RDFLib, and that is pretty difficult to dig into...
>
> An option might be to add a step to the makefile that strips the BOM. This can probably be done with a line of perl or awk or whatever.
>
> But yeah the Right Thing to do would be to get the validator fixed.
>
> Richard
>
>
>
>
>>
>> Ivan
>>
>> On Apr 25, 2012, at 13:57 , Richard Cyganiak wrote:
>>
>>> Aidan,
>>>
>>> Ivan's file is (pedantically speaking) fine. The error reported by the RDF Validator is a symptom of a Jena bug. The issue is triggered by the presence of a Byte Order Mark at the beginning of the file:
>>>
>>> http://en.wikipedia.org/wiki/Byte_order_mark
>>>
>>> See here for a nice explanation from Rob Vesse, and his related bug report:
>>>
>>> http://www.dotnetrdf.org/blogitem.asp?blogID=37
>>> https://issues.apache.org/jira/browse/JENA-12
>>>

>>> In fairness, the simplest fix would be for Ivan to edit the RDF files and remove the initial byte order mark. Googling for �remove byte order mark� shows various ways of doing that.

Andreas Radinger

unread,
Apr 25, 2012, 10:49:23 AM4/25/12
to pedant...@googlegroups.com, Richard Cyganiak, Ivan Herman
Hi,

I don't think any of these files (neither .rdf nor .ttl) have a BOM at
the beginning of the file.
http://people.w3.org/rishida/utils/bomtester/index.php?filename=http%3A%2F%2Fwww.w3.org%2Fns%2Fformats%2Fdata%2FRDF_XML.rdf

The W3C RDF Validator has also no bug in dealing with RDF/XML files that
have a BOM.

Any other ideas what's going on with
http://www.w3.org/ns/formats/data/RDF_XML.rdf ?

Best,
Andreas


On 4/25/12 1:57 PM, Richard Cyganiak wrote:
> Aidan,
>
> Ivan's file is (pedantically speaking) fine. The error reported by the RDF Validator is a symptom of a Jena bug. The issue is triggered by the presence of a Byte Order Mark at the beginning of the file:
>
> http://en.wikipedia.org/wiki/Byte_order_mark
>
> See here for a nice explanation from Rob Vesse, and his related bug report:
>
> http://www.dotnetrdf.org/blogitem.asp?blogID=37
> https://issues.apache.org/jira/browse/JENA-12
>

> In fairness, the simplest fix would be for Ivan to edit the RDF files and remove the initial byte order mark. Googling for �remove byte order mark� shows various ways of doing that.

Damian Steer

unread,
Apr 25, 2012, 10:28:19 AM4/25/12
to pedant...@googlegroups.com
On 25/04/12 12:57, Richard Cyganiak wrote:
> Aidan,
>
> Ivan's file is (pedantically speaking) fine. The error reported by
> the RDF Validator is a symptom of a Jena bug. The issue is triggered
> by the presence of a Byte Order Mark at the beginning of the file:

It's not a jena issue (the jira issue concerns newer turtle and related
parsers). Having poked around a bit I think it's an issue with the
validator servlet, which does its own character decoding, but I find the
code a bit impenetrable. [1]

Damian

[1]
<http://dev.w3.org/cvsweb/2006/RDFValidator/WEB-INF/src/org/w3c/rdfvalidator/ARPServlet.java?rev=1.6>

Aidan Hogan

unread,
Apr 25, 2012, 12:24:14 PM4/25/12
to pedant...@googlegroups.com, Andreas Radinger, Richard Cyganiak, Ivan Herman
On 25/04/2012 15:49, Andreas Radinger wrote:
> I don't think any of these files (neither .rdf nor .ttl) have a BOM at
> the beginning of the file.
> http://people.w3.org/rishida/utils/bomtester/index.php?filename=http%3A%2F%2Fwww.w3.org%2Fns%2Fformats%2Fdata%2FRDF_XML.rdf

I checked the file in a Hex editor and there's an FFFE code at the start
before the content, which suggests UTF-16 little-endian [1].

@Andreas, Perhaps the tool you link only looks for the UTF-8 BOM?

And possibly the RDF/XML validator is not incorrect in this case?

Cheers,
Aidan

[1]
http://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding

Richard Cyganiak

unread,
Apr 25, 2012, 12:40:07 PM4/25/12
to Aidan Hogan, pedant...@googlegroups.com, Andreas Radinger, Ivan Herman
On 25 Apr 2012, at 17:24, Aidan Hogan wrote:
> On 25/04/2012 15:49, Andreas Radinger wrote:
>> I don't think any of these files (neither .rdf nor .ttl) have a BOM at
>> the beginning of the file.
>> http://people.w3.org/rishida/utils/bomtester/index.php?filename=http%3A%2F%2Fwww.w3.org%2Fns%2Fformats%2Fdata%2FRDF_XML.rdf
>
> I checked the file in a Hex editor and there's an FFFE code at the start before the content, which suggests UTF-16 little-endian [1].

Well, the UTF-16 BOM might just be the result of you saving the file to your Windows machine. If I download the file with curl -O, there doesn't seem to be any BOM.

Either way I can't see anything wrong with the file. Current Jena versions parse it just fine. So I'm pretty sure the Validator is broken.

Best,
Richard

Aidan Hogan

unread,
Apr 25, 2012, 1:27:11 PM4/25/12
to Richard Cyganiak, pedant...@googlegroups.com, Andreas Radinger, Ivan Herman
On 25/04/2012 17:40, Richard Cyganiak wrote:
> On 25 Apr 2012, at 17:24, Aidan Hogan wrote:
>> On 25/04/2012 15:49, Andreas Radinger wrote:
>>> I don't think any of these files (neither .rdf nor .ttl) have a BOM at
>>> the beginning of the file.
>>> http://people.w3.org/rishida/utils/bomtester/index.php?filename=http%3A%2F%2Fwww.w3.org%2Fns%2Fformats%2Fdata%2FRDF_XML.rdf
>>
>> I checked the file in a Hex editor and there's an FFFE code at the start before the content, which suggests UTF-16 little-endian [1].
>
> Well, the UTF-16 BOM might just be the result of you saving the file to your Windows machine. If I download the file with curl -O, there doesn't seem to be any BOM.

It is. Apologies.

> Either way I can't see anything wrong with the file. Current Jena versions parse it just fine. So I'm pretty sure the Validator is broken.

It also validates fine for a direct (CURL) copy onto another server.

http://www.w3.org/RDF/Validator/ARPServlet?URI=http%3A%2F%2Fsw.deri.org%2F~aidanh%2FRDF_XML.rdf&PARSE=Parse+URI%3A+&TRIPLES_AND_GRAPH=PRINT_TRIPLES&FORMAT=PNG_EMBED

So I guess it's the validator and my text editor that need fixing.

Cheers,
Aidan


Damian Steer

unread,
Apr 25, 2012, 11:07:07 AM4/25/12
to pedant...@googlegroups.com
On 25/04/12 15:49, Andreas Radinger wrote:
> Hi,
>
> I don't think any of these files (neither .rdf nor .ttl) have a BOM at
> the beginning of the file.
> http://people.w3.org/rishida/utils/bomtester/index.php?filename=http%3A%2F%2Fwww.w3.org%2Fns%2Fformats%2Fdata%2FRDF_XML.rdf
>
> The W3C RDF Validator has also no bug in dealing with RDF/XML files that
> have a BOM.

+1.

I tried another file under ns/:

<http://www.w3.org/ns/ma-ont.rdf>

=> "Undecodable data when reading URI at byte 24574 using encoding 'UTF-8'."

And then the rdf namespace:

=> "... byte 0 ..."

But <http://people.w3.org/simon/foaf.rdf> was fine.

Hypothesis: validating rdf under the www.w3.org domain is broken.

It may be unrelated to encoding. The error is triggered by any
IOException reading characters from an input stream reader.

Damian

Damian Steer

unread,
Apr 26, 2012, 11:33:07 AM4/26/12
to www-rdf-...@w3.org, pedant...@googlegroups.com
Forwarded from the pedantic-web list.

Initially this was (erroneously) reported as an issue with ARP and UTF-8 BOMs, but there's no BOM involved and ARP has never had an issue with BOMs.

It seems that validating (all?) rdf files under www.w3.org results in errors of the form:

"An attempt to load the RDF from URI 'http://www.w3.org/ns/formats/data/RDF_XML' failed. (Undecodable data when reading URI at byte 0 using encoding 'UTF-8'. Please check encoding and encoding declaration of your document.)"

But the byte value may vary, e.g. 24574 for http://www.w3.org/ns/ma-ont.rdf.

I understand that the same file (RDF_XML) validated without issue when copied to a remote server.

The code is question is presumably:

try {// read whole file as characters
int c;
while ((c = isr.read()) != -1) {
sb.append((char)c);
bytenum++;
}
}
catch (IOException e){
throw new getRDFException("Undecodable data when reading URI at byte "+bytenum+" using encoding '"+finalCharset+"'."+" Please check encoding and encoding declaration of your document.");
}

<http://dev.w3.org/cvsweb/2006/RDFValidator/WEB-INF/src/org/w3c/rdfvalidator/ARPServlet.java?rev=1.6>

So the issue may not be encoding, the same message being reported for any IO exception.

Thanks for your help,

Damian Steer

Begin forwarded message:

Richard Cyganiak

unread,
Apr 26, 2012, 4:11:05 PM4/26/12
to www-rdf-...@w3.org, pedant...@googlegroups.com
Dear RDF Validator team,

The issue that Damian reports looks like some bizarre caching/networking/load-balancer thing in the W3C infrastructure.

1. It has nothing to do with RDF, encoding, or BOMs. It affects all URLs (incl. non-RDF files) from certain domains. The reported byte value depends on the size of the file and is always somewhere within the last couple of kByte.

2. The domains that don't work are www.w3.org, dev.w3.org, and all the various aliases for www.w3.org such as web4.w3.org, web5.w3.org, www-mit.w3.org, hans.w3.org, and ipv4.w3.org. I couldn't find any non-w3.org domains that exhibit the problem, but that doesn't mean they don't exist. Other w3.org subdomains like people.w3.org work fine.

3. Curiously, for all the alias subdomains listed above, I was able to validate *the first* URL successfully. After that, re-validating the same URL, or any other URL from the same domain, results in the usual error.

4. Furthermore, there is *one* URL on www.w3.org that can always be successfully validated, and that's the URL of the validator servlet itself: http://www.w3.org/RDF/Validator/ARPServlet . This is probably related to the fact that the ARPServlet doesn't run on the main W3C webserver(s), but on a separate machine at http://smithers.w3.org/servlet/ARPServlet that seems to be reverse-proxied into the www.w3.org URL space.

5. I ran the troublesome function ARPServlet.getRDFFromURI() locally on my own machine, and it can read from all the affected URLs just fine. (Assuming the link that Damian posted is the right version of ARPServlet.) There's nothing suspicious in the code, it all looks very normal and just uses standard Java APIs, no potentially buggy libraries or anything. So I doubt the answer is in there. (It should have better error reporting for the IOException as Damian pointed out; seeing what the actual exception is *might* reveal a clue.)

Now I hope someone at W3C can make sense of this!

Best,
Richard
Reply all
Reply to author
Forward
0 new messages