UTF-8 encoding error in umbel.n3

0 views
Skip to first unread message

Ian Dickinson

unread,
Sep 22, 2008, 7:18:07 AM9/22/08
to UMBEL
Hi,
In Umbel release 0.71 the file umbel.n3 has an encoding error that
makes the resulting text invalid UTF-8, and thus unparseable. The
problem is this literal:

skos:note """Subject concepts are a special kind of concept: namely,
ones that are concrete, subject-related and non-abstract. Note in
other systems or ontologies, similar constructs may alternatively be
called topics, subjects, concepts or perhaps interests. UMBEL has
adopted the term subject concept to distinguish from these uses, which
have different nuances of meaning and use, as well as to highlight the
subject or topic nature of UMBELís concrete concepts.


Note the accented i between UMBEL and the letter s. This is not a
legal UTF-8 character, and causes the parser to fail in a way that is
hard to provide good reporting for (so I'm told). So two suggestions:
first, this file should be corrected, second that you might want to
consider some sort of charset validation step as part of your release
cycle.

Regards,
Ian

Ian Dickinson
HPLabs, Bristol, UK

Ian Dickinson

unread,
Sep 22, 2008, 8:41:11 AM9/22/08
to UMBEL
Likewise, the concept Function_Denotational in
umbel_abstract_concepts.n3 has a whole load of ill-formed characters
in the skos:definition.

Ian

Mike Bergman

unread,
Sep 22, 2008, 9:13:38 AM9/22/08
to umbel-o...@googlegroups.com
Hi Ian,
I heartily agree that this check should be applied as best
practice. Any suggestions for utilities or services that do just
that?

Thanks, Mike

Frederick Giasson

unread,
Sep 22, 2008, 10:23:50 AM9/22/08
to umbel-o...@googlegroups.com
Hi Ian,


Sorry about that. Fixed now.


Take care,


Fred

Frederick Giasson

unread,
Sep 22, 2008, 10:26:52 AM9/22/08
to umbel-o...@googlegroups.com
Hi Ian,

> Likewise, the concept Function_Denotational in
> umbel_abstract_concepts.n3 has a whole load of ill-formed characters
> in the skos:definition.
>

I made sure that the procedures that create these files convert non-utf8
characters in utf8 (the default charset used for umbel is utf8).

Sorry about these encoding issues; but since much information come from
different places, charsets quickly becomes mixed. The goal now, and in
the future, is to make sure everything is converted in utf8 first. It is
possible that such errors could be found elsewhere, so just report
future issues so that I quickly fix them (the files & the procedures
that create them).

Thanks!


Take care,


Fred

Ian Dickinson

unread,
Sep 22, 2008, 10:28:53 AM9/22/08
to UMBEL
Hi Mike,
I'm still looking for a good utility. I've tried:

http://bolek.techno.cz/UTF8-Validator/

but this only seems to report binary yes-it-is/no-it-isn't valid per
file, whereas what is actually needed is some diagnostics of what to
fix! And it runs on MS Windows only. There's a validation service at
W3C, but that seems to be down at the moment. I'm going to do some
asking around for more suggestions.

Regards,
Ian

Ian Dickinson

unread,
Sep 22, 2008, 12:28:19 PM9/22/08
to umbel-o...@googlegroups.com
Well, I asked around and got a good answer: use Gnu iconv. For
example, on Linux, I did:

[ontologies] $ iconv umbel_abstract_concepts.n3 -o /dev/null
iconv: illegal input sequence at position 630884

To find the line number of that location, I used:

[ontologies] $ head -c 630884 | wc -l

Not sure if there's a version of iconv available for other platforms
(I would hope so, but a quick google for a cygwin version didn't
reveal an obvious answer).

Regards,
Ian

Ian Dickinson

unread,
Sep 22, 2008, 12:32:05 PM9/22/08
to umbel-o...@googlegroups.com
On Mon, Sep 22, 2008 at 3:26 PM, Frederick Giasson <fr...@fgiasson.com> wrote:
> I made sure that the procedures that create these files convert non-utf8
> characters in utf8 (the default charset used for umbel is utf8).

Thanks for the quick response Fred! I think utf8 is also the standard
charset for n3/turtle files.

> Sorry about these encoding issues; but since much information come from
> different places, charsets quickly becomes mixed.

No problem, I understand the problem. And, as you've already done, now
that you know about it it's not so hard to fix.

> Thanks!
You're welcome.

Ian

Mike Bergman

unread,
Sep 22, 2008, 12:32:36 PM9/22/08
to umbel-o...@googlegroups.com
Great, Ian, thanks.

I'm sure Fred will add this to his bag of tricks. If we find any
others, we will post a notice here, too.

Thanks, Mike

Reply all
Reply to author
Forward
0 new messages