LEXVO - Invalid XML character (Unicode: 0xd800)

135 views
Skip to first unread message

LarsB

unread,
Oct 13, 2011, 2:18:01 PM10/13/11
to Pedantic Web Group
Hey Gerard,

Opening lexvo_2011-05-12.rdf in Protégé 4.1 produces the following
fatal error:

“org.xml.sax.SAXParseException: An invalid XML character (Unicode:
0xd800) was found in the comment.”

Unicode 0xd800 belongs to Unicode high surrogate codes, which are
illegal in XML based documents.

Reference: http://xmlconf.sourceforge.net/xml/reports/report-xerces-jnv.html

Regards,
LarsB

Gerard de Melo

unread,
Nov 11, 2011, 5:00:55 AM11/11/11
to Pedantic Web Group
Hi Lars,

Thanks for reporting this. It would be great if you could CC me in
future posts related to Lexvo.org, as I don't normally follow this
group.

> “org.xml.sax.SAXParseException: An invalid XML character (Unicode:
> 0xd800) was found in the comment.”

The issue you are facing might be a bug in whatever XML parser Protégé
is relying on. Due to the multilingual nature of Lexvo.org, the RDF
dump also includes Unicode characters in the Supplementary
Multilingual Plane (e.g. for Gothic). If the RDF file is decoded
properly, U+D800 should not occur as a character.

Unfortunately, dealing with such strings in Java is non-trivial. For
instance,
new String(Character.toChars(0x10330)).matches(".*\ud800.*")
evaluates to false, as Java's regular expressions properly handle SMP
characters.
However,
new String(Character.toChars(0x10330)).indexOf("\ud800") != -1
evaluates to true.

Best regards,
Gerard

--
Gerard de Melo [dem...@icsi.berkeley.edu]
http://www.icsi.berkeley.edu/~demelo/
Reply all
Reply to author
Forward
0 new messages