Hi Lars,
Thanks for reporting this. It would be great if you could CC me in
future posts related to Lexvo.org, as I don't normally follow this
group.
> “org.xml.sax.SAXParseException: An invalid XML character (Unicode:
> 0xd800) was found in the comment.”
The issue you are facing might be a bug in whatever XML parser Protégé
is relying on. Due to the multilingual nature of Lexvo.org, the RDF
dump also includes Unicode characters in the Supplementary
Multilingual Plane (e.g. for Gothic). If the RDF file is decoded
properly, U+D800 should not occur as a character.
Unfortunately, dealing with such strings in Java is non-trivial. For
instance,
new String(Character.toChars(0x10330)).matches(".*\ud800.*")
evaluates to false, as Java's regular expressions properly handle SMP
characters.
However,
new String(Character.toChars(0x10330)).indexOf("\ud800") != -1
evaluates to true.
Best regards,
Gerard
--
Gerard de Melo [
dem...@icsi.berkeley.edu]
http://www.icsi.berkeley.edu/~demelo/