Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Umlaut and other special characters in XML

702 views
Skip to first unread message

Bernhard Messerer

unread,
Dec 17, 2000, 6:39:55 AM12/17/00
to
Hi all!

I have a problem with XML: AFAIK XML allows umlaut and other special
characters as sharp s (ß ö etc.). However, if I try to write
something like this into the XML-document I get this exception

org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x1addac)
was found in the element content of the document.

at MyXML.deserialize(MyXML.java:204)

at TestClass.main(TestClass.java:80)

Although the exception occurs when de-serializing the document it the error
is in the serializing part, because e.g. IE5 and NS6 cannot read the file,
and the umlaut is replaced by "?" when viewing in NS6 and IE5, the plain
text file correctly contains the umlaut-string.
I don't know why it occurs... I'm using XML4J 3.1.0 (IBM alphaworks) which
is based on Apache Xerces, encoding of the file is "UTF-8" (the default one,
UTF-16 doesn't seem to be supported), serialization is done by this code,
where d is the document

OutputFormat format = new OutputFormat(d);
format.setLineSeparator("\n");
Writer out = new StringWriter();
XMLSerializer serial = new XMLSerializer( out, format );
try
{
serial.asDOMSerializer();
serial.serialize(d);
}
catch(IOException ex) { throw new XMLException("An error occured while
serializing the document, original exception was:\n"+ex.toString());}
String s=out.toString();

Please help asap, any comments appreciated and thanks in advance

Messi

Bernhard Messerer

unread,
Dec 17, 2000, 9:43:28 AM12/17/00
to
Ah, just an addendum: the "if I write something like this..." doesn't mean
ß etc. but the umlaut itself. I just wrote szlig etc. because I do not
know which encoding you installed and if ß, ä, ö, ü will show correctly.

Greetings

Messi


Henrik Motakef

unread,
Dec 17, 2000, 10:49:45 AM12/17/00
to
Bernhard Messerer <bmes...@wsop.at> wrote:
> encoding of the file is "UTF-8"

It should work with "ISO-8859-1".

hth
Henrik

--
Was sich überhaupt sagen läßt, läßt sich klar sagen.
Wovon man nicht sprechen kann, darüber muß man schweigen.
-- L. Wittgenstein

Bjoern Hoehrmann

unread,
Dec 17, 2000, 3:46:16 PM12/17/00
to
* Bernhard Messerer wrote in comp.text.xml:

>I have a problem with XML: AFAIK XML allows umlaut and other special
>characters as sharp s (&szlig; &ouml; etc.). However, if I try to write
>something like this into the XML-document I get this exception
>
>org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x1addac)
>was found in the element content of the document.

>Although the exception occurs when de-serializing the document it the error


>is in the serializing part, because e.g. IE5 and NS6 cannot read the file,
>and the umlaut is replaced by "?" when viewing in NS6 and IE5, the plain
>text file correctly contains the umlaut-string.

ISO-8859-1: ö -> 0xFC
UTF-8: ö -> 0xC3 0xB6

Use <?xml version='1.0' encoding="iso-8859-1"?> and see it works. Your
files are not properly UTF-8 encoded.
--
Björn Höhrmann ^ mailto:bjo...@hoehrmann.de ^ http://www.bjoernsworld.de
am Badedeich 7 ° Telefon: +49(0)4667/981ASK ° http://bjoern.hoehrmann.de
25899 Dagebüll # PGP Pub. KeyID: 0xA4357E78 # http://learn.to/quote [!]e
"It may be those who do most, dream most." -- Stephen Leacock

Bernhard Messerer

unread,
Dec 18, 2000, 4:23:42 AM12/18/00
to
Naja, nur verwende ich ungern ISO-8859-1... da ich irgendwo gelesen habe es
sei "discouraged"...
Sollte doch auch mit UTF-8 möglich sein, oder? Schließlich ist das unicode!

cu

Messi

"Henrik Motakef" <henrik....@ruhr-uni-bochum.de> wrote in message
news:pani19...@adorno.iaw.ruhr-uni-bochum.de...

Henrik Motakef

unread,
Dec 18, 2000, 5:43:54 AM12/18/00
to
Bernhard Messerer <bmes...@wsop.at> wrote:
> Naja, nur verwende ich ungern ISO-8859-1... da ich irgendwo gelesen habe es
> sei "discouraged"...

Wo denn das?

> Sollte doch auch mit UTF-8 möglich sein, oder? Schließlich ist das unicode!

Wenn dein Dokument UTF-8-Codiert ist, kannst du auch UTF-8 als
Codierung angeben. Bei einem ISO-8859-1-Codierten Dokument empfielt
sich ISO-8859-1.

<translation_for_non-krauts quality="low">
> I don't like using ISO-8859-1... i somewhere read it'd be "discouraged"

Who told you this?

> It should be possible with UTF-8, shouldn't it? After all, it's unicode!

If your document is encoded in UTF-8, you may choose UTF-8 as
encoding. For ISO-8859-documents ISO-8859-1 is the better choice.
</translation_for_non-krauts>

Andreas Popper

unread,
Dec 19, 2000, 5:06:18 PM12/19/00
to

Try this header:
<?xml version="1.0" encoding="ISO-8859-1"?>

geronimo

0 new messages