Java: Character conversion error: Parser doesn't recognize XML's encoding attribute

Michael Bellinger

unread,

Jan 2, 2003, 12:51:52 PM1/2/03

to

Hi, I'm getting the following

org.xml.sax.SAXParseException: Character conversion error: "UTF-8
encoding of character 0x001a7b29 can't be converted to Unicode." (line
number may be too low)
at org.apache.crimson.parser.InputEntity.fatal(InputEntity.java:1100)
at
org.apache.crimson.parser.InputEntity.fillbuf(InputEntity.java:1072)
at
org.apache.crimson.parser.InputEntity.isXmlDeclOrTextDeclPrefix(InputEntity.
java:914)
at org.apache.crimson.parser.Parser2.maybeXmlDecl(Parser2.java:1009)
at org.apache.crimson.parser.Parser2.parseInternal(Parser2.java:486)
at org.apache.crimson.parser.Parser2.parse(Parser2.java:305)
at
org.apache.crimson.parser.XMLReaderImpl.parse(XMLReaderImpl.java:442)
at
org.apache.crimson.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:1
85)
at Test.main(Test.java:51)

when trying to invoke the following code:

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setValidating(true);
factory.setIgnoringComments(true);
factory.setIgnoringElementContentWhitespace(true);
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(new InputSource("request URI here"));
/* I tried several alternatives here, none working except using
InputStream->BufferedReader->StringBuffer->StringReader->InputSource
for parsing
which seems to do the conversion internally somehow right
*/

I'm trying to parse a XML document (Crimson parser, Tomcat 4.06) which
contains special characters.

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE query SYSTEM "http://bla.test.com/query.dtd">
<query><annotations>Nicht möglich!!
\n</annotations></query>

The file to be read with Java is generated using ASP on an IIS/NT4 box. The
response's content type and charset is properly set there as the HTTP header
states:

HTTP/1.1 200 OK
Server: Microsoft-IIS/4.0
Date: Thu, 02 Jan 2003 17:12:40 GMT
Content-Type: text/xml; Charset=ISO-8859-1

The crimson parser is not able to recognize the documents correct encoding
(ISO-8859-1) since it assumes UTF-8. Trying to convert 0x001a7b29 (the hell
knows which char that is) into Unicode, the parser complains. Any hints on
that.

Thanks a lot!

Michael

alexj

unread,

Jan 2, 2003, 1:07:22 PM1/2/03

to

This is a well know trouble due to the fact you use accent.
XSLT parser recognize only UTF-8 and UTF-16.
Try to save your documents in these formats.

Chears.
--
Alexandre Jaquet

Julian F. Reschke

unread,

Jan 2, 2003, 6:16:02 PM1/2/03

to

"alexj" <al...@freesurf.ch> schrieb im Newsbeitrag
news:3e147fe2$0$332$5402...@news.sunrise.ch...

> This is a well know trouble due to the fact you use accent.
> XSLT parser recognize only UTF-8 and UTF-16.

1) There is no such thing as an "XSLT parser". There are XML parsers and
XSLT processors. XSLT processors consume the data provided by an XML parser.

2) XML parsers MUST understand UTF-8 and UTF-16 and MAY understand other
encodings. If they don't understand the declared encoding, they should
reject the document, not proceed using the wrong encoding.

3) I'm confident that Crimson *does* support ISO.

> Try to save your documents in these formats.

If you re-read the original posting, you'll see that the problem is in
parsing a document accessed using HTTP GET.

Julian

Julian F. Reschke

unread,

Jan 2, 2003, 6:17:13 PM1/2/03

to

Interesting.

Any chance that the problem is in fact in the DTD that gets referenced?

"Michael Bellinger" <michael....@dresdner-bank.com> schrieb im
Newsbeitrag news:av1t4i$86c...@news-1.bank.dresdner.net...

alexj

unread,

Jan 2, 2003, 8:29:44 PM1/2/03

to

"Julian F. Reschke" <res...@muenster.de> a écrit dans le message de news:
av2ha7$b9e13$1...@ID-98527.news.dfncis.de...

> "alexj" <al...@freesurf.ch> schrieb im Newsbeitrag
> news:3e147fe2$0$332$5402...@news.sunrise.ch...
> > This is a well know trouble due to the fact you use accent.
> > XSLT parser recognize only UTF-8 and UTF-16.
>
> 1) There is no such thing as an "XSLT parser". There are XML parsers and
> XSLT processors. XSLT processors consume the data provided by an XML
parser.

" org.apache.crimson.parser

> 2) XML parsers MUST understand UTF-8 and UTF-16 and MAY understand other
> encodings. If they don't understand the declared encoding, they should
> reject the document, not proceed using the wrong encoding.

" encoding of character 0x001a7b29 can't be converted to Unicode

Do you really think this is an http error ???? It's didn't have the look.

--
Alexandre Jaquet

alexj

unread,

Jan 2, 2003, 8:39:43 PM1/2/03

to

I'm sure this is ö .

"How the special characters are taken, well if the document include a
special character who can't
be represented by a xslt processor for the output, this character have to be
reproduce as a
character reference, or xslt processor have to produce an error"
--
Alexandre Jaquet

"alexj" <al...@freesurf.ch> a écrit dans le message de news:
3e14e791$0$337$5402...@news.sunrise.ch...

Julian F. Reschke

unread,

Jan 3, 2003, 4:14:28 AM1/3/03

to

But the message comes from the XML parser, not the XSLT processesor, right?

"alexj" <al...@freesurf.ch> schrieb im Newsbeitrag

news:3e14e9e8$0$331$5402...@news.sunrise.ch...

Michael Bellinger

unread,

Jan 3, 2003, 6:44:58 AM1/3/03

to

Hi,

Crimson and Xerces behave the same way. Previously I suppressed the parser's
warnings. Now I saw that the parser says:

"org.xml.sax.SAXParseException: Deklarierte Codierung "ISO-8859-1"
entspricht nicht der tatsächlichen Codierung "UTF-8"; möglicherweise
kein Fehler."

That means, even the document contains the correct coding attribute
ISO-8859-1 the parser insists that the document is UTF-8. Reading the
document via HTTP GET on byte-level and saving it to a file confirms that
there are no multi-byte chars or anything to assume UTF-8. Maybe it is
really a property of the HTTP response (e.g. content type) or the referenced
DTD which leads the parser to use UTF-8 for conversion.

I'm going to download the Xerces sources and try to debug it.

Michael

"Julian F. Reschke" <res...@muenster.de> schrieb im Newsbeitrag
news:av2ha7$b9e13$1...@ID-98527.news.dfncis.de...

Michael Bellinger

unread,

Jan 3, 2003, 8:37:58 AM1/3/03

to

Hi,

we finally found the solution: the problem was on the ASP side where the
document was generated using MSDOM. Somehow the document was tagged as UTF-8
but special characters weren't properly encoded at byte level as Unicode.
For example the german Umlaut ö was written as byte HEX F6. We then changed
the encoding attribute in the generated XML from ISO-8859-1 to UTF-8 and did
some other codepage related stuff on the IIS/NT box and the Umlaut was
finally written as UTF-8 compliant double-byte HEX C3 B6 and the conversion
worked. Setting the HTTP request's properties (mimetype and charset) on the
ASP side had no impact on the solution.

Bill sucks - Java rules!

Thanks,

Michael

"Michael Bellinger" <michael....@dresdner-bank.com> schrieb im

Newsbeitrag news:av3s0i$87s...@news-1.bank.dresdner.net...