Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

SAX & UTF-8 problem

571 views
Skip to first unread message

Chris

unread,
Jul 8, 2004, 4:11:25 PM7/8/04
to
My SAX parser is choking on UTF-8 encoded files (a "document root element is
missing" error). The problem is three bytes that appear at the beginning of
each file:

0xEF 0xBB 0xBF

If I delete the bytes the problem goes away.

I'm accessing the file by using a FileInputStream and then wrapping it in a
SAX InputSource. My guess is that the InputSource is converting bytes to
chars using the platform's default encoding, rather than UTF-8.

Is there any existing InputSource class or Reader class that will
automatically detect UTF-8 and encode chars correctly? Or do I have to write
my own Reader class to do it?


Roedy Green

unread,
Jul 8, 2004, 4:36:36 PM7/8/04
to
On Thu, 8 Jul 2004 15:11:25 -0500, "Chris" <nos...@nospam.com> wrote
or quoted :

>Is there any existing InputSource class or Reader class that will
>automatically detect UTF-8 and encode chars correctly? Or do I have to write
>my own Reader class to do it?

The thing with most chance of success is a Reader with an explicit
UTF-8 encoding.

Hopefully it will just discard the signature.

--
Canadian Mind Products, Roedy Green.
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.

Van Ly

unread,
Jul 9, 2004, 4:24:50 AM7/9/04
to
"Chris" <nos...@nospam.com> wrote in news:40edabe3$1...@news.mcleodusa.net:

> My SAX parser is choking on UTF-8 encoded files (a "document root
> element is missing" error). The problem is three bytes that appear at
> the beginning of each file:
>
> 0xEF 0xBB 0xBF
>
> If I delete the bytes the problem goes away.
>

I'm getting the same problem with Javadoc. With the three bytes above
(referred to as the BOM) of a UTF-8 file, Javadoc will choke because
it considers the three bytes illegal characters. Of course, removing
the three bytes will get Javadoc going again.

I've tried the "-encoding UTF8" and "-encoding UTF-8" options of Javadoc.
But it still bombs. Anyone reading this can help?

Thanks,
Van

Thomas Weidenfeller

unread,
Jul 9, 2004, 4:47:49 AM7/9/04
to
Van Ly wrote:
> I've tried the "-encoding UTF8" and "-encoding UTF-8" options of Javadoc.
> But it still bombs. Anyone reading this can help?

Write a pre-processor to filter the BOM mark out.

If you have control over the source code (I know, you don't have it for
JavaDoc), see

http://groups.google.com/groups?selm=Xns93AE681211BEidNoMailid%40192.89.123.233

/Thomas

Thomas Weidenfeller

unread,
Jul 9, 2004, 4:49:31 AM7/9/04
to
Chris wrote:
> Is there any existing InputSource class or Reader class that will
> automatically detect UTF-8 and encode chars correctly? Or do I have to write
> my own Reader class to do it?

We wrote one a year or so ago:

http://groups.google.com/groups?selm=Xns93AE681211BEidNoMailid%40192.89.123.233

/Thomas

Chris

unread,
Jul 10, 2004, 2:29:37 PM7/10/04
to

Thanks. That worked perfectly.


0 new messages