0xEF 0xBB 0xBF
If I delete the bytes the problem goes away.
I'm accessing the file by using a FileInputStream and then wrapping it in a
SAX InputSource. My guess is that the InputSource is converting bytes to
chars using the platform's default encoding, rather than UTF-8.
Is there any existing InputSource class or Reader class that will
automatically detect UTF-8 and encode chars correctly? Or do I have to write
my own Reader class to do it?
>Is there any existing InputSource class or Reader class that will
>automatically detect UTF-8 and encode chars correctly? Or do I have to write
>my own Reader class to do it?
The thing with most chance of success is a Reader with an explicit
UTF-8 encoding.
Hopefully it will just discard the signature.
--
Canadian Mind Products, Roedy Green.
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
> My SAX parser is choking on UTF-8 encoded files (a "document root
> element is missing" error). The problem is three bytes that appear at
> the beginning of each file:
>
> 0xEF 0xBB 0xBF
>
> If I delete the bytes the problem goes away.
>
I'm getting the same problem with Javadoc. With the three bytes above
(referred to as the BOM) of a UTF-8 file, Javadoc will choke because
it considers the three bytes illegal characters. Of course, removing
the three bytes will get Javadoc going again.
I've tried the "-encoding UTF8" and "-encoding UTF-8" options of Javadoc.
But it still bombs. Anyone reading this can help?
Thanks,
Van
Write a pre-processor to filter the BOM mark out.
If you have control over the source code (I know, you don't have it for
JavaDoc), see
http://groups.google.com/groups?selm=Xns93AE681211BEidNoMailid%40192.89.123.233
/Thomas
We wrote one a year or so ago:
http://groups.google.com/groups?selm=Xns93AE681211BEidNoMailid%40192.89.123.233
/Thomas
Thanks. That worked perfectly.