May i know is there any possible solutions to detect the encoding or
character set (charset) of a file automatically? Second, how to
convert a particular encoding to Unicode once the file encoding is
detected?
Thanks in advance.
--
regards,
Simon
> Hi all,
>
> May i know is there any possible solutions to detect the encoding or
> character set (charset) of a file automatically? Second, how to
> convert a particular encoding to Unicode once the file encoding is
> detected?
AFAIK, Unicode is the only commonly used encoding with a "signature" (the
byte-order marker, "BOM"). Detecting other encodings can be done
heuristically, but I'm not aware of any specific support within Java to do
so, and it wouldn't be 100% reliable anyway.
As far as converting once you know the encoding, see the InputStreamReader
and Charset classes for reading the file's bytes using a specific
encoding. Then you can use OutputStreamWriter to generate a new file
using whatever encoding you want, including any of the Unicode formats you
might want.
Pete
I suspect Pete is simplifying for clarity, but it is worth remembering
that Unicode is not an encoding.
AFAIK the byte-order mark (BOM) is optional for UTF-8/16/32 encodings.
Note: for the UTF-8 encoding, there are no byte-order issues and so, if
a BOM is included, it is only as a marker to signify a UTF encoding.
Files written on Unix systems typically do not include a BOM as it would
interfere with other important file-type marks.
--
RGB
The short answer: there's no easy way to detect charset automatically.
The long answer:
Typically, no filesystem stores metadata that one can associate with a
file encoding. All of your ISO 8859-* codes differ only in what the
codepoints in the x80 - xFF range look like, be it standard accented
characters (like à), Greek characters (α), or some other language.
Pragmatically differentiating between these single-byte encodings forces
you to resort to either heuristics or getting help from the user (if you
notice, all major browsers allow you to select a web page's encoding for
this very reason).
There is another class of encodings-- variable-length encodings like
UTF-8 or Shift-JIS. One can sometimes rule out these encodings, if
invalid sequences are produced. For example, 0xa4 0xf4 is invalid UTF-8,
so it's probably an ISO 8859-* language instead.
Context is also helpful. You may recall coming across documents that
have unusual character pairings, like ۍ or something (if your
newsreader sucks at i18n, you'll probably be seeing those in this
message as well). That is pretty much a dead giveaway that the message
is UTF-8 but someone is treating it as ISO 8859-1 (or it's very close
sibling, Windows-1252). If you're seeing multiple high-byte characters
in a row, it's more likely UTF-8 than it is ISO 8859-1, although some
other languages may have these cases routinely (like Greek).
The final way to guess at the encoding is to look at what the platform's
default is. Western European-localized products will tend to be in
either Cp1252 (which is pretty much ISO 8859-1) or UTF-8; Japanese are
probably either Shift-JIS or UTF-8. I believe Java's conversion methods
will default to platform encoding for you anyways, so that may be a
safer bet for you. The other alternative is to just assume everyone uses
the same charset and not think about it.
--
Beware of bugs in the above code; I have only proved it correct, not
tried it. -- Donald E. Knuth
>May i know is there any possible solutions to detect the encoding or
>character set (charset) of a file automatically? Second, how to
>convert a particular encoding to Unicode once the file encoding is
>detected?
I wrote a utility to manually assist the process. You could do it
automatically if you know the vocabulary of the file. Search for byte
patterns of encoded words.
see http://mindprod.com/jgloss/encoding.html
The fact you can't tell is so dirty coffee cups and pizza boxes on the
floor. I can't imagine that happening if someone like Martha Stewart
were in charge.
--
Roedy Green Canadian Mind Products
http://mindprod.com
"Everybody�s worried about stopping terrorism. Well, there�s a really easy way: stop participating in it."
~ Noam Chomsky
>
> Of course, one can take advantage of the fact, that certain
> octet values and octet sequence values are absolutely forbidden
> in certain encodings so as to exclude those encodings.
The biggest clue is the country source of the file. Check the
national encodings first.
I've often thought an elegant solution would be to define more than
one BOM (byte order mark) in Unicode. They could allocate enough
BOMs to have a different one for each encoding.
--
Wayne
>I've often thought an elegant solution would be to define more than
>one BOM (byte order mark) in Unicode. They could allocate enough
>BOMs to have a different one for each encoding.
There are hundreds of encodings. You could add it now with:
BOM BOM name-of-encoding BOM.
That way you don't have to reserve any new characters.
While we are at it, we should encode the MIME type and create an
extensible scheme to add other meta-information.
Right, but there should be a simple way to deal with plain text
files too.
Turns out there is one! I've been reading the HTML5 draft spec
and cam across this:
2.7.3 Content-Type sniffing: text or binary
1. The user agent may wait for 512 or more bytes of the resource
to be available.
2. Let n be the smaller of either 512 or the number of bytes
already available.
3. If n is 4 or more, and the first bytes of the resource match
one of the following byte sets:
Bytes in
Hexadecimal Description
FE FF UTF-16BE BOM
FF FE UTF-16LE BOM
EF BB BF UTF-8 BOM
So there is already defined multiple BOMs, including one
for UTF-8. (I knew it was a good idea! :-)
--
Wayne
>FE FF UTF-16BE BOM
>FF FE UTF-16LE BOM
>EF BB BF UTF-8 BOM
>
>So there is already defined multiple BOMs, including one
>for UTF-8. (I knew it was a good idea! :-)
I suppose we could try to get rid of all the old 8-bit encodings and
use Unicode/UTF rather than try to patch all those text files out
there with some scheme to mark the encoding.
--
Roedy Green Canadian Mind Products
http://mindprod.com
Never discourage anyone... who continually makes progress, no matter how slow.
~ Plato 428 BC died: 348 BC at age: 80
I wouldn't say that "multiple" BOMs are already defined. The idea of the
BOM is to insert a zero-width no-break space character, whose code point
is U+FEFF, at the start of the file.
Since this character will be encoded differently by different encodings,
it enables to distinguish between UTF-16BE, UTF-16LE, UTF-8 and other
Unicode encodings.
It is also a somewhat acceptable way to indicate a file is UTF-8 rather
than latin-1 or something, since it seems unlikely that a plain text
file would start with the characters that the BOM's binary represents in
non-Unicode encodings.
Bottomline, the BOM is a zero-width no-break space. It is unique, there
are no multiple BOMs.
Or if there are that I don't know of, that would be another norm the
given table wouldn't conform with.
--
Mayeul