[java programming] How to detect the file encoding?

Simon

unread,

May 24, 2009, 11:35:02 PM5/24/09

to

Hi all,

May i know is there any possible solutions to detect the encoding or
character set (charset) of a file automatically? Second, how to
convert a particular encoding to Unicode once the file encoding is
detected?

Thanks in advance.

--

regards,
Simon

Peter Duniho

unread,

May 25, 2009, 1:25:34 AM5/25/09

to

On Sun, 24 May 2009 20:35:02 -0700, Simon <choonc...@gmail.com> wrote:

> Hi all,
>
> May i know is there any possible solutions to detect the encoding or
> character set (charset) of a file automatically? Second, how to
> convert a particular encoding to Unicode once the file encoding is
> detected?

AFAIK, Unicode is the only commonly used encoding with a "signature" (the
byte-order marker, "BOM"). Detecting other encodings can be done
heuristically, but I'm not aware of any specific support within Java to do
so, and it wouldn't be 100% reliable anyway.

As far as converting once you know the encoding, see the InputStreamReader
and Charset classes for reading the file's bytes using a specific
encoding. Then you can use OutputStreamWriter to generate a new file
using whatever encoding you want, including any of the Unicode formats you
might want.

Pete

Message has been deleted

RedGrittyBrick

unread,

May 25, 2009, 3:07:45 PM5/25/09

to

On 25/05/2009 06:25, Peter Duniho wrote:
> On Sun, 24 May 2009 20:35:02 -0700, Simon <choonc...@gmail.com>
> wrote:
>
>> May i know is there any possible solutions to detect the encoding
>> or character set (charset) of a file automatically?
>

> AFAIK, Unicode is the only commonly used encoding with a "signature"
> (the byte-order marker, "BOM").

I suspect Pete is simplifying for clarity, but it is worth remembering
that Unicode is not an encoding.

AFAIK the byte-order mark (BOM) is optional for UTF-8/16/32 encodings.
Note: for the UTF-8 encoding, there are no byte-order issues and so, if
a BOM is included, it is only as a marker to signify a UTF encoding.
Files written on Unix systems typically do not include a BOM as it would
interfere with other important file-type marks.

--
RGB

Joshua Cranmer

unread,

May 25, 2009, 4:57:59 PM5/25/09

to

Simon wrote:
> May i know is there any possible solutions to detect the encoding or
> character set (charset) of a file automatically? Second, how to
> convert a particular encoding to Unicode once the file encoding is
> detected?

The short answer: there's no easy way to detect charset automatically.

The long answer:
Typically, no filesystem stores metadata that one can associate with a
file encoding. All of your ISO 8859-* codes differ only in what the
codepoints in the x80 - xFF range look like, be it standard accented
characters (like à), Greek characters (α), or some other language.
Pragmatically differentiating between these single-byte encodings forces
you to resort to either heuristics or getting help from the user (if you
notice, all major browsers allow you to select a web page's encoding for
this very reason).

There is another class of encodings-- variable-length encodings like
UTF-8 or Shift-JIS. One can sometimes rule out these encodings, if
invalid sequences are produced. For example, 0xa4 0xf4 is invalid UTF-8,
so it's probably an ISO 8859-* language instead.

Context is also helpful. You may recall coming across documents that
have unusual character pairings, like Â€ç or something (if your
newsreader sucks at i18n, you'll probably be seeing those in this
message as well). That is pretty much a dead giveaway that the message
is UTF-8 but someone is treating it as ISO 8859-1 (or it's very close
sibling, Windows-1252). If you're seeing multiple high-byte characters
in a row, it's more likely UTF-8 than it is ISO 8859-1, although some
other languages may have these cases routinely (like Greek).

The final way to guess at the encoding is to look at what the platform's
default is. Western European-localized products will tend to be in
either Cp1252 (which is pretty much ISO 8859-1) or UTF-8; Japanese are
probably either Shift-JIS or UTF-8. I believe Java's conversion methods
will default to platform encoding for you anyways, so that may be a
safer bet for you. The other alternative is to just assume everyone uses
the same charset and not think about it.

--
Beware of bugs in the above code; I have only proved it correct, not
tried it. -- Donald E. Knuth

Roedy Green

unread,

May 25, 2009, 6:26:48 PM5/25/09

to

On Sun, 24 May 2009 20:35:02 -0700 (PDT), Simon
<choonc...@gmail.com> wrote, quoted or indirectly quoted someone
who said :

>May i know is there any possible solutions to detect the encoding or
>character set (charset) of a file automatically? Second, how to
>convert a particular encoding to Unicode once the file encoding is
>detected?

I wrote a utility to manually assist the process. You could do it
automatically if you know the vocabulary of the file. Search for byte
patterns of encoded words.

see http://mindprod.com/jgloss/encoding.html

The fact you can't tell is so dirty coffee cups and pizza boxes on the
floor. I can't imagine that happening if someone like Martha Stewart
were in charge.
--
Roedy Green Canadian Mind Products
http://mindprod.com

"Everybody�s worried about stopping terrorism. Well, there�s a really easy way: stop participating in it."
~ Noam Chomsky

Roedy Green

unread,

May 26, 2009, 3:47:11 PM5/26/09

to

On 25 May 2009 12:56:47 GMT, r...@zedat.fu-berlin.de (Stefan Ram)

wrote, quoted or indirectly quoted someone who said :

>
> Of course, one can take advantage of the fact, that certain
> octet values and octet sequence values are absolutely forbidden
> in certain encodings so as to exclude those encodings.

The biggest clue is the country source of the file. Check the
national encodings first.

Wayne

unread,

May 30, 2009, 8:24:58 PM5/30/09

to

Roedy Green wrote:
> On Sun, 24 May 2009 20:35:02 -0700 (PDT), Simon
> <choonc...@gmail.com> wrote, quoted or indirectly quoted someone
> who said :
>
>> May i know is there any possible solutions to detect the encoding or
>> character set (charset) of a file automatically? Second, how to
>> convert a particular encoding to Unicode once the file encoding is
>> detected?
>
> I wrote a utility to manually assist the process. You could do it
> automatically if you know the vocabulary of the file. Search for byte
> patterns of encoded words.
>
> see http://mindprod.com/jgloss/encoding.html
>
> The fact you can't tell is so dirty coffee cups and pizza boxes on the
> floor. I can't imagine that happening if someone like Martha Stewart
> were in charge.

I've often thought an elegant solution would be to define more than
one BOM (byte order mark) in Unicode. They could allocate enough
BOMs to have a different one for each encoding.

--
Wayne

Roedy Green

unread,

May 31, 2009, 11:47:40 AM5/31/09

to

On Sat, 30 May 2009 20:24:58 -0400, Wayne <nos...@all.4me.invalid>

wrote, quoted or indirectly quoted someone who said :

>I've often thought an elegant solution would be to define more than

>one BOM (byte order mark) in Unicode. They could allocate enough
>BOMs to have a different one for each encoding.

There are hundreds of encodings. You could add it now with:

BOM BOM name-of-encoding BOM.

That way you don't have to reserve any new characters.

While we are at it, we should encode the MIME type and create an
extensible scheme to add other meta-information.

Message has been deleted

Wayne

unread,

Jun 6, 2009, 11:23:04 PM6/6/09

to

Stefan Ram wrote:

> Roedy Green <see_w...@mindprod.com.invalid> writes:
>> There are hundreds of encodings. You could add it now with:
>> BOM BOM name-of-encoding BOM.
>

> It is called »XML«:
>
> <?xml encoding="name-of-encoding" ?><text><![CDATA[...]]></text>
> ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
>

Right, but there should be a simple way to deal with plain text
files too.

Turns out there is one! I've been reading the HTML5 draft spec
and cam across this:

2.7.3 Content-Type sniffing: text or binary

1. The user agent may wait for 512 or more bytes of the resource
to be available.
2. Let n be the smaller of either 512 or the number of bytes
already available.
3. If n is 4 or more, and the first bytes of the resource match
one of the following byte sets:

Bytes in
Hexadecimal Description
FE FF UTF-16BE BOM
FF FE UTF-16LE BOM
EF BB BF UTF-8 BOM

So there is already defined multiple BOMs, including one
for UTF-8. (I knew it was a good idea! :-)

--
Wayne

Roedy Green

unread,

Jun 7, 2009, 1:43:13 AM6/7/09

to

On Sat, 06 Jun 2009 23:23:04 -0400, Wayne <nos...@all.4me.invalid>

wrote, quoted or indirectly quoted someone who said :

>FE FF UTF-16BE BOM

>FF FE UTF-16LE BOM
>EF BB BF UTF-8 BOM
>
>So there is already defined multiple BOMs, including one
>for UTF-8. (I knew it was a good idea! :-)

I suppose we could try to get rid of all the old 8-bit encodings and
use Unicode/UTF rather than try to patch all those text files out
there with some scheme to mark the encoding.

--
Roedy Green Canadian Mind Products
http://mindprod.com

Never discourage anyone... who continually makes progress, no matter how slow.
~ Plato 428 BC died: 348 BC at age: 80

Mayeul

unread,

Jun 9, 2009, 10:12:38 AM6/9/09

to

I wouldn't say that "multiple" BOMs are already defined. The idea of the
BOM is to insert a zero-width no-break space character, whose code point
is U+FEFF, at the start of the file.

Since this character will be encoded differently by different encodings,
it enables to distinguish between UTF-16BE, UTF-16LE, UTF-8 and other
Unicode encodings.
It is also a somewhat acceptable way to indicate a file is UTF-8 rather
than latin-1 or something, since it seems unlikely that a plain text
file would start with the characters that the BOM's binary represents in
non-Unicode encodings.

Bottomline, the BOM is a zero-width no-break space. It is unique, there
are no multiple BOMs.

Or if there are that I don't know of, that would be another norm the
given table wouldn't conform with.

--
Mayeul