converting from one charset encoding to another ...

Albretch Mueller

unread,

Nov 22, 2009, 10:02:36 PM11/22/09

to

Sometime ago I coded some methods to charset re-encoding. Say you get
files in kirillic, “KOI8-R” and you want them as UTF-8

What I did was basically opening an InputStreamReader(FileInputStream
FIS, String aEncoding1) and an OutputStreamWriter(FOS, “UTF-8”) and
went InputStreamReader.read(char[] chrBffr) and
OutputStreamWriter.write(chrBffr, 0, iRdByts) in a while loop till it
hit an EOF

That works just fine, yet I wonder if there are better/faster ways to
do that using channels/memory mapped files

Also where can you get actual files with different types fo encodings
to test these methods.

Thanks
lbrtchx
{comp.lang.java.programmer}

Mike Schilling

unread,

Nov 23, 2009, 12:54:51 AM11/23/09

to

You can create them easily enough with a FileWriter that writes to an
OutputStreamWriter of the desired encoding.

Albretch Mueller

unread,

Nov 23, 2009, 6:07:43 AM11/23/09

to

On Nov 23, 5:54 am, "Mike Schilling" <mscottschill...@hotmail.com>
wrote:

~
After checking the API I don't see what the difference would be
between a plain reader and a FileOutputStream. What is it?

Thank you
lbrtchx

Lew

unread,

Nov 23, 2009, 9:06:25 AM11/23/09

to

Albretch Mueller wrote:
> After checking the API I don't see what the difference would be
> between a plain reader and a FileOutputStream. What is it?

I'll assume you either meant a "plain writer" or a 'FileInputStream', but the
question remains what you mean by a "plain reader/writer".

'Reader's and 'Writer's deal with encoded 'char's. Streams deal with raw bytes.

--
Lew

Mike Schilling

unread,

Nov 23, 2009, 12:31:00 PM11/23/09

to

Albretch Mueller wrote:
>>
>> You can create them easily enough with a FileWriter that writes to
>> an
>> OutputStreamWriter of the desired encoding.
> ~
> After checking the API I don't see what the difference would be
> between a plain reader and a FileOutputStream. What is it?

A Writer converts from characters (Unicode) to whatever encoding it
was created with. an OutputStream just outputs bytes with no
conversion being done..

Roedy Green

unread,

Nov 23, 2009, 1:34:18 PM11/23/09

to

On Sun, 22 Nov 2009 19:02:36 -0800 (PST), Albretch Mueller
<lbr...@gmail.com> wrote, quoted or indirectly quoted someone who
said :

>
> That works just fine, yet I wonder if there are better/faster ways to
>do that using channels/memory mapped files

The thing I don't understand, is nio uses ordinary file i/o
underneath. So how is it faster if you don't do something stupid with
ordinary file i/o in a case where caching would not help?
--
Roedy Green Canadian Mind Products
http://mindprod.com
Finding a bug is a sign you were asleep a the switch when coding. Stop debugging, and go back over your code line by line.

Christian

unread,

Nov 23, 2009, 1:36:24 PM11/23/09

to

Albretch Mueller schrieb:

>
> Sometime ago I coded some methods to charset re-encoding. Say you get

> files in kirillic, ï¿½KOI8-Rï¿½ and you want them as UTF-8

>
> What I did was basically opening an InputStreamReader(FileInputStream

> FIS, String aEncoding1) and an OutputStreamWriter(FOS, ï¿½UTF-8ï¿½) and

> went InputStreamReader.read(char[] chrBffr) and
> OutputStreamWriter.write(chrBffr, 0, iRdByts) in a while loop till it
> hit an EOF
>
> That works just fine, yet I wonder if there are better/faster ways to
> do that using channels/memory mapped files
>
> Also where can you get actual files with different types fo encodings
> to test these methods.
>
> Thanks
> lbrtchx
> {comp.lang.java.programmer}

you might get better speed if you write it on your own .. i.e. not using
reader but directly acting on 2 FileChannels and CharsetEncoder/Decoder ...
if you want you can try using this with memory mapped files...

Albretch Mueller

unread,

Nov 23, 2009, 2:27:56 PM11/23/09

to

> I'll assume you either meant a "plain writer" or a 'FileInputStream'

~
;-)
~

> 'Reader's and 'Writer's deal with encoded 'char's. Streams deal with raw bytes.

~
but once you write to a file as I am doing it all becomes a stream of
bytes anyway, till you eventually reopen the file using a Reader and
specifying the charset to interpret chuncks of bytes as they are being
read into an array of chars, and as specified by the API:
~
http://java.sun.com/javase/6/docs/api/java/lang/Character.html
~
"The Java 2 platform uses the UTF-16 representation in char arrays
and in the String and StringBuffer classes."
~
So I think there is no real fancifulness in converting streams from
and to char sets as long as your OS/Java supports both encodings, it
is by nature a serial process.
~
Thank you
lbrtchx

Lew

unread,

Nov 23, 2009, 8:45:13 PM11/23/09

to

Lew wrote:
>> 'Reader's and 'Writer's deal with encoded 'char's. Streams deal with raw bytes.

Albretch Mueller wrote:
> but once you write to a file as I am doing it all becomes a stream of
> bytes anyway, till you eventually reopen the file using a Reader and
> specifying the charset to interpret chuncks of bytes as they are being
> read into an array of chars, and as specified by the API:

The exact bytes written through a Writer depend on the encoding used. If you
use a Reader with a different encoding, you'll get garbage.

--
Lew

Albretch Mueller

unread,

Nov 25, 2009, 3:18:17 AM11/25/09

to

OK, you have made me wonder about what to do when you don't know the
encoding of a file you got. As long as I know this is not taken care
by Readers even though some heuristics may be used

So, what do you do in those situations?

Thank you
lbrtchx

Lew

unread,

Nov 25, 2009, 9:13:58 AM11/25/09

to

Albretch Mueller wrote:

>> Lew wrote:
>> The exact bytes written through a Writer depend on the encoding used. If you
>> use a Reader with a different encoding, you'll get garbage.
>>
>> --
>> Lew

Don't quote sigs.

> OK, you have made me wonder about what to do when you don't know the
> encoding of a file you got. As long as I know this is not taken care
> by Readers even though some heuristics may be used
>
> So, what do you do in those situations?

The editor in Rational Software Architect, an IDE built on Eclipse, simply
reports that the file is not in the specified encoding. I haven't looked at
its source, but I guess it notices illegal code points. Other editors just
display the wrong thing.

--
Lew
Don't quote sigs.

Mike Schilling

unread,

Nov 25, 2009, 11:39:16 AM11/25/09

to

Albretch Mueller wrote:

>
> OK, you have made me wonder about what to do when you don't know
> the
> encoding of a file you got. As long as I know this is not taken care
> by Readers even though some heuristics may be used

Readers assume that what you tell them is true. (If you don't create
a Reader with an explicit charset, it uses the platform's default.)

Roedy Green

unread,

Nov 25, 2009, 3:25:13 PM11/25/09

to

On Wed, 25 Nov 2009 00:18:17 -0800 (PST), Albretch Mueller

<lbr...@gmail.com> wrote, quoted or indirectly quoted someone who
said :

> OK, you have made me wonder about what to do when you don't know the

>encoding of a file you got. As long as I know this is not taken care
>by Readers even though some heuristics may be used

see http://mindprod.com/applet/encodingrecogniser.html

http://mindprod.com/project/encodingidentification.html

--
Roedy Green Canadian Mind Products
http://mindprod.com

I mean the word proof not in the sense of the lawyers, who set two half proofs equal to a whole one, but in the sense of a mathematician, where half proof = 0, and it is demanded for proof that every doubt becomes impossible.
~ Carl Friedrich Gauss

Arne Vajhøj

unread,

Nov 25, 2009, 3:26:54 PM11/25/09

to

Albretch Mueller wrote:
> OK, you have made me wonder about what to do when you don't know the
> encoding of a file you got. As long as I know this is not taken care
> by Readers even though some heuristics may be used
>
> So, what do you do in those situations?

Ask for a specification.

The same sequence of bytes can be several different sequences of
chars depending on encoding.

A specification is necessary.

Arne