Sometime ago I coded some methods to charset re-encoding. Say you get
files in kirillic, “KOI8-R” and you want them as UTF-8
What I did was basically opening an InputStreamReader(FileInputStream
FIS, String aEncoding1) and an OutputStreamWriter(FOS, “UTF-8”) and
went InputStreamReader.read(char[] chrBffr) and
OutputStreamWriter.write(chrBffr, 0, iRdByts) in a while loop till it
hit an EOF
That works just fine, yet I wonder if there are better/faster ways to
do that using channels/memory mapped files
Also where can you get actual files with different types fo encodings
to test these methods.
Thanks
lbrtchx
{comp.lang.java.programmer}
You can create them easily enough with a FileWriter that writes to an
OutputStreamWriter of the desired encoding.
Thank you
lbrtchx
I'll assume you either meant a "plain writer" or a 'FileInputStream', but the
question remains what you mean by a "plain reader/writer".
'Reader's and 'Writer's deal with encoded 'char's. Streams deal with raw bytes.
--
Lew
A Writer converts from characters (Unicode) to whatever encoding it
was created with. an OutputStream just outputs bytes with no
conversion being done..
>
> That works just fine, yet I wonder if there are better/faster ways to
>do that using channels/memory mapped files
The thing I don't understand, is nio uses ordinary file i/o
underneath. So how is it faster if you don't do something stupid with
ordinary file i/o in a case where caching would not help?
--
Roedy Green Canadian Mind Products
http://mindprod.com
Finding a bug is a sign you were asleep a the switch when coding. Stop debugging, and go back over your code line by line.
you might get better speed if you write it on your own .. i.e. not using
reader but directly acting on 2 FileChannels and CharsetEncoder/Decoder ...
if you want you can try using this with memory mapped files...
Albretch Mueller wrote:
> but once you write to a file as I am doing it all becomes a stream of
> bytes anyway, till you eventually reopen the file using a Reader and
> specifying the charset to interpret chuncks of bytes as they are being
> read into an array of chars, and as specified by the API:
The exact bytes written through a Writer depend on the encoding used. If you
use a Reader with a different encoding, you'll get garbage.
--
Lew
OK, you have made me wonder about what to do when you don't know the
encoding of a file you got. As long as I know this is not taken care
by Readers even though some heuristics may be used
So, what do you do in those situations?
Thank you
lbrtchx
Don't quote sigs.
> OK, you have made me wonder about what to do when you don't know the
> encoding of a file you got. As long as I know this is not taken care
> by Readers even though some heuristics may be used
>
> So, what do you do in those situations?
The editor in Rational Software Architect, an IDE built on Eclipse, simply
reports that the file is not in the specified encoding. I haven't looked at
its source, but I guess it notices illegal code points. Other editors just
display the wrong thing.
--
Lew
Don't quote sigs.
>
> OK, you have made me wonder about what to do when you don't know
> the
> encoding of a file you got. As long as I know this is not taken care
> by Readers even though some heuristics may be used
Readers assume that what you tell them is true. (If you don't create
a Reader with an explicit charset, it uses the platform's default.)
> OK, you have made me wonder about what to do when you don't know the
>encoding of a file you got. As long as I know this is not taken care
>by Readers even though some heuristics may be used
see http://mindprod.com/applet/encodingrecogniser.html
http://mindprod.com/project/encodingidentification.html
--
Roedy Green Canadian Mind Products
http://mindprod.com
I mean the word proof not in the sense of the lawyers, who set two half proofs equal to a whole one, but in the sense of a mathematician, where half proof = 0, and it is demanded for proof that every doubt becomes impossible.
~ Carl Friedrich Gauss
Ask for a specification.
The same sequence of bytes can be several different sequences of
chars depending on encoding.
A specification is necessary.
Arne