Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

trying to parse lines of files with non-ASCII chars

2 views
Skip to first unread message

lbr...@hotmail.com

unread,
Dec 22, 2006, 9:01:06 PM12/22/06
to
I have some text data in a file I need to parse.
.
the file's data contains characters such as accents, ntildes, ...
.
if I go "cat file" I can see all characters fine in the source file,
but after I parse the data and save it in another file using:
.
// - - - - - - - - - - - - - - - - - - - - - - - - - -
String aEnc = "UTF-8";
// __
FileOutputStream FOStrm = new FileOutputStream((new File(aOFlNm)));

OutputStreamWriter OStrmRdr = new OutputStreamWriter(FOStrm, aEnc);

BffrWrtr = new BufferedWriter(OStrmRdr);
// __
FileInputStream FIStrm = new FileInputStream(Fl);
InputStreamReader IStrmRdr = new InputStreamReader(FIStrm, aEnc);
BffrRdr = new BufferedReader(IStrmRdr);
// __
aRdLn = BffrRdr.readLine();
while(aRdLn != null){
// . . .
aRdLn = BffrRdr.readLine();
}
// __
BffrWrtr.flush(); BffrWrtr.close();
BffrRdr.close();
// - - - - - - - - - - - - - - - - - - - - - - - - - -
.
I don't see the non-ASCII characters right in the file, but all kinds
of weird chars
.
How can I fix this problem?
.
thanks
lbrtchx

hiwa

unread,
Dec 22, 2006, 9:53:15 PM12/22/06
to

String aEnc = "UTF-8"; // !! use "UTF8" for java.io classes

FileOutputStream FOStrm = new FileOutputStream((new File(aOFlNm)));
OutputStreamWriter OStrmRdr = new OutputStreamWriter(FOStrm, aEnc);
BffrWrtr = new BufferedWriter(OStrmRdr);

FileInputStream FIStrm = new FileInputStream(Fl);
// !! your input file may not be UTF-8, actually ...


InputStreamReader IStrmRdr = new InputStreamReader(FIStrm, aEnc);
BffrRdr = new BufferedReader(IStrmRdr);

aRdLn = BffrRdr.readLine();
while(aRdLn != null){
aRdLn = BffrRdr.readLine(); // !! aRdLn is/are discarded ...
}
BffrWrtr.flush(); BffrWrtr.close();
BffrRdr.close();

lbr...@hotmail.com

unread,
Dec 23, 2006, 6:23:31 AM12/23/06
to
> // !! use "UTF8" for java.io classes
: Well, actually I had tried both "UTF8" and "UTF-8"and java appears
to be taken both as the same
.

> // !! your input file may not be UTF-8, actually ...
: This is the very first thing I checked using KDE's kate
.

> // !! aRdLn is/are discarded ...
: What do you mean? What I posted was some extract from my actual code
.
the problem I am having might be related to the BOM "byte order
marker" under Linux/Knoppix, but I am not sure about it
.
I see there was a most despised SUN bug that was declared as "Closed,
will not be fixed"
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058
.
// __ I am using the following JVM
sh-3.1# java -version
java version "1.4.2_11"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_11-b06)
Java HotSpot(TM) Client VM (build 1.4.2_11-b06, mixed mode)
.
// __ Defaul encoding is "ANSI_X3.4-1968"
String aDefEnc = System.getProperty("file.encoding");
System.out.println("// __ aDefEnc=" + aDefEnc);
// __ aDefEnc=ANSI_X3.4-1968
.
// __ i I use -Dfile.encoding=UTF-8 as JVM parameter
String aDefEnc = System.getProperty("file.encoding");
System.out.println("// __ aDefEnc=" + aDefEnc);
// __ aDefEnc=UTF-8
// __ OStrmRdr.getEncoding()=UTF8
.
// __ if I use aEnc="UTF-8";
sh-3.1# java k_killed08Test
// __ OStrmRdr.getEncoding()=UTF8
.
// __ if I use aEnc="UTF8";
sh-3.1# java k_killed08Test
// __ OStrmRdr.getEncoding()=UTF8
.
// __ if I use some non-sense string aEnc="8FTU";
java.io.UnsupportedEncodingException: 8FTU
at sun.io.Converters.getConverterClass(Converters.java:215)
at sun.io.Converters.newConverter(Converters.java:248)
at
sun.io.CharToByteConverter.getConverter(CharToByteConverter.java:64)
at sun.nio.cs.StreamEncoder$ConverterSE.<init>(StreamEncoder.java:189)
at sun.nio.cs.StreamEncoder$ConverterSE.<init>(StreamEncoder.java:172)
at
sun.nio.cs.StreamEncoder.forOutputStreamWriter(StreamEncoder.java:72)
at java.io.OutputStreamWriter.<init>(OutputStreamWriter.java:82)
at k_killed08Test.parse(k_killed08Test.java:54)
at k_killed08Test.main(k_killed08Test.java:26)
.
lbrtchx

hiwa

unread,
Dec 23, 2006, 10:24:54 PM12/23/06
to

Does your document really have BOM?

0 new messages