Problem: When a string with norwegian character is loaded with
readObject, I get a UTFDataFormatException. I would expect that
readObject and writeObject handled Strings with any content.
To run the program, install the package "Remote Method Invocation
and Object Serialization" from Sun.
With english strings, the program works all right:
D:\s\cafe\tstutf>java tstUTF an english string
string an english string saved.
D:\s\cafe\tstutf>java tstUTF read
an english string
D:\s\cafe\tstutf>
With a norwegian character in the string, I get this:
D:\s\cafe\tstutf>java tstUTF Skål!
string Sk_l! saved.
D:\s\cafe\tstutf>java tstUTF read
java.io.UTFDataFormatException
at java.io.DataInputStream.readUTF(DataInputStream.java:326)
at java.io.DataInputStream.readUTF(DataInputStream.java:281)
at
java.io.ObjectInputStream.readObject(ObjectInputStream.java:193)
at tstUTF.<init>(tstUTF.java:41)
at tstUTF.main(tstUTF.java:57)
D:\s\cafe\tstutf>
The program:
import java.io.*;
//import java.awt.*;
public class tstUTF
{
void write( String[] argv ) throws java.io.IOException
{
FileOutputStream f = new FileOutputStream( "tstUTF.data" );
ObjectOutput s = new ObjectOutputStream(f);
String str = new String();
String str2;
int i = 0;
while ( true )
{
try
{
str2 = argv[ i ];
if ( str.length() > 0 )
str = str + " ";
str = str + str2;
i = i + 1;
}
catch (ArrayIndexOutOfBoundsException ex )
{
break;
}
}
s.writeObject( str );
f.flush();
System.out.println( "string " + str + " saved." );
}
public tstUTF(String[] argv )
throws java.io.IOException, java.lang.ClassNotFoundException
{
try
{
if ( argv[0].equals( "read" ))
{
FileInputStream in = new FileInputStream(
"tstUTF.data" );
ObjectInputStream sin = new ObjectInputStream( in );
String str2 = ( String ) sin.readObject();
System.out.println( str2 );
}
else
write( argv );
}
catch ( ArrayIndexOutOfBoundsException ex )
{
System.out.println( "parm read for read, anything for
write" );
return;
}
}
public static void main ( String[] argv)
throws java.io.IOException, java.lang.ClassNotFoundException
{
new tstUTF( argv );
return;
}
}
--
email: ri...@bgnett.no
Sorry, i didn't know about argv.length at the time.
Johan
--
email: ri...@bgnett.no
The strange thing is that someone typing the same character from Europe
(on a European keyboard) did not cause the problem.
John Neffenger
http://www.volano.com/
> try
> {
> str2 = argv[ i ];
> if ( str.length() > 0 )
> str = str + " ";
> str = str + str2;
> i = i + 1;
> }
> catch (ArrayIndexOutOfBoundsException ex )
> {
> break;
> }
> }
The TextField class produces a \uFFF8 character instead of a \u00F8
character when "ø" is typed in.
You are right, the problem relates to
java.io.DataInputStream.readUTF().
I have been able to ring the problem in this far: When I type the
norwegian character "ø" (which is 0xF8 in Ms Windows and latin1) in a
TextField, the character in the String from TextField.getText() is
actually the unicode character \uFFF8. This is stored correctly with
writeUTF as the three bytes EF BF B8 or
1110 1111 1011 1111 1011 1001
---- xxxx --xx xxxx --xx xxxx
where the bits marked x is the real character, and the bits marked -
(minus) is the UTF flag bits.
When this sequence is read in, the readUTF function throws the
UTFDataFormatException.
The characters from \uFFF0 to \uFFFC is described as "Specials" in the
Unicode encoding, according to "Java in a Nutshell" (David Flanagan).
The same problem appears when i use the character "ø" in a parameter
to a program in a dos box.
Now I have two mysteries:
1. Why does TextField (and args to main) produce Strings with \uFFF8
in stead of u\00F8 when a norwegian "ø" is typed?
2. Why does not readUTF handle the \uFFF8 character?
--
email: ri...@bgnett.no
I think I found a bug in the readUTF-function. Correct me if I am
wrong: There should be a break-statement as the last statement in the
case:14-case. With the bug, all three-byte UTF-characters produces the
exception in the default-statement, that is all unicode characters
from \u0800 to \uFFFF.
From the file ...Java\Src\java\io\DataInputStream.java in the Symantec
Cafe 1.51-distribution: (--> in the margin where the bug is).
public final static String readUTF(DataInput in) throws
IOException {
int utflen = in.readUnsignedShort();
char str[] = new char[utflen];
int count = 0;
int strlen = 0;
while (count < utflen) {
int c = in.readUnsignedByte();
int char2, char3;
switch (c >> 4) {
case 0: case 1: case 2: case 3: case 4: case 5: case 6: case
7:
// 0xxxxxxx
count++;
str[strlen++] = (char)c;
break;
case 12: case 13:
// 110x xxxx 10xx xxxx
count += 2;
if (count > utflen)
throw new UTFDataFormatException();
char2 = in.readUnsignedByte();
if ((char2 & 0xC0) != 0x80)
throw new UTFDataFormatException();
str[strlen++] = (char)(((c & 0x1F) << 6) | (char2 & 0x3F));
break;
case 14:
// 1110 xxxx 10xx xxxx 10xx xxxx
count += 3;
if (count > utflen)
throw new UTFDataFormatException();
char2 = in.readUnsignedByte();
char3 = in.readUnsignedByte();
if (((char2 & 0xC0) != 0x80) || ((char3 & 0xC0) != 0x80))
throw new UTFDataFormatException();
str[strlen++] = (char)(((c & 0x0F) << 12) |
((char2 & 0x3F) << 6) |
--> ((char3 & 0x3F) << 0));
default:
// 10xx xxxx, 1111 xxxx
throw new UTFDataFormatException();
}
--
email: ri...@bgnett.no
> The TextField class produces a \uFFF8 character instead of a \u00F8
> character when "ø" is typed in.
>
> You are right, the problem relates to
> java.io.DataInputStream.readUTF().
> I have been able to ring the problem in this far: When I type the
> norwegian character "ø" (which is 0xF8 in Ms Windows and latin1) in a
> TextField, the character in the String from TextField.getText() is
> actually the unicode character \uFFF8. This is stored correctly with
> writeUTF as the three bytes EF BF B8 or
>
> 1110 1111 1011 1111 1011 1001
> ---- xxxx --xx xxxx --xx xxxx
>
> where the bits marked x is the real character, and the bits marked -
> (minus) is the UTF flag bits.
>
> When this sequence is read in, the readUTF function throws the
> UTFDataFormatException.
>
> The characters from \uFFF0 to \uFFFC is described as "Specials" in the
> Unicode encoding, according to "Java in a Nutshell" (David Flanagan).
>
> The same problem appears when i use the character "ø" in a parameter
> to a program in a dos box.
>
> Now I have two mysteries:
>
> 1. Why does TextField (and args to main) produce Strings with \uFFF8
> in stead of u\00F8 when a norwegian "ø" is typed?
Since char is signed in C, I would guess that the library that's
grabbing the chars from the TextField is sign-extending them
as it writes them into the TextField object. (I.e., instead of
masking them.) You'd probably see the same thing w/ \u0080-\u00FF.
> 2. Why does not readUTF handle the \uFFF8 character?
It looks like a bug in src/java/io/DataInputStream.java's readUTF()!
Case 14 is missing the break at the end, falling thru to the
default case of a format error.
--
J. David Beutel "You're inhabited by the society you live in through
11011011 j...@pinn.net your use of language." McCorduck on Turkle on Lacan
> Since char is signed in C...
This varies between compilers. I've worked with more than one compiler
that treated char as unsigned and more than one which treated it as
signed. One of the many reasons not to use naked types in C.
Doug Bell
db...@shvn.com
I have reported these in to bug reports to java...@java.sun.com.
--
email: ri...@bgnett.no