Object serialization and java.io.UTFDataFormatException

Johan Ur Riise

unread,

Oct 18, 1996, 3:00:00 AM10/18/96

to

The program below writes the parameters as a string to a file, and
reads the same string back, using object serialization.

Problem: When a string with norwegian character is loaded with
readObject, I get a UTFDataFormatException. I would expect that
readObject and writeObject handled Strings with any content.

To run the program, install the package "Remote Method Invocation
and Object Serialization" from Sun.

With english strings, the program works all right:

D:\s\cafe\tstutf>java tstUTF an english string
string an english string saved.

D:\s\cafe\tstutf>java tstUTF read
an english string

D:\s\cafe\tstutf>

With a norwegian character in the string, I get this:

D:\s\cafe\tstutf>java tstUTF Skål!
string Sk_l! saved.

D:\s\cafe\tstutf>java tstUTF read
java.io.UTFDataFormatException
at java.io.DataInputStream.readUTF(DataInputStream.java:326)
at java.io.DataInputStream.readUTF(DataInputStream.java:281)
at
java.io.ObjectInputStream.readObject(ObjectInputStream.java:193)
at tstUTF.<init>(tstUTF.java:41)
at tstUTF.main(tstUTF.java:57)

D:\s\cafe\tstutf>

The program:

import java.io.*;
//import java.awt.*;
public class tstUTF
{
void write( String[] argv ) throws java.io.IOException
{
FileOutputStream f = new FileOutputStream( "tstUTF.data" );
ObjectOutput s = new ObjectOutputStream(f);
String str = new String();
String str2;
int i = 0;
while ( true )
{
try
{
str2 = argv[ i ];
if ( str.length() > 0 )
str = str + " ";
str = str + str2;
i = i + 1;
}
catch (ArrayIndexOutOfBoundsException ex )
{
break;
}
}
s.writeObject( str );

f.flush();
System.out.println( "string " + str + " saved." );
}
public tstUTF(String[] argv )
throws java.io.IOException, java.lang.ClassNotFoundException
{
try
{
if ( argv[0].equals( "read" ))
{
FileInputStream in = new FileInputStream(
"tstUTF.data" );
ObjectInputStream sin = new ObjectInputStream( in );
String str2 = ( String ) sin.readObject();
System.out.println( str2 );
}
else
write( argv );
}

catch ( ArrayIndexOutOfBoundsException ex )
{
System.out.println( "parm read for read, anything for
write" );
return;
}
}
public static void main ( String[] argv)
throws java.io.IOException, java.lang.ClassNotFoundException
{
new tstUTF( argv );
return;
}
}

--
email: ri...@bgnett.no

Johan Ur Riise

unread,

Oct 18, 1996, 3:00:00 AM10/18/96

to

ri...@bgnett.no (Johan Ur Riise) wrote:
> try
> {
> str2 = argv[ i ];
> if ( str.length() > 0 )
> str = str + " ";
> str = str + str2;
> i = i + 1;
> }
> catch (ArrayIndexOutOfBoundsException ex )
> {
> break;
> }

Sorry, i didn't know about argv.length at the time.
Johan
--
email: ri...@bgnett.no

John Neffenger

unread,

Oct 18, 1996, 3:00:00 AM10/18/96

to

You don't need RMI. I just had the same problem today trying to decode
a string with a "ç" in it. (I think it's called a cedilla accent?) I
sent the string with writeUTF and attempt to read it with readUTF. The
readUTF got the exception.

The strange thing is that someone typing the same character from Europe
(on a European keyboard) did not cause the problem.

John Neffenger
http://www.volano.com/

> try
> {
> str2 = argv[ i ];
> if ( str.length() > 0 )
> str = str + " ";
> str = str + str2;
> i = i + 1;
> }
> catch (ArrayIndexOutOfBoundsException ex )
> {
> break;
> }
> }

Johan Ur Riise

unread,

Oct 20, 1996, 3:00:00 AM10/20/96

to

[Johan Riise]

> Problem: When a string with norwegian character is loaded with
> readObject, I get a UTFDataFormatException. I would expect that
> readObject and writeObject handled Strings with any content.

[John Neffenger <jo...@volano.com>]

>You don't need RMI. I just had the same problem today trying to decode
>a string with a "ç" in it. (I think it's called a cedilla accent?) I
>sent the string with writeUTF and attempt to read it with readUTF. The
>readUTF got the exception.

The TextField class produces a \uFFF8 character instead of a \u00F8
character when "ø" is typed in.

You are right, the problem relates to
java.io.DataInputStream.readUTF().
I have been able to ring the problem in this far: When I type the
norwegian character "ø" (which is 0xF8 in Ms Windows and latin1) in a
TextField, the character in the String from TextField.getText() is
actually the unicode character \uFFF8. This is stored correctly with
writeUTF as the three bytes EF BF B8 or

1110 1111 1011 1111 1011 1001
---- xxxx --xx xxxx --xx xxxx

where the bits marked x is the real character, and the bits marked -
(minus) is the UTF flag bits.

When this sequence is read in, the readUTF function throws the
UTFDataFormatException.

The characters from \uFFF0 to \uFFFC is described as "Specials" in the
Unicode encoding, according to "Java in a Nutshell" (David Flanagan).

The same problem appears when i use the character "ø" in a parameter
to a program in a dos box.

Now I have two mysteries:

1. Why does TextField (and args to main) produce Strings with \uFFF8
in stead of u\00F8 when a norwegian "ø" is typed?

2. Why does not readUTF handle the \uFFF8 character?
--
email: ri...@bgnett.no

Johan Ur Riise

unread,

Oct 20, 1996, 3:00:00 AM10/20/96

to

>[Johan Riise]

>2. Why does not readUTF handle the \uFFF8 character?

I think I found a bug in the readUTF-function. Correct me if I am
wrong: There should be a break-statement as the last statement in the
case:14-case. With the bug, all three-byte UTF-characters produces the
exception in the default-statement, that is all unicode characters
from \u0800 to \uFFFF.

From the file ...Java\Src\java\io\DataInputStream.java in the Symantec
Cafe 1.51-distribution: (--> in the margin where the bug is).

public final static String readUTF(DataInput in) throws
IOException {
int utflen = in.readUnsignedShort();
char str[] = new char[utflen];
int count = 0;
int strlen = 0;
while (count < utflen) {
int c = in.readUnsignedByte();
int char2, char3;
switch (c >> 4) {
case 0: case 1: case 2: case 3: case 4: case 5: case 6: case
7:
// 0xxxxxxx
count++;
str[strlen++] = (char)c;
break;
case 12: case 13:
// 110x xxxx 10xx xxxx
count += 2;
if (count > utflen)
throw new UTFDataFormatException();
char2 = in.readUnsignedByte();
if ((char2 & 0xC0) != 0x80)
throw new UTFDataFormatException();
str[strlen++] = (char)(((c & 0x1F) << 6) | (char2 & 0x3F));
break;
case 14:
// 1110 xxxx 10xx xxxx 10xx xxxx
count += 3;
if (count > utflen)
throw new UTFDataFormatException();
char2 = in.readUnsignedByte();
char3 = in.readUnsignedByte();
if (((char2 & 0xC0) != 0x80) || ((char3 & 0xC0) != 0x80))
throw new UTFDataFormatException();
str[strlen++] = (char)(((c & 0x0F) << 12) |
((char2 & 0x3F) << 6) |
--> ((char3 & 0x3F) << 0));
default:
// 10xx xxxx, 1111 xxxx
throw new UTFDataFormatException();
}

--
email: ri...@bgnett.no

J. David Beutel

unread,

Oct 21, 1996, 3:00:00 AM10/21/96

to Johan Ur Riise, java...@java.sun.com

On Sun, 20 Oct 1996, Johan Ur Riise wrote:

> The TextField class produces a \uFFF8 character instead of a \u00F8
> character when "ø" is typed in.
>
> You are right, the problem relates to
> java.io.DataInputStream.readUTF().
> I have been able to ring the problem in this far: When I type the
> norwegian character "ø" (which is 0xF8 in Ms Windows and latin1) in a
> TextField, the character in the String from TextField.getText() is
> actually the unicode character \uFFF8. This is stored correctly with
> writeUTF as the three bytes EF BF B8 or
>
> 1110 1111 1011 1111 1011 1001
> ---- xxxx --xx xxxx --xx xxxx
>
> where the bits marked x is the real character, and the bits marked -
> (minus) is the UTF flag bits.
>
> When this sequence is read in, the readUTF function throws the
> UTFDataFormatException.
>
> The characters from \uFFF0 to \uFFFC is described as "Specials" in the
> Unicode encoding, according to "Java in a Nutshell" (David Flanagan).
>
> The same problem appears when i use the character "ø" in a parameter
> to a program in a dos box.
>
> Now I have two mysteries:
>
> 1. Why does TextField (and args to main) produce Strings with \uFFF8
> in stead of u\00F8 when a norwegian "ø" is typed?

Since char is signed in C, I would guess that the library that's
grabbing the chars from the TextField is sign-extending them
as it writes them into the TextField object. (I.e., instead of
masking them.) You'd probably see the same thing w/ \u0080-\u00FF.

> 2. Why does not readUTF handle the \uFFF8 character?

It looks like a bug in src/java/io/DataInputStream.java's readUTF()!
Case 14 is missing the break at the end, falling thru to the
default case of a format error.

--
J. David Beutel "You're inhabited by the society you live in through
11011011 j...@pinn.net your use of language." McCorduck on Turkle on Lacan

Doug Bell

unread,

Oct 21, 1996, 3:00:00 AM10/21/96

to

"J. David Beutel" <jdb@localhost> wrote:

> Since char is signed in C...

This varies between compilers. I've worked with more than one compiler
that treated char as unsigned and more than one which treated it as
signed. One of the many reasons not to use naked types in C.

Doug Bell
db...@shvn.com

Johan Ur Riise

unread,

Oct 23, 1996, 3:00:00 AM10/23/96

to

"J. David Beutel" <jdb@localhost> wrote:

I have reported these in to bug reports to java...@java.sun.com.
--
email: ri...@bgnett.no