Converting String to byte[] and back

Dave Roberts

unread,

Mar 29, 1998, 3:00:00 AM3/29/98

to

I'm trying to convert a string to a byte array of *ASCII* (or ISO
8859-1) characters. I need a way to do this reliably in JDK 1.1+.

I have found String.getBytes(int srcBegin, int srcEnd, byte[] dst, int
dstBegin), but this is deprecated.

I have found String.getBytes(String enc), only I can't seem to find
out what encoding names are valid. I tried "ASCII" and "USASCII".
Those threw an UnsupportedEncodingException. I then tried "US-ASCII"
and this throws:
java.lang.IllegalArgumentException: sun.io.CharToByteUS-ASCII
at
sun.io.CharToByteConverter.getConverter(CharToByteConverter.java:95)
at java.lang.String.getBytes(String.java:532)
at UserNameAttribute.getEncodedData(UserNameAttribute.java:24)
at RadiusPacket.getPacket(RadiusPacket.java:59)
at Test.main(Test.java:11)

So, queston: What is everybody else doing to solve this? Just using
the deprecated routines that seem to work, or using the new routines
that don't seem to be documented worth spit?

Any help would be appreciated.

-- Dave Roberts
da...@droberts.com

Dave Roberts

unread,

Mar 29, 1998, 3:00:00 AM3/29/98

to

On Sun, 29 Mar 1998 19:23:33 GMT, da...@droberts.com (Dave Roberts)
wrote:

For anybody else who is looking for this or just interested...

I found this after finally searching JDC for bugs on this subject. It
looks like this is a big issue. The names are not in the docs,
including the 1.2 docs. The only place I could find anything like the
names documented was the native2ascii converter utility documentation.
The names listed there as follows. "8859_1" seems to work well. I
can't believe there isn't something for simple ASCII. Has that stopped
being a character set now? :-) Also, there seems to be a lot of bugs
filed saying that these encoding names aren't following the official
encoding names specified by IANA/IETF. For instance, ISO 8859-1 is
supposed to be "ISO-8859-1", not "8859_1".

I gotta say, this is pretty lame. I spent about 2 hours on this one.

-- Dave

8859_1 ISO 8859-1
8859_2 ISO 8859-2
8859_3 ISO 8859-3
8859_4 ISO 8859-4
8859_5 ISO 8859-5
8859_6 ISO 8859-6
8859_7 ISO 8859-7
8859_8 ISO 8859-8
8859_9 ISO 8859-9
Big5 Big5, Traditional Chinese
CNS11643 CNS 11643, Traditional Chinese
Cp037 USA, Canada(Bilingual, French), Netherlands,
Portugal, Brazil, Australia
Cp1006 IBM AIX Pakistan (Urdu)
Cp1025 IBM Multilingual Cyrillic: Bulgaria, Bosnia,
Herzegovinia, Macedonia(FYR)
Cp1026 IBM Latin-5, Turkey
Cp1046 IBM Open Edition US EBCDIC
Cp1097 IBM Iran(Farsi)/Persian
Cp1098 IBM Iran(Farsi)/Persian (PC)
Cp1112 IBM Latvia, Lithuania
Cp1122 IBM Estonia
Cp1123 IBM Ukraine
Cp1124 IBM AIX Ukraine
Cp1125 IBM Ukraine (PC)
Cp1250 Windows Eastern European
Cp1251 Windows Cyrillic
Cp1252 Windows Latin-1
Cp1253 Windows Greek
Cp1254 Windows Turkish
Cp1255 Windows Hebrew
Cp1256 Windows Arabic
Cp1257 Windows Baltic
Cp1258 Windows Vietnamese
Cp1381 IBM OS/2, DOS People's Republic of China (PRC)
Cp1383 IBM AIX People's Republic of China (PRC)
Cp273 IBM Austria, Germany
Cp277 IBM Denmark, Norway
Cp278 IBM Finland, Sweden
Cp280 IBM Italy
Cp284 IBM Catalan/Spain, Spanish Latin America
Cp285 IBM United Kingdom, Ireland
Cp297 IBM France
Cp33722 IBM-eucJP - Japanese (superset of 5050)
Cp420 IBM Arabic
Cp424 IBM Hebrew
Cp437 MS-DOS United States, Australia, New Zealand,
South Africa
Cp500 EBCDIC 500V1
Cp737 PC Greek
Cp775 PC Baltic
Cp838 IBM Thailand extended SBCS
Cp850 MS-DOS Latin-1
Cp852 MS-DOS Latin-2Cp855 IBM Cyrillic
Cp857 IBM Turkish
Cp860 MS-DOS Portuguese
Cp861 MS-DOS Icelandic
Cp862 PC Hebrew
Cp863 MS-DOS Canadian French
Cp864 PC Arabic
Cp865 MS-DOS Nordic
Cp866 MS-DOS Russian
Cp868 MS-DOS Pakistan
Cp869 IBM Modern Greek
Cp870 IBM Multilingual Latin-2
Cp871 IBM Iceland
Cp874 IBM Thai
Cp875 IBM Greek
Cp918 IBM Pakistan(Urdu)
Cp921 IBM Latvia, Lithuania (AIX, DOS)
Cp922 IBM Estonia (AIX, DOS)
Cp930 Japanese Katakana-Kanji mixed with 4370 UDC,
superset of 5026
Cp933 Korean Mixed with 1880 UDC, superset of 5029
Cp935 Simplified Chinese Host mixed with 1880 UDC,
superset of 5031
Cp937 Traditional Chinese Host miexed with 6204 UDC,
superset of 5033
Cp939 Japanese Latin Kanji mixed with 4370 UDC,
superset of 5035
Cp942 Japanese (OS/2) superset of 932
Cp948 OS/2 Chinese (Taiwan) superset of 938
Cp949 PC Korean
Cp950 PC Chinese (Hong Kong, Taiwan)
Cp964 AIX Chinese (Taiwan)
Cp970 AIX Korean
EUCJIS JIS, EUC Encoding, Japanese
GB2312 GB2312, EUC encoding, Simplified Chinese
GBK GBK, Simplified Chinese
ISO2022CN ISO 2022 CN, Chinese
ISO2022CN_CNS CNS 11643 in ISO-2022-CN form, T. Chinese
ISO2022CN_GB GB 2312 in ISO-2022-CN form, S. Chinese
ISO2022KR ISO 2022 KR, Korean
JIS JIS, Japanese
JIS0208 JIS 0208, Japanese
KOI8_R KOI8-R, Russian
KSC5601 KS C 5601, Korean MS874 Windows Thai

MacArabic Macintosh Arabic MacCentralEurope Macintosh
Latin-2
MacCroatian Macintosh Croatian MacCyrillic Macintosh
Cyrillic
MacDingbat Macintosh Dingbat
MacGreek Macintosh Greek
MacHebrew Macintosh Hebrew
MacIceland Macintosh Iceland MacRoman Macintosh
Roman
MacRomania Macintosh Romania
MacSymbol Macintosh Symbol
MacThai Macintosh Thai
MacTurkish Macintosh Turkish
MacUkraine Macintosh Ukraine
SJIS Shift-JIS, Japanese
UTF8 UTF-8

PPAATT

unread,

Mar 30, 1998, 3:00:00 AM3/30/98

to

> ASCII ... ISO 8859-1 ... String

I get the impression that everybody :(except me): "just knows" that Unicode
agrees with ISO-8859-1 on the meaning of the first 256 codes? (Even I knew
about \u0020..\u007E - but the others?)

> new routines ... don't seem ... documented worth spit?
> convert ... string ... byte array ... JDK 1.1+.

How about write a string into an OutputStreamWriter constructed on top of a
ByteArrayOutputStream for the conversion one way ...

... and read a string from an InputStreamReader constructed on top of a
ByteArrayInputStream for the conversion the other way?

Note the ByteArrayOutputStream does, yes does, grow as needed to cope with
however much compression/inflation results from the translation.

> reliably ... Any help would be appreciated.

This might be cheap and good - though perhaps therefore none too fast? I
imagine one could lose plenty of time in re-instantiating the
ByteArrayInputStream class for each new byte array handled? (In particular,
ByteArrayInputStream lacks a symmetric analogue to the reset() method of
ByteArrayOutputStream.)

> can't seem to find out what encoding names are valid ...
> [ docs/tooldocs/win32/native2ascii.html ]
> JDC ... names aren't following ... IANA/IETF.
> For instance, ... "ISO-8859-1", not "8859_1".

java.io.InputStreamReader.getEncoding() will identify at least one valid
encoding to try?

Ppa...@aol.com

Stuart D. Gathman

unread,

Mar 30, 1998, 3:00:00 AM3/30/98

to

On Sun, 29 Mar 1998 19:23:33 GMT, Dave Roberts <da...@droberts.com> wrote:
>I'm trying to convert a string to a byte array of *ASCII* (or ISO
>8859-1) characters. I need a way to do this reliably in JDK 1.1+.
>
>I have found String.getBytes(int srcBegin, int srcEnd, byte[] dst, int
>dstBegin), but this is deprecated.
>
>I have found String.getBytes(String enc), only I can't seem to find
>out what encoding names are valid. I tried "ASCII" and "USASCII".

The ones I use the most are "Cp437" (IBM-PC charset) and "Cp037" (EBCDIC).

Look at the contents of the sun.io package. All the CharToBytexxxx classes
are the encoding names you can use. You can even define your own -
the API is in the JDK1.1.1 internationalization documentation. The
CharToBytexxx classes were originally in java.io. Unfortunately, they
were removed from the docs and moved to sun.io because "a better API was
in the works". JDK 1.2 beta 3 is out and there is still no "better API"
even on the horizon!

There is a serious performance problem if you do a lot of String to
byte[] conversions (I do). Their code to find proper class from
the encoding name is ridiculously inefficient. To make matters worse,
the CharToByte and ByteToChar classes keep getting GC'ed and reloaded!

I finally made myself a class to access to sun.io API directly (I know,
they can go away) and sped up our apps by a factor of *4*!!!
--
Stuart D. Gathman <stu...@bmsi.com>
Business Management Systems Inc. Phone: 703 591-0911 Fax: 703 591-6154
"Microsoft is the QWERTY of Operating Systems" - SDG
"Confutatis maledictus, flamis acribus addictus" - background song for
a Microsoft sponsored "Where do you want to go from here?" commercial.
(HINT: Find translation for the "Confutatis" movement of the Mozart Requiem).

Bill Wilkinson

unread,

Mar 30, 1998, 3:00:00 AM3/30/98

to Dave Roberts

Dave Roberts wrote:
>
> I'm trying to convert a string to a byte array of *ASCII* (or ISO
> 8859-1) characters. I need a way to do this reliably in JDK 1.1+.
>
>

> I have found String.getBytes(String enc), only I can't seem to find
> out what encoding names are valid.

String.getBytes("ISO8859_1");
or maybe
String.getBytes("8859_1");
(This ends up invoking methods on classes
sun.io.ByteToChar8859_1 or sun.io.CharToByte8859_1.)

However, note that if you simply use

String.getBytes( );

the "default" encoding for the current installation will be used.
This *MAY* be preferable to explicitly giving an encoding.

The supported encodings are discussed in
http://www.javasoft.com/products/jdk/1.1/docs/guide/intl/intlTOC.doc.html
and particularly at
http://www.javasoft.com/products/jdk/1.1/docs/guide/intl/intl.doc.html#25303

Yury Voronov

unread,

Mar 31, 1998, 3:00:00 AM3/31/98

to

Hello!

Stuart D. Gathman wrote:
[...]

> >I have found String.getBytes(String enc), only I can't seem to find

> >out what encoding names are valid. I tried "ASCII" and "USASCII".
>
> The ones I use the most are "Cp437" (IBM-PC charset) and "Cp037" (EBCDIC).
>
> Look at the contents of the sun.io package. All the CharToBytexxxx classes
> are the encoding names you can use. You can even define your own -
> the API is in the JDK1.1.1 internationalization documentation. The
> CharToBytexxx classes were originally in java.io. Unfortunately, they
> were removed from the docs and moved to sun.io because "a better API was
> in the works". JDK 1.2 beta 3 is out and there is still no "better API"
> even on the horizon!

You can also look at native2ascii utility documentation to find
available encoding names.

That's so. Yu.

Paul

unread,

Mar 31, 1998, 3:00:00 AM3/31/98

to

Something else to look into might be to use ByteArrayOutputStream and
ByteArrayInputStream.

Paul

mY

unread,

Mar 31, 1998, 3:00:00 AM3/31/98

to

On 30 Mar 1998 15:58:33 GMT, stu...@www.bmsi.com (Stuart D. Gathman)
wrote:

>There is a serious performance problem if you do a lot of String to
>byte[] conversions (I do). Their code to find proper class from
>the encoding name is ridiculously inefficient. To make matters worse,
>the CharToByte and ByteToChar classes keep getting GC'ed and reloaded!

I do a lot of ASCII to String and back conversion. Right now I just
use the default encoding. But I just know that this is inefficient
because there is no conversion needed.

>I finally made myself a class to access to sun.io API directly (I know,
>they can go away) and sped up our apps by a factor of *4*!!!

Can you show me what you did? Thanks!

Stuart D. Gathman

unread,

Apr 1, 1998, 3:00:00 AM4/1/98

to

package bmsi.xopen;
import sun.io.ByteToCharConverter;
import sun.io.CharToByteConverter;
import sun.io.MalformedInputException;
import java.io.UnsupportedEncodingException;
import java.util.Hashtable;

/** This class is used to load and store values in Cisam/Xopen file format.
Although we access sun.io.* classes directly for performance reasons, we
can change this class to use the documented interface should those classes
change or go away. The performance problem is that getConverter(String)
is *really* slow. Plus the converter classes get GC's and reloaded a lot.
*/
public class FType {
private final ByteToCharConverter tochar;
private final CharToByteConverter tobyte;
public final byte blank;
private final String enc;
private static final char[] cblank = { ' ' };
private static final Hashtable tbl = new Hashtable();

public static FType getConverter(String enc)
throws UnsupportedEncodingException
{
FType f = (FType)tbl.get(enc);
if (f == null) {
f = new FType(enc);
tbl.put(enc,f);
}
return f;
}

private FType(String enc) throws UnsupportedEncodingException {
ByteToCharConverter tochar = ByteToCharConverter.getConverter(enc);
CharToByteConverter tobyte = CharToByteConverter.getConverter(enc);
try {
byte[] b = tobyte.convertAll(cblank);
if (b.length != 1)
throw new UnsupportedEncodingException(
"Multi-byte blank not supported");
blank = b[0];
this.tochar = tochar;
this.tobyte = tobyte;
this.enc = enc;
}
catch (MalformedInputException e) {
throw new UnsupportedEncodingException("Can't convert blank");
}
}

public String getEncoding() { return enc; }

public final byte[] getBytes(String s) {
try {
return tobyte.convertAll(s.toCharArray());
}
catch (MalformedInputException e) {
return new byte[0];
}
}

public void ststr(String s,byte[] buf,int pos,int len) {
byte[] b = getBytes(s);
while (len > b.length)
buf[pos + --len] = blank;
System.arraycopy(b,0,buf,pos,len);
}

public String getString(byte[] buf) {
try {
return new String(tochar.convertAll(buf));
}
catch (MalformedInputException e) {
return "";
}
}

public String ldstr(byte[] buf,int pos,int len) {
byte[] b = new byte[len];
System.arraycopy(buf,pos,b,0,len);
try {
char[] c = tochar.convertAll(b);
len = c.length;
while (len > 0 && c[len-1] <= ' ') --len;
return new String(c,0,len);
}
catch (MalformedInputException e) {
return "";
}
}

public static void stshort(short i,byte[] buf,int pos) {
buf[pos++] = (byte)(i >> 8);
buf[pos] = (byte)i;
}
public static void stint(int i,byte[] buf,int pos) {
buf[pos++] = (byte)(i >> 24);
buf[pos++] = (byte)(i >> 16);
buf[pos++] = (byte)(i >> 8);
buf[pos] = (byte)i;
}
public static short ldshort(byte[] buf,int pos) {
return (short)((buf[pos] << 8) | (buf[pos+1] & 255));
}
public static int ldushort(byte[] buf,int pos) {
return ((buf[pos] & 255) << 8) | (buf[pos+1] & 255);
}
public static int ldint(byte[] buf,int pos) {
return (ldshort(buf,pos) << 16) | ldushort(buf,pos+2);