Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Re: converting unicode to UTF-8

207 views
Skip to first unread message

Steve Horsley

unread,
Nov 19, 2004, 5:48:58 PM11/19/04
to
peter10 wrote:
> Hi everybody,
>
> I would like to convert unicode text (coming from a swing JTextPane -
> I think that is unicode by default!?) to UTF-8. I tried the code
> underneath, but the xml-database I am using still complains about
> wrong characters (error message: "Invalid byte 2 of 3-byte UTF-8
> sequence").
>
> ByteArrayOutputStream out = new ByteArrayOutputStream();
> DataOutputStream dataOut = new DataOutputStream(out);
> dataOut.writeUTF(text_input);
> String text_output = out.toString("UTF-8");
>
> Can anybody tell me what the mistake is that I am making???
>
> Thanks a lot for your help!
>
> Peter

Look closely at the docs for writeUTF and you will find that it
also writes a 2-byte binary length indicator at the front. I guess
this is the problem. I suggest that you use an OutputStreamWriter
instead, like this:

ByteArrayOutputStream baos = new ByteArrayOutputStream();
OutputStreamWriter out = new OutputStreamWriter(baos);
out.write(text_input);

Steve

Boudewijn Dijkstra

unread,
Nov 20, 2004, 6:50:58 AM11/20/04
to
"peter10" <peter.m...@gmx.de> schreef in bericht
news:5ccab056.04111...@posting.google.com...

> Hi everybody,
>
> I would like to convert unicode text (coming from a swing JTextPane -
> I think that is unicode by default!?) to UTF-8.

The U in UTF stands for 'Unicode', so you want to convert Unicode to Unicode.


Chris Uppal

unread,
Nov 20, 2004, 7:59:12 AM11/20/04
to
peter10 wrote:

> ByteArrayOutputStream out = new ByteArrayOutputStream();
> DataOutputStream dataOut = new DataOutputStream(out);
> dataOut.writeUTF(text_input);

The first problem here is that writeUTF8() does /NOT/ write UTF-8. It's an
incredibly, unbelievably, stupidly, misleadingly-named method. What it does is
write a two-byte character count (as Steve has already mentioned) followed by
some bytes that represent the string in a format that is (conceptually) related
to, but completely incompatible with, UTF-8.

UTF-8 is a a way of taking a stream/string of Unicode characters (and Java
Strings can be viewed as such, although the correspondence is not as close as
it looks), and representing them as bytes in a binary stream or similar. In
Java that conversion is ultimately provided by a "charset", specifically the
one named "UTF-8". Probably the easiest way for you to use that would be
either to ask your String for its
aString.getBytes("UTF-8");
or to use an OutputStreamWriter constructed with a 'charsetname' of "UTF-8".

-- chris


Chris Smith

unread,
Nov 20, 2004, 9:43:09 AM11/20/04
to
Steve Horsley <sh...@the.moon> wrote:
> Look closely at the docs for writeUTF and you will find that it
> also writes a 2-byte binary length indicator at the front. I guess
> this is the problem. I suggest that you use an OutputStreamWriter
> instead, like this:
>
> ByteArrayOutputStream baos = new ByteArrayOutputStream();
> OutputStreamWriter out = new OutputStreamWriter(baos);
> out.write(text_input);

Since UTF-8 was explicitly requested, that should be:

ByteArrayOutputStream baos = new ByteArrayOutputStream();

OutputStreamWriter out = new OutputStreamWriter(baos, "UTF-8");
out.write(text_input);

--
www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation

peter10

unread,
Nov 20, 2004, 12:43:19 PM11/20/04
to
Hallo!

Thanks to your code-snippet and with the getEncoding()-method of the
OutputStreamWriter I found out that the encoding that is apparently
being used inside the JTextPane is "Cp1252".

ByteArrayOutputStream baos = new ByteArrayOutputStream();
OutputStreamWriter out = new OutputStreamWriter(baos);

out.write(input_string);
String encoding = out.getEncoding();

Now I have two - maybe stupid - questions:
1) How is that possible if the sun-documentation about Documents (used
in JTextPanes) reads as follows:

"To support internationalization, the Swing text model uses unicode
characters..." ???

2) how do I get a String out of the OutputStreamWriter as there is no
getText() method available?

Thanks for any help!

Peter

Boudewijn Dijkstra

unread,
Nov 20, 2004, 3:43:43 PM11/20/04
to
"peter10" <peter.m...@gmx.de> schreef in bericht
news:5ccab056.04112...@posting.google.com...

> Hallo!
>
> Thanks to your code-snippet and with the getEncoding()-method of the
> OutputStreamWriter I found out that the encoding that is apparently
> being used inside the JTextPane is "Cp1252".
>
> ByteArrayOutputStream baos = new ByteArrayOutputStream();
> OutputStreamWriter out = new OutputStreamWriter(baos);
> out.write(input_string);
> String encoding = out.getEncoding();
>
> Now I have two - maybe stupid - questions:
> 1) How is that possible if the sun-documentation about Documents (used
> in JTextPanes) reads as follows:
>
> "To support internationalization, the Swing text model uses unicode
> characters..." ???

OutputStreamWriter by default uses the *platform* default encoding, not the
Swing default encoding.

> 2) how do I get a String out of the OutputStreamWriter as there is no
> getText() method available?

If you want the string back, you'd get the original input_string back. I
recommend you use input_string.getBytes("UTF-8") instead.


peter10

unread,
Nov 19, 2004, 5:03:14 PM11/19/04
to
Hi everybody,

I would like to convert unicode text (coming from a swing JTextPane -

I think that is unicode by default!?) to UTF-8. I tried the code
underneath, but the xml-database I am using still complains about
wrong characters (error message: "Invalid byte 2 of 3-byte UTF-8
sequence").

ByteArrayOutputStream out = new ByteArrayOutputStream();

DataOutputStream dataOut = new DataOutputStream(out);
dataOut.writeUTF(text_input);

0 new messages