character encoding issues

741 views
Skip to first unread message

undertow

unread,
May 4, 2010, 5:11:54 PM5/4/10
to Google Web Toolkit
hello, i seem to be having issued with GWT and character encoding. I
have an Oracle database which stores strings with iso-8859-1
encoding. GWT does NOT support java's String.getBytes(), nor does it
support new String(byte[], encoding).

the question is, how do i get the string bytes from the database blob
to the properly encoded string?

--
You received this message because you are subscribed to the Google Groups "Google Web Toolkit" group.
To post to this group, send email to google-we...@googlegroups.com.
To unsubscribe from this group, send email to google-web-tool...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-web-toolkit?hl=en.

David Given

unread,
May 5, 2010, 12:26:29 PM5/5/10
to google-we...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 04/05/10 22:11, undertow wrote:
> hello, i seem to be having issued with GWT and character encoding. I
> have an Oracle database which stores strings with iso-8859-1
> encoding. GWT does NOT support java's String.getBytes(), nor does it
> support new String(byte[], encoding).
>
> the question is, how do i get the string bytes from the database blob
> to the properly encoded string?

You need to convert the data from ISO-8859-1 to UTF-16 at the point
where your app touches Oracle --- that is, on the server.

GWT supports standard Java strings, which are UTF-16 (i.e. arrays of
16-bit Character values --- note that this is *not* Unicode!), but as
you've found it does support transcoding. Character conversion ideally
only happens when you do I/O on the string, via a Reader or a Writer,
and as all I/O on GWT is supposed to either happen on the server or else
use native Unicode they haven't implemented it.

I have on occasion managed to force non-UTF-16 data into a string, but
strictly only as a hack, and it always causes problems. If Oracle's API
is giving you such a string, then They Are Doing It Wrong...

[As an aside: I have managed to port huge chunks of java.io and java.nio
to run on GWT client-side, and I've also got the basic framework of
java.nio.charset, so it *is* possible to do character encoding
translation on the client... but I haven't found any sensibly small
encoding codecs yet, so I'm having to write my own slow buggy ones ---
so you probably don't want this approach.]

- --
┌─── dg@cowlark.com ───── http://www.cowlark.com ─────

│ life←{ ↑1 ⍵∨.^3 4=+/,¯1 0 1∘.⊖¯1 0 1∘.⌽⊂⍵ }
│ --- Conway's Game Of Life, in one line of APL
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkvhnDMACgkQf9E0noFvlzgkfwCfSgDz12nZN2AOLtiZw6qMh8Xz
yloAoIznv+9JGBozm20a3HDU56+IcL7D
=M+yG
-----END PGP SIGNATURE-----

undertow

unread,
May 6, 2010, 11:08:34 AM5/6/10
to Google Web Toolkit
Thank you for confirming what i had suspected i would need to do. So
the idea is, user enters a bunch of text into a textarea via typing it
all in or cut and paste from somewhere (like Word, ugh and its mangled
characters). when time comes to ship that text off to the server i
would then pluck the string out of the textarea stick it in a transfer
object of sorts. (this is where i am a little fuzzy) I would then
take the input string do a getBytes() on it and then push that array
of bytes into a blob. would i need to get the bytes with an encoding
argument? e.g. txt.getBytes("ISO-8859-1"). This method seems to work
ok, but if user had pasted from ms word into the text box things still
come out mangled.
> Comment: Using GnuPG with Mozilla -http://enigmail.mozdev.org/

Sripathi Krishnan

unread,
May 6, 2010, 1:32:56 PM5/6/10
to google-we...@googlegroups.com
Just a correction - GWT uses UTF-8 and not UTF-16. Also, you can do String.getBytes() and similar hacks to convert from ISO-8859-1 (oracle) to UTF-8 -- but in my opinion it is best to store data in UTF-8 in the database.

In general, you need to revisit all interfaces where data exchange happens, and ensure that a) both systems are using same encoding or b) One system re-encodes the data appropriately. (a) is always better than (b).

--Sri



On 6 May 2010 22:53, David Given <d...@cowlark.com> wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 06/05/10 16:08, undertow wrote:
> Thank you for confirming what i had suspected i would need to do.  So
> the idea is, user enters a bunch of text into a textarea via typing it
> all in or cut and paste from somewhere (like Word, ugh and its mangled
> characters).  when time comes to ship that text off to the server i
> would then pluck the string out of the textarea stick it in a transfer
> object of sorts.  (this is where i am a little fuzzy)  I would then
> take the input string do a getBytes() on it and then push that array
> of bytes into a blob.  would i need to get the bytes with an encoding
> argument?

I believe so. GWT ought to get the string from the browser in UTF-16 ---
as that's what Strings are defined to be. You can then ship it back to
the server, as a String, and it should Just Work. Then you get to do the
charset conversion on the server.


> e.g. txt.getBytes("ISO-8859-1").  This method seems to work
> ok, but if user had pasted from ms word into the text box things still
> come out mangled.

I'm quite willing to believe that there are web browser bugs with all
this. It may be worth verifying that GWT is actually getting a valid
string from the browser (by going through and listing all the codepoints
in the string).

In addition, if Word is using all kinds of whacky non-ISO-8859-1
characters such as unbreaking spaces and quotation marks, then
getBytes() might be replacing them with ? signs --- how is it being mangled?


- --
┌─── dg@cowlark.com ───── http://www.cowlark.com ─────

│ life←{ ↑1 ⍵∨.^3 4=+/,¯1 0 1∘.⊖¯1 0 1∘.⌽⊂⍵ }
│ --- Conway's Game Of Life, in one line of APL
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkvi+w4ACgkQf9E0noFvlziutgCghRKvCoszHy+H0ONz6CnbNtSw
FL4AoKv2Jo0V1wznq4awrMVgzzaNXDuQ
=+bWt

-----END PGP SIGNATURE-----

--
You received this message because you are subscribed to the Google Groups "Google Web Toolkit" group.
To post to this group, send email to google-we...@googlegroups.com.
To unsubscribe from this group, send email to google-web-tool...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-web-toolkit?hl=en.

David Given

unread,
May 6, 2010, 1:23:29 PM5/6/10
to google-we...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 06/05/10 16:08, undertow wrote:
> Thank you for confirming what i had suspected i would need to do. So
> the idea is, user enters a bunch of text into a textarea via typing it
> all in or cut and paste from somewhere (like Word, ugh and its mangled
> characters). when time comes to ship that text off to the server i
> would then pluck the string out of the textarea stick it in a transfer
> object of sorts. (this is where i am a little fuzzy) I would then
> take the input string do a getBytes() on it and then push that array
> of bytes into a blob. would i need to get the bytes with an encoding
> argument?

I believe so. GWT ought to get the string from the browser in UTF-16 ---
as that's what Strings are defined to be. You can then ship it back to
the server, as a String, and it should Just Work. Then you get to do the
charset conversion on the server.

> e.g. txt.getBytes("ISO-8859-1"). This method seems to work
> ok, but if user had pasted from ms word into the text box things still
> come out mangled.

I'm quite willing to believe that there are web browser bugs with all
this. It may be worth verifying that GWT is actually getting a valid
string from the browser (by going through and listing all the codepoints
in the string).

In addition, if Word is using all kinds of whacky non-ISO-8859-1
characters such as unbreaking spaces and quotation marks, then
getBytes() might be replacing them with ? signs --- how is it being mangled?

- --
┌─── dg@cowlark.com ───── http://www.cowlark.com ─────

│ life←{ ↑1 ⍵∨.^3 4=+/,¯1 0 1∘.⊖¯1 0 1∘.⌽⊂⍵ }
│ --- Conway's Game Of Life, in one line of APL
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkvi+w4ACgkQf9E0noFvlziutgCghRKvCoszHy+H0ONz6CnbNtSw
FL4AoKv2Jo0V1wznq4awrMVgzzaNXDuQ
=+bWt

David Given

unread,
May 6, 2010, 6:48:35 PM5/6/10
to google-we...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 06/05/10 18:32, Sripathi Krishnan wrote:
> Just a correction - GWT uses UTF-8 and not UTF-16. Also, you *can* do
> String.getBytes() and similar hacks to convert from ISO-8859-1 (oracle) to
> UTF-8 -- but in my opinion it is best to store data in UTF-8 in the
> database.

GWT *source code* is UTF-8 (if you know what's good for you!). GWT
*strings* are UTF-16 --- because the Java spec says so. String.charAt()
will return you an unboxed Character, which is a single UTF-16 value.
It's vitally important to note that this is not the same as a Unicode
code point! Some code points get stored as pairs of Characters, so if
you assume your string contains Unicode your app will break on some
strings. You need String.codePointAt() to get a Unicode code point, but
I haven't checked to see whether that's supported on the client.

String.getBytes() is not supported by GWT, and will only work on the
server (where it's running real Java).

- --
┌─── dg@cowlark.com ───── http://www.cowlark.com ─────

│ "There is no Fermi Paradox. Any time space faring aliens make it to
│ Earth, the cows get them." --- Sam Starfall
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iD8DBQFL40dDf9E0noFvlzgRAs3fAJ9Gm3EBkuBu6+5DRKptNiz6XEqXSQCfXn2e
DFFOoY3blHvzlWgoQ+GBOaI=
=g1tt

Sripathi Krishnan

unread,
May 6, 2010, 10:30:21 PM5/6/10
to google-we...@googlegroups.com
I meant GWT RPC explicitly sets character encoding to UTF-8 to transfer data from server to client and back. It does not matter how javascript internally stores characters; the problem here is that you are lying to javascript. You say "these raw bytes are UTF-8 encoded", but in fact they are encoded in ISO-8859-1.

--Sri

undertow

unread,
May 7, 2010, 11:39:25 AM5/7/10
to Google Web Toolkit
I was able to solve the problem i believe. The real problem was the
fact that i couldn't use unsupported JRE functions in my transfer
obeject, so i had to pull the encoding/decoding process out of that
object and stick it in a utility function. From there i keep the
string from the browser all the way to the server and to the
persistence layer. There i do a kind of "switcharoo", I take the
string decode it to byte array and stuff it into the database blob. I
do the exact opposite on the way out of the database, gather the
bytes, run them through the encoding util function and stuff the
string into the transfer object. One thing still stands out to me, is
that the encoding/decoding process I am using the 'windows-1258' char
set. Using this to encode and decode preserves the mangled characters
when one pastes from a word doc. Most, if not all, folks using this
webapp will be coming from windows machines, so i think i am safe with
the 'hardcoded' (blasphemy i know) windows-1258 encoding.


Thanks to everyone for their input, it helped me arrive at my
solution.
> Comment: Using GnuPG with Mozilla -http://enigmail.mozdev.org/

David Given

unread,
May 10, 2010, 4:45:01 PM5/10/10
to google-we...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I have a rich app client that wants to be able to construct data
algorithmically and save the result; it also wants to be able to read a
local file, and process it locally.

[No, I do *not* want unrestricted access to the filesystem from the
client! I want to do all this via standard file load/save dialogues
mediated by the user, like a sane app does.]

I can save data by constructing a data: URL, invoking it, and the web
browser will pop up a save dialogue to the user; this is ideal.

Unfortunately the only way I've found in classic HTML of loading data is
to use a file upload field, but that doesn't let the client see the data
- --- it sends it straight to the server. Apart from being a waste of time
and bandwidth, my app may not *have* a server (instead running entirely
locally).

Are there any new techniques I'm not aware of that will allow me to
prompt the user for a file and then be able to get the contents of the
file into a client-side structure so I can process it?

I'm willing to use Gears or HTML5 if necessary, but I'd prefer to use
stock HTML/JS if possible.

- --
┌─── dg@cowlark.com ───── http://www.cowlark.com ─────

│ "There is no Fermi Paradox. Any time space faring aliens make it to
│ Earth, the cows get them." --- Sam Starfall
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iD8DBQFL6HBNf9E0noFvlzgRAthtAJ9z9Ctqgch4o+GoEals2Is5UCkBpgCcCzXJ
DzHZtGbGNaLs777RQLe22dc=
=UHjb

Sripathi Krishnan

unread,
May 10, 2010, 5:17:46 PM5/10/10
to google-we...@googlegroups.com
Are there any new techniques I'm not aware of that will allow me to prompt the user for a file and then be able to get the contents of the file into a client-side structure so I can process it?
If your clients have Flash Player 10, you can use Flex to read or write files, provided the action is initiated by a user click. See http://www.mikechambers.com/blog/2008/08/20/reading-and-writing-local-files-in-flash-player-10/

Javascript and flash can communicate, so theoretically, you can write GWT + JSNI code, compile a simple flex app that takes the data as parameters, intercepts the user click and pops up the file dialog. You'd have to mess around with GWT, JS and flex though to get it working properly.

In practice, its much easier to make a server round trip. You can deploy to Google App Engine - its free for trivial uses.

--Sri

ben fenster

unread,
May 11, 2010, 11:05:02 AM5/11/10
to Google Web Toolkit
you can use php to create a server side script that will create a file
from data you will send via post request and alink to the file
this way u will create save button that will be a download link to a
file in the server side
loading shold be even easier since it just a file upload
> Comment: Using GnuPG with Mozilla -http://enigmail.mozdev.org/
Reply all
Reply to author
Forward
0 new messages