Re: Wrong UTF-8 string parsing in GWT JSON

1,313 views
Skip to first unread message

Philippe Lhoste

unread,
May 30, 2013, 11:36:23 AM5/30/13
to Google-We...@googlegroups.com
On 30/05/2013 08:49, Tibor Szolnoki wrote:
> Hi, my stack is overflow :):):):) I can't find the solution...
>
> I'm developing a server-client application.
> My GWT client running in browser. Communicate with my C++ server by:
> GWT-JSON -> lighttpd -> libfcgi -> cgicc -> libjson -> C++ application
>
>
> My problem:
> The server response with a JSON string to client's request. This response contains UTF-8
> strings. Accent characters encoded correctly with "\uXXXX" in response. For example: "�"
> encoded: "\u00C3\u0081".
> The client extract the string from JSON string. But the extracted string contains bad
> encoded characters. :(:(:(:(
>
> Lucky, I can narrow the problem to JSON-GWT. Here is a code to demonstrate the problem,
> running in client side only in GWT:
>
> String response="{ \"test\" : \"\\u00C3\\u0081\\u00C3\\u0089\\u00C5\\u00B0\" }";
> //"���" in UTF-8

No. That's not UTF-8, that's UNC encoding. It results in Java's UTF-16 encoding.

> JSONObject json=JSONParser.parseStrict(response).isObject();
> String s1=json.get("test").isString().stringValue();
> Window.alert(s1);
> byte[] b1=s1.getBytes();
>
> The results:
>
> Alert is: "� ÉŰ" instead of "���"
> s1="� ÉŰ" instead of "���"

That's the correct UTF-8 encoding of your characters.

It is working as intended. Perhaps you should read a bit more about Unicode, UTF-8, UTF-16
and all this confusing stuff... :-)

--
Philippe Lhoste
-- (near) Paris -- France
-- http://Phi.Lho.free.fr
-- -- -- -- -- -- -- -- -- -- -- -- -- --

Tibor Szolnoki

unread,
May 31, 2013, 1:54:34 AM5/31/13
to google-we...@googlegroups.com, Google-We...@googlegroups.com
Dear Philippe,

Thank you for the post,



>      String response="{ \"test\" : \"\\u00C3\\u0081\\u00C3\\u0089\\u00C5\\u00B0\" }";
> //"���" in UTF-8

No. That's not UTF-8, that's UNC encoding. It results in Java's UTF-16 encoding.

But "\u00C3\u0081" why not UTF-8 encoding?

See: http://www.utf8-chartable.de/

"Á" (LATIN CAPITAL LETTER A WITH ACUTE) hexa code is 0xC3 0x81
"\u00xx" in the JSON string is an escaped hexadecimal representation according to  RFC4627 (JSON)
See:
http://www.ietf.org/rfc/rfc4627.txt

But the character encoding is remain UTF-8, I think.

Regards,
Tibor


Tibor Szolnoki

unread,
May 31, 2013, 4:07:23 AM5/31/13
to google-we...@googlegroups.com
Dear Philippe,

You are right...
If I change the escaped ("\uXXXX") codes to UTF-16, for my example:
String response="{ \"test\" : \"\\u00c1\\u00c9\\u0170\" }"; //"ÁÉÜ" in UTF-16
All works correctly.


But I found a strange thin too:
If I disable  the"\uxxxx" escaping in JSON writer in server side, all works as expected. But this is not a good idea according to RFC4627 :((((. In this mode, the JSON string transports the non-printable characters (0xc3, 0x81, 0xc3, 0x89, 0xc5, 0xb0) ("ÁÉÚ" in UTF-8) without any encoding.... :(:(:( GWT JSON parser expands the UTF-8 character correctly, and the alert displays correct characters.

I think, JSON string transfer the characters in escaped UTF-16 encoding, but final expands/stores in String in UTF-8. Therefore, if I skip the UTF-16 escaping, I can send string in UTF-8 in raw (without escaping). :(:(:(:(



Thomas Broyer

unread,
May 31, 2013, 4:32:40 AM5/31/13
to google-we...@googlegroups.com


On Friday, May 31, 2013 10:07:23 AM UTC+2, Tibor Szolnoki wrote:
Dear Philippe,

You are right...
If I change the escaped ("\uXXXX") codes to UTF-16, for my example:
String response="{ \"test\" : \"\\u00c1\\u00c9\\u0170\" }"; //"ÁÉÜ" in UTF-16
All works correctly.


But I found a strange thin too:
If I disable  the"\uxxxx" escaping in JSON writer in server side, all works as expected. But this is not a good idea according to RFC4627 :((((.

I can't find where it says it's "not a good idea". It says all over the place that JSON "SHALL be encoded in Unicode", with a default to UTF-8, so why not just use UTF-8?
 
In this mode, the JSON string transports the non-printable characters (0xc3, 0x81, 0xc3, 0x89, 0xc5, 0xb0) ("ÁÉÚ" in UTF-8) without any encoding....

These are bytes, not characters.
The encoding is determined by the first 4 bytes of the response (see RFC4627)

Tibor Szolnoki

unread,
May 31, 2013, 5:22:23 AM5/31/13
to google-we...@googlegroups.com

In RFC4627: JSON "String" and "Text" is two different things.

Text: is a sequence of JSON objects, with barckets, strings, quotes etc...
(RFC4627 section 2.)

String: Is a JSON basic data type (single JSON data).
(RFC4627 section 2.5.)

As RFC4627 text and string encoding shall be different.
As you write, Text is default UTF-8, determined by first 4 characters. (section 3)
But not the String!

String is always Unicode, unicode characters escaped by "\uXXXX". (section 2.5)

I was only problem with JSON string, not the whole JSON text.
(My text encoding is UTF-8, as default)


I found my solution: I have to use Unicode characters in JSON string. That's all...

As Philip writes, GWT works as indeed.
GWT JSON  parser "\uXXXX" interpret as UTF-16 character.
And this is independent from JSON text encoding, which is UTF-8.
Reply all
Reply to author
Forward
0 new messages