How to handle unicode characters in Json Body?

25 views
Skip to first unread message

javadevmtl

unread,
Jan 30, 2019, 6:55:21 PM1/30/19
to vert.x
Hi, trying to send emojis inside the JSON.

Take for example the smiley face...

If we send

{
    "smiley":"\u1F601"
}

And then 

json.encodePrettily() we get
{
    "smiley":"ὠ1"
}

or

json.getString("smiley") we get

ὠ1

If we send the surrogate pair then it works...
{
    "smiley":"\uD83D\uDE01"
}

json.encodePrettily() we get
{
    "smiley":"😁"
}

and json.getString("smiley") we get
😁

So what would be the best way to handle Emoji and unicode chars using standard unicode code rather than surrogates with JsonObject().

javadevmtl

unread,
Jan 30, 2019, 7:27:56 PM1/30/19
to vert.x
Or rather if the client application sends

Content-Type: application/json

Then the JSON and unicode chars are expected to be UTF-8 already?

Paulo Lopes

unread,
Jan 31, 2019, 3:23:57 AM1/31/19
to vert.x
The ECMA 404 spec says:

A string is a sequence of Unicode code points wrapped with quotation marks (U+0022)

So I'd say JSON should be encoded in UTF8 but there's no official reference to character encoding on the spec at all. So it should be pre agreed between the parties. The safest is to escape a code point that is not in the Basic Multilingual Plane, the character may be represented as a twelve character sequence, encoding the UTF-16 surrogate pair corresponding to the code point. So for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E" (like you noticed).

Now it could be that your client is specifying a custom character encoding and in that case it isn't being respected by the json parser?

Can you verify your client "Content-Type" header? Does it have:

Content-type: application/json; charset=utf-8

Or the charset is something else?

javadevmtl

unread,
Jan 31, 2019, 11:11:01 AM1/31/19
to vert.x
Ok so it was JMeter. On the HTTP Sampler we need to set Content encoding to UTF-8. Funny enough this doesn't actually change anything in the HTTP headers. But the character now comes through correctly...

Key: Connection, Value: keep-alive
Key: Content-Type, Value: application/json
Key: Content-Length, Value: 254
Key: Host, Value: localhost:18081
Key: User-Agent, Value: Apache-HttpClient/4.5.3 (Java/1.8.0_131)

Keeping in mind that the API should work with as many clients as possible... Should I imply to the client that Json strings are UTF-8 or should I be telling them to encode to \u surrogates? The latter seems very Javaish.

Paulo Lopes

unread,
Jan 31, 2019, 11:30:23 AM1/31/19
to vert.x
Browsers do encode in UTF8 so if your clients are browsers then there's no issue. If you want a fully portable solution then escape is the way to go as stated on the spec (and it's not javaish as it works everywhere :-)

javadevmtl

unread,
Jan 31, 2019, 11:47:16 AM1/31/19
to vert.x
When I mean clients it could be other 3rd party APIs written in any language.

Ok so I should recommend they send surrogate pairs when possible right? 

javadevmtl

unread,
Jan 31, 2019, 11:59:45 AM1/31/19
to vert.x
I mean. You can't just use standard 4 hex digits for all chars, the client needs to be aware that they must escape to surrogate pairs when possible. Or are you saying that most JSOn libs will encode to surrogate pairs anyways?

javadevmtl

unread,
Jan 31, 2019, 12:06:08 PM1/31/19
to vert.x
Ok silly me I get it... So according to the spec and what you indicated above means that if it's above the multilingual plane then JSON starts to use surrogate pairs. 
Reply all
Reply to author
Forward
0 new messages