Character encoding handling

38 views
Skip to first unread message

ore...@orestis.gr

unread,
Jul 3, 2018, 7:52:52 AM7/3/18
to pedestal-users
Hi all!

I've recently raised an issue [1] that relates to Character Encoding and Content-Length in Pedestal (and probably Ring in general). It could be that I'm very much mistaken or missing something, but I can't for the life of me figure out how Pedestal actually handles converts a String to a byte array via an explicit encoding.

The suggestion was that JSON encoding handles this for you — but it still gives you back a `String`, which is internally UTF-16. You may be able to encode characters if you want to get ASCII back, e.g. `{:escape-non-ascii true}` in Cheshire), but you still get a String back. If you do choose to encode non-ascii characters, it happens that the String will only contain characters in the ASCII set, so everything works fine. But you shouldn't need to, esp. since JSON pretty much dictates that UTF-8 be used for the serialisation format.

The only possible way to get UTF-8 back from a Java String, would be, *I think*, to get a raw `byte[]` using `(.getBytes s StandardCharsets/UTF_8)` — so if you want to *guarantee* that UTF-8 will be sent, you have to do it yourself right now, and pass the byte array to Pedestal so it gets copied to the Servlet response.

I am mistaken? Is the default platform charset guaranteed to be UTF-8, so the OutputStreamWriter that Pedestal uses will implicitly convert with that? Am I just too paranoid about this?

Thanks and sorry for keeping on about this!

Orestis


[1]: https://github.com/pedestal/pedestal/issues/582

Paul deGrandis

unread,
Jul 3, 2018, 8:40:03 AM7/3/18
to pedestal-users
Hi Orestis,

Character encoding gets enforced in a number of ways.

Java's internal String representation is UFT-16, but its default character encoding (when extracting bytes) is UTF-8, unless the default encoding has been overridden with a platform setting.  It's often best practice to set the platform's encoding to UTF-8 to ensure no matter the code path, the default encoding will be UTF-8.

The containers also have a default encoding.  Most default to UTF-8, unless overridden by the same platform setting.

Lastly, you can always take tight-control over encoding by returning byte-arrays as response bodies, but this is strongly discouraged.

Hope this answers your question!

Cheers,
Paul

ore...@orestis.gr

unread,
Jul 4, 2018, 5:18:18 AM7/4/18
to pedestal-users
Hi Paul,

thanks for the answer -- I wasn't aware that it's safe to assume that the default encoding is going to be correct.

There still is an issue with Content Length though — counting a string returns the number of Unicode code points, and not the number of converted bytes:

(count "Orestis") ;; => 7
(count "Ορέστης") ;; => 7
(count (.getBytes "Orestis")) ;; => 7
(count (.getBytes "Ορέστης")) ;; => 14

(Greek characters use 2 bytes in UTF-8)

So to be able to produce a correct Content-Length header, you do need to do the conversion operation and then count the resulting byte array. I realise that in many cases with huge responses you might not want to do this, of course, and let the conversion happen at same other layer that is aware of the network details etc.

Does that sound correct? I'm kind of going against the grain here so I just wonder if I have some misconception or if there are alternative ways to do things. I currently have a hard requirement to include Content-Length headers, so I can't avoid messing with this :)

Best regards,
Orestis

Phill Wolf

unread,
Jul 4, 2018, 7:23:04 AM7/4/18
to pedestal-users
Ring containers, in cooperation with the enclosing Servlet container or whatever, convert a String to a byte array and affix the correct Content-Length, or manage the chunk-length bytes of the chunks of a chunked response.  Try an experiment: I think I once observed (although not with Pedestal) that the conversion to byte array respects your Content-Type charset= header.

Paul deGrandis

unread,
Jul 4, 2018, 8:12:42 AM7/4/18
to pedestal-users
Phill's advice is spot on- At some point the container has to put bytes on the wire, and most containers (can) automatically set Content-Length when that occurs.  Furthermore, some containers inspect Content-Type for conversion and then fallback from there (as I previously described).  Some containers only do this by default for asset/file results, but can be toggled to do it for all responses.

Setting content-length is probably best to do at the container level and not in the service/app level.  That said, there's no real issue doing it within an interceptor if that makes the most sense for your service -- just be mindful to always set the "Content-Type" header by hand (since the default content-type for a byte array is an octet-steam).

Note that some containers also support NIO responses, enabling them to be non-blocking down to the wire.  I mention this only because you have to be mindful of your response body types, as it may trigger different behaviors at the container level.

Cheers,
Paul

ore...@orestis.gr

unread,
Jul 4, 2018, 8:44:14 AM7/4/18
to pedestal-users
Thanks for the guidance!

I've done some experiments with Jetty, and leaving all the relevant stuff the same as the Pedestal Lein template generates.

a) Jetty definitely doesn't do anything automatically given a charset in the Content Type — changing from UTF-8 to UTF-16 makes no difference, you still get UTF-8 back when you use a String. I'd like to do some tests with the other supported containers.

b) Not setting an explicit Content Type and passing either a byte array or a String switches on "Transfer-Encoding: chunked", which is arguably not only correct but also the better behaviour, I think?

I'd think it's worth documenting this behaviour somewhere in Pedestal — if someone points me at the relevant place I'm happy to write a few lines and submit a PR.

Thanks again.
Reply all
Reply to author
Forward
0 new messages