UTF16 Surrogate Pairs in Fressian not encoded to utf8 correctly

44 views
Skip to first unread message

Kyle Wilt

unread,
Nov 7, 2019, 7:51:40 AM11/7/19
to Clojure
I posted an issue about this to the datomic/fressian github page but I don't know if anyone is monitoring it anymore.


I'm trying to find out if this is intentional for some reason or a bug. Right now it encodes UTF16 surrogate pairs as two 3 byte values for 10FFFF rather than one 4 byte value as expected.

Francis Avila

unread,
Nov 7, 2019, 2:20:44 PM11/7/19
to Clojure
Perhaps this is so invalid character streams (e.g. mismatched or orphaned surrogate pairs) can survive encoding and decoding (I haven't tested)? Strictly speaking not every CharacterSequence is validly encode-able to utf-8. Java just kind of hides this. For example, this is a reversed surrogate pair (or two orphaned surrogates, take your pick):

(mapv #(Integer/toHexString (int %)) (String. (.getBytes "\uDC00\uD800" "UTF-8") "UTF-8"))
=> ["3f" "3f"]

Note that Java's utf-8 encoder will translate these to "?", losing information about the original char value.

That said, if this is the case, it makes more sense for fressian to say "we have a custom encoding that is mostly utf-8 except it preserves invalid utf-16" than "this is utf-8". I wonder if other fressian implementations handle this the same way? Javascript also shares java's utf-16 string type but not every platform does.

Kyle Wilt

unread,
Nov 7, 2019, 4:21:50 PM11/7/19
to Clojure
I'm currently working an implementation for the CLR which is why I'm looking at it. The CLR uses the same approach as JS and Java for UTF16 surrogate pairs more or less. I'd be surprised if this was intentional but since the code has no comments I can only speculate :-)
Reply all
Reply to author
Forward
0 new messages