David,
On Tue, Mar 01, 2016 at 07:57:50PM +0300, Solar Designer wrote:
> The encoding code you wrote looks right at first glance.
... but not exactly right at second glance. ;-)
You escape these characters: '\n', '\t', '"', '\\'. You have a comment
that says:
// Others just put it in. Teh reader should be able to read binary unicode chars. (ie: not \uXXXX encoded)
(BTW, there's a typo in "Teh", and the line is over-long - that's one of
my complaints about onion coding style and comments in general.)
While in general I agree that we shouldn't unnecessarily escape too many
characters, the exact choice of what to expect doesn't look entirely
right to me. The RFC says:
https://tools.ietf.org/html/rfc7159#section-7
" [...] All Unicode characters may be placed within the
quotation marks, except for the characters that must be escaped:
quotation mark, reverse solidus, and the control characters (U+0000
through U+001F)."
So the RFC requires escaping all of the control characters in that
range, whereas you only escape some of them. You need to add '\b',
'\f', '\r' (because these have the specific shorter escapes) and then a
catch-all for the rest of the controls in that range, using the "\u00xx"
notation for them.
Thank you for implementing the tests. I recommend that you also add at
least the last one of the surrogate pairs listed at, {U+DBFF, U+DFFF}:
http://unicodebook.readthedocs.org/unicode_encodings.html#utf-16-surrogate-pairs
I actually found a bug in an RPC-JSON library the other day by testing
with these four pairs, where only the last one would fail test.
Apparently, it's a common error:
https://twitter.com/solardiz/status/704376496186662916
"When decoding UTF-16 surrogate pairs,
https://github.com/hmng/jsonrpc-c
ORs 0x10000 instead of adding it. A common error? Check other libs?"
(got a few replies to that tweet).
Also, test all of the short escapes. Right now, you appear to only test
"\\"", "\n", "\t".
Alexander