escapes in JSON strings

15 views
Skip to first unread message

Solar Designer

unread,
Feb 29, 2016, 3:06:27 PM2/29/16
to onio...@coralbits.com
Hi,

I suspected this could be the case, but I was nevertheless disappointed
to find out that libonion lacks escape character support in its JSON
implementation. This is a bug. A warning in some prominent place is
needed, so that people don't try to actually use this for anything where
the improper processing would break things (e.g., passwords). Things
would break especially bad on escaped double-quote characters. Luckily,
I had checking this on my to-do list.

Anyway, for my current server prototype where I now use libonion, I only
needed these escapes on the input side. The attached patch implements
that, and it is about as dirty as the rest of libonion is. Sorry, but
dealing with libonion internals puts me in bad mood. I even tried to
match the weird coding style, but I haven't done that consistently. %-)
If you choose to integrate this code, please consider it public domain
and too minor for copyright. ;-)

Of course, escaping would also need to be added to the JSON encoder.
I just didn't need this now. Hopefully, I'll replace libonion
altogether before I need that. And yes, after messing with the
internals a bit, I no longer consider libonion a viable option for use
in something I'd release for the general public to use. It's a hack.

Alexander
onion-json-hack.diff

David Moreno Montero

unread,
Mar 1, 2016, 3:08:59 AM3/1/16
to Solar Designer, onion-dev
Hi Alexander,

sorry to hear you dislike libonion internals. I will check the patch as soon as possible and merge (btw: I prefer github pull requests). Also I will add the tests and encoding fixes.

libonion is a community project,. If you find some internals you dont like, or feel like a hack, you are free to comment more in depth and all together can try to fix it. Or send pull requests.

Regards,
David.




--
You received this message because you are subscribed to the Google Groups "onion-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to onion-dev+...@coralbits.com.
To post to this group, send email to onio...@coralbits.com.
Visit this group at https://groups.google.com/a/coralbits.com/group/onion-dev/.
For more options, visit https://groups.google.com/a/coralbits.com/d/optout.



--

David Moreno Montero

unread,
Mar 1, 2016, 7:38:49 AM3/1/16
to Solar Designer, onion-dev
Hi,

just merged into master, and added also improved json writing and some tests.

Please feel free to comment and suggest improvements. After all we all just want a nice C library bug free and easy to use.

Regards,
David.

Solar Designer

unread,
Mar 1, 2016, 11:57:55 AM3/1/16
to David Moreno Montero, onion-dev
Hi David,

On Tue, Mar 01, 2016 at 01:38:29PM +0100, David Moreno Montero wrote:
> just merged into master, and added also improved json writing and some
> tests.

Thank you!

The encoding code you wrote looks right at first glance.

As to your changes to my decoding code, I think you shouldn't have
introduced valid_hex4. Clearly, my lack of source code comments caused
it, sorry about that. I deliberately chose to treat "\u0000" as an
error, because I think the onion API is returning C strings, so it is
unprepared to handle embedded NULs. With your changes, it will probably
be returning strings effectively truncated at first NUL, which is a
worse behavior than not handling them at all.

Am I possibly missing something?

I also found (yesterday) that apparently jansson doesn't handle embedded
NULs either. Here's relevant discussion:

https://groups.google.com/forum/#!topic/jansson-users/XYoMISbaZEQ

> Please feel free to comment and suggest improvements. After all we all just
> want a nice C library bug free and easy to use.

Thank you. I'll plan on taking notes as I come across things I don't
like, and hopefully will communicate them to you when I have time.

Alexander

David Moreno Montero

unread,
Mar 1, 2016, 12:14:20 PM3/1/16
to Solar Designer, onion-dev
Hi,

you are right, its better to give an error here than a wrong string later. I did not do that reasoning.

I will try to fix it later today.

From your message I see you are exploring using a  json library for the JSON parsing, which is great. In my last email I forgot to emphasize that the JSON support in Onion is minimal (no numbers for example), and that for real JSON support you can/should use a third party library. When you have some comments, maybe you can drop some lines with your experience, and we can add some lines of advice on the github wiki.

Regards,
David.

Solar Designer

unread,
Mar 2, 2016, 3:20:52 AM3/2/16
to David Moreno Montero, onion-dev
David,

On Tue, Mar 01, 2016 at 07:57:50PM +0300, Solar Designer wrote:
> The encoding code you wrote looks right at first glance.

... but not exactly right at second glance. ;-)

You escape these characters: '\n', '\t', '"', '\\'. You have a comment
that says:

// Others just put it in. Teh reader should be able to read binary unicode chars. (ie: not \uXXXX encoded)

(BTW, there's a typo in "Teh", and the line is over-long - that's one of
my complaints about onion coding style and comments in general.)

While in general I agree that we shouldn't unnecessarily escape too many
characters, the exact choice of what to expect doesn't look entirely
right to me. The RFC says:

https://tools.ietf.org/html/rfc7159#section-7

" [...] All Unicode characters may be placed within the
quotation marks, except for the characters that must be escaped:
quotation mark, reverse solidus, and the control characters (U+0000
through U+001F)."

So the RFC requires escaping all of the control characters in that
range, whereas you only escape some of them. You need to add '\b',
'\f', '\r' (because these have the specific shorter escapes) and then a
catch-all for the rest of the controls in that range, using the "\u00xx"
notation for them.

Thank you for implementing the tests. I recommend that you also add at
least the last one of the surrogate pairs listed at, {U+DBFF, U+DFFF}:

http://unicodebook.readthedocs.org/unicode_encodings.html#utf-16-surrogate-pairs

I actually found a bug in an RPC-JSON library the other day by testing
with these four pairs, where only the last one would fail test.
Apparently, it's a common error:

https://twitter.com/solardiz/status/704376496186662916

"When decoding UTF-16 surrogate pairs, https://github.com/hmng/jsonrpc-c
ORs 0x10000 instead of adding it. A common error? Check other libs?"

(got a few replies to that tweet).

Also, test all of the short escapes. Right now, you appear to only test
"\\"", "\n", "\t".

Alexander

David Moreno Montero

unread,
Mar 2, 2016, 3:35:21 AM3/2/16
to Solar Designer, onion-dev
Thanks for the review,

I will try to fix it today.

Regards,
David.

David Moreno Montero

unread,
Mar 2, 2016, 9:45:29 AM3/2/16
to Solar Designer, onion-dev
Hi Alexander,

I'm doing some new tests, and Im lost on the surrogate pairs, and would love some help here.

As far as I understand if I find "\uD800\uDC00" this refers to the symbol numbered 0x10000, which is important to do if you store the data as a vector of int32, or need to find how to draw it in some table. But for parsing and storing the utf data in memory, nothing has to change, just copy it straight. Is it like that?

In this case the test is just that decode("\\uD800\\uDC00)=="\xDC\x00\xDC\x00".

Right?

(I know I'm worng somehow, but need some help to see it)

I will anyway push current tests and changes (conding encoding into its own codecs.c functions) right now.

Regards,
David.

Solar Designer

unread,
Mar 2, 2016, 2:18:26 PM3/2/16
to David Moreno Montero, onion-dev
Hi David,

On Wed, Mar 02, 2016 at 03:45:08PM +0100, David Moreno Montero wrote:
> I'm doing some new tests, and Im lost on the surrogate pairs, and would
> love some help here.

I'm no expert in this, but I'll try:

> As far as I understand if I find "\uD800\uDC00" this refers to the symbol
> numbered 0x10000,

Correct so far.

> which is important to do if you store the data as a
> vector of int32, or need to find how to draw it in some table. But for
> parsing and storing the utf data in memory, nothing has to change, just
> copy it straight. Is it like that?
>
> In this case the test is just that
> decode("\\uD800\\uDC00)=="\xDC\x00\xDC\x00".
>
> Right?

For JSON, we need to decode UTF-16, including possible surrogate pairs
like the above, to UTF-8, and no, this isn't that simple - this is why
we need that code like in the patch I posted. You can experiment with
this online converter, it looked correct in my testing last week:

http://www.endmemo.com/unicode/unicodeconverter.php

Enter \uD800\uDC00 into the Escaped Unicode input box and press Convert.
It shows F0 90 80 80 in the UTF-8 Code box. It's these four octets that
should be put into the decoded string, because those strings are UTF-8.

In other words, UTF-16 and UTF-8 encode Unicode code points differently.
It isn't only about grouping the octets differently. And this
difference is seen without surrogate pairs as well, e.g. "\u0410" is
"\xd0\x90" in UTF-8 (Cyrillic letter "A"):

http://www.fileformat.info/info/unicode/char/0410/index.htm

$ cat>utf.c
#include <stdio.h>
int main(void) { puts("\u0410"); return 0; }
$ gcc utf.c -o utf -std=c99
$ ./utf | od -tx1
0000000 d0 90 0a

Alexander

Solar Designer

unread,
Mar 13, 2016, 5:48:01 PM3/13/16
to David Moreno Montero, onion-dev
David,

I noticed you're now escaping not only control codes below 32, but also
127. I think this is not needed, because it's not mandated. If we were
worried about terminal escapes, we'd need to also escape the C1 range,
but I think JSON's escaping is not about that - it is about avoiding
data corruption when transferring JSON over e.g. line-oriented protocols
where CR and LF are special, etc.

Alternatively, if you choose to protect those viewing JSON on terminals
(even though there's no such guarantee for JSON in general, so people
should be cautious anyway), then we do in fact need to escape the C1
controls as well, so not just 127, but 127 to 159.

2016-03-02 9:20 GMT+01:00 Solar Designer <so...@openwall.com>:
> https://tools.ietf.org/html/rfc7159#section-7
>
> " [...] All Unicode characters may be placed within the
> quotation marks, except for the characters that must be escaped:
> quotation mark, reverse solidus, and the control characters (U+0000
> through U+001F)."
>
> So the RFC requires escaping all of the control characters in that
> range, whereas you only escape some of them. You need to add '\b',
> '\f', '\r' (because these have the specific shorter escapes) and then a
> catch-all for the rest of the controls in that range, using the "\u00xx"
> notation for them.

Alexander
Reply all
Reply to author
Forward
0 new messages