Proposed patch to deal with \u0000 characters

Deron Meranda

unread,

Oct 6, 2011, 5:56:05 PM10/6/11

to jansso...@googlegroups.com

Petri,

I've been thinking of ways to work around the jansson limitation of
not being able to deal with JSON texts that contain embedded "\u0000"
characters. I've attached a minimal patch in diff format against
2.2.1 that I think may be reasonable to consider.

Basically when strings are in the C representation we've been assuming
they are UTF-8 encoded. However by allowing a backwards-compatible
relaxation of UTF-8, called "Modified UTF-8", the zero-byte is
represented by an overly-long two byte sequence 0xC0 0x80, instead of
the single byte 0x00. The advantage is that it is now possible to
store the codepoint U+0000 in a null-terminated C string, since the
zero-byte value does not occur.

My patch essentially changes the JSON decoding so that all "\u0000"
escapes get translated into two-byte strings "\xC0\x80" instead of
throwing a lexer error. And inversely, when dumping into JSON all C
string sequences of "\xC0\x80" get output as the JSON escape "\u0000".
All other over-long UTF-8 encodings are still rejected as being an
invalid.

Note that it is only the C-representation of strings that may use
Modified UTF-8. The JSON text is still always in strict UTF-8 encoding
(though the RFC does not allow a raw un-escaped zero-byte to occur in
a JSON text anyway).

I believe the behavior of jansson remains identical to before, except:

* Occurrences of \u0000 in JSON text no longer results in a parsing error.

* If (and only if) \u0000 is in a JSON text then C strings are not
strictly UTF-8, but are in Modified UTF-8.

* C strings containing the overlong UTF-8 sequence C0 80 are now
accepted and converted to JSON \u0000 escapes.

The only other caveat of this approach to mention, is that when
sorting object keys, any keys containing the character U+0000 will not
be sorted in the same lexigraphical order as-if they were encoded with
strict UTF-8.

From my testing this seems to work fine. I've not bothered patching
the test cases or the documentation yet, because I thought feedback
may be useful first.

BTW, for the record, I release my patch into the Public Domain.
Feel free to discuss.
--
Deron Meranda
http://deron.meranda.us/

escaped-zero-byte.diff

Petri Lehtinen

unread,

Oct 7, 2011, 2:16:28 PM10/7/11

to jansso...@googlegroups.com

Deron Meranda wrote:
> Feel free to discuss.

This is an interesting idea. However, it would definitely require
functions to convert a string with embedded zero bytes to modified
UTF-8 and vice versa, and the users would need to grasp the idea of
getting something of invalid UTF-8, that needs to be converted before
using it.

And what about creating values with embedded zero bytes? Would
json_string_nocheck() and json_string_set_nocheck() be used?

-----

Now that we started to think about it, I believe that real support for
zero bytes inside strings wouldn't be that hard to do, at least if
they are restricted to values only. Object keys would have to be
zeroless. We'd need to add 5 new string functions:

* 2 for creating strings with zero bytes (check & nocheck)
* 2 for setting string's value to one with zero bytes (check & nocheck)
* 1 for getting string's length

The decored would of course have to be adjusted too.

If support was added for object keys, 8 more functions would be
needed:

* get
* set & set_nocheck
* set_new & set_new_nocheck
* del
* iter_at
* iter_key_length

I actually started implementing this once, but was struck by the
amount of extra API required. Restricting to values only came into my
mind now.

So what do you think?

Petri

Deron Meranda

unread,

Oct 7, 2011, 7:14:28 PM10/7/11

to jansso...@googlegroups.com

> This is an interesting idea. However, it would definitely require
> functions to convert a string with embedded zero bytes to modified
> UTF-8 and vice versa,

I was hoping for a solution that was very minimal and almost invisible
-- a small amount of internal code changes and no changes to the API.

The whole idea of using Modified UTF-8 (which I believe has its
origins in Java), is that it is basically identical to strict UTF-8,
except it avoids having zero bytes. It is thus more friendly to
languages which use null-terminated strings, which is a perfect
solution for C.

Also I wanted it to be the case that the behavior of jansson would
remain absolutely unchanged for all non-error cases. It is only in the
few cases where jansson would have thrown an error on otherwise legal
JSON text, that the new patched version would now succeed.

So if a user never encounters JSON text that contains a U+0000
character (as an escaped strings) - which I suspect is 99.99% of the
time - then jansson will behave identically - and you'll only ever see
strings in strict UTF-8 inside your C code.

> and the users would need to grasp the idea of
> getting something of invalid UTF-8, that needs to be converted before
> using it.

Well, its important to keep in mind there are two string spaces: JSON
and C. The documentation would have to make this concept very clear.

We're not changing the JSON strings. They are always strict UTF-8.
Also the spec already doesn't allow a null (0x00) byte as it must
always be represented by a \u0000 escaped string literal. And neither
would we accept the overlong 0xC0 0x80 inside JSON either.

In the C string space though, it's best to continue using the C
convention of null-terminated strings, which means that there will
never be any zero bytes in C strings either. But since there's no
spec/RFC to live to, we can choose to use Modified UTF-8 rather than
strict UTF-8 inside C. The idea of this patch is that we will
additionally allow the two-byte sequence "\xC0\x80" --- which will
"represent" a zero byte, but it is not actually a zero byte.

It is up to the user to determine what, if anything, to do with a C080
sequence when it is encountered.

> And what about creating values with embedded zero bytes? Would
> json_string_nocheck() and json_string_set_nocheck() be used?

Well, yes, all those functions would need to continue to behave with
the same intent as before. The nocheck() versions would just copy the
bytes without any encoding safety checks ... which means that not only
would C080 be allowed, but also completely invalid C081, etc.

The json_string() and json_string_set() functions though would now
have to check the string for being valid Modified UTF-8. E.g., they
do the same thing, only they allow C080 byte sequences too.

> Now that we started to think about it, I believe that real support for
> zero bytes inside strings wouldn't be that hard to do, at least if
> they are restricted to values only. Object keys would have to be
> zeroless. We'd need to add 5 new string functions:

Do we really need new functions? Or just alter the behavior of the
existing ones?

Also I think it's important (and probably easy) to allow U+0000
characters inside object keys as well as string values.

I think my previously-attached patch already does most of it, but I
may have missed a couple places such as with json_string_set(). I'll
need to look over it again. The only thing which makes the patch a
bit tricky is the internal function utf8_check_first() function will
reject the two-byte sequence C080 as being valid, so that needs to be
worked around without also making it accept C081 (which should always
fail).

Jonathan Landis

unread,

Oct 7, 2011, 9:07:40 PM10/7/11

to jansso...@googlegroups.com

On Fri, 2011-10-07 at 19:14 -0400, Deron Meranda wrote:
> Also I wanted it to be the case that the behavior of jansson would
> remain absolutely unchanged for all non-error cases. It is only in the
> few cases where jansson would have thrown an error on otherwise legal
> JSON text, that the new patched version would now succeed.

If this feature is added, please provide a flag for it, and keep the
existing behavior as the default.

JKL

Deron Meranda

unread,

Oct 7, 2011, 9:38:58 PM10/7/11

to jansso...@googlegroups.com

The existing behavior is to return an error (on otherwise valid JSON).

Unless you have code that depends on jansson failing when it encounters a valid \u0000, then is a flag necessary?

If you do have a need to check that a JSON text does not contain a reference to the Unicode character U+0000, then you could always do a strstr(s,"\xC0\x80") as easily as trapping a json_error_t and testing if it's text is equal to "\u0000 is not allowed".

Maybe a flag could be useful? But I'd rather not add complexity to the API unnecessarily. This would be 100% compatible, except in the error/failure cases.

Am I overlooking a need?

--
Jansson users mailing list
jansso...@googlegroups.com
http://groups.google.com/group/jansson-users

Petri Lehtinen

unread,

Oct 8, 2011, 6:11:57 AM10/8/11

to jansso...@googlegroups.com

On Fri, 7 Oct 2011 19:14:28 -0400, Deron Meranda wrote:
> > This is an interesting idea. However, it would definitely require
> > functions to convert a string with embedded zero bytes to modified
> > UTF-8 and vice versa,
>
> I was hoping for a solution that was very minimal and almost invisible
> -- a small amount of internal code changes and no changes to the API.

This is impossible. At the bare minimum, we would need a decoding flag
to allow decoding \u0000 (as Jonathan already mentioned), so that the
users would have to explicitly tell that they are willing to receive
modified UTF-8. Changing behavior without adding a flag might cause
subtle problems (even security issues) with existing programs that
assume valid UTF-8.

> The whole idea of using Modified UTF-8 (which I believe has its
> origins in Java), is that it is basically identical to strict UTF-8,
> except it avoids having zero bytes. It is thus more friendly to
> languages which use null-terminated strings, which is a perfect
> solution for C.

[snip]

> In the C string space though, it's best to continue using the C
> convention of null-terminated strings, which means that there will
> never be any zero bytes in C strings either. But since there's no
> spec/RFC to live to, we can choose to use Modified UTF-8 rather than
> strict UTF-8 inside C. The idea of this patch is that we will
> additionally allow the two-byte sequence "\xC0\x80" --- which will
> "represent" a zero byte, but it is not actually a zero byte.

I don't buy this argument. There's lots of C code out there that can
handle buffers with embedded zero bytes without any problems. The zero
byte doesn't represent a valid character in any encoding, so using
\u0000 inside JSON strings means that binary data is being passed
around. And in this case, you'll have to deal with string/buffer
lengths, too, because it's C.

> It is up to the user to determine what, if anything, to do with a C080
> sequence when it is encountered.
>
>
> > And what about creating values with embedded zero bytes? Would
> > json_string_nocheck() and json_string_set_nocheck() be used?
>
> Well, yes, all those functions would need to continue to behave with
> the same intent as before. The nocheck() versions would just copy the
> bytes without any encoding safety checks ... which means that not only
> would C080 be allowed, but also completely invalid C081, etc.
>
> The json_string() and json_string_set() functions though would now
> have to check the string for being valid Modified UTF-8. E.g., they
> do the same thing, only they allow C080 byte sequences too.

I believe this is a bad idea, too. I'd like to have separate functions
that take modified UTF-8 as input, to avoid problems with backwards
compatibility, because Jansson's UTF-8 codec currently treats \xc0\x80
as invalid.

> > Now that we started to think about it, I believe that real support for
> > zero bytes inside strings wouldn't be that hard to do, at least if
> > they are restricted to values only. Object keys would have to be
> > zeroless. We'd need to add 5 new string functions:
>
> Do we really need new functions? Or just alter the behavior of the
> existing ones?
>
> Also I think it's important (and probably easy) to allow U+0000
> characters inside object keys as well as string values.

I can think of only one way to avoid adding new functions: Having a
global set of flags that alters the behavior of the library as a
whole. Not a very good idea, though.

On Fri, 7 Oct 2011 21:38:58 -0400, Deron Meranda wrote:
> The existing behavior is to return an error (on otherwise valid
> JSON).
>
> Unless you have code that depends on jansson failing when it
> encounters a valid \u0000, then is a flag necessary?

I doubt there's code that depends on failure on \u0000, but there most
probably is code that depends on Jansson always giving valid UTF-8.
There might also be code that relies on Jansson's ability to check the
validity of UTF-8 that is passed to Jansson. Silently starting to
conver invalid UTF-8 to "\u0000" might introduce unexpected behavior.

> If you do have a need to check that a JSON text does not contain a
> reference to the Unicode character U+0000, then you could always do
> a strstr(s,"\xC0\x80") as easily as trapping a json_error_t and
> testing if it's text is equal to "\ u0000 is not allowed".