Documentation on limitations and RFC conformance

50 views
Skip to first unread message

Deron Meranda

unread,
Mar 17, 2010, 4:11:31 PM3/17/10
to Jansson users
In general jansson appears to implement RFC 4627 (JSON) quite well!
Thanks.

However, like any language-specific binding, there are going to be
limitations and other issues. Would it be possible to add this to the
documentation? This is especially important to people using JSON to
transfer data between disparate languages. These are the things I
think I have determined so far.

1. Strings.

Generally, JSON strings are mapped to C-style null-terminated
character (byte) arrays using the UTF-8 encoding.

Strings may not contain null characters (U+0000); which generally
means that jansson can not handle direct binary data encoded in JSON
strings. The JSON standard allows this, but jansson follows the
established C language convention of null-terminated strings, so the
null character must be excluded. (Note this could be allowed if
jansson introduced a second type of string value (binary string?)
which was a length/pointer pair rather than a C-style string ... or
perhaps with C++ wrappers by using the std::string class)

All other Unicode codepoints U+0001 through U+10FFFF, excluding the
surrogate pairs U+D800 to U+DFFF, are allowed.

2. Numbers.

* JSON integer numbers are mapped to C "int" values.
* JSON real numbers are mapped to C "double" values.

* Real vs. Integer. JSON makes no distinction between real and
integer numbers; C does. Thus a JSON number is considered to be a
real number if its lexical representation includes one of: "e", "E",
or "."; regardless if its actual numeric value is a true integer.
(e.g., all of "1E6", "3.0", "400E-2", and "3.14E3" are mathematical
integers, but will be treated as real values and thus mapped to the C
double type). When converted to JSON (dumping), real values are
always represented with a fractional part; e.g., the C double value
3.0 will be represented in JSON as "3.0", not "3".

* Numbers that are too large to fit in one type are not promoted to
another; thus very large integers may result in an overflow rather
than conversion to a double, regardless if such conversion could be
lossless or lossy. Thus, depending on platform, the JSON number
1000000000000000 may result in an overflow (see below) rather than
being converted to the mathematically equivalent 1.0E+15.

* No support is provided in jansson for any C numeric types other
than 'int' and 'double'. This excludes things such as unsigned types,
long, long long, long double, etc. Obviously, shorter types like
'short' and 'float' are implicitly handled via the ordinary C type
coercion rules (subject to overflow semantics). Also, no support or
hooks are provided for any supplemental "bignum" type add-on packages.

* Signed zeros. JSON makes no statement about what a number means;
however Javascript (ECMAscript) does state that +0.0 and -0.0 must be
treated as being distinct values, i.e., -0.0 != 0.0. The jansson
implementation relies on the underlying floating point library in the
C environment in which it is compiled. Therefore it is platform-
dependent whether 0.0 and -0.0 will be distinct values. Most
platforms that use the IEEE 754 floating-point standard will support
signed zeros. Note this only applies to floating-point; neither JSON,
C, or IEEE support the concept of signed integers.

* Underflow. JSON real numbers that are too small to be represented
in a C double will result in an underflow error (a JSON decoding
error). The number is not rounded to zero, approximated with DBL_MIN,
nor is an IEEE precision underflow indicator set. Thus, depending on
platform, JSON numbers such as 1E-999 will result in a parsing error,
not 0.0.

* Overflow (real). JSON real numbers whose absolute values are too
large to be represented in a C 'double' type will result in an
overflow error (a JSON decoding error). The number is not
approximated with DBL_MAX, the an IEEE inf (or -inf) special values,
nor is it coerced to a potentially larger floating-point type like
'long double'. Thus, depending on platform, JSON numbers like 1E+999
will result in a parsing error.

* Overflow (integer). JSON integer numbers whose absolute values
are too large to be represented in a C 'int' type will result in an
overflow error (a JSON decoding error). They are not approximated
with INT_MAX or INT_MIN, nor are they coerced to potentially larger
types (long long) or to floating-point numbers.

* Precision. Parsing JSON real numbers may result in a loss of
precision. As long as underflow or overflow does not occur (i.e. a
total loss of precision), then the rounded approximate value is
silently used. Thus the JSON number 1.000000000000000005 may,
depending on platform, result in the C double 1.0.

3. Encoding.

Jansson only supports UTF-8 encoded JSON texts (or by implicatation
the US-ASCII subset). It does not support or auto-detect any of the
other encodings mentioned in the RFC, namely UTF-16LE, UTF-16BE,
UTF-32LE or UTF-32BE. (The RFC doesn't mandate that an implementation
must support those other encodings, but does suggest they could be
used and that parsers support autodetection)


Please feel free to correct or add to anything I've said.
Deron Meranda

Petri Lehtinen

unread,
Mar 18, 2010, 2:25:25 AM3/18/10
to jansso...@googlegroups.com
Deron Meranda wrote:
> In general jansson appears to implement RFC 4627 (JSON) quite well!
> Thanks.
>
> However, like any language-specific binding, there are going to be
> limitations and other issues. Would it be possible to add this to the
> documentation? This is especially important to people using JSON to
> transfer data between disparate languages. These are the things I
> think I have determined so far.
>

Thanks. You have come up with quite an impressive list, and this is a
very good addition to the documentation. See my comments below.

Can you propose how these bignum hooks could be implemented in
Jansson?

I have plans on changing the underlying type of integer to long. This
is a backwards incompatible change, though, so it has to wait until
version 2.0.

>
> * Signed zeros. JSON makes no statement about what a number means;
> however Javascript (ECMAscript) does state that +0.0 and -0.0 must be
> treated as being distinct values, i.e., -0.0 != 0.0. The jansson
> implementation relies on the underlying floating point library in the
> C environment in which it is compiled. Therefore it is platform-
> dependent whether 0.0 and -0.0 will be distinct values. Most
> platforms that use the IEEE 754 floating-point standard will support
> signed zeros. Note this only applies to floating-point; neither JSON,
> C, or IEEE support the concept of signed integers.

"signed integer zeroes" I assume.

>
> * Underflow. JSON real numbers that are too small to be represented
> in a C double will result in an underflow error (a JSON decoding
> error). The number is not rounded to zero, approximated with DBL_MIN,
> nor is an IEEE precision underflow indicator set. Thus, depending on
> platform, JSON numbers such as 1E-999 will result in a parsing error,
> not 0.0.
>
> * Overflow (real). JSON real numbers whose absolute values are too
> large to be represented in a C 'double' type will result in an
> overflow error (a JSON decoding error). The number is not
> approximated with DBL_MAX, the an IEEE inf (or -inf) special values,
> nor is it coerced to a potentially larger floating-point type like
> 'long double'. Thus, depending on platform, JSON numbers like 1E+999
> will result in a parsing error.
>
> * Overflow (integer). JSON integer numbers whose absolute values
> are too large to be represented in a C 'int' type will result in an
> overflow error (a JSON decoding error). They are not approximated
> with INT_MAX or INT_MIN, nor are they coerced to potentially larger
> types (long long) or to floating-point numbers.

Would it be useful to have a decoding flag to approximate underflows
with zero and overflows with DBL_MAX? This would require a new set of
decoding functions, though, to add the flag parameter.

>
> * Precision. Parsing JSON real numbers may result in a loss of
> precision. As long as underflow or overflow does not occur (i.e. a
> total loss of precision), then the rounded approximate value is
> silently used. Thus the JSON number 1.000000000000000005 may,
> depending on platform, result in the C double 1.0.
>
> 3. Encoding.
>
> Jansson only supports UTF-8 encoded JSON texts (or by implicatation
> the US-ASCII subset). It does not support or auto-detect any of the
> other encodings mentioned in the RFC, namely UTF-16LE, UTF-16BE,
> UTF-32LE or UTF-32BE. (The RFC doesn't mandate that an implementation
> must support those other encodings, but does suggest they could be
> used and that parsers support autodetection)

I have thought about supporting UTF-16 and UTF-32, but I don't know if
it's worth the trouble.

Deron Meranda

unread,
Mar 18, 2010, 1:58:01 PM3/18/10
to jansso...@googlegroups.com
On Thu, Mar 18, 2010 at 1:25 AM, Petri Lehtinen <pe...@digip.org> wrote:
>>   * No support is provided in jansson for any C numeric types other
>> than 'int' and 'double'. [...]  Also, no support or

>> hooks are provided for any supplemental "bignum" type add-on
>> packages.
>
> Can you propose how these bignum hooks could be implemented in
> Jansson?

Hmm. I certainly don't want to force unnecessary complexity in your
library. But allowing extensions could be useful too.

I guess the best thing would be to just provide a basic hook
structure, and let the user provide the logic to tie it together with
whatever bugnum package they want to use, if any.

This probably means that both the encoding and decoding
functions need to allow additional arguments. Probably a
combination of new user-defined json_types, along with
callback functions that can handle the parsing or encoding
when needed.

It may also be reasonable to do this in a generic way,
not just for integers. Maybe something like adding a new
json_type:

typedef enum {
...
JSON_EXTERNAL
} json_type;

and then define an opaque "container" for these external types:

typedef struct {
json_t json;
int external_subtype;
void * external_value;
}
json_external_t;


These user-defined types could then be handled by a
set of callback functions that the user provides.


> I have plans on changing the underlying type of integer to long. This
> is a backwards incompatible change, though, so it has to wait until
> version 2.0.

That would be welcome. Though I already use some JSON where
long long would be nice too (obviously though this is already beyond
the capabilities of Javascript, but not for example, python).


>>   * Signed zeros.   ...


>> Note this only applies to floating-point; neither JSON,
>> C, or IEEE support the concept of signed integers.
>
> "signed integer zeroes" I assume.

Yes, of course :)


> Would it be useful to have a decoding flag to approximate underflows
> with zero and overflows with DBL_MAX? This would require a new set of
> decoding functions, though, to add the flag parameter.

Approximating underflows with 0.0 I think makes a lot of
sense. Overflows though are, I would assume, much more
controversial. Again this may be a case where having "fixup"
hooks that allow user-defined behavior to be expressed
would be better.

Or perhaps even thinking about all this on a bigger picture,
it may be nicer to just have a single user-supplied callback
that could be used to fix up all values. With a signature
something like:

json_t * (*fixup_callback)(
const char* text,
json_t * value,
json_error_t * error,
void * callback_private )

Then the parsing functions would accept two additional
parameters: the callback function pointer (to the above),
and an opaque void* for passing through user-defined
private data to the callback. Either of those could be NULL.

This callback, if not NULL, would be called immediately after
parsing every fundamental JSON value, and could optionally
either let the normally-computed value go through unchanged,
or replace it, or raise an error. The callback would even be
called for values that were syntactically correct, but for which
a value could not be computed (e.g., numeric overflows).

The 'text' field would contain the original lexical string that made
up the original JSON literal. The 'value' field would contain
the computed value using the existing built-in rules, or NULL
if it could not be computed (e.g., overflow).

The 'error' field would contain a pointer to the error as raised
by the built-in value conversion; and could also be used
by the callback to replace the error message with something
else by overwriting its text member.

Finally, the 'callback_private' pointer would just be passed
through unchanged. It would allow the user-provided
callbacks to maintain some sort of state, it desired.

The callback would just return 'value' if it didn't wish to
make any changes. Or it could synthesize a new value
(perhaps of type JSON_EXTERNAL) and return that
instead. Or return NULL if it wants to raise an error.


So not only would this give the caller an opportunity to
handle overflowed-numbers, they could also manipulate
values which have no inherent problems. For example
to give the user a hook to allow for interning of string
values. For some types of repetitive JSON input, this
could result in huge memory use reductions.


The reverse, creating JSON, would need some other
mechanism. Something perhaps that allowed direct
insertion of characters into the JSON output buffers. Of
course that would depend on the user-supplied functions
to produce syntactically valid JSON, but so be it.


I don't know, those are just some very rough ideas.
I'm sure it would need a lot of refinement, if done.
Or perhaps a different approach entirely?


> I have thought about supporting UTF-16 and UTF-32,
> but I don't know if it's worth the trouble.

It's probably not worth it, especially as NULL bytes will
need to be allowed in strings for the other encodings,
and that would be a big change. UTF-8 seems to be
pretty much dominant; and the user can easily translate
or re-encode between all the encodings if desired.

However it may be worthwhile to at least put in a bit
of encoding auto-detection logic at the start of
the various loads() functions; and to raise an error
if it detects something other than UTF-8. But then
again, if NULL bytes aren't allowed, you'll already
bomb out from parsing within the first couple bytes
anyway.

Perhaps it is best just to document it.
--
Deron Meranda

Petri Lehtinen

unread,
Mar 19, 2010, 2:28:35 AM3/19/10
to jansso...@googlegroups.com

This would probably make the library too complex. A SAX-style library
(e.g. yajl) would probably be more suitable. It would be an
interesting project to make a library that uses yajl for parsing and
Jansson for data representation. This combination would probably allow
these hooks you suggest.

It wouldn't probably solve the issue for bignums, though. They would
really need to be able to hook to the point where an input token has
been read but it hasn't yet been converted to the C type. And this
solution would not be very generic, I only see it useful for integers.

>
>
> The reverse, creating JSON, would need some other
> mechanism. Something perhaps that allowed direct
> insertion of characters into the JSON output buffers. Of
> course that would depend on the user-supplied functions
> to produce syntactically valid JSON, but so be it.
>
>
> I don't know, those are just some very rough ideas.
> I'm sure it would need a lot of refinement, if done.
> Or perhaps a different approach entirely?
>
>
> > I have thought about supporting UTF-16 and UTF-32,
> > but I don't know if it's worth the trouble.
>
> It's probably not worth it, especially as NULL bytes will
> need to be allowed in strings for the other encodings,
> and that would be a big change. UTF-8 seems to be
> pretty much dominant; and the user can easily translate
> or re-encode between all the encodings if desired.

Jansson uses UTF-8 as the internal string representation, so my idea
of supporting other encodings was that the input could be converted to
UTF-8 in the parser. This way there's no need to support null bytes
inside strings.

>
> However it may be worthwhile to at least put in a bit
> of encoding auto-detection logic at the start of
> the various loads() functions; and to raise an error
> if it detects something other than UTF-8. But then
> again, if NULL bytes aren't allowed, you'll already
> bomb out from parsing within the first couple bytes
> anyway.
>
> Perhaps it is best just to document it.

Yes :)

Petri

Deron Meranda

unread,
Mar 19, 2010, 11:58:07 AM3/19/10
to jansso...@googlegroups.com
On Fri, Mar 19, 2010 at 1:28 AM, Petri Lehtinen <pe...@digip.org> wrote:
> This would probably make the library too complex.

Yep it would, so that's your call.

Your code is at least licensed such (and clean enough) that users
could hack these changes in if they really needed them.


> It wouldn't probably solve the issue for bignums, though. They would
> really need to be able to hook to the point where an input token has
> been read but it hasn't yet been converted to the C type. And this
> solution would not be very generic, I only see it useful for integers.

Handling of integers is the obvious case where extensibility
could make sense. However there are so many different bignum
packages that I don't see how they could be accommodated
without some sort of callback mechanism and opaque (void*)
object types.

Just going up to a native C integer type larger than an 'int'
('long' or 'long long') would be beneficial. That, and treating
floating-point underflows as 0.0 (overflows should continue
raising errors).

Or perhaps, a new type JSON_UNPARSABLE, which would
contain the raw json text in all cases that were deemed syntactically
correct but where you couldn't convert to a value, and let the user do
with them what they want post-parsing. That could also be used for
strings that contained \u0000. You'd also want a symmetric capability
for creating json, where the user could provide a raw token which
they claimed was already in json format. You'd only need this for
primitive types/tokens; compound structures like objects and arrays
don't really need any extensibility.

Or of course, do nothing. I suspect few users need anything more.


> Jansson uses UTF-8 as the internal string representation, so my idea
> of supporting other encodings was that the input could be converted to
> UTF-8 in the parser. This way there's no need to support null bytes
> inside strings.

Oh. That may work well. Though I don't see a huge need for
it, so it's just whether you want to do the work.
--
Deron Meranda

Reply all
Reply to author
Forward
0 new messages