reading several JSON values with json_loadf ?

1,370 views
Skip to first unread message

Basile Starynkevitch

unread,
May 17, 2011, 2:36:01 PM5/17/11
to jansso...@googlegroups.com
Hello All,

Is it permitted to read several JSON values from the same FILE*, e.g.

json_error_t err;
FILE* f = fopen("somefile.txt","r");
json_t* v1 = NULL; json_t *v2 = NULL; json_t *v3 = NULL;
v1 = json_load(f, 0, &err);
dosomethingwith(v1);
v2 = json_load(f, 0, &err);
dosomethingwith(v2);
v3 = json_load(f, 0, &err);
dosomethingwith(v3);

Of course, I could try, but I prefer a more serious opinion.

Likewise, is it permitted to write several JSON value, .eg.
json_dumpf(v1, f, 0);
json_dumpf(v2, f, 0);

The reason to do that is when one don't want to put the values in an array, because their might be thousands of them, and keeping a single json_t array containing all of them is not realistic.

Regards.

--
Basile STARYNKEVITCH http://starynkevitch.net/Basile/
email: basile<at>starynkevitch<dot>net mobile: +33 6 8501 2359
8, rue de la Faiencerie, 92340 Bourg La Reine, France
*** opinions {are only mine, sont seulement les miennes} ***

Ladar Levison

unread,
May 17, 2011, 3:00:58 PM5/17/11
to jansso...@googlegroups.com, Basile Starynkevitch
Everything passed to the loader must be syntactically valid JSON. That
means all of the entries need to be wrapped using an object, or an array:

http://www.digip.org/jansson/doc/2.0/apiref.html#decoding

If you wanted to process each object individually, you could read them
from the file yourself and then use the json_loads() function.

Basile Starynkevitch

unread,
May 17, 2011, 3:34:35 PM5/17/11
to jansso...@googlegroups.com, Ladar Levison
On Tue, 17 May 2011 14:00:58 -0500
Ladar Levison <la...@lavabit.com> wrote:

> Everything passed to the loader must be syntactically valid JSON. That
> means all of the entries need to be wrapped using an object, or an array:
>
> http://www.digip.org/jansson/doc/2.0/apiref.html#decoding
>
> If you wanted to process each object individually, you could read them
> from the file yourself and then use the json_loads() function.


According to the documentation:

json_t *json_loadf(FILE *input, size_t flags, json_error_t *error)
Return value: New reference.

Decodes the JSON text in stream input and returns the array or object
it contains, or NULL on error, in which case error is filled with
information about the error. flags is currently unused, and should be
set to 0.


So I mean to call several times json_loadf on the same open FILE*.


> > json_error_t err;
> > FILE* f = fopen("somefile.txt","r");
> > json_t* v1 = NULL; json_t *v2 = NULL; json_t *v3 = NULL;

> > v1 = json_loadf(f, 0,&err);
> > dosomethingwith(v1);
> > v2 = json_loadf(f, 0,&err);
> > dosomethingwith(v2);
> > v3 = json_loadf(f, 0,&err);


> > dosomethingwith(v3);
> >
> > Of course, I could try, but I prefer a more serious opinion.


My previous email had a typo (I wrote json_load instead of json_loadf).
What I am reading are several valid instances of JSON. For what reasons
could json_loadf only load a single JSON? What is it purpose if it
cannot be used on an already opened FILE*? (Imagine a specialized HTTP
or SMTP server accepting and replying only JSON; the header information
was already parsed, and you'll need an opened FILE*).

By the way, using json_loadf may also help with a limited case of //
comments. One could imagine parsing the // comment by user code, and
then passing the opened FILE* to json_loadf.

By glancing into the source code, it seems that using json_loadf as I
suggest is possible, at least when the various JSON values are
separated by blank lines.

I do know that the catenation of several JSON texts is not a valid JSON
text, however I also find extremely useful to read (and write) several
JSON chunks in some FILE*

Cheers.

Ladar Levison

unread,
May 17, 2011, 4:34:11 PM5/17/11
to jansso...@googlegroups.com

From RFC 4627:

A JSON value MUST be an object, array, number, or string, or one of
the following three literal names:

false null true

Since only objects and arrays can encapsulate children, they need to be
at the root of any valid structure holding more than one data item.

The json_loadf() function was written with the assumption that the
entire file is a single valid JSON object, so if all the items aren't
encapsulated the load will fail.

Like I said before , you could easily read each line out of the file
yourself, and then pass the data to json_loads() instead.


Jonathan Landis

unread,
May 17, 2011, 4:53:42 PM5/17/11
to jansso...@googlegroups.com
On Tue, 2011-05-17 at 15:34 -0500, Ladar Levison wrote:
> Like I said before , you could easily read each line out of the file
> yourself, and then pass the data to json_loads() instead.

Maybe you could try parsing using json_loadf, and then detect the
location of the first error from the error object. Presumably that is
where the next object starts. Then you know where the object begins and
ends, and you can copy it to a memory buffer and parse it from there
using json_loads.

JKL

Ladar Levison

unread,
May 17, 2011, 5:01:29 PM5/17/11
to jansso...@googlegroups.com
Just sending along 3 patches. The optional param patch adds the 'S'
token to the pack/unpack functions. Optional params are allowed tobe
strings, or NULL values without causing an error. For example:

"{s:S}", "hello", "world" = {"hello", "world"}
"{s:S}", "hello", NULL = {"hello", null}

Note this only applies to object values. Object keys, and array entries
are still required so the 'S' token won't work.

I'm also including a patch that allows me to poll the library version at
runtime, and a patch to convert some of the important macros into
compiled functions. I personally needed the compiled functions so I
could load and use the library via dlopen.

Wondering what's next for Jansson? Personally I'd like to see support
added for arbitrary (aka binary) strings. I'm a little nervous about
passing around untrusted data as NULL terminated strings. I'd prefer to
hand off buffers with a length parameter, and pick them up with a length
value. Presumably Jansson would take care of the encoding/decoding of
any binary data it found.

Anyone tried encoding/escaping binary data and then using the
*_nocheck() functions to send/read it?


jansson-inlines.patch
jansson-optional-params.patch
jansson-version.patch

Graeme Smecher

unread,
May 17, 2011, 5:11:29 PM5/17/11
to jansso...@googlegroups.com
Hi Ladar,

(I see Jansson is now available in Debian. Whoever did that -- thanks,
and congrats!)

On 17/05/11 02:01 PM, Ladar Levison wrote:
> Just sending along 3 patches. The optional param patch adds the 'S'
> token to the pack/unpack functions. Optional params are allowed tobe
> strings, or NULL values without causing an error. For example:
>
> "{s:S}", "hello", "world" = {"hello", "world"}
> "{s:S}", "hello", NULL = {"hello", null}
>
> Note this only applies to object values. Object keys, and array
> entries are still required so the 'S' token won't work.

Why not array entries? I think '["foo", "bar", null, "baz"]' is valid
JSON. Also, I'd imagine these patches will be easier to integrate if
they include test cases.

> I'm also including a patch that allows me to poll the library version
> at runtime, and a patch to convert some of the important macros into
> compiled functions. I personally needed the compiled functions so I
> could load and use the library via dlopen.
>
> Wondering what's next for Jansson? Personally I'd like to see support
> added for arbitrary (aka binary) strings. I'm a little nervous about
> passing around untrusted data as NULL terminated strings. I'd prefer
> to hand off buffers with a length parameter, and pick them up with a
> length value. Presumably Jansson would take care of the
> encoding/decoding of any binary data it found.

I have a couple of itches to scratch. I will probably not have the
chance to tackle these unless they become more than an itch, but here
they are:

* restartable parsing, i.e. for parsing a JSON object available as a
sequence of character arrays
* disabling unicode support, i.e. while running on very limited CPUs

cheers,
Graeme Smecher

Ladar Levison

unread,
May 17, 2011, 6:11:59 PM5/17/11
to jansso...@googlegroups.com
On 5/17/2011 4:11 PM, Graeme Smecher wrote:
> Why not array entries? I think '["foo", "bar", null, "baz"]' is valid
> JSON. Also, I'd imagine these patches will be easier to integrate if
> they include test cases.

Wasn't sure optional NULL support would be a good idea for arrays. I'm
still working out the best ways to interface with JSON, but I didn't
think it would be good to encourage "[S,S,S,S]", &a, &b, &c, &d. Since
arrays can have an arbitrary number of results, and the values can be in
any order.

That said, in looking at the code, the patch may around offer S for the
pack methods. And if you add 'S' to the unpack_value_starters string, it
might work for unpacking too.

P.S. I agree on the test cases. I have a few for my app that test this
functionality, but didn't write any for the library itself. If only
there was more caffeine.


Basile Starynkevitch

unread,
May 18, 2011, 1:35:50 AM5/18/11
to jansso...@googlegroups.com, Ladar Levison

Yes, and json_loadf even test that the JSON value is followed by end of file.

The following program demonstrate it
#######################################################################
/* file testmanyj.c */
#include <stdio.h>
#include <string.h>
#include <jansson.h>

int
main (void)
{
FILE *f = NULL;
json_t *j1 = NULL;
json_t *j2 = NULL;
json_t *j3 = NULL;
json_error_t err;
#define MYFILENAME "testfile.txt"
f = fopen (MYFILENAME, "w");
j1 = json_pack ("{s:s,s:i,s:[ss]}",
"name", "Doe",
"number", 1234, "address", "Bar Street", "There");
json_dumpf (j1, f, 0);
fputc ('\n', f);
json_decref (j1), j1 = NULL;
j2 = json_pack ("[sii]", "Yep", 33, 44);
json_dumpf (j2, f, 0);
fputc ('\n', f);
json_decref (j2), j2 = NULL;
j3 = json_pack ("s", "String");
json_dumpf (j3, f, 0);
json_decref (j3), j3 = NULL;
fputc ('\n', f);
fclose (f);
memset (&err, 0, sizeof (err));
f = fopen (MYFILENAME, "r");
/* first json value */
j1 = json_loadf (f, 0, &err);
if (j1)
{
printf ("loaded j1 @ %p: ", (void *) j1);
json_dumpf (j1, stdout, 0);
putchar ('\n');
json_decref (j1), j1 = NULL;
}
else
printf ("failed to load j1: %s @ %s:%d:%d\n", err.text, err.source,
err.line, err.column);
/* second json value */
j2 = json_loadf (f, 0, &err);
if (j2)
{
printf ("loaded j2 @ %p: ", (void *) j2);
json_dumpf (j2, stdout, 0);
putchar ('\n');
json_decref (j2), j2 = NULL;
}
else
printf ("failed to load j2: %s @ %s:%d:%d\n", err.text, err.source,
err.line, err.column);
/* third json value */
j3 = json_loadf (f, 0, &err);
if (j3)
{
printf ("loaded j3 @ %p: ", (void *) j3);
json_dumpf (j3, stdout, 0);
putchar ('\n');
json_decref (j3), j3 = NULL;
}
else
printf ("failed to load j3: %s @ %s:%d:%d\n", err.text, err.source,
err.line, err.column);
fclose (f);
return 0;
}
######################################################
When running, it gives:


% ./testmanyj
failed to load j1: end of file expected near '[' @ <stream>:2:1
failed to load j2: '[' or '{' expected near '"Yep"' @ <stream>:1:5
failed to load j3: '[' or '{' expected near ',' @ <stream>:1:1
% cat -n testfile.txt
cat -n testfile.txt
1 {"address": ["Bar Street", "There"], "name": "Doe", "number": 1234}
2 ["Yep", 33, 44]
3
######################################################
Strangely, j3 has not been dumped.

Regards.

Petri Lehtinen

unread,
May 18, 2011, 1:46:41 AM5/18/11
to jansso...@googlegroups.com
Basile Starynkevitch wrote:
> Hello All,
>
> Is it permitted to read several JSON values from the same FILE*, e.g.
>
> json_error_t err;
> FILE* f = fopen("somefile.txt","r");
> json_t* v1 = NULL; json_t *v2 = NULL; json_t *v3 = NULL;
> v1 = json_load(f, 0, &err);
> dosomethingwith(v1);
> v2 = json_load(f, 0, &err);
> dosomethingwith(v2);
> v3 = json_load(f, 0, &err);
> dosomethingwith(v3);
>
> Of course, I could try, but I prefer a more serious opinion.

Currently this doesn't work. As Ladar said, all json_load* functions
expect the _whole_ input to be a single, valid JSON text.

The decoder first skips all whitespace, then decodes an array or
object, then skips all whitespace again, and then returns success if
EOF was reached. Anything other than EOF is considered garbage and
will make the decoding fail.

Particularly, seeking the file yourself before calling json_loadf()
works as you expect.

To make it possible to decode many JSON texts from a single input
stream has been requested before. At that time, there was no flag
parameter to the decoding functions yet, but now as we have it, this
would only be a matter of adding a flag.

With the flag on, the decoder would stop and return success after
successfully decoding an array or object, leaving the file position
pointing to the first character after the '}' or ']'.

There is an issue, though: When one of the values in your long stream
of arrays and/or objects has an error, there's no way to
"re-synchronize" the stream, i.e. to move the file position to the end
of the erroneous array or object.

> Likewise, is it permitted to write several JSON value, .eg.
> json_dumpf(v1, f, 0);
> json_dumpf(v2, f, 0);

This works as you would expect. json_dumpf() only writes the JSON text
at the current file position and leaves the file open after writing.

Petri

Petri Lehtinen

unread,
May 18, 2011, 1:51:22 AM5/18/11
to jansso...@googlegroups.com, Ladar Levison

It's a string, and json_dump* functions will only encode objects and
arrays, as per the RFC.

Petri

Basile Starynkevitch

unread,
May 18, 2011, 2:01:08 AM5/18/11
to jansso...@googlegroups.com, Petri Lehtinen
On Wed, 18 May 2011 08:46:41 +0300
Petri Lehtinen <pe...@digip.org> wrote:
[....]
> Currently this doesn't work. As Ladar said, all json_load* functions
> expect the _whole_ input to be a single, valid JSON text.

Yes, and they expect an EOF after that single JSON value. In practical
terms, this means that, when dumping JSON values, jansson favors a lot
of (sometimes very) small text files (one per JSON value). This might
put a burden on some filesystems.

I believe that fact should be reflected in the documentation. The
current one reads:
#######


json_t *json_loadf(FILE *input, size_t flags, json_error_t *error)
Return value: New reference.

Decodes the JSON text in stream input and returns the array or object
it contains, or NULL on error, in which case error is filled with
information about the error. flags is currently unused, and should be
set to 0.

######

I suggest to state explicitly instead
######


Decodes the JSON text in stream input and returns the array or object
it contains, or NULL on error, in which case error is filled with

information about the error. The JSON text should be followed by the
end of file, unless flags contain JSON_LOAD_UNENDED
######

[....]


> To make it possible to decode many JSON texts from a single input
> stream has been requested before. At that time, there was no flag
> parameter to the decoding functions yet, but now as we have it, this
> would only be a matter of adding a flag.
>
> With the flag on, the decoder would stop and return success after
> successfully decoding an array or object, leaving the file position
> pointing to the first character after the '}' or ']'.

Yes, such a flag would be very nice.


>
> There is an issue, though: When one of the values in your long stream
> of arrays and/or objects has an error, there's no way to
> "re-synchronize" the stream, i.e. to move the file position to the end
> of the erroneous array or object.

I am not sure that is a serious issue. If there is an error, there is
no much to do except aborting the processing. And it is even the same
when reading a single JSON value: if a comma has been transmuted to a
dot (e.g. by some improbable hardware failure), the JSON parsing would
have failed, and jansson gives (rightly) no way to recover cleverly
from that.

Regards

Petri Lehtinen

unread,
May 18, 2011, 2:23:02 AM5/18/11
to jansso...@googlegroups.com
Ladar Levison wrote:
> Just sending along 3 patches. The optional param patch adds the 'S'
> token to the pack/unpack functions. Optional params are allowed tobe
> strings, or NULL values without causing an error. For example:
>
> "{s:S}", "hello", "world" = {"hello", "world"}
> "{s:S}", "hello", NULL = {"hello", null}
>
> Note this only applies to object values. Object keys, and array
> entries are still required so the 'S' token won't work.
>
> I'm also including a patch that allows me to poll the library
> version at runtime, and a patch to convert some of the important
> macros into compiled functions. I personally needed the compiled
> functions so I could load and use the library via dlopen.

I'm reluctant to un-inline these functions, especially json_incref()
and json_decref(), because of the performance hit.

What prevents you from using the macros even if you otherwise use
dlopen()? Just #include the header and you have all the macros
available. AFAICS, you'll have to #include it anyway to make the
json_t struct available...

You can shadow the global function declarations in a function scope
with your own variables, or add your own prefix in the variable names
if you want to bind the function pointers in the global scope.

> Wondering what's next for Jansson? Personally I'd like to see
> support added for arbitrary (aka binary) strings. I'm a little
> nervous about passing around untrusted data as NULL terminated
> strings. I'd prefer to hand off buffers with a length parameter, and
> pick them up with a length value. Presumably Jansson would take care
> of the encoding/decoding of any binary data it found.
>
> Anyone tried encoding/escaping binary data and then using the
> *_nocheck() functions to send/read it?

I'm planning to release v2.1 before the end of May, except if I get
enough interesting features to be added before that :) There are some
new features and a minor bugfix already.

What comes to using binary data, I'm reluctant to support it, as the
JSON specification explicitly states that everything shall be Unicode
(although mr. Crockford has quite liberal conception of Unicode).

Adding support for zero bytes inside strings wouldn't still make it
possible to directly use binary data with Jansson. However, it would
make it possible to use binary data by first encoding it as UTF-8, as
Jansson (in line with mr. Crockford's opinions) allows all code
points. And AFAIK there are no invalid code points in the range
U+0000...U+00FF anyway, and the binary data would map into this range.

Petri

Petri Lehtinen

unread,
May 18, 2011, 2:29:21 AM5/18/11
to jansso...@googlegroups.com
Ladar Levison wrote:
> On 5/17/2011 4:11 PM, Graeme Smecher wrote:
> >Why not array entries? I think '["foo", "bar", null, "baz"]' is
> >valid JSON. Also, I'd imagine these patches will be easier to
> >integrate if they include test cases.
>
> Wasn't sure optional NULL support would be a good idea for arrays.
> I'm still working out the best ways to interface with JSON, but I
> didn't think it would be good to encourage "[S,S,S,S]", &a, &b, &c,
> &d. Since arrays can have an arbitrary number of results, and the
> values can be in any order.

The S format is sounds like a good idea to me. I agree with Graeme
that it should be available for array packing and all unpacking, too.

> That said, in looking at the code, the patch may around offer S for
> the pack methods. And if you add 'S' to the unpack_value_starters
> string, it might work for unpacking too.

Correct. Looking at the code myself now, I don't know/remember why the
unpack_value_starters string even is there. The only reason seems to
be to catch an invalid format character before catching an array index
out of range error.

> P.S. I agree on the test cases. I have a few for my app that test
> this functionality, but didn't write any for the library itself. If
> only there was more caffeine.

I'll go on and apply your patch when I have time. Don't worry about
the test cases, I can write those if you have to sleep :)

Petri

Petri Lehtinen

unread,
May 18, 2011, 2:53:04 AM5/18/11
to jansso...@googlegroups.com
Graeme Smecher wrote:
> I have a couple of itches to scratch. I will probably not have the
> chance to tackle these unless they become more than an itch, but
> here they are:
>
> * restartable parsing, i.e. for parsing a JSON object available
> as a sequence of character arrays

I started working on this feature last October, but after doing it for
a week I realized that it's really much work.

It makes the decoder much more compilcated code-wise, as you basically
have to check for end of data after reading every single character,
and also be able to restart the parsing from any point. It probably
also hits the decoding performance because of the extra checks, but
I'm not really concerned about that.

I still have the code around if you want to have a look. IIRC, I got
to the point where the tokenizer works, or should work, but didn't
touch the parser yet.

> * disabling unicode support, i.e. while running on very limited CPUs

This is on the roadmap, too, and also quite a bit of work. I thought
it would be reasonable to have a ./configure flag (and
jansson_config.h option) to disable all UTF-8 related code and replace
it with much simpler ASCII logic.

IIRC, the reason why I didn't start implementing this already a year
ago was that I couldn't figure out how to adapt the test suite. And
nobody has requested this ever since :)

Petri

Ladar Levison

unread,
May 19, 2011, 4:36:06 AM5/19/11
to jansso...@googlegroups.com
On 5/18/2011 1:23 AM, Petri Lehtinen wrote:
> What prevents you from using the macros even if you otherwise use
> dlopen()? Just #include the header and you have all the macros
> available. AFAICS, you'll have to #include it anyway to make the
> json_t struct available...
>
> You can shadow the global function declarations in a function scope
> with your own variables, or add your own prefix in the variable names
> if you want to bind the function pointers in the global scope.
>

I try to avoid using library macros in my program because if the library
is updated, the macros aren't so its impossible to update the shared
libraries without recompiling the consumer code. The macros also make
use of the original function names, which don't correspond to the
variable names assigned to by dlsym(). (Not sure what you mean by shadow
the global decelerations?)

Might I suggest making them functions, and then using macros to enable
the always_inline function attribute? It should accomplish the same
thing. http://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html

You could also create two versions of the function. The original
traditional public function, and the macro alternatives using the _fast
suffix. Internally the library could use the macros, but external
programs could go through the traditional interface...


> Adding support for zero bytes inside strings wouldn't still make it
> possible to directly use binary data with Jansson. However, it would
> make it possible to use binary data by first encoding it as UTF-8, as
> Jansson (in line with mr. Crockford's opinions) allows all code
> points. And AFAIK there are no invalid code points in the range
> U+0000...U+00FF anyway, and the binary data would map into this range.

That's what I figured. But are you saying if I pass the data to Jansson
already encoded it will work? Or no? I'm working with email messages, so
I'm not quite sure how I should deal with the situation where a message
contains a NULL character, and would rather not have to worry that
scenario at all...


Petri Lehtinen

unread,
May 19, 2011, 5:30:57 AM5/19/11
to jansso...@googlegroups.com
Ladar Levison wrote:
> On 5/18/2011 1:23 AM, Petri Lehtinen wrote:
> >What prevents you from using the macros even if you otherwise use
> >dlopen()? Just #include the header and you have all the macros
> >available. AFAICS, you'll have to #include it anyway to make the
> >json_t struct available...
> >
> >You can shadow the global function declarations in a function scope
> >with your own variables, or add your own prefix in the variable names
> >if you want to bind the function pointers in the global scope.
> >
>
> I try to avoid using library macros in my program because if the
> library is updated, the macros aren't so its impossible to update
> the shared libraries without recompiling the consumer code.

Jansson 2.x is guaranteed to be backwards compatible to the distant
future for both API and ABI, so you shouldn't have to be concerned
about compatibility.

> The
> macros also make use of the original function names, which don't
> correspond to the variable names assigned to by dlsym(). (Not sure
> what you mean by shadow the global decelerations?)

I mean that you can do this:

#include <jansson.h>
/* declares void json_delete(json_t *json) */
/* defines json_decref(x) macro that uses json_delete() */

static void my_function()
{
void (*json_delete)(json_t *);
void *handle;
json_t *value;

handle = dlopen("/path/to/libjansson.so");
json_delete = dlsym(handle, "json_delete");

/* ... */

json_decref(value);
}

This works as the function pointer declared in the function's scope
shadows the global json_delete symbol (that is unresolved if you use
dlopen()). As json_decref() is just a macro, it expands to code that
uses json_delete, which in turn uses the json_delete function pointer
declared in your function.

The other functions than json_delete() may be more problematic,
though, as they're note used by a macro but a static inline function.
May I ask why you're using dlopen() in the first place?

> Might I suggest making them functions, and then using macros to
> enable the always_inline function attribute? It should accomplish
> the same thing.
> http://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html

Doesn't work, as the body of the functions would still have to be in
the header file for inlining to be possible. And this is exactly why
json_object_set() and friends don't work for your use case.

> You could also create two versions of the function. The original
> traditional public function, and the macro alternatives using the
> _fast suffix. Internally the library could use the macros, but
> external programs could go through the traditional interface...

This would be doable, but I'd rather name the function versions _slow
and keep the macros for normal use.

> >Adding support for zero bytes inside strings wouldn't still make it
> >possible to directly use binary data with Jansson. However, it would
> >make it possible to use binary data by first encoding it as UTF-8, as
> >Jansson (in line with mr. Crockford's opinions) allows all code
> >points. And AFAIK there are no invalid code points in the range
> >U+0000...U+00FF anyway, and the binary data would map into this range.
>
> That's what I figured. But are you saying if I pass the data to
> Jansson already encoded it will work? Or no? I'm working with email
> messages, so I'm not quite sure how I should deal with the situation
> where a message contains a NULL character, and would rather not have
> to worry that scenario at all...

Currently, null bytes in strings don't work. The functions that take
strings as input treat a null byte as a string terminator. If you're
decoding a file or buffer (using json_loadf(), json_load_file() or
json_loadb()), the null byte in the input causes a decoding error,
just because null bytes are not allowed. If you're decoding a string
with json_lodas(), you get unexpected EOF error, as the null byte is
treated as the end of the input string.

What I meant to say was that if/when the support for null bytes inside
strings is implemented some day, then you could encode your binary
data as UTF-8 and pass it to a new imaginary function:

json_t *my_string = json_string_from_buffer(buffer, buffer_length);

Supporting null bytes inside strings would also mean that no decoding
function would bail out with an error for a JSON input that contains a
null byte inside a string.

But as of now, if you want to use binary data, you need to encode it
as UTF-8 AND escape all null bytes somehow. The escaping scheme can be
just anything, for example the string "ab" could mean a null byte and
"aa" the character 'a'. Implementing such an escaping function is not
hard, but in any case, you have to do it in your own code.

When you have an escaping function and a UTF-8 encoder available, you
can store binary data in strings like this:

json_t *create_binary_string(usigned char *buffer, size_t buffer_length)
{
/* buffer contains the binary data */

char *escaped;
char *utf8;
json_t *result;

/* escapes null bytes in buffer, returns an null-terminated string */
escaped = escape_null_bytes(buffer, buffer_length);

/* encodes a null terminated byte string as UTF-8 */
utf8 = encode_to_utf8(escaped);

result = json_string(utf8);

free(escaped);
free(utf8);

return result;
}

Petri

Ladar Levison

unread,
May 19, 2011, 6:42:59 AM5/19/11
to jansso...@googlegroups.com
On 5/19/2011 4:30 AM, Petri Lehtinen wrote:
>
> Jansson 2.x is guaranteed to be backwards compatible to the distant
> future for both API and ABI, so you shouldn't have to be concerned
> about compatibility.
>
I've been burned in the past by macros that changed between versions and
it resulted in a few difficult to find bugs. Although most of the big
issues only affect my legacy/production code base. I spent a lot of time
improving/automating how I handle libraries on my dev tree so that
updates would be easier going forward. For example I now check that the
a function signature hasn't changed, which would require me to manually
update the corresponding symbol in my codebase.

>
> I mean that you can do this:
>
> #include<jansson.h>
> /* declares void json_delete(json_t *json) */
> /* defines json_decref(x) macro that uses json_delete() */
>
> static void my_function()
> {
> void (*json_delete)(json_t *);
> void *handle;
> json_t *value;
>
> handle = dlopen("/path/to/libjansson.so");
> json_delete = dlsym(handle, "json_delete");
>
> /* ... */
>
> json_decref(value);
> }
>
> This works as the function pointer declared in the function's scope
> shadows the global json_delete symbol (that is unresolved if you use
> dlopen()). As json_decref() is just a macro, it expands to code that
> uses json_delete, which in turn uses the json_delete function pointer
> declared in your function.
>
> The other functions than json_delete() may be more problematic,
> though, as they're note used by a macro but a static inline function.
> May I ask why you're using dlopen() in the first place?
>

Using function scoped variable names would be quite the pain. Instead I
have a function that loads all of the external library symbols and
assigns them to global variables identical to the original function
name, just with a "_d" at the end. I call the loader on startup and in
the event of a HUP signal. I thought you knew of a way to create global
variables identical to the symbols in the header. If I create variables
using identical names I get this error:

./providers/symbols.c:16: error: �json_delete� redeclared as different
kind of symbol
/home/ladar/Lavabit/sources/jansson/src/jansson.h:92: note: previous
declaration of �json_delete� was here

As for why I use dlopen()? I have a compatibility layer between my code
and all the external f/oss libraries I use. The layer allows me to swap
libraries out for platform or licensing reasons without breaking the
interface my codebase uses.


>
> Doesn't work, as the body of the functions would still have to be in
> the header file for inlining to be possible. And this is exactly why
> json_object_set() and friends don't work for your use case.
>

See the link below, but I don't believe that is the case. You could
create the inlined function normally, and then decorate it using the
inline keyword in the header file. I believe that causes the linker to
replace references to that function with the inlined code. At the very
least I can say from experience I was able to write inlined code this
way, but I took it on faith that the resulting binaries were indeed
inlined. See this article for more:

http://gcc.gnu.org/onlinedocs/gcc/Inline.html#Inline

The case against this method is that it would require you to check for
platform/compiler support and disable inlining if those checks failed.
I've seen libraries accomplish that using macros, but never attempted it
myself. Checkout the MySQL header my_global.h for the macros
STATIC_INLINE, ALWAYS_INLINE, NEVER_INLINE as possible examples.

>> You could also create two versions of the function. The original
>> traditional public function, and the macro alternatives using the
>> _fast suffix. Internally the library could use the macros, but
>> external programs could go through the traditional interface...
> This would be doable, but I'd rather name the function versions _slow
> and keep the macros for normal use.
>

I'm perfectly happy with this approach. Might I also suggest that if you
go this route, adding a comment to the header files above the macro
letting people like me know there is a function equivalent and what its
called? And if your really feeling generous, include an indicator of
some sort (asterisk or color perhaps) to the documentation to indicate
which functions are implemented as macro's and what are true functions...

> Currently, null bytes in strings don't work. The functions that take
> strings as input treat a null byte as a string terminator. If you're
> decoding a file or buffer (using json_loadf(), json_load_file() or
> json_loadb()), the null byte in the input causes a decoding error,
> just because null bytes are not allowed. If you're decoding a string
> with json_lodas(), you get unexpected EOF error, as the null byte is
> treated as the end of the input string.
>
> What I meant to say was that if/when the support for null bytes inside
> strings is implemented some day, then you could encode your binary
> data as UTF-8 and pass it to a new imaginary function:
>
> json_t *my_string = json_string_from_buffer(buffer, buffer_length);
>
> Supporting null bytes inside strings would also mean that no decoding
> function would bail out with an error for a JSON input that contains a
> null byte inside a string.

I thought Jansson automatically encoded anything it was passed as valid
UTF8, sans the null chars because of they are used for termination? And
then conversely did the decoding during an unpack? But not having much
experience with Unicode, I wasn't quite sure what that specifically
involved. At least for me, I chose to use a library like Jansson
specifically so I wouldn't have to learn.

I've been kicking around the idea of adding a json_binary_t type to
Jansson that would be identical to the json_string_t, except that you
could pass the buffer length to the constructor and could similarly call
a length function on unpacked data. Then add a new symbol like e/E (alas
b was taken) which would require a length be passed as a size_t followed
by the actual data pointer to the pack/unpack functions. I figured the
binary type would encode the data using UTF-8, or possibly UTF-16 BE.

Not knowing what this would involve, or just how I should be encoding
the data has kept me from attempting this so far, although that could
change if my implementation effort requires me to tackle the issue. I'm
also kicking around the idea of base64 encoding binary or untrusted
data, but the ugliness of that solution gives me chills.


Petri Lehtinen

unread,
May 19, 2011, 7:19:16 AM5/19/11
to jansso...@googlegroups.com
Ladar Levison wrote:
> Using function scoped variable names would be quite the pain.
> Instead I have a function that loads all of the external library
> symbols and assigns them to global variables identical to the
> original function name, just with a "_d" at the end. I call the
> loader on startup and in the event of a HUP signal. I thought you
> knew of a way to create global variables identical to the symbols in
> the header. If I create variables using identical names I get this
> error:
>
> ./providers/symbols.c:16: error: ‘json_delete’ redeclared as

> different kind of symbol
> /home/ladar/Lavabit/sources/jansson/src/jansson.h:92: note: previous
> declaration of ‘json_delete’ was here

Yes, I thought you must have something like this.

> As for why I use dlopen()? I have a compatibility layer between my
> code and all the external f/oss libraries I use. The layer allows me
> to swap libraries out for platform or licensing reasons without
> breaking the interface my codebase uses.
> >
> >Doesn't work, as the body of the functions would still have to be in
> >the header file for inlining to be possible. And this is exactly why
> >json_object_set() and friends don't work for your use case.
> >
>
> See the link below, but I don't believe that is the case. You could
> create the inlined function normally, and then decorate it using the
> inline keyword in the header file. I believe that causes the linker
> to replace references to that function with the inlined code. At the
> very least I can say from experience I was able to write inlined
> code this way, but I took it on faith that the resulting binaries
> were indeed inlined. See this article for more:
>
> http://gcc.gnu.org/onlinedocs/gcc/Inline.html#Inline
>
> The case against this method is that it would require you to check
> for platform/compiler support and disable inlining if those checks
> failed. I've seen libraries accomplish that using macros, but never
> attempted it myself. Checkout the MySQL header my_global.h for the
> macros STATIC_INLINE, ALWAYS_INLINE, NEVER_INLINE as possible
> examples.

Ah, sorry. I was in the impression that json_incref() and
json_decref() are actually macros and not functions, but now that I
looked at jansson.h, I see that they are static inline functions.
Never trust your memory when talking about your own code! :D

So my last example was invalid, the one with a function pointer,
dlsym() and a json_decref() call. It doesn't work that way.

But I still don't think using inline could solve what you are trying
to achieve. As if I make the functions inline (as they already are),
they need to be in the header file. And in this case they also need to
be static so that we don't end up with multiple definitions of a
symbol when jansson.h is included in multiple modules. And when they
are static, the won't end up in the shared library file, so you cannot
use them with dlopen().

No, Jansson doesn't automatically do any manipulation. It expects that
all strings you pass to it are valid UTF-8. Note that ASCII is a
subset of UTF-8, so as long as your JSON strings are plain old 7-bit
ASCII, you're safe. But once you need to use all the 8 bits, things
get somewhat complicated.

UTF-8 is a multi-byte encoding. It can encode integers (code points in
Unicode jargon) in the range 0...0x10FFFF. It does this by using one
byte for integers 0...0x7F (the ASCII range), two bytes for integers
0xF0..0x7FF, and so on to up to 4 bytes. But this encoding scheme has
a consequence that not all byte combinations are valid! For example,
using any of the bytes in the range 0xF0...0xBF as the first character
in a string is invalid UTF-8.

This makes it impossible to use binary data directly in Jansson. But
if you take your binary data and consider all the individual bytes as
integers (code points), and encode it as UTF-8, you get a
representation that passes bytes 0...0x7F through unmodified, and
transforms bytes 0xF0...0xFF to two bytes. You can think this the same
way as base64, but this encoding requires less space and is less
readable :)

And because Jansson itself as a library is limited and doesn't allow
zero bytes, you'd need to incorporate your own encoder to escape the
zero bytes as well.

> I've been kicking around the idea of adding a json_binary_t type to
> Jansson that would be identical to the json_string_t, except that
> you could pass the buffer length to the constructor and could
> similarly call a length function on unpacked data. Then add a new
> symbol like e/E (alas b was taken) which would require a length be
> passed as a size_t followed by the actual data pointer to the
> pack/unpack functions. I figured the binary type would encode the
> data using UTF-8, or possibly UTF-16 BE.

I opened an issue about the zero byte problem on GitHub:

https://github.com/akheron/jansson/issues/26

What comes to an additional type for binary data, that's not going to
happen because that's not JSON. If you want to use JSON and Jansson
for storing or transmitting binary data, you'll have to live with the
fact that JSON is not meant for that, or choose another data format.

Petri

Ladar Levison

unread,
May 19, 2011, 11:16:00 AM5/19/11
to jansso...@googlegroups.com
On 5/19/2011 6:19 AM, Petri Lehtinen wrote:
>
> What comes to an additional type for binary data, that's not going to
> happen because that's not JSON. If you want to use JSON and Jansson
> for storing or transmitting binary data, you'll have to live with the
> fact that JSON is not meant for that, or choose another data format.

The json_binary_t type wasn't isn't supposed to be a new JSON data type,
just a new Jansson data type. I was thinking it could be used to
interface with JSON when working with binary data. Its creation would
mean I the current json_string_t type/functions wouldn't need to be
modified.


Petri Lehtinen

unread,
May 23, 2011, 3:36:32 PM5/23/11
to jansso...@googlegroups.com
Petri Lehtinen wrote:
> > You could also create two versions of the function. The original
> > traditional public function, and the macro alternatives using the
> > _fast suffix. Internally the library could use the macros, but
> > external programs could go through the traditional interface...
>
> This would be doable, but I'd rather name the function versions _slow
> and keep the macros for normal use.

I decided to not implement this either. Jansson is a shared library,
not a plugin. If you want it to behave like a plugin, you can easily
make a plugin by wrapping the static inline functions as regular
functions yourself.

Petri Lehtinen

unread,
May 23, 2011, 3:40:22 PM5/23/11
to jansso...@googlegroups.com
Petri Lehtinen wrote:

> Ladar Levison wrote:
> The S format is sounds like a good idea to me. I agree with Graeme
> that it should be available for array packing and all unpacking, too.
>
> > That said, in looking at the code, the patch may around offer S for
> > the pack methods. And if you add 'S' to the unpack_value_starters
> > string, it might work for unpacking too.

After a while, I've started to think that the "S" format may not be
such a good idea after all. If we allow strings to be NULL, why not
other values too? And is this an easy way to make annoying bugs, as
NULL string pointers end up as null in the JSON output instead of
causing an error that's easier to detect?

Petri Lehtinen

unread,
May 29, 2011, 2:34:11 PM5/29/11
to jansso...@googlegroups.com
Basile Starynkevitch wrote:
> > To make it possible to decode many JSON texts from a single input
> > stream has been requested before. At that time, there was no flag
> > parameter to the decoding functions yet, but now as we have it, this
> > would only be a matter of adding a flag.
> >
> > With the flag on, the decoder would stop and return success after
> > successfully decoding an array or object, leaving the file position
> > pointing to the first character after the '}' or ']'.
>
> Yes, such a flag would be very nice.

The flag has now been implemented [1]. It's called JSON_DISABLE_EOF_CHECK
and will be available in the upcoming 2.1 release.

Petri

[1] https://github.com/akheron/jansson/commit/a76ba52f

Reply all
Reply to author
Forward
0 new messages