rson data model

Baptiste Lepilleur

unread,

Mar 12, 2010, 3:01:54 PM3/12/10

to rson-d...@googlegroups.com

Hi,

I just skimmed throught the manual and I have a question:

The introduction says:
"- Dictionary keys do not have to be strings. "

Does that mean that rson allow object's key to be number, null, array, and object ?

As a consequence, this would mean that it may not be possible to convert/serialize an rson document in json format, isn't it?

Baptiste.

Patrick Maupin

unread,

Mar 12, 2010, 4:47:31 PM3/12/10

to rson-d...@googlegroups.com

On Fri, Mar 12, 2010 at 2:01 PM, Baptiste Lepilleur
<baptiste....@gmail.com> wrote:
> Hi,
>
> I just skimmed throught the manual and I have a question:
>
> The introduction says:
> "- Dictionary keys do not have to be strings. "
>
> Does that mean that rson allow object's key to be number, null, array, and
> object ?

Yes. If the key is an array, it will be converted to a tuple. If it is a dict,
it will be converted to a sorted tuple of tuples.

> As a consequence, this would mean that it may not be possible to
> convert/serialize an rson document in json format, isn't it?

It is possible to create a RSON document that is not JSON compatible.
However, the function that creates the dicts can be easily replaced with
a user-specific one to disallow this from happening. Also, I could easily
make an option to disallow this.

I'm trying very hard to separate syntax from semantics, and there is no
reason in the indented mode syntax to disallow this, but, of course, there
may well be valid semantic reasons to do so.

I have made some updates to the manual and some minor updates to the
code (related to '=' handling). I will be uploading 0.03 very shortly.

Thanks,
Pat

Baptiste Lepilleur

unread,

Mar 16, 2010, 5:04:58 AM3/16/10

to rson-discuss

2010/3/12 Patrick Maupin <pma...@gmail.com>

On Fri, Mar 12, 2010 at 2:01 PM, Baptiste Lepilleur

<baptiste....@gmail.com> wrote:

> Hi,
>
> I just skimmed throught the manual and I have a question:
>
> The introduction says:
> "- Dictionary keys do not have to be strings. "
>
> Does that mean that rson allow object's key to be number, null, array, and
> object ?

Yes. If the key is an array, it will be converted to a tuple. If it is a dict,
it will be converted to a sorted tuple of tuples.

IMHO this is a feature that could hurt the adoption of rson as it would no longer be just an alternative json format, but also a distinct data model. Having a distinct data model means that you can not reuse the existing json "eco-system": representation, validation, serialization...

It would also likely be not trivial to implement in language other than python (I certainly wouldn't want to try to hack this into JsonCpp while keeping the API robust and easy to use).

Is there strong use case for this feature? I can not see it as making document easier to read. Manipulating such data would be tricky even in Python (without even getting into nested dict key).

> As a consequence, this would mean that it may not be possible to
> convert/serialize an rson document in json format, isn't it?

It is possible to create a RSON document that is not JSON compatible.
However, the function that creates the dicts can be easily replaced with
a user-specific one to disallow this from happening. Also, I could easily
make an option to disallow this.

I'm trying very hard to separate syntax from semantics, and there is no
reason in the indented mode syntax to disallow this, but, of course, there
may well be valid semantic reasons to do so.

I'm not sure what you mean by separating syntax from semantic. For me the semantic can only be provided by the code that will interpret the data graph once parsed.

From what I've seen of the parser code, it already allows infinite look-ahead as the whole document is tokenized before parsing. While this is certainly very handy when exploring the grammar, one of the likely untold success of JSON is that writing a parser is simple because you only need to know the type of the next token, which can often be obtained just from the next non-space character.

Baptiste.

Patrick Maupin

unread,

Mar 16, 2010, 11:49:56 AM3/16/10

to rson-d...@googlegroups.com

On Tue, Mar 16, 2010 at 4:04 AM, Baptiste Lepilleur
<baptiste....@gmail.com> wrote:
> IMHO this is a feature that could hurt the adoption of rson as it would no
> longer be just an alternative json format, but also a distinct data model.
> Having a distinct data model means that you can not reuse the existing json
> "eco-system": representation, validation, serialization...

When I started the project, I asked for feedback in the Python group,
and obviously went about it completely the wrong way...

So, I just tried to code something flexible that would at least do
JSON. I have options to restrict it to JSON. I also feel that the
ability to have a tuple or dict key is of limited utility in most
cases (and certainly in most configuration file use cases), so I have
no problem with removing that capability.

> It would also likely be not trivial to implement in language other than
> python (I certainly wouldn't want to try to hack this into JsonCpp while
> keeping the API robust and easy to use).

That's interesting feedback.

> Is there strong use case for this feature? I can not see it as making
> document easier to read. Manipulating such data would be tricky even in
> Python (without even getting into nested dict key).

No, I do not believe that there is. But, I don't know if you have
looked at the merge capabilities or not, or how difficult that is for
other systems. I think there is a really good use case for the merge
capability:

deeply:nested:key1
id1 = value1
id2 = value2

deeply:nested:key2
id3 = value3
id4 = value4

creates:

{u'deeply': {u'nested': {u'key2': {u'id4': u'value4', u'id3':
u'value3'}, u'key1': {u'id2': u'value2', u'id1': u'value1'}}}}

>> I'm trying very hard to separate syntax from semantics, and there is no
>> reason in the indented mode syntax to disallow this, but, of course, there
>> may well be valid semantic reasons to do so.
>
> I'm not sure what you mean by separating syntax from semantic. For me the
> semantic can only be provided by the code that will interpret the data graph
> once parsed.

Agreed, except that by replacing parts of the parser, the client code
can actually influence the building of the graph. I do not know if
you had a chance to look at the XML example I built with the parser.
The merging of dict key:value pairs and list item value-only data in
the same structure is something that definitely affects the semantics,
but does not really affect the syntax too much. But I may have
overstated my case. What I am trying to say is that since (until now)
I received very little useful feedback on the project, I tried to make
a very flexible parser/tokenizer that could be reused for similar
grammars. (Reused by subclassing and/or passing specialized parse
functions to loads, not by copying the code and making modifications
to it.) This flexibility is also very useful because it lets me code
up something that runs the bulk of the parser against the simplejson
regression suite.

> From what I've seen of the parser code, it already allows infinite
> look-ahead as the whole document is tokenized before parsing. While this is
> certainly very handy when exploring the grammar, one of the likely untold
> success of JSON is that writing a parser is simple because you only need to
> know the type of the next token, which can often be obtained just from the
> next non-space character.

Yes, as you surmise, this was done for exploratory purposes. Once the
grammar is solidified, this can probably go, and that will make
implementation easier and speed things up a little. However, it is
always interesting to me how well Python's re.split() performs in
comparison to doing iterated matches. The prototype parser is, in my
real-world usage, more than competitive against Python 2.6's built-in
json parser (even with the C tokenizer speedups on json). But, it
falls behind when comparing against current subversion simplejson, and
gets completely blown away by the simplejson C speedups.

So, my current plan is to try to solidify the grammar known as RSON,
but keep some options in the parser for people to tweak it according
to their needs (looking at the simplejson issue tracker, obviously
people often want to do things like use the Decimal class instead of
floats).

I think that a valid RSON document should have a canonical JSON
representation (which implies the restriction on key values you bring
up), but that the RSON grammar should allow the key merging I showed
above. Does that make sense?

Thanks for taking the time to look at this and give feedback.

Best regards,
Pat

Baptiste Lepilleur

unread,

Mar 17, 2010, 4:53:10 AM3/17/10

to rson-d...@googlegroups.com

2010/3/16 Patrick Maupin <pma...@gmail.com>

On Tue, Mar 16, 2010 at 4:04 AM, Baptiste Lepilleur

<baptiste....@gmail.com> wrote:

> IMHO this is a feature that could hurt the adoption of rson as it would no
> longer be just an alternative json format, but also a distinct data model.
> Having a distinct data model means that you can not reuse the existing json
> "eco-system": representation, validation, serialization...

When I started the project, I asked for feedback in the Python group,
and obviously went about it completely the wrong way...

Well, I'm mostly interested in the format definition. Some of your proposals are a given:

optional quoting object keys: according to Douglas Crockford, this is not supported because javascript reserves keyword regardless of the position in the source. If I understood well, the next specification of javascript should fix this.

comment: javascript comments used to be supported (and are still supported by JsonCpp). My guess is that it was removed for increased portability (python & javascript comments are not compatible).

> Is there strong use case for this feature? I can not see it as making
> document easier to read. Manipulating such data would be tricky even in
> Python (without even getting into nested dict key).

No, I do not believe that there is. But, I don't know if you have
looked at the merge capabilities or not, or how difficult that is for
other systems. I think there is a really good use case for the merge
capability:

deeply:nested:key1
id1 = value1
id2 = value2

deeply:nested:key2
id3 = value3
id4 = value4

creates:

{u'deeply': {u'nested': {u'key2': {u'id4': u'value4', u'id3':
u'value3'}, u'key1': {u'id2': u'value2', u'id1': u'value1'}}}}

I would recommend quoting properly indented json, this makes comparison fairer and allow the json document to easily understand the structure of the json document (4 levels nesting as a one liner is a bit much...):

{ "deeply": { "nested": {
    "key2": {
        "id3": "value3",
        "id4": "value4"
    },
    "key1": {
        "id1": "value1",
        "id2": "value2"
    }
} } }

I don't believe merging to be an issue for any json parser, it is basically equivalent to being able to handle:

{
    "deeply" : {"nested": {"key1": 1234 } },
    "deeply" : {"nested": {"key2": 1235 } }
}

and being sure that you don't reset the values associated to "deeply" and "nested" to an empty dictionary the second time you hit them (I'm not sure what the standard say for json concerning this test case).

The "hard" part is more likely to be the parsing itself as it likely requires look-ahead for interpreting the string following the ':'.

>> I'm trying very hard to separate syntax from semantics, and there is no
>> reason in the indented mode syntax to disallow this, but, of course, there
>> may well be valid semantic reasons to do so.
>
> I'm not sure what you mean by separating syntax from semantic. For me the
> semantic can only be provided by the code that will interpret the data graph
> once parsed.

Agreed, except that by replacing parts of the parser, the client code
can actually influence the building of the graph. I do not know if
you had a chance to look at the XML example I built with the parser.
The merging of dict key:value pairs and list item value-only data in
the same structure is something that definitely affects the semantics,
but does not really affect the syntax too much. But I may have
overstated my case. What I am trying to say is that since (until now)
I received very little useful feedback on the project, I tried to make
a very flexible parser/tokenizer that could be reused for similar
grammars. (Reused by subclassing and/or passing specialized parse
functions to loads, not by copying the code and making modifications
to it.) This flexibility is also very useful because it lets me code
up something that runs the bulk of the parser against the simplejson
regression suite.

I haven't looked at the XML example, but I understand now what you mean by syntax vs semantic.

> From what I've seen of the parser code, it already allows infinite
> look-ahead as the whole document is tokenized before parsing. While this is
> certainly very handy when exploring the grammar, one of the likely untold
> success of JSON is that writing a parser is simple because you only need to
> know the type of the next token, which can often be obtained just from the
> next non-space character.

Yes, as you surmise, this was done for exploratory purposes. Once the
grammar is solidified, this can probably go, and that will make
implementation easier and speed things up a little. However, it is
always interesting to me how well Python's re.split() performs in
comparison to doing iterated matches. The prototype parser is, in my
real-world usage, more than competitive against Python 2.6's built-in
json parser (even with the C tokenizer speedups on json). But, it
falls behind when comparing against current subversion simplejson, and
gets completely blown away by the simplejson C speedups.

This is interesting to know.

So, my current plan is to try to solidify the grammar known as RSON,
but keep some options in the parser for people to tweak it according
to their needs (looking at the simplejson issue tracker, obviously
people often want to do things like use the Decimal class instead of
floats).

Sound ok, though from the description above, it seems you may be mixing up two goals:

Providing a more flexible json parser
Providing alternative syntax to represent json document

I'm not sure what are your priority (e.g. releasing an alternate json parser is fairly easy).

I think that a valid RSON document should have a canonical JSON

representation (which implies the restriction on key values you bring
up), but that the RSON grammar should allow the key merging I showed
above. Does that make sense?

Yes, and in fact I would recommend that you provides a properly indented json representation of each rson example in the manual (I have many ideas on how to interpret the example in the "Nested arrays and dicts" section, but they probably don't match yours).

Baptiste.

Patrick Maupin

unread,

Mar 17, 2010, 11:43:54 AM3/17/10

to rson-discuss

On Mar 17, 3:53 am, Baptiste Lepilleur <baptiste.lepill...@gmail.com>
wrote:

> I would recommend quoting properly indented json, this makes comparison
> fairer and allow the json document to easily understand the structure of the
> json document (4 levels nesting as a one liner is a bit much...):

Sorry, that wasn't really JSON -- just the results of running the
example through the parser and printing the resultant string from
Python.

> The "hard" part is more likely to be the parsing itself as it likely
> requires look-ahead for interpreting the string following the ':'.

One thought I had on this (my initial thought, in fact) was to use "/"
for a "between-key" separator. I could go back to that. Visually, I
like it a lot. The only downside is that it reserves an additional
character that cannot be used in unquoted strings, and that is, in
fact, quite heavily used in filenames. But one could argue that a DOS/
Windows filename or a URL might need to be quoted anyway due to the
possible use of a colon, so this may be a weak argument against using
"/" for between key separation.

This is the sort of thing that I would really like feedback on.

> Sound ok, though from the description above, it seems you may be mixing up
> two goals:
>

> - Providing a more flexible json parser
> - Providing alternative syntax to represent json document

>
> I'm not sure what are your priority (e.g. releasing an alternate json parser
> is fairly easy).

The priority is on the definition of the alternate syntax. Having a
more flexible parser is a goal in and of itself, but it really makes
exploring different syntax possibilities much easier. When I first
asked for opinions on the syntax on the Python list (before I started
development), I was accused of peddling "vaporware," so I thought it
would be prudent to develop a parser that allowed for rapid
prototyping.

> Yes, and in fact I would recommend that you provides a properly indented
> json representation of each rson example in the manual (I have many ideas on
> how to interpret the example in the "Nested arrays and dicts" section, but
> they probably don't match yours).

That's a very good point. In my hurried development of the document,
I just provided what the Python interpreter spits out. I can run it
through the simplejson encoder the next time I get a chance to build
the manual and that will make things much clearer, I think.

Best regards,
Pat

Baptiste Lepilleur

unread,

Mar 17, 2010, 6:12:28 PM3/17/10

to rson-discuss

2010/3/17 Patrick Maupin <pma...@gmail.com>

On Mar 17, 3:53 am, Baptiste Lepilleur <baptiste.lepill...@gmail.com>
wrote:

> The "hard" part is more likely to be the parsing itself as it likely
> requires look-ahead for interpreting the string following the ':'.

One thought I had on this (my initial thought, in fact) was to use "/"
for a "between-key" separator. I could go back to that. Visually, I
like it a lot. The only downside is that it reserves an additional
character that cannot be used in unquoted strings, and that is, in
fact, quite heavily used in filenames. But one could argue that a DOS/
Windows filename or a URL might need to be quoted anyway due to the
possible use of a colon, so this may be a weak argument against using
"/" for between key separation.

This is the sort of thing that I would really like feedback on.

I would recommend sticking with the previous syntax. While it is slightly more complex to parse, it only requires one token look-ahead (is the value a string followed by a ':'?). This is a bit more complex than the json grammar, which does not require look-ahead, but we are no longer writting parser in assembly language so we can deal with that.

When you have something in your grammar like C++ that states something along the line of "a statement is a declaration if it can be interpreted as a declaration, otherwise it is an expression", then you know that something is wrong and both implementers and users will be confused. This is not the case here.

On a general basis, I think it is better to avoid introducing too many symbols as their semantics can be hard to grasp(/guess?) for users. I tend to prefer introducing a bit of parser complexity than introducing symbols as a crutch for simplifying parsing.

In this case, it should be possible to come-up with a semantic for symbol ':' that is compatible with all usages. Also, having the symbol '/' available for string values without quoting is handy when you have many small relative paths/urls in a document: it increases significantly the signal/noise ratio. I agree though that for full paths and url this is not an issue as those strings are tend to be much longer, and therefore have a high signal/noise ratio.

For syntax selection, it might be interesting to use an approach similar to the one used by python for selecting the language feature's syntax ("x if cond else y" come to mind):
- list possible syntax options
- apply on some representative examples
- evaluate how this interact with the syntax options of other features

How does this "Windows-registry-style keys" feature interact with plain json? Is the following document legal?

{ "evilness": { "high": {"starter": {
        "cmd1": "cd /",
        "cmd2": "rm -Rf *" } } } }

evilness:high:more subtle
        cmd1 = cd /etc
        cmd2 = rm *pass*

Baptiste.

Patrick Maupin

unread,

Mar 17, 2010, 6:46:28 PM3/17/10

to rson-d...@googlegroups.com

On Wed, Mar 17, 2010 at 5:12 PM, Baptiste Lepilleur
<baptiste....@gmail.com> wrote:

> For syntax selection, it might be interesting to use an approach similar to
> the one used by python for selecting the language feature's syntax ("x if
> cond else y" come to mind):
> - list possible syntax options
> - apply on some representative examples
> - evaluate how this interact with the syntax options of other features

I will have to think about that. Haven't gotten that far yet.

> How does this "Windows-registry-style keys" feature interact with plain
> json? Is the following document legal?
>
> { "evilness": { "high": {"starter": {
>         "cmd1": "cd /",
>         "cmd2": "rm -Rf *" } } } }
>
> evilness:high:more subtle
>         cmd1 = cd /etc
>         cmd2 = rm *pass*
>

With the current parser:

>>> from rson import loads
>>> data = '''
... { "evilness": { "high": {"starter": {
... "cmd1": "cd /",
... "cmd2": "rm -Rf *" } } } }
...
... evilness:high:more subtle
... cmd1 = cd /etc
... cmd2 = rm *pass*
... '''
>>> loads(data)

rson.tokenizer.RSONDecodeError: Cannot mix list elements with dict
(key/value) elements: line 6, column 9, text ':'

What happens is that, as soon as you step into JSON syntax via [] or
{}, you are creating a subelement of the top element, so after the
first JSON section, the parser is convinced it is dealing with a list.
(At that point, it *could* be persuaded that it is dealing with a big
dict key if there were a colon or indentation following all the JSON
stuff, but we already discussed how using non-scalars as keys is of
very marginal utility, so the parser currently disallows that. In any
case, seeing another element at the same indentation level, without
having seen an indentation or a colon, makes up the parser's mind --
it must be dealing with a list. Then when it see the next item has
the colon, it complains.

So, for example, it works if I "unwrap" the top JSON element and leave
the rest of it in JSON syntax:

>>> data = '''
... "evilness": { "high": {"starter": {
... "cmd1": "cd /",
... "cmd2": "rm -Rf *" } } }
... evilness:high:more subtle
... cmd1 = cd /etc
... cmd2 = rm *pass*
... '''
>>> loads(data)
{u'evilness': {u'high': {u'more subtle': {u'cmd1': u'cd /etc',
u'cmd2': u'rm *pass*'}, u'starter': {u'cmd1': u'cd /', u'cmd2': u'rm
-Rf *'}}}}

My current thinking is that the primary utility of mixing JSON and
RSON semantics is, for example, to have things like two-dimensional
arrays:

>>> data = '''
... [1,2,3]
... [4,5,6]
... [7,8,9]
... '''
>>> loads(data)
[[1, 2, 3], [4, 5, 6], [7, 8, 9]]

Regards,
Pat

Reply all

Reply to author

Forward