Transit Grammar Questions

Justin Leitgeb

unread,

Aug 3, 2014, 1:08:30 PM8/3/14

to transit...@googlegroups.com

Hi,

Thanks for publishing the transit format! I've started looking into the Transit spec and the initial implementations, and I'd be interested in trying it on a future project.

I started working on a toy implementation of a transit library in Haskell but I'm confused about certain points in the grammar. Here are some issues that I've noticed:

Emitters of verbose JSON don't send cached values; however the reader specs don't indicate what should happen when cached values are encountered in the input stream.
In concise JSON it's not clear what it would mean if a non-even number of elements follows a marker for a Hash map represented as an Array, but this isn't prohibited by the grammar (I tested in the Ruby implementation and `["^ ", 1]` throws an `ArgumentError`).

In general, given that there are reserved characters that indicate that a given string is a token that has an impact on the subsequent acceptable tokens in the stream, there seem to be JSON inputs that are implicitly invalid according to the spec. Since there isn't a BNF, though, it has been difficult for me to think through an exhaustive list of these cases, and even more difficult to write a parser. Is there a plan to put together a BNF for the format? I'd be happy to help if this is something that is desired.

Thank again for releasing this format, and for any help that you may be able to provide!

Justin

David Nolen

unread,

Aug 3, 2014, 3:10:05 PM8/3/14

to Justin Leitgeb, transit...@googlegroups.com

On Aug 3, 2014, at 1:08 PM, Justin Leitgeb <jus...@stackbuilders.com> wrote:

Hi,

Thanks for publishing the transit format! I've started looking into the Transit spec and the initial implementations, and I'd be interested in trying it on a future project.

I started working on a toy implementation of a transit library in Haskell but I'm confused about certain points in the grammar. Here are some issues that I've noticed:
Emitters of verbose JSON don't send cached values; however the reader specs don't indicate what should happen when cached values are encountered in the input stream.

Readers should be able to consume both verbose and non-verbose without configuration.

In concise JSON it's not clear what it would mean if a non-even number of elements follows a marker for a Hash map represented as an Array, but this isn't prohibited by the grammar (I tested in the Ruby implementation and `["^ ", 1]` throws an `ArgumentError`).

Yes that's an invalid hash map as array representation.

In general, given that there are reserved characters that indicate that a given string is a token that has an impact on the subsequent acceptable tokens in the stream, there seem to be JSON inputs that are implicitly invalid according to the spec. Since there isn't a BNF, though, it has been difficult for me to think through an exhaustive list of these cases, and even more difficult to write a parser. Is there a plan to put together a BNF for the format? I'd be happy to help if this is something that is desired.

Thank again for releasing this format, and for any help that you may be able to provide!

Justin

Someone else will have to speak to these points.

At higher level it sounds like you are trying to parse Transit JSON directly - Transit to some degree is designed with existing performant streaming JSON parsers (like YAJL) or browser JS environments in mind. I see that a yajl binding for Haskell exists https://hackage.haskell.org/package/yajl :)

David

Justin Leitgeb

unread,

Aug 3, 2014, 4:46:36 PM8/3/14

to transit...@googlegroups.com, jus...@stackbuilders.com

Hi David,

Thanks for your response! It's especially useful to know that a reader is supposed to be able to parse concise and verbose JSON without configuration.

That leads me to my next question - can a single stream contain both verbose and concise JSON and still be considered valid? The following looks weird since it switches from verbose JSON to concise JSON in the middle, but it parses just fine, at least in the Ruby client:

> JSON.parse(File.read('maps_four_char_string_keys.weird.json')) 
=> [{"aaaa"=>1, "bbbb"=>2}, ["^ ", "aaaa", 3, "bbbb", 4], ["^ ", "^0", 5, "^1", 6]]
> Transit::Reader.new(:json, File.open('maps_four_char_string_keys.weird.json')).read
=> [{"aaaa"=>1, "bbbb"=>2}, {"aaaa"=>3, "bbbb"=>4}, {"aaaa"=>5, "bbbb"=>6}]

I understand wanting to avoid the need to configure the reader, but if that's the desire, should the reader be smart enough to detect the algorithm used by the writer and reject documents that are invalid according to the spec? The above seems to be invalid according to at least two rules that I've understood from the document - specifically that concise JSON should not contain JSON objects, and that the keys should always be cached in concise JSON based on their length.

The above document seems problematic to me in that if you're streaming, you can't tell until you ingest more data that the first tuple wasn't written out by a well-behaved writer. I understand the desire to have a client that doesn't require configuration, but I think I'd prefer a token to indicate which format is being written. It seems that this could potentially lead to more performant decoding (although this isn't really a concern of mine with the format), as well as earlier detection of mistaken assumptions about the format that is being sent by the other side.

Based on my understanding of transit so far it seems like it's really intended to allow for clearer communication about data types between applications, so it feels like the type of the stream itself should be specified as well. At the very least though, I'd like to know if heterogenous documents of concise and verbose JSON are acceptable, or if there should be auto-detection by the readers to allow for rejection of invalid data.

Best,

Justin

David Nolen

unread,

Aug 3, 2014, 5:08:37 PM8/3/14

to Justin Leitgeb, transit...@googlegroups.com

On Aug 3, 2014, at 4:46 PM, Justin Leitgeb <jus...@stackbuilders.com> wrote:

> JSON.parse(File.read('maps_four_char_string_keys.weird.json')) => [{"aaaa"=>1, "bbbb"=>2}, ["^ ", "aaaa", 3, "bbbb", 4], ["^ ", "^0", 5, "^1", 6]] > Transit::Reader.new(:json, File.open('maps_four_char_string_keys.weird.json')).read => [{"aaaa"=>1, "bbbb"=>2}, {"aaaa"=>3, "bbbb"=>4}, {"aaaa"=>5, "bbbb"=>6}]

I understand wanting to avoid the need to configure the reader, but if that's the desire, should the reader be smart enough to detect the algorithm used by the writer and reject documents that are invalid according to the spec? The above seems to be invalid according to at least two rules that I've understood from the document - specifically that concise JSON should not contain JSON objects, and that the keys should always be cached in concise JSON based on their length.

While it’s strange to mix verbose and non-verbose JSON encodings like this I don’t actually see a problem with it and I believe all existing Transit implementations from Cognitect support this just fine. Needing to specify the format on the wire is problematic for the case where you want to consume an existing JSON endpoint - something that mostly works today outside of edge cases and we might account for these in the future.

David

Rich Hickey

unread,

Aug 3, 2014, 5:44:13 PM8/3/14

to transit...@googlegroups.com

What you are calling 'concise' and 'verbose' are not separate formats, they are just write modes.

There is only one format - transit-json

Each write mode writes a subset as specified, but a reader *must* accept any transit-JSON supported construct at any time. All of the current readers can and do.

> --
> You received this message because you are subscribed to the Google Groups "transit-format" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to transit-forma...@googlegroups.com.
> To post to this group, send email to transit...@googlegroups.com.
> Visit this group at http://groups.google.com/group/transit-format.
> For more options, visit https://groups.google.com/d/optout.

Justin Leitgeb

unread,

Aug 4, 2014, 2:19:23 PM8/4/14

to transit...@googlegroups.com

Thanks for the clarification, Rich and David.

Rich, you referred to the format as "transit-JSON," but I assume you just mean "transit" to allow for the inclusion of MessagePack as a write mode as well as the two JSON modes. I verified that at least the Ruby library supports constructs like compression/caching on the reader side for MessagePack, and based on your explanation this makes sense to me.

I've made a suggestion on how to improve the docs in a pull request to the transit-format repo. In one case it looks like we previously referred to the write modes as formats, which originally confused me, so I clarified the wording in that section. I also took Rich's description of how we should understand the write modes and inserted it almost verbatim (most substantially I added MessagePack, in addition to JSON and JSON-Verbose as a write mode) at a point where I think it made sense.

I think that a change along these lines would help with others' understanding of the format, but I'd be just as happy if you close the PR and make the change in a way that you find to be more clear.