oob schemas, re: The Language of the System

Brian Craft

unread,

Jan 18, 2014, 3:00:09 PM1/18/14

to clo...@googlegroups.com

Regarding Rich's talk (http://www.youtube.com/watch?v=ROor6_NGIWU), can anyone explain the points he's trying to make about self-describing and extensible data formats, with the JSON and google examples?

He argues that google couldn't exist if the web depended on out-of-band schemas. He gives as an example of such a schema a JSON encoding where an out-of-band agreement is made that field names with substring "date" refer to string-encoded dates.

However, this is exactly the sort of thing google does. It finds dates, and other data types, heuristically, and not through the formats of the web being self-describing or extensible.

Jonas

unread,

Jan 18, 2014, 4:27:31 PM1/18/14

to clo...@googlegroups.com

IIRC in that particular part of the talk he was specifically talking about (non-self describing) protocol buffers and not JSON.

Brian Craft

unread,

Jan 18, 2014, 6:08:15 PM1/18/14

to clo...@googlegroups.com

Ok, so consider a different system (besides google) that handles the JSON example. If it has no prior knowledge of the date field, of what use is it to know that it's a date? What is a situation where a system reading the JSON needs to know a field is a date, but has no idea what the field is for?

Jonah Benton

unread,

Jan 19, 2014, 12:03:53 PM1/19/14

to clo...@googlegroups.com

I read these self-describing, extensible points in the context of EDN, which has a syntax/wire format for some types- maps, strings, etc- and also has an extensibility syntax:


#myapp/Person {:first "Fred" :last "Mertz"}

These tagged elements are "extensions" because they allow values of types not known to EDN to be included in the stream, and are "self-describing" in two senses:

* if a wire format reader does know how to create a myapp/Person{}, that blob of data contains all the information needed to do so
* if a wire format reader doesn't known how to create a myapp/Person, it can still read past this particular element in the stream, because tags have a defined envelope, so a reader can figure out where data comprising this element ends

The JSON example is mostly about the "extensibility" attribute. JSON's format natively supports some types (like strings) but not others (like dates), and for those others, JSON's format does not include a way to "bucket" or "envelope" data comprising those unknown types. So JSON is not extensible.

The google example is mostly about the "self-describing" attribute, and to my mind is more accurately framed as a statement about the Internet as a whole. Hypothetically, if all data exchange occurred using data formats whose details were private arrangements between writers and readers- for instance, all servers only spoke ProtocolBuffers and used a different schema for each client- there would be no Internet at all, much less a google who as a third party is able to broadly read and understand data made available by servers. (Or, to your point, any ability to parse anything useful from a server data stream by clients lacking knowledge of the schema would be at best be inferential and heuristic- possible, but infeasible on a large scale.)

With all that said- my read is that Rich bundled those two points together in the JSON date example- JSON doesn't have an extensibility syntax to support dates, but people still have to transmit dates over JSON, so how do they do that? One way is by adopting a "convention", which in some ways is better than an out of band schema, because, as you say, a convention gives a reader additional information to heuristically interpret the stream, but in other ways is worse because it isn't consistent- some people will want date fields to look like "dateModified", others will want "modifiedDate", and others use "modificationDatetime".

So in a broad sense, it is not desirable to use a data format that does not include an extensibility capability which itself is self-describing, because a format that lacks extensibility creates a combinatorial explosion in conventions to convey values not known to the format, and extensions that are not self-describing require out of band agreements between readers and writers that can preclude the scalable third-party interoperability that is so important to the Internet.

Hope that helps.

--
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Brian Craft

unread,

Jan 19, 2014, 1:48:36 PM1/19/14

to clo...@googlegroups.com

That helps, thanks. It's still unclear to me that this is important enough to worry about. What application or service is hindered by string encoding a date in JSON? An example would really help. It's not compelling to assert or imagine some hypothetical application that benefits from knowing a field is a date without having any other knowledge of it. I would guess that cases where this matters are vanishingly few.

Matching Socks

unread,

Jan 19, 2014, 5:44:59 PM1/19/14

to clo...@googlegroups.com

Hmm, here's a date field that says "090715". I wonder what it means...

Patrick Logan

unread,

Jan 19, 2014, 7:36:37 PM1/19/14

to clo...@googlegroups.com

"finds dates, and other data types, heuristically" -- I'm sure Google would rather not, but that's life on the web.

Google also supports JSON-LD which is a W3 standard for semi-structured and linked data. JSON-LD defines "in-band" syntax for dates, all XSD data types, and arbitrary data types (including but not exclusively those defined at http://schema.org/docs/full.html )

Jonah Benton

unread,

Jan 19, 2014, 10:41:40 PM1/19/14

to clo...@googlegroups.com

I hear you; like anything it's a probably function of context and lifecycle.

Rich as an uber architect I suspect is cognizant of the 100 year language thought experiment [1]. On that scale, I would agree with his choices. The Y2K problem basically arose because dates were being treated as numbers by lots of old code, rather than as references to a calendar. And back in the day there no doubt were some guys like Mel [2] who would have been clever enough to encode dates as calendar references, and whose code continued to work when the clock turned over. But then there's the rest of us. I think Rich is trying to make the infrastructure so that tools that prevent various of these problems- in concurrency and data flow and data persistence and so on- lie within relatively easy reach. Even if those problems only occur at the margins.

But plenty of production systems also work perfectly fine despite having not made these choices- from those that encode blog posting dates as strings in JSON to the entirety of Google, built on Protocol Buffers. A rule of thumb I've been kicking around has to do with the lifespan of the data in whatever form it's encoded- the longer lived the data in its encoding, the more it benefits from stronger static type semantics. But it needs a more pithy rendering. :)

1. http://paulgraham.com/hundred.html
2. http://www.catb.org/jargon/html/story-of-mel.html

James Gatannah

unread,

Jan 20, 2014, 10:31:48 AM1/20/14

to clo...@googlegroups.com

On Sunday, January 19, 2014 12:48:36 PM UTC-6, Brian Craft wrote:

That helps, thanks. It's still unclear to me that this is important enough to worry about. What application or service is hindered by string encoding a date in JSON? An example would really help. It's not compelling to assert or imagine some hypothetical application that benefits from knowing a field is a date without having any other knowledge of it. I would guess that cases where this matters are vanishingly few.

This isn't dates, but it's along the same lines:

I'm working on the central hub of a communications distributor/router/thing which has to deal with a wide range of diverse clients that each speak whatever "format" was the most expedient for whoever wrote that piece. The closest thing that we have to a standard is "we mostly want to use JSON."

A huge chunk of our messages include UUIDs. This left me with two real options:

1. Special case every incoming message, based on what I know about the sender, and convert the fields that I know are supposed to represent a UUID (based on extremely informal verbal "specs") as a message is read

2. Take the generic, weakly coupled approach: pass every incoming message through a parser that converts every value string into a UUID (if that conversion is possible).

The pain here could be alleviated with a formalized schema, but we're working too fast and furious for that. We tossed out the last one of those we had almost 2 months ago.

Dates really present the same challenge, but everything that's using those is tied to a relational database. So there, at least, we're forced to stick to something fairly standardized. (Though I *do* have another conversion function for converting those messages, which tends to change about once a week when some other developer decides to change column names...but that's a different story).