Binary vs text protocol for distributed processing

Generic Usenet Account

unread,

Sep 15, 2010, 8:50:06 AM9/15/10

to

Greetings,

Can someone kindly point to some good links that are devoted to the
text vs binary protocol debate for distributed processing? I work in
a standards body that is currently debating this issue for M2M
(machine-to-machine) ecosystems. We are debating whether resource
constrained devices should support a binary protocol or an XML-free
text protocol. Any personal insight from the members would also be
welcome.

Regards,
KP Bhta

Pascal J. Bourguignon

unread,

Sep 15, 2010, 9:58:19 AM9/15/10

to

Depends on the kind of resource that is constrained. If it's
programmer time that is the constraint, then a textual representation
might be more efficient. If it's the computing power, then a binary
representation might be processed faster.

In my experience, I find on one hand that textual representations (not
XML) may be shorter than binary (eg. some real data set with a lot of
numbers takes less space when encoded in a textual representation than
in binary, because as text, the size of a number is proportional to
the number of significant digits, while in binary it is of fixed
size). On the other hand, in communication protocols there may be
need for non-context-free data which is more easily generated by a
machine so that there's not a lot more of an advantage to use a
textual representation than a binary one.

Otherwise, in a 'machine-to-machine' "ecosystem", I would deem
important self describing data, which means that the type of the data
elements should not be implicit, but explicit. This is orthogonal to
the binary/textual representation question, but notice that in textual
representations, you often have more syntactical information about the
type of data than in binary formats.

1234 --> an integer
"1234" --> a string
12.34 --> a floating point number.

On the other hand, in binary:

d2 04 00 00 --> ?
31 32 33 34 --> ?
a4 70 45 41 --> ?

(But you could tag binary data with types, only it's has to be
explicit).

--
__Pascal Bourguignon__ http://www.informatimago.com/

Mark Storkamp

unread,

Sep 15, 2010, 10:24:35 AM9/15/10

to

In article <87aanj8...@kuiper.lan.informatimago.com>,

p...@informatimago.com (Pascal J. Bourguignon) wrote:

> In my experience, I find on one hand that textual representations (not
> XML) may be shorter than binary (eg. some real data set with a lot of
> numbers takes less space when encoded in a textual representation than
> in binary, because as text, the size of a number is proportional to
> the number of significant digits, while in binary it is of fixed
> size).

Take a look at how MIDI implements time. It uses a variable length
binary number where the MSb indicates if there are an additional 7 bits
to be concatenated or not. The resulting number represents the number of
ticks, which is a duration defined earlier in the header.

>
> Otherwise, in a 'machine-to-machine' "ecosystem", I would deem
> important self describing data, which means that the type of the data
> elements should not be implicit, but explicit. This is orthogonal to
> the binary/textual representation question, but notice that in textual
> representations, you often have more syntactical information about the
> type of data than in binary formats.
>
> 1234 --> an integer
> "1234" --> a string
> 12.34 --> a floating point number.
>
> On the other hand, in binary:
>
> d2 04 00 00 --> ?
> 31 32 33 34 --> ?
> a4 70 45 41 --> ?
>
> (But you could tag binary data with types, only it's has to be
> explicit).

You've also explicitly tagged your textual representations. You tagged
the string with " and the floating point number with .

Personally, I think it's more of a factor of the language I'm writing
with. If it's done in assembly language, using a textual representation
would likely be more work than a binary one. In a high level language
it's likely easier to use text.

BGB / cr88192

unread,

Sep 15, 2010, 1:52:16 PM9/15/10

to

"Generic Usenet Account" <use...@sta.samsung.com> wrote in message
news:779f2be0-180c-4196...@c21g2000vba.googlegroups.com...

text is often simple and flexible vs binary formats, and can in most cases
be easily extended (depending on the design of the format). keeping a binary
format flexible is a good deal more work, and when one does so, it is often
no longer simple. often, text formats can be easily compacted if needed via
Deflate compression or similar.

as for generalized self-describing binary formats, I have tried them, but
the attempts turned unworkable.
best luck I have had hear is to take a high-level model (say, XML, or a
simplified LISP-like S-Expression based system), and then create a
serialization specifically for this.

depending on the specifics, a binary-serialized high-level model can be
faster (WRT processing time) than an equivalent textual format, but is
usually similar to a compressed textual format WRT size.

this is assuming variable-length encodings, or possibly entropy coding,
since as noted by another poster, fixed-width integers/... in structures can
often be bigger (on average) than their (uncompressed) textual equivalents.
similarly, IME, structural data often doesn't compress as well (via Deflate
or similar) than either variable-length encodings (bytewise only), or
textual data.

fixed form binary data and schemas is something I personally find nasty, but
some people seem to like these (these were used in RPC and CORBA and
similar).

I forget the name, but Google uses a system for a variable-length
schema-driven format. however, admittedly, I still personally dislike
schemas, but oh well...

or such...

Ian Collins

unread,

Sep 15, 2010, 5:47:13 PM9/15/10

to

A lot depends on how constrained the machine is. If you use XML, it
also depends on the complexity the XML dialect (for example compare
XML-RPC to SOAP).

I used to use binary protocols in distributed systems, then I switched
to XML-RPC because it was:

Easier to monitor.
Easier to debug.
Easier to extend.

The price was verbosity.

Now I favour JSON, which is widely used in web based applications (a
common M2M ecosystem). JSON offers all the benefits of an XML based
text protocol with some significant advantages:

Less verbose
Easier to generate and parse.
Easier manipulate in code.

So if you want the convenience of a text based protocol on resource
constrained devices, consider JSON.

--
Ian Collins

Pascal J. Bourguignon

unread,

Sep 15, 2010, 5:47:51 PM9/15/10

to

Ian Collins <ian-...@hotmail.com> writes:

Or Lisp SEXPS. See for example:
http://people.csail.mit.edu/rivest/sexp.html
(He's the Rivest of RSA).

Ian Collins

unread,

Sep 15, 2010, 6:24:38 PM9/15/10

to

On 09/16/10 09:47 AM, Pascal J. Bourguignon wrote:

> Ian Collins<ian-...@hotmail.com> writes:
>>
>> So if you want the convenience of a text based protocol on resource
>> constrained devices, consider JSON.
>
> Or Lisp SEXPS. See for example:
> http://people.csail.mit.edu/rivest/sexp.html

I see it sets out to do much the same as JSON and the resulting
expressions are similar in complexity and appearance.

JSON probably has the advantage of wider support, expressions are native
to JavaScript (obviously!) and support is built in to common scripting
languages (such as PHP).

It is interesting to see how two requirements from different languages
end up with such similar results.

--
Ian Collins

BGB / cr88192

unread,

Sep 16, 2010, 2:20:50 AM9/16/10

to

"Pascal J. Bourguignon" <p...@informatimago.com> wrote in message
news:87y6b27...@kuiper.lan.informatimago.com...

I am not as fond of this system, as it sort of resembles S-Exps, and uses
the same name, but it seems to be a very different entity in some ways.

for example, it loses the aspect of being a direct transcription of a
LISP-style typesystem, ...

IMO, the typesystem is a much bigger aspect of the entity than that of
having a "similar" notation (the addition of size-prefixes, ... is very
unlike the original notation).

so, in my definition of the term:
lists are made out of CONS-cells, and the usual dot-equivalence exists.

(a 1 2 3)
is the same as:
(a . (1 . (2 . (3 . ()))

and 'a' is a symbol, which is separate from a string, and '1' is a fixnum.

in my uses, I often include keywords, which usually support either the
syntax ':foo' or 'foo:', where the later form is mostly a personal
innovation which exists for aesthetic reasons (IMO, 'foo:' looks nicer than
':foo'), and does not exist in most Scheme or Common Lisp implementations.

back when I had my own scheme implementation, IIRC they had slightly
different semantics, but most of my later uses have treated them as
equivalent.

I use XML a fair amount as well.
a binary XML coding is used for some things, and this seems reasonable
enough (saves space, and is faster to re-parse than text XML...).

in my case, JSON isn't used as much (in itself) since, although I use
BGBScript (personal/pet scripting language) which has basically the same
syntax as JS, it would be a little wasteful to use the BS parser/interpreter
to parse data (slower, more garbage, ...). granted, a mini parser/printer
dedicated to JSON could make sense (although, in my implementation, there is
some overlap with my S-Exp printer/parser, which handles most of the same
syntax as both).

in this case, the real difference between S-Exps and JSON is the types used,
and the notation allows some hybridization, although BS/JS code fragments
will not parse in the S-Exp parser.

obj={a: (b 1 2 3) c: #(4 5 6) d: [7, 8, 9] }

typeof(obj.a) => "_cons_t"
typeof(obj.b) => "_array_t" ("array")
typeof(obj.c) => "_array_t" ("array")
typeof(car(obj.a)) => "_symbol_t" ("name" or "symbol"?)
typeof(cadr(obj.a)) => "_fixint_t" ("number", note: "_fixint_t"==fixnum)

yeah, the exact names returned by 'typeof' are sort of a current unresolved
issue between BS and JS (not yet thought up an optimal solution to this one,
among other issues WRT standards conformance: numerical precision issues,
some differences in object semantics, ...). but, good enough for my uses...

but, yeah, one would have to choose a specific type subset and notation
(either a Lisp-like subset, or a JS-like subset, say for S-Exps or JSON),
nevermind the whole matter of "byte[]" or "int[]" arrays, ... (BS also has
these, it lacks built in syntax support so things like "new byte[32]", ...
will not work at present, leaving me to call into C land to create them...).

but, yeah, JSON seems like a sane option as well IMO...

or such...

Pascal J. Bourguignon

unread,

Sep 16, 2010, 2:42:34 AM9/16/10

to

"BGB / cr88192" <cr8...@hotmail.com> writes:

> "Pascal J. Bourguignon" <p...@informatimago.com> wrote in message
> news:87y6b27...@kuiper.lan.informatimago.com...

>> Or Lisp SEXPS. See for example:
>> http://people.csail.mit.edu/rivest/sexp.html
>> (He's the Rivest of RSA).
>>
>
> I am not as fond of this system, as it sort of resembles S-Exps, and uses
> the same name, but it seems to be a very different entity in some ways.
>
> for example, it loses the aspect of being a direct transcription of a
> LISP-style typesystem, ...
>
> IMO, the typesystem is a much bigger aspect of the entity than that of
> having a "similar" notation (the addition of size-prefixes, ... is very
> unlike the original notation).
>

> [...]

>
> in this case, the real difference between S-Exps and JSON is the types used,
> and the notation allows some hybridization, although BS/JS code fragments
> will not parse in the S-Exp parser.
>
> obj={a: (b 1 2 3) c: #(4 5 6) d: [7, 8, 9] }
>
> typeof(obj.a) => "_cons_t"
> typeof(obj.b) => "_array_t" ("array")
> typeof(obj.c) => "_array_t" ("array")
> typeof(car(obj.a)) => "_symbol_t" ("name" or "symbol"?)
> typeof(cadr(obj.a)) => "_fixint_t" ("number", note: "_fixint_t"==fixnum)
>
> yeah, the exact names returned by 'typeof' are sort of a current unresolved
> issue between BS and JS (not yet thought up an optimal solution to this one,
> among other issues WRT standards conformance: numerical precision issues,
> some differences in object semantics, ...). but, good enough for my uses...
>
> but, yeah, one would have to choose a specific type subset and notation
> (either a Lisp-like subset, or a JS-like subset, say for S-Exps or JSON),
> nevermind the whole matter of "byte[]" or "int[]" arrays, ... (BS also has
> these, it lacks built in syntax support so things like "new byte[32]", ...
> will not work at present, leaving me to call into C land to create them...).
>
>
> but, yeah, JSON seems like a sane option as well IMO...

Well, that's what I don't like in JSON (which I tried to use for
Javascript <-> Common Lisp communications), round-trips don't conserve
the types. There are mappings of vectors to list, and hash-tables to
a-list or something, I don't remember the details.

You are correct about Rivest's Sexp, it would have a similar problem.

That said, implementing a Lisp reader (with only a fixed set (or
subset) of reader macro characters defined by the standard) is rather
simple (and wouldn't be harder than implementing Rivest's Sexp).

BGB / cr88192

unread,

Sep 16, 2010, 1:14:25 PM9/16/10

to

"Pascal J. Bourguignon" <p...@informatimago.com> wrote in message

news:87tylq6...@kuiper.lan.informatimago.com...

yeah.

I am not actually using CL in my case, but rather something more funky:
parts of a Lisp-like typesystem built on top of C (adds GC and dynamic
types, ...);
the current BGBScript VM half-way implements Scheme, and then builds BS on
top of it;
add a C compiler to the mix, and partially implemented logic for a custom
Java and C# implementation (they are close enough that I am mostly treating
them as "conjoined siblings").

and so, the code is large and complex...
and, it can "eval" C fragments, but this is troublesome IME.

implementing a single language is not difficult, but implementing several of
them and shoving them all through a common architecture is awkward.

note: none of this has anything to do with .NET, rather all is a custom
implementation (implemented in C and x86+x64 ASM mostly).

so, there are a fair amount of options for data types at least (one actually
needs a common superset of everything involved...).

admittedly, none of this helps with the OP's issue...

sadly, whatever is a best choice depends some on the choice of underlying
implementation.
S-Exps are best with a Lisp-like typesystem available (or at least an
analogue thereof);
JSON is best with a JS-like typesystem (or at least an analogue thereof);
XML is (IMO) best with something DOM-like;
...

or such...

jacko

unread,

Sep 17, 2010, 9:27:10 AM9/17/10

to

When the evolution of the constrained system is taken into account, a
source level code seems best. This is best represented as text. <XML>
is not the best delimiter between objects. It's long beyond necessity,
and makes the parser more complicated. Consider using ASCII 16-31 as
delimiters. Or use [ and ] and have an escape character such as \.
Using a single character such as , and using "" as a single " may be
basic, but parsing nested structures is made more complicated.

The concession to binary should be in using hexadecimal throughout the
script.

Cheers Jacko

jacko

unread,

Sep 17, 2010, 9:35:03 AM9/17/10

to

UTF-8 C0 and C1 codes are also a possibility, but may lead to parse
problems in the UTF-8 decoder. As would 01xxxxxx codes unprefixed.
Contact Oracle, and ask for a reserve of codes within the unused high
value Java bytecodes. This would allow JavaME embedding easily within
the stream...

Pascal J. Bourguignon

unread,

Sep 17, 2010, 11:29:59 AM9/17/10

to

jacko <jacko...@gmail.com> writes:

> When the evolution of the constrained system is taken into account, a
> source level code seems best. This is best represented as text. <XML>
> is not the best delimiter between objects. It's long beyond necessity,
> and makes the parser more complicated. Consider using ASCII 16-31 as
> delimiters.

What about 32? I like 32. And 10.

> Or use [ and ] and have an escape character such as \.

What have you got against ( and ) they're nice too.

> Using a single character such as , and using "" as a single " may be
> basic, but parsing nested structures is made more complicated.

Agreed.

> The concession to binary should be in using hexadecimal throughout the
> script.

So let's wrap it up:

A record with three numbers:

(1 2 3)

A record containing two sub-records with each a name (string), an age
(number) and a phone number (string):

(("Julia" 22 "212-555-1122")
("Maria" 23 "212-555-1133"))

;-)

Les Cargill

unread,

Sep 17, 2010, 9:24:42 PM9/17/10

to

Why not do both? It could even be negotiated - the most constrained
device can plead poverty. it's a small matter to have a table that
maps strings onto numbers loadable into the stack of the "wealthier"
devices. The numbers can even be delimited vectors, much as the
strings can be delimited.

I suppose SNMP is widely despised these days, but there was an
extension to Tcl called Scotty that made SNMP very easy to deal with.
What made this truly interesting was the ability do introspect MIB
objects. You could pretty much write Tcl to write SNMP agents,
except for the most fiddly bits. By naming entry points carefully, you
could generate most of the thing.

You could also use "MIB compilers" to generate the skeletons of agents.

You used the word "ecosystem", and I think a tool that filled the role
Scotty did with SNMP is vital. This also supports the "implementations
trump abstraction" principle. That principle may not apply in
the case of a standards body; not sure here. Once upon a time,
it did...

Others have mentioned JSON. Another interesting choice, I'd have
to look hard into implementing it on, say, a PIC.

--
Les Cargill

Generic Usenet Account

unread,

Sep 28, 2010, 4:48:07 PM9/28/10

to

On Sep 15, 7:50 am, Generic Usenet Account <use...@sta.samsung.com>
wrote:

Greetings,

We are continuing our discussion on the "bloating" that occurs with
text representation of information, as opposed to binary
representation. Can anyone point to some studies that have been done
on this topic? In my opinion, the bloat is about 150%. Here’s how I
arrived at this figure. I am not sure if this analysis is correct.

Assuming uniform distribution of possible values across bytes (i.e. 0
to 255), we find that there are 10 single digit values, 90 double
digit values and 156 triple digit values. For single digit values,
the size of the text and binary representation is the same. For
double digit values, the size of the text representation is double
that of the binary representation. For triple digit values, the size
of the text representation is triple that of the binary
representation.

So the average size of one byte of binary data in textual
representation is:
((10*1)+(90*2)+(156*3))/256 = 2.5703125

There might be some savings if we are dealing with multi-byte entities
(short, long, long long etc.) in which the most significant nibbles
are zeros. So I am rounding the bloat to around 150%.

Thanks,
KP Bhat

Pascal J. Bourguignon

unread,

Sep 28, 2010, 5:13:32 PM9/28/10

to

Generic Usenet Account <use...@sta.samsung.com> writes:

> We are continuing our discussion on the "bloating" that occurs with
> text representation of information, as opposed to binary
> representation. Can anyone point to some studies that have been done
> on this topic? In my opinion, the bloat is about 150%. Here’s how I
> arrived at this figure. I am not sure if this analysis is correct.
>
> Assuming uniform distribution of possible values across bytes (i.e. 0
> to 255), we find that there are 10 single digit values, 90 double
> digit values and 156 triple digit values. For single digit values,
> the size of the text and binary representation is the same. For
> double digit values, the size of the text representation is double
> that of the binary representation. For triple digit values, the size
> of the text representation is triple that of the binary
> representation.
>
> So the average size of one byte of binary data in textual
> representation is:
> ((10*1)+(90*2)+(156*3))/256 = 2.5703125
>
> There might be some savings if we are dealing with multi-byte entities
> (short, long, long long etc.) in which the most significant nibbles
> are zeros. So I am rounding the bloat to around 150%.

Well, you can always design a specific representation that's more
economical in space than a generic representation.

My data point concerned a map of the roads of a small country. IRRC,
the coordinates were given as integers, in meter (with 1 m resolution).
So basically the file was a list of poly line, each poly line being a
list of points (2d IIRC, so 2 integers).

The natural representation, in binary would be:

typedef struct { int x;int y; } Point;
typedef struct { int length; Point points[0]; } Polyline;
typedef struct { int size; Polyline polylines[0]; } File;

The equivalent structure written in ASCII with spaces to separate
coordinates of a polyline, and newlines to separate polylines was
shorter.

The point is that data is not random. Coordinates of road segments in a
small contry are not evenly distributed betwee 0 m, and 4e9 m. The
maximum 2d coordinate at 1 m resolution on the surface of Earth is
20,000,000 m. So 25 bits would have been enough; there was one byte
wasted for each number... For real data, you may also take into
consideration Benford's Law.

Notice also, that text compression algorithms make a very good job of
extracting supperflous information from a text stream. In general, you
won't be able to invent a better scheme, sizewise, than just writing
your data in plain text and having it compressed by bzip2.

Ian Collins

unread,

Sep 28, 2010, 5:34:50 PM9/28/10

to

I think your initial assumption is flawed; at least for the data I see
and use.

Based On my reference set, small numeric values predominate. Unless the
data maps to bounded types with a small range, it is also more common to
use a "natural" type (such as int) to represent those values in data
structures. Thus a majority of numeric data would require 4 bytes to
transmit as binary, but only one or two bytes as text.

If minimising the packet size on the wire is a priory, the data can be
compressed.

--
Ian Collins