After reading through discussions on IETF JSON group (where there is
talk of maybe writing a minor update of JSON spec, to clarify some of
vague areas), I started thinking again about possibly writing a more
"formal" description of Smile, most likely RFC.
And to help that, I asked for feedback from one of json group members.
He gave good feedback, much of what aligned with what I have heard so far:
- Having a header to distinguish content (first couple of bytes) is Good.
o But why allow header to be optional?
- Back-references can reduce size, but add non-trivial amount of complexity
There were small terminology things to fix (zigzag description had
"multiplied by one" instead of "multiplied by two", basic typo; RFCs
use Octets instead of bytes), and also one question that led me to
think of minor improvement idea:
- Why not use VInt-length prefixed Strings
Now: the reason for this was that figuring out byte-length of a String
in UTF-8 requires one to do encoding; but because VInts are variable
in length, one may have to more encoded content around. So except for
the special case of short strings (where length, 6 bits, is encoded in
type byte), Smile actually uses end marker instead of length, to avoid
moving content.
But: while end-marker has its benefits on some encoders, length-prefix
option definitely has its benefits for decoder.
Thinking about this, I realized that there is nothing preventing
adding alternate String type where length prefix is used, instead of
end marker.
This would be beneficial for encoders that either happen to have
byte-serialized content available (some languages store Strings as
UTF-8; I think Perl does this?); or where optimizations can avoid
move, or where moving is required anyway.
And decoding of such Strings is simple enough that added complexity
for this new type should be very small.
What do you think? This change would require boosting of format
version to 1.2 I think. And if we consider it, should discuss it in
bit more detail and so on; I just thought I should mention it right
away.
Further, I am thinking that it would make sense to have a Github
project for defining the format; separate from existing Smile codec
projects (
https://github.com/pierre/libsmile and
https://github.com/FasterXML/jackson-dataformat-smile).
And if so, this idea can be logged as an issue.
-+ Tatu +-