[erlang-questions] sortable serialization format

9 views
Skip to first unread message

Ulf Wiger

unread,
Oct 25, 2009, 12:28:22 PM10/25/09
to erlang-questions Questions

A while ago I started hacking on a serialization format that
would have the same sorting properties as Erlang terms.

I didn't quite get it to work (negative floats was the most
difficult part), but when I returned to it today, I realized
that it was only a very small problem. Once fixed, all my
QuickCheck suites passed.

http://svn.ulf.wiger.net/sext/trunk/sext/doc/index.html

In many cases this format is slightly more compact than
the External Term Format. The most notable exception is
negative bignums, but large binaries also take up a little
bit more space (asymptotically approaching 12.5% overhead
for binaries larger than 36 bytes).

(Strings are much bigger, since I don't try to detect that
they are special cases of lists. This should perhaps be fixed.)

You are welcome to give it a spin and report to me if you
find good uses for it. The original idea was to make it
easier to make use of external storage solutions with
ordered set semantics, without foregoing the genericity and
convenience of generic term serialization.

BR,
Ulf W

PS Float comparisons are handled in the QuickCheck suite by
coercing to IEEE 764 binary64 format. This is supposedly what
Erlang uses internally, but I encountered lots of weird behaviour
before doing this coercion.

--
Ulf Wiger
CTO, Erlang Training & Consulting Ltd
http://www.erlang-consulting.com

________________________________________________________________
erlang-questions mailing list. See http://www.erlang.org/faq.html
erlang-questions (at) erlang.org

Ulf Wiger

unread,
Oct 25, 2009, 6:26:12 PM10/25/09
to Tony Finch, erlang-questions Questions
Tony Finch wrote:
>
> How about using "consistent overhead byte stuffing" to encode binaries
> more compactly? http://www.stuartcheshire.org/papers/COBSforToN.pdf
> The Erlang reference manual doesn't say how binaries are ordered.

Interesting, but wouldn't that break the sort order?

Binaries and bit strings are sorted as left-aligned bit arrays.
You are right, it isn't documented. It took me a while to wrap
my brain around it, even though it's perfectly logical:

1> lists:sort([<<2#111:3>>, <<2#1111111:7>>,
<<2#110111111:9>>, <<1:1>>]).
[<<1:1>>,<<223,1:1>>,<<7:3>>,<<127:7>>]

Looking back, I don't remember how I thought I would work...

...oh, but my email archive has better memory than I:

5> lists:sort([<<1:1>>,<<1:2>>,<<1:3>>,<<1:4>>,
<<1:5>>,<<1:6>>,<<1:7>>,<<1:8>>,
<<2:2>>,<<2:3>>,<<2:4>>,<<1>>,
<<16>>, <<128>>, <<255>>]).
[<<1>>,
<<1>>,
<<1:7>>,
<<1:6>>,
<<1:5>>,
<<1:4>>,
<<16>>,
<<1:3>>,
<<2:4>>,
<<1:2>>,
<<2:3>>,
<<1:1>>,
<<2:2>>,
<<128>>,
<<"ÿ">>]

A certain professor used the academic term "Gobsmacking!" to
describe this, but the fallacy is that you fall into thinking
that the sort order above should resemble the order of the
integer values that we have encoded with different lengths.
It makes sense only if you view them as bit streams:

6> lists:sort([{sext:pp(B),B} || B <-
[<<1:1>>,<<1:2>>,<<1:3>>,<<1:4>>,
<<1:5>>,<<1:6>>,<<1:7>>,<<1:8>>,
<<2:2>>,<<2:3>>,<<2:4>>,<<1>>,
<<16>>, <<128>>, <<255>>]]).
[{"00000001",<<1>>},
{"00000001",<<1>>},
{"0000001",<<1:7>>},
{"000001",<<1:6>>},
{"00001",<<1:5>>},
{"0001",<<1:4>>},
{"00010000",<<16>>},
{"001",<<1:3>>},
{"0010",<<2:4>>},
{"01",<<1:2>>},
{"010",<<2:3>>},
{"1",<<1:1>>},
{"10",<<2:2>>},
{"10000000",<<128>>},
{"11111111",<<"ÿ">>}]


BR,
Ulf W

Ulf Wiger

unread,
Oct 31, 2009, 7:17:15 AM10/31/09
to erlang-questions Questions
Ulf Wiger wrote:
>
> A while ago I started hacking on a serialization format that
> would have the same sorting properties as Erlang terms.
>
> I didn't quite get it to work (negative floats was the most
> difficult part), but when I returned to it today, I realized
> that it was only a very small problem. Once fixed, all my
> QuickCheck suites passed.

I just had to try this on Tokyo Tyrant, so I wrote a small
prototype for connecting to TT and encoding a few requests,
using the sext library to encode terms before sending them.

I realized that a new function was needed in sext: prefix(Term),
which encodes a 'prefix' that will match similar terms, and allow
some wildcarding. A prefix can't be decoded (at least, I didn't
write any code for doing so).

Some examples:

Eshell V5.7.1 (abort with ^G)
1> sext:encode({1,2,3}).
<<16,0,0,0,3,10,0,0,0,2,10,0,0,0,4,10,0,0,0,6>>
2> sext:prefix({1,'_','_'}).
<<16,0,0,0,3,10,0,0,0,2>>
3> sext:encode([1,2,3]).
<<17,10,0,0,0,2,10,0,0,0,4,10,0,0,0,6,0>>
4> sext:prefix([1,2|'_']).
<<17,10,0,0,0,2,10,0,0,0,4>>


Armed with this, I opened a B-tree table in Tokyo Tyrant,
and connected to it with my prototype module.

Eshell V5.7.1 (abort with ^G)
1> {ok,TT} = tt_proto:open(tt,[]).
{ok,<0.35.0>}
2> tt_proto:put(TT,{1,a}, 1).
ok
3> tt_proto:get(TT,{1,a}).
{ok,1}
4> tt_proto:put(TT,{1,b}, 2).
ok
5> tt_proto:put(TT,{1,c}, 3).
ok
6> tt_proto:put(TT,{2,a}, 4).
ok

Now, for some prefix matching:

7> tt_proto:keys(TT,{1,'_'}).
{ok,[{1,a},{1,b},{1,c}]}
8> tt_proto:keys(TT,{2,'_'}).
{ok,[{2,a}]}
9> timer:tc(tt_proto,keys,[TT,{1,'_'}]).
{279,{ok,[{1,a},{1,b},{1,c}]}}

I made no real effort to optimize anything. The module starts
a gen_server which keeps a connection open to ttserver. It handles
only one query at a time, but looking at the TCP protocol, it's
hard to see how it could to otherwise, as there is no tagging of
requests. The round trip times are going to be fairly high for simple
requests (compared to dets and mnesia on small data sets), but the
main benefit of using TT in the first place ought to be either that
the data set is uncomfortably large for mnesia and dets, or that one
wants ordered_set semantics on disk-based storage.

I put the tt_proto module in sext/examples/
There is some edoc for it too.

http://svn.ulf.wiger.net/sext/trunk/sext/doc/index.html

BR,
Ulf W

Paul Mineiro

unread,
Oct 31, 2009, 12:18:36 PM10/31/09
to erlang-questions Questions
Awesome. That's 90% of tcerl; most of the complexity had to do with
implementing erlang term order from encoded binaries, some of the rest
with augmenting term_to_binary to allow for prefixes (there's no smallest
or largest term in erlang), and the remainder with query planning. The
query planning part is pure erlang and independent of the term encoding
(http://code.google.com/p/tcerl/source/browse/trunk/tcerl/src/tcbdbmsutil.erl),
so you could hopefully reuse it.

In retrospect, clearly, I should have abandoned term_to_binary, if only
because Erlang is easy and C is hard, so the C side should be just memcmp.

-- p

Reply all
Reply to author
Forward
0 new messages