Confusion over variable size integer encoding in binary protocol

79 views
Skip to first unread message

Michael Peterson

unread,
Jan 30, 2015, 10:19:47 PM1/30/15
to orient-...@googlegroups.com
Hi,

I've just recently started an initial effort to write a Go (golang) driver for OrientDB. I'm starting with the binary protocol and I have some questions.

In "field_data serialization by type" section of this document: http://www.orientechnologies.com/docs/last/orientdb.wiki/Record-Schemaless-Binary-Serialization.html it states that variable size integers are "implemented in the same way of UTF-8".  But the ranges then given contradict that statement:

-64 < value < 64 1 byte
-8192 < value < 8192 2 byte
-1048576 < value < 1048576 3 byte
-134217728 < value < 134217728 4 byte
-17179869184 < value < 17179869184 5 byte


If you are truly using UTF-8 "marker" bits, then a 2 byte varint would be of the form:

110xxxxx 10yyyyyy


which leaves only 11 bits free, but your range of -8192 < value < 8192 for 2 bytes, implies that you have 14 bits available.  I then found this reference:

https://groups.google.com/forum/#!searchin/orient-database/varint$20variable$20length$20int/orient-database/8r1ES_LEDxE/rwdpxjMr-BQJ

which indicates that you are using the high-bit in all bytes to indicate whether there is another byte (1=yes).  Again that is actually not how UTF-8 works.  UTF-8 allows you to tell how many totals bytes are used by parsing only the first byte only; and the subsequent bytes all only have 6 bits free, not 7.

So the documentation should be clarified I think.  (Examples would be even better!)

Also two other questions:

* could you also specify whether all integer types are encoded big-endian, including varints?

* can you confirm that you are using ZigZag encoding for varints?  The first reference above says you are, but the second one (the google group link) has no mention of it.  If you are using ZigZag encoding, can you confirm that you using the form used in Google's Protocol Buffers as documented here:  https://developers.google.com/protocol-buffers/docs/encoding?csw=1 ?

Thanks very much for your help,
Michael

Emanuel

unread,
Feb 1, 2015, 3:53:37 PM2/1/15
to orient-...@googlegroups.com
Hi,

Sorry the documentation was not perfect updating it right now, the UTF-8 way was an experiment, but the protocol it use the varint encoding as the protobuf spec say :),

bye
Emanuel
--

---
You received this message because you are subscribed to the Google Groups "OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to orient-databa...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michael Peterson

unread,
Feb 2, 2015, 7:48:17 AM2/2/15
to orient-...@googlegroups.com
Thanks Emanuel.

It looks like the Google Protobuf spec (https://developers.google.com/protocol-buffers/docs/encoding?csw=1) uses little-endian encoding for both varints and non-varint numbers.

Can you clarify what the endian-ness is for the network binary protocol and the schemaless serialization in OrientDB?

From my (limited) experimentation so far, it looks like the network binary protocol uses big-endian.  What about float types in the schemaless serialization?

-Michael
Reply all
Reply to author
Forward
0 new messages