Variable-Width Integer Encoding

matthew....@gmail.com

unread,

Jan 2, 2018, 3:54:59 PM1/2/18

to Protocol Buffers

What is the rationale behind the current variable-width integer encoding?

As I understand it, an integer is terminated by a byte that's most-significant bit is equal to zero. Thus, bytes must be read one at a time, and this condition must be checked after reading each one to determine whether to read another. Why was this encoding chosen over a variable-width encoding that would require at most two reads -- that is, an encoding that specifies the number of subsequent bytes to read in the first byte?

No, I don't mean for the first byte's value to be the length of the rest of the integer. Rather, the number of leading ones in the first byte could be the number of following bytes. This would still allow 7 bits of a value to be stored per byte, with the added bonus of a full 64-bit value being encoded in 9 bytes instead of 10.

Examples:

0 leading ones followed by a terminating zero and then 7 bits:

0b0.......

1 leading one followed by a terminating zero, then 6 bits, and then 1 byte:

0b10...... ........

7 leading ones followed by a terminating zero and then 7 bytes:

0b11111110 ........ ........ ........ ........ ........ ........ ........

8 leading ones followed by 8 bytes:

0b11111111 ........ ........ ........ ........ ........ ........ ........ ........

So, such an encoding is clearly possible. Why does Protocol Buffers use something different? Is this to provide some level of protection against dropped bytes? Has all of the data already been read into a buffer by the time that it is to be decoded, and so reducing the number of reads does not provide much of a speed boost?

Ilia Mirkin

unread,

Jan 3, 2018, 12:08:23 PM1/3/18

to matthew....@gmail.com, Protocol Buffers

I doubt you're going to get a nice clean answer. Chances are it's
"whatever Sanjay was thinking at the time" which led to the current
encoding, maintained throughout the proto versions for backwards
compatibility with existing data. While APIs have changed over time,
the wire encoding has remained extremely stable.

While we now live in the future, and storage/memory/bandwidth are free
and infinite, and all CPUs are 64-bit, that was not always the case.
Your encoding would not be a clear win for smaller values, and would
be an obvious waste of 4 bits when encoding int32 values.
Additionally, your encoding would not be cleanly extendable to 128-bit
integers in a backwards-compatible way.

The neat thing about the current encoding is that the width of the
integer doesn't really matter -- it could be a 4096-bit integer for
all you know, you just keep reading bytes until you hit one without a
high bit set. Which means it's easy to adjust protos as values change
in allowed ranges, esp between 32 and 64 bits.

But that's just a post-facto justification. It's unlikely that the
original rationale from 15+ years ago exists anywhere.

-ilia

> --
> You received this message because you are subscribed to the Google Groups "Protocol Buffers" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to protobuf+u...@googlegroups.com.
> To post to this group, send email to prot...@googlegroups.com.
> Visit this group at https://groups.google.com/group/protobuf.
> For more options, visit https://groups.google.com/d/optout.

Josh Haberman

unread,

Jan 5, 2018, 6:55:31 PM1/5/18

to Protocol Buffers

I believe the encoding you mentioned here is the same as PrefixVarint. See some more info here when WebAssembly was evaluating which varint scheme to use:

https://github.com/WebAssembly/design/issues/601

Our existing varint encoding wins on simplicity, and WebAssembly chose it even though they were aware of PrefixVarint. That said, there may be cases where PrefixVarint would have given a noticeable speed boost.

Reply all

Reply to author

Forward