optional bytes vs repeated bytes

Saptarshi

unread,

Aug 24, 2009, 10:51:37 AM8/24/09

to Protocol Buffers

Hello,
Suppose I would like to store a type that could be a sequence of raw
bytes, so

message ...{

optional bytes sdata1=1; //A
repeated bytes sdata2=2; //B
optional BYT sdata3=3; //C
}

message BYT{
uint32 length=1;
bytes data=2;
}
Now I have a unsigned char * array which I wish to store in the
message.

My first approach was using (B), add_sdata2(array[i],i) (something
similar), but this is 3bytes per byte stored.

I then tried, option A, storing the entire set of data into sdata1
(which is actually a string, according to the generated protobuf
header files). But when it comes to reading it, how do i know the
number if bytes stored in the string? Suppose I my data looks like
0x00,0x00,0x00, what will be the length?

I am currently using option (c).

Have I missed something? Is there a better approach.
Thank you in advance
Saptarshi

jasonh

unread,

Aug 24, 2009, 2:46:26 PM8/24/09

to Protocol Buffers

On Aug 24, 7:51 am, Saptarshi <saptarshi.g...@gmail.com> wrote:
> Hello,
> Suppose I would like to store a type that could be a sequence of raw
> bytes, so
>
> message ...{
>
> optional bytes sdata1=1; //A
> repeated bytes sdata2=2; //B
> optional BYT sdata3=3; //C
>
> }
>
> message BYT{
> uint32 length=1;
> bytes data=2;}
>
> Now I have a unsigned char * array which I wish to store in the
> message.
>
> My first approach was using (B), add_sdata2(array[i],i) (something
> similar), but this is 3bytes per byte stored.

This shouldn't be 3 bytes per byte, unless you are storing each byte
individually. If you only have a single string, you should just be
adding the entire thing as the repeated element. Also, with that
add_sdata2() call you are using the index as the length of the string.
Given const char* array[] = { "foo", "bar" }; you want add_sdata2(array
[i], strlen(array[i]));

>
> I then tried, option A, storing the entire set of data into sdata1
> (which is actually a string, according to the generated protobuf
> header files). But when it comes to reading it, how do i know the
> number if bytes stored in the string? Suppose I my data looks like
> 0x00,0x00,0x00, what will be the length?
>
> I am currently using option (c).

You should use option a or b: c adds additional overhead. Here is the
wire format for each of the options:

(a): <1-byte tag + wire type (00001010 for tag 1, length-delimited)>
<varint length><raw bytes of sdata1>
If your data contains three null characters, then you'll get
<0x0a><0x3><0x00><0x00><0x00>
When parsing, the size of the data will just be msg.sdata1().size() ==
3.
(b): has the same wire format as (a), except that you can encode
multiple byte arrays with the same tag. I.e., if you had:
unsigned char* array[] = { "foo", "bar", "quux" };
You could add each to the repeated bytes field, and each string would
get encoded to the wire with the format above:
<0x12><0x3>"foo"<0x12><0x3>"bar"<0x12><0x4>"quux"
(c): The wire format for nested messages already encodes the size, so
you are adding extra bytes of overhead by encoding the length
separately. You now have:
[tag for BYT field][BYT message length][tag for size][varint size][tag
for data][length of data][data]
Or for three null characters:
<0x1a><0x07><0x8><0x03><0x12><0x03><0x00><0x00><0x00>

Hope that helps,
Jason

Kenton Varda

unread,

Aug 24, 2009, 3:21:30 PM8/24/09

to Saptarshi, Protocol Buffers

In C++, "bytes" fields are stored using std::string. This class has a size() method which returns the length of the string. So, the length of sdata1 is:

message.sdata1().size()

So you want to use option (A).

You would use option B if you wanted to store multiple independent byte strings, each one containing an arbitrary number of bytes. There is never a reason to use option C.

Saptarshi Guha

unread,

Aug 24, 2009, 3:58:31 PM8/24/09

to Kenton Varda, Protocol Buffers

Hello,
Thank you! Being new to c++ i had no idea of the size method. I was
thinking of strlen on c_str() which would have given me the wrong
number of bytes.
Nice and clean now.
Regards
Saptarshi

Kenton Varda

unread,

Aug 24, 2009, 4:09:38 PM8/24/09

to sg...@purdue.edu, Protocol Buffers

Yes, unlike C-style char*, C++'s std::string can store strings that contain NUL bytes. You should never use .c_str() in this case -- use .data() and .size() instead.

Sapsi

unread,

Aug 24, 2009, 6:18:48 PM8/24/09

to Kenton Varda, sg...@purdue.edu, Protocol Buffers

Thanks, I had no idea about the data method.

Much to learn.

Regards

Saptarshi

On Aug 24, 2009, at 4:09 PM, Kenton Varda <ken...@google.com> wrote:

Yshould never use .c_str() in this case -- use .data() and .size() instead.

Reply all

Reply to author

Forward