protoc --decode_raw produces incorrect output

257 views
Skip to first unread message

Barry waldbaum

unread,
May 19, 2021, 10:40:01 AM5/19/21
to Protocol Buffers
I've been building tools to generate generic protobuf data from json (yes, I'm aware I lose context, this is for testing)

I've found a message I can parse properly, but protoc --decode_raw can't:

 \n  \a   3   4   5   0   0   0   0

protoc 
1 {
  6 {
  }
  6: 0x30303030
}

to reproduce it:

echo -n $'\x0a' > binary.dat; echo -n $'\x07' >> binary.dat; echo -n $'3450000' >> binary.dat ; protoc --decode_raw < binary.dat

I'm able to parse this properly using protowire as:
1: "3450000"

also if I change the first character in the bytes to a '1' I get a valid output:

$ echo -n $'\x0a' > binary.dat; echo -n $'\x07' >> binary.dat; echo -n $'1450000' >> binary.dat ; protoc --decode_raw < binary.dat
1: "1450000"

$ protoc --version
libprotoc 3.15.8

I've worked really hard to keep the reproduction as simple as possible, I haven't dug into the code for decode_raw yet, that's my next step.. 

thanks
-Barry

Ilia Mirkin

unread,
May 19, 2021, 11:12:01 AM5/19/21
to Barry waldbaum, Protocol Buffers
I haven't looked at the decode_raw logic either, but wire type 2 is
used for all length-delimited values. That's strings, bytes, and
embedded messages (and packed repeated fields). decode_raw doesn't
know when it's an embedded message and when it's a string/bytes. So it
has to guess somehow, based on some heuristics (like how well the
values decode, presumably). And here it guesses wrong.

Cheers,

-ilia
> --
> You received this message because you are subscribed to the Google Groups "Protocol Buffers" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to protobuf+u...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/protobuf/2ccfbf93-6b0e-49d5-ae8e-3b3dc1b55218n%40googlegroups.com.

Barry waldbaum

unread,
May 19, 2021, 11:25:09 AM5/19/21
to Protocol Buffers
Hi! Thanks for your response!

Wouldn't wire type 3 be the embedded logic? wire type 2 length delimited data shouldn't be parsed and just printed out.

thanks
-Barry




Ilia Mirkin

unread,
May 19, 2021, 11:37:03 AM5/19/21
to Barry waldbaum, Protocol Buffers
Wire types 3 and 4 are for "groups", which were only a thing in
proto1, and deprecated ever since. These were a lot like messages, but
could not stand on their own, and could only be embedded at the
appropriate points as I recall.

Some more info here:
https://developers.google.com/protocol-buffers/docs/encoding

Cheers,

-ilia

On Wed, May 19, 2021 at 11:25 AM Barry waldbaum
> To view this discussion on the web visit https://groups.google.com/d/msgid/protobuf/a44d0cc7-ac54-4ad5-830c-d0cc8ab42573n%40googlegroups.com.

Barry waldbaum

unread,
May 19, 2021, 1:31:03 PM5/19/21
to Protocol Buffers
Ah, that makes sense. I can see why this change was done as it might save you a few bytes here and there, but makes even raw decoding of a protobuf precarious. I guess at google scale, a few bytes here and there is like $50K every 12 seconds :) 
Reply all
Reply to author
Forward
0 new messages