Decode size delimited compressed protobuf file in python

74 views

Skip to first unread message

Mr Moose

unread,

May 17, 2024, 7:19:58 AMMay 17

to Protocol Buffers

Hello everyone,

I hope I can find some advise here.

I have C++ code that writes a number of protobuf messages to a compressed size delimited stream like this (simplified):

FILE *ofile = fopen("myfile.bin.gz", "wb");

google::protobuf::io::FileOutputStream ostream(_fileno(ofile));

google::protobuf::io::GzipOutputStream zipstream(&ostream);

while (loop) {

google::protobuf::util::SerializeDelimitedToZeroCopyStream(my_msg, zipstream);

}

This works fine. The files are written and I can read them back in in C++ with no issues.

Now I am trying to read them in Python and I'm having difficulties to understand the structure of the files. Here's what I'm trying:

def read_messages(raw_data: bytes):
offset = 0
while offset < len(raw_data):
# Read the size (4 bytes, little-endian) and decode
size_bytes = raw_data[offset : offset + 4]
offset += 4
size, _ = _DecodeVarint(size_bytes, 0)

# This reads the correct size of the message (verified in C++)

message_data = raw_data[offset : offset + size]
offset += size

# This causes an "Error parsing message" exception at the first message

msg = my_messages_protobuf.MyMessage()
msg.ParseFromString(message_data)

... and ...

with gzip.open( "myfile.bin.gz", "r") as f:
while True:
    chunk = f.read(chunk_size)
    if not chunk:
break;
    read_messages(chunk)

Now, to clarify a bit, I have worked with protobuf for very long, although not in Python. Yet much Python code already deserializes such messages that come in elsewhere, so I assume the whole "setup Protobuf in Python" thing is not an issue here. It should work.

Given the fact that _DecodeVarint() correctly reads the message size leads me to believe the reading of the gzipped file is okay too.

Yet when I look at the raw buffer "message_data" it looks very different than the raw message data looks in C++ when I use the debugger there. I have no idea what could cause this difference.

Can anybody give me a hint on what could be wrong here?

Much appreciated,

Moose

Mr Moose

unread,

May 17, 2024, 8:06:12 AMMay 17

to Protocol Buffers

I have figured it out by myself.

The problem is that _DecodeVarint() may only consume fewer than the 4 bytes reserved for it and reports how long it really was in the second return tuple element. So progressing offset by that returned value rather than 4 does the trick.

Cheers,

Moose

Reply all

Reply to author

Forward

0 new messages