Decode size delimited compressed protobuf file in python

37 views
Skip to first unread message

Mr Moose

unread,
May 17, 2024, 7:19:58 AMMay 17
to Protocol Buffers
Hello everyone,

I hope I can find some advise here.
I have C++ code that writes a number of protobuf messages to a compressed size delimited stream like this (simplified):

FILE *ofile = fopen("myfile.bin.gz", "wb");
google::protobuf::io::FileOutputStream ostream(_fileno(ofile));
google::protobuf::io::GzipOutputStream zipstream(&ostream);

while (loop) {
   google::protobuf::util::SerializeDelimitedToZeroCopyStream(my_msg, zipstream);
}

This works fine. The files are written and I can read them back in in C++ with no issues.
Now I am trying to read them in Python and I'm having difficulties to understand the structure of the files. Here's what I'm trying:

def read_messages(raw_data: bytes):
    offset = 0
    while offset < len(raw_data):
        # Read the size (4 bytes, little-endian) and decode
        size_bytes = raw_data[offset : offset + 4]
        offset += 4
        size, _ = _DecodeVarint(size_bytes, 0)
        # This reads the correct size of the message (verified in C++)

        message_data = raw_data[offset : offset + size]
        offset += size

        # This causes an "Error parsing message" exception at the first message
        msg = my_messages_protobuf.MyMessage()
        msg.ParseFromString(message_data)

... and ...

 with gzip.open( "myfile.bin.gz", "r") as f:
      while True:
          chunk = f.read(chunk_size)
          if not chunk:
              break;
          read_messages(chunk)

Now, to clarify a bit, I have worked with protobuf for very long, although not in Python. Yet much Python code already deserializes such messages that come in elsewhere, so I assume the whole "setup Protobuf in Python" thing is not an issue here. It should work.

Given the fact that _DecodeVarint() correctly reads the message size leads me to believe the reading of the gzipped file is okay too.

Yet when I look at the raw buffer "message_data" it looks very different than the raw message data looks in C++ when I use the debugger there. I have no idea what could cause this difference.

Can anybody give me a hint on what could be wrong here?

Much appreciated,
Moose

Mr Moose

unread,
May 17, 2024, 8:06:12 AMMay 17
to Protocol Buffers
I have figured it out by myself.
The problem is that _DecodeVarint() may only consume fewer than the 4 bytes reserved for it and reports how long it really was in the second return tuple element. So progressing offset by that returned value rather than 4 does the trick.

Cheers,
Moose
Reply all
Reply to author
Forward
0 new messages