Reading Multiple Protobufs from a File

2,529 views
Skip to first unread message

Mark Dredze

unread,
Dec 9, 2009, 12:23:09 PM12/9/09
to Protocol Buffers
Hi All,

I have what I believe is a simple task: write multiple protocol
buffers to a single file and then read them back sequentially. When
reading, I should not have to load the entire file in memory but read
each protobuf object one at a time (with some buffering). An example
application would to be store a large number of documents in a file,
where each document is a single protobuf.

This functionality seems to be provided in Java using writeTo
(OutputStream output) and parseFrom(InputStream input). However, it
seems to be missing from Python:
http://groups.google.com/group/protobuf/browse_thread/thread/cfe1955729077132/c8ccf86adecf3b47?lnk=gst
http://groups.google.com/group/protobuf/browse_thread/thread/838eb489871a92df/b2863c8b9ebfc433

Have people come up with a solution to this problem for Python? One
approach would be to port CodedOutputStream and CodedInputStream to
Python.

I am wondering if anyone has developed an approach to deal with this.

Best,
Mark

Jason Hsueh

unread,
Dec 9, 2009, 1:33:03 PM12/9/09
to Mark Dredze, Protocol Buffers
Indeed, CodedOutputStream and CodedInputStream aren't presently available in Python. Want to add it?

You should also be able to implement what you want without the help of these classes. http://code.google.com/apis/protocolbuffers/docs/techniques.html#streaming describes what to do - once you have the size delimiters you can read just enough to parse a single message, rather than reading the entire file into memory.


--

You received this message because you are subscribed to the Google Groups "Protocol Buffers" group.
To post to this group, send email to prot...@googlegroups.com.
To unsubscribe from this group, send email to protobuf+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.



Mark Dredze

unread,
Dec 9, 2009, 2:37:32 PM12/9/09
to Protocol Buffers
I had the same idea (writing the size of the message first and then
the message). Here is a simple reader and writer for Python with this
idea. Note that I assume the message size is an (unsigned) integer of
4 bytes. For very long messages, this won't work. However, I am
relying on the assumption that messages aren't that large. If they
are, then you probably need a different format for storing the data
anyway (break up into many protocol buffers.)

Comments welcome.
Mark


class ProtocolBufferFileReader:
def __init__(self, input_filename, message_constructor):
self.file = open(input_filename, 'rb')
self.message_constructor = message_constructor

def next(self):
read_byte = self.file.read(4)
if len(read_byte) == 0:
raise StopIteration
size = struct.unpack('I', read_byte)[0]

message = self.message_constructor()
bytes_read = message.MergeFromString(self.file.read(size))

return message

def __iter__(self):
return self

def close(self):
self.file.close()

class ProtocolBufferFileWriter:
def __init__(self, output_filename):
self.file = open(output_filename, 'wb')

def write(self, message):
string_to_write = message.SerializeToString()
size = struct.pack('I', message.ByteSize())

self.file.write(size)
self.file.write(message.SerializeToString())

def flush(self):
self.file.flush()

def close(self):
self.file.close()

Kenton Varda

unread,
Dec 9, 2009, 2:52:45 PM12/9/09
to Mark Dredze, Protocol Buffers
In Python, you probably don't have much to gain from Coded{Input,Output}Stream.  Just serialize to a string, and then write it.  On the other end, read the bytes to a string, then parse them.  The extra copy will be fairly cheap compared to the time to parse/serialize, since memory copying is implemented at the C level.

On Wed, Dec 9, 2009 at 9:23 AM, Mark Dredze <dre...@gmail.com> wrote:

Nick Bolton

unread,
Dec 13, 2009, 12:50:12 PM12/13/09
to Mark Dredze, Protocol Buffers
Hi Mark,

> Comments welcome.
> Mark
>
>
> class ProtocolBufferFileReader:
>        def __init__(self, input_filename, message_constructor):
>                self.file = open(input_filename, 'rb')
>                self.message_constructor = message_constructor

It may also be useful to modify this constructor to accept a "file" argument:

class ProtocolBufferFileReader:
def __init__(self, file, message_constructor):
self.file = file
self.message_constructor = message_constructor

So that you can call the class in two ways:

reader = ProtocolBufferFileReader(open(input_filename, 'rb'))

... or ...

from cStringIO import StringIO
reader = ProtocolBufferFileReader(StringIO(binary_data))

... where binary_data is a Python binary string created by
myMessage.SerializeToString()

This would be useful for receiving several protobuf messages over the
network at once (for example). It also means that you don't need to
modify the existing code in ProtocolBufferFileReader, since StringIO
also has the same read, write, seek, and tell functions as the file
class. It's a bit like using a stream in C++ I suppose.

Nick

Mark Dredze

unread,
Dec 22, 2009, 10:46:56 PM12/22/09
to Nick Bolton, Protocol Buffers
That's a good idea. I'll try that. Thanks.

Mark

Reply all
Reply to author
Forward
0 new messages