Re: [protobuf] Protocol buffers and large data sets

2,967 views
Skip to first unread message

Jason Hsueh

unread,
May 17, 2010, 7:00:36 PM5/17/10
to sanikumbh, Protocol Buffers
There is a default byte size limit of 64MB when parsing protocol buffers - if a message is larger than that, it will fail to parse. This can be configured if you really need to parse larger messages, but it is generally not recommended. Additionally, ByteSize() returns a 32-bit integer, so there's an implicit limit on the size of data that can be serialized.

You can certainly use protocol buffers in large data sets, but it's not recommended to have your entire data set be represented by a single message. Instead, see if you can break it up into smaller messages.

On Mon, May 17, 2010 at 1:05 PM, sanikumbh <sani...@gmail.com> wrote:
I wanted to get some opinion on large data sets and protocol buffers.
Protocol Buffer project page by google says that for data > 1
megabytes, one should consider something different but they don’t
mention what would happen if one crosses this limit. Are there any
known failure modes when it comes to the large data sets?
What are your observations, recommendations from your experience on
this front?

--
You received this message because you are subscribed to the Google Groups "Protocol Buffers" group.
To post to this group, send email to prot...@googlegroups.com.
To unsubscribe from this group, send email to protobuf+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.


--
You received this message because you are subscribed to the Google Groups "Protocol Buffers" group.
To post to this group, send email to prot...@googlegroups.com.
To unsubscribe from this group, send email to protobuf+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.

sanikumbh

unread,
May 17, 2010, 4:05:01 PM5/17/10
to Protocol Buffers

Terri

unread,
May 24, 2010, 3:21:33 PM5/24/10
to Protocol Buffers
Hi,

I've been struggling to figure out just exactly how to do the many
smaller messages approach. I've implemented this strategy, which is
working except for a byte limit problem:

http://groups.google.com/group/protobuf/browse_thread/thread/038cc4ad000b4265/95981da7e07ce197?hide_quotes=no

I also raised the byte limit using SetTotalBytesLimit to maxint.

I use a python program to read my data form disk and package it up
into messages that are roughly 110 bytes each. Then I pipe it to a C++
program that reads messages and crunches. But, I still have a problem
because the total number of bytes of all my smaller messages is
greater than maxint and the C++ fails to read when it hits the limit.

I like the protobuf approach to passing data, I just need to remove
that limit.

What can I do?

Thanks,
Terri

On May 17, 7:00 pm, Jason Hsueh <jas...@google.com> wrote:
> There is a default byte size limit of 64MB when parsing protocol buffers -
> if a message is larger than that, it will fail to parse. This can be
> configured if you really need to parse larger messages, but it is generally
> not recommended. Additionally, ByteSize() returns a 32-bit integer, so
> there's an implicit limit on the size of data that can be serialized.
>
> You can certainly use protocol buffers in large data sets, but it's not
> recommended to have your entire data set be represented by a single message.
> Instead, see if you can break it up into smaller messages.
>
>
>
> On Mon, May 17, 2010 at 1:05 PM, sanikumbh <saniku...@gmail.com> wrote:
> > I wanted to get some opinion on large data sets and protocol buffers.
> > Protocol Buffer project page by google says that for data > 1
> > megabytes, one should consider something different but they don’t
> > mention what would happen if one crosses this limit. Are there any
> > known failure modes when it comes to the large data sets?
> > What are your observations, recommendations from your experience on
> > this front?
>
> > --
> > You received this message because you are subscribed to the Google Groups
> > "Protocol Buffers" group.
> > To post to this group, send email to prot...@googlegroups.com.
> > To unsubscribe from this group, send email to
> > protobuf+u...@googlegroups.com<protobuf%2Bunsu...@googlegroups.com>
> > .
> > For more options, visit this group at
> >http://groups.google.com/group/protobuf?hl=en.
>
> --
> You received this message because you are subscribed to the Google Groups "Protocol Buffers" group.
> To post to this group, send email to prot...@googlegroups.com.
> To unsubscribe from this group, send email to protobuf+u...@googlegroups.com.
> For more options, visit this group athttp://groups.google.com/group/protobuf?hl=en.

Kenton Varda

unread,
May 24, 2010, 4:46:10 PM5/24/10
to Terri, Protocol Buffers
My guess is that you're using a single CodedInputStream to read all your input, repeatedly calling message.ParseFromCodedStream().  Instead, create a new CodedInputStream for each message.  If you construct it on the stack, there is no significant overhead to doing this:

  while (true) {
    CodedInputStream stream(&input);
    // read one message, or break if at EOF

Terri Kamm

unread,
May 27, 2010, 10:28:46 AM5/27/10
to Kenton Varda, Protocol Buffers
Thanks, that worked!

Terri

Reply all
Reply to author
Forward
0 new messages