Facilitate Object Streaming

73 views
Skip to first unread message

Noble

unread,
Jul 9, 2008, 5:01:24 AM7/9/08
to Protocol Buffers
Is it possible to stream objects one by one . This will help reduce
the memory footprint of my server.
eg: I must be able to do
Builder#addAllXXX(Iterable T)
It immedietly reads all the objects . That means it the no:of elements
is very large my server may go out of memory

The best solution is that it iterates through the objects just before
it writes down the stream.
--Noble

shalinsmangar

unread,
Jul 9, 2008, 12:54:17 PM7/9/08
to Protocol Buffers
It would be great if we can support this. Any server which sends out a
large amount of objects in one request will need streaming to ably
handle large amount of requests.

Regards,
Shalin

Kenton Varda

unread,
Jul 9, 2008, 1:18:53 PM7/9/08
to Noble, Protocol Buffers

Protocol buffers are mostly designed for messages under 1MB.  We have seen people manipulate messages up to 2GB but this isn't the intended usage.  When manipulating a large number of messages, it's probably a good idea to represent each individual message as a protocol buffer but created a custom container format to contain them.  This will also allow you to do things like efficient random access, which protocol buffers don't currently provide.

Noble

unread,
Jul 9, 2008, 1:26:55 PM7/9/08
to Protocol Buffers
It is fine for small messages .But , depending on the application ,
the messages can be bigger .Sometimes the size of the response is
unpredictable. .If a server is serving 1000's of users every extra MB
of RAM will just add to GBs in no time.


On Jul 9, 10:18 pm, "Kenton Varda" <ken...@google.com> wrote:

JT Olds

unread,
Jul 15, 2008, 6:05:41 PM7/15/08
to Protocol Buffers
Most of the large protocol buffers I plan to work with have the
majority of their size contained in repeated, nested protocol buffers.
What would potentially be a great solution to real time parsing and
object streaming issues, where a lot of protocol buffers being worked
with simultaneously can't all fit in memory, would be if there was an
asynchronous-style parse function that simply calls a callback each
time one of the sub-protobuffer fields is parsed. Perhaps a more
general solution would be if there was another tag (allowasync,
besides optional, required, and repeated) that specified that a field
was a repeated nested protocol buffer that should have the supplied
callback called when that field is parsed, maybe limiting one
allowasync field per protocol buffer. The callback could return true
if it wants that nested field to still be added to the protocol buffer
object being built and false if it wants to have the finished protocol
buffer ignore it.

Kenton, is the design of protocol buffers such that something like
this possible for either an independent developer wanting to break
interoperability with protocol buffers or as a possible extension to
all protocol buffers?

-JT

On Jul 9, 11:18 am, "Kenton Varda" <ken...@google.com> wrote:

Kenton Varda

unread,
Jul 15, 2008, 6:21:13 PM7/15/08
to JT Olds, Protocol Buffers
I don't think it would be very hard to write a parser like you describe, but it's not done at present.  Any interest in working on this?

I think, ideally, there should be a visitor interface like:

class Visitor {
 public:
  virtual ~Visitor();

  virtual void AddInt32(const FieldDescriptor* field, int32 value) = 0;
  virtual void AddInt64(const FieldDescriptor* field, int64 value) = 0;
  ...

  // May return NULL to skip the sub-message.  Otherwise, the
  // sub-message's fields are added to the returned Visitor.
  virtual Visitor* BeginSubMessage(const FieldDescriptor* field) = 0;
  // Parameter is the same Visitor object returned by BeginSubMessage().
  virtual void EndSubMessage(const FieldDescriptor* field, Visitor* sub_visitor) = 0;
};

Message::Reflection could pretty easily be made to subclass this interface, which would automatically mean that any Message object can act as a Visitor.  Then wire_format.cc could be refactored to work in terms of Visitor instead of Message::Reflection.

JT Olds

unread,
Jul 16, 2008, 8:43:04 PM7/16/08
to Protocol Buffers
Oh, that's a great way to do it. However, I recently realized that the
use-case I think this would be useful for also would require that
loose-ordering is guaranteed, which is not currently the case.

For example, assume you are passing a large protocol buffer (large
because there are many instances of a repeated, nested protocol
buffer). You want to parse the nested protocol buffers in a streaming
fashion, discarding previously parsed ones, but you need to know some
additional information in the parent protocol buffer before you can
actually do anything useful with the nested protocol buffers. If there
was a way of asserting that certain fields always come first in the
protocol buffer (not even strictly ordered, but just that they come
before repeated fields), this would make streaming a piece of cake.
However, since ordering is not guaranteed during parsing in a general
sense, there is the chance relying on ordering like this could break
in the case when a header field ends up coming after all of the nested
protocol buffers in the stream.

There's two possible ways streaming in this way could still work:
1) The ordering is deterministic as long as the fields you want to
come first existed in every version of the protocol buffer definition
and had lower field numbers than repeated fields. The downside of
hoping for this scenario is that perhaps you want to add a field later
that you would like to parse before any of the repeated nested
protocol buffers. If that happens this system will break.
2) It is guaranteed that protocol buffers are only ever serialized and
parsed once, such that intermediate parsing and serializing by older
parsers don't break the ordering. Then streaming with header-like
fields would work.

I don't know if what I said made any sense just now, but basically if
there was a way of saying in the protocol buffer definition that you
wanted certain fields (repeated, possibly) to come last always, there
would be a ton of new flexibility allowed in protocol buffer parsing
design.

If I decide that not being able to safely add new header-type fields
to these large protocol buffers before a large amount of repeated
fields is a worthwhile sacrifice, or alternatively if asserting
protocol buffers are only ever serialized and parsed once is a
worthwhile sacrifice, I may still implement this Visitor solution, but
otherwise I don't have a use for it and will probably go some other
way.

My technical writing skills are quite poor. Sorry.

Kenton Varda

unread,
Jul 16, 2008, 8:54:24 PM7/16/08
to JT Olds, Protocol Buffers
Well, presumably if you're using a Visitor for parsing -- because you do not want to parse everything at once -- you're probably going to be doing something similar for serializing.  For example, there could be a class like:

class SerializingVisitor : pulbic Visitor {
 public:
  SerializingVisitor(CodedOutputStream* output);
  ...
}

This class just writes each visited field to the given output stream.  It would then be up to you to feed it fields to serialize.

If you're doing thing this way, then it's up to you in what order you want to serialize the fields.

Another point:

If you define your top-level message like this:

message Stream {
  optional Header header = 1;
  repeated Item item = 2;
  ...
}

message Header {
  ...
}

Then "header" (if present) will always come before any other field when using the standard serialization routines.  You can add new fields to the Header class without changing this, even when the message is parsed and re-serialized by older programs that don't know about the new field.

Alek Storm

unread,
Jul 17, 2008, 2:47:39 AM7/17/08
to Protocol Buffers
On Jul 16, 7:54 pm, "Kenton Varda" <ken...@google.com> wrote:
> Then "header" (if present) will always come before any other field when
> using the standard serialization routines.  You can add new fields to the
> Header class without changing this, even when the message is parsed and
> re-serialized by older programs that don't know about the new field.

But placing the header message first isn't required by the Protocol
Buffers definition, meaning anything but the standard protoc
implementation could break it.

JT Olds

unread,
Jul 17, 2008, 12:13:19 PM7/17/08
to Protocol Buffers
My need for streaming serialization isn't actually as high as
streaming parsing, but that's a great point that streaming
serialization would fix the ordering issue to a degree.

However, your suggestion of the Header nested protocol buffer is a
work of pure genius. If I get either Visitor interface done and can
convince my company to release them, I'll let you know. Thanks for all
your help.

JT Olds

unread,
Jul 17, 2008, 12:15:49 PM7/17/08
to Protocol Buffers
No one is requiring that a header protocol buffer exists. It would
just be something you would need to assert exists for any streaming
processing you plan to do.

In my case, this would just mean that any protocol buffers I plan to
stream with I need to make sure I write a header as well. We're not
changing the definition. The visitor interface listed above would work
fine with protocol buffers without headers, unless your program logic
needs a field that doesn't end up showing up before all the data
you're streaming, then your program logic has to cache all of the
protocol buffer data until that field arrives, negating the usefulness
of caching.

JT Olds

unread,
Jul 17, 2008, 12:16:43 PM7/17/08
to Protocol Buffers
s/caching\.$/streaming\./

Alek Storm

unread,
Jul 17, 2008, 12:30:45 PM7/17/08
to Protocol Buffers
On Jul 17, 11:15 am, JT Olds <jto...@gmail.com> wrote:
> No one is requiring that a header protocol buffer exists. It would
> just be something you would need to assert exists for any streaming
> processing you plan to do.
>
> In my case, this would just mean that any protocol buffers I plan to
> stream with I need to make sure I write a header as well. We're not
> changing the definition. The visitor interface listed above would work
> fine with protocol buffers without headers, unless your program logic
> needs a field that doesn't end up showing up before all the data
> you're streaming, then your program logic has to cache all of the
> protocol buffer data until that field arrives, negating the usefulness
> of caching.

That's not what I meant. I meant that, when you're reading in your
message as a stream, it's perfectly legal for the header nested
message to come *last*. If you rely on it being first, you're writing
very fragile code that could be broken by just about anything.
However, if the header can be placed anywhere in the stream, you're
good to go.

Kenton Varda

unread,
Jul 17, 2008, 12:58:08 PM7/17/08
to Alek Storm, Protocol Buffers
On Thu, Jul 17, 2008 at 9:30 AM, Alek Storm <alek....@gmail.com> wrote:
That's not what I meant. I meant that, when you're reading in your
message as a stream, it's perfectly legal for the header nested
message to come *last*.  If you rely on it being first, you're writing
very fragile code that could be broken by just about anything.
However, if the header can be placed anywhere in the stream, you're
good to go.

Technically it is valid for the header to come last.  However, all current implementations will write it first and all future implementations are strongly advised to write it first.  I think it is reasonable for JT to specify in the docs for his format that "the header must appear first" and leave it at that.  Alternatively, creating a custom container format that writes the header as a separate message would also work, though it would take more effort.

Alek Storm

unread,
Jul 17, 2008, 1:31:00 PM7/17/08
to Protocol Buffers
On Jul 17, 11:58 am, "Kenton Varda" <ken...@google.com> wrote:
> Technically it is valid for the header to come last.  However, all current
> implementations will write it first and all future implementations are
> strongly advised to write it first.  I think it is reasonable for JT to
> specify in the docs for his format that "the header must appear first" and
> leave it at that.  Alternatively, creating a custom container format that
> writes the header as a separate message would also work, though it would
> take more effort.

I just don't like it. I can see this being a problem if a streaming
serializer is ever implemented. Perfectly legal Protocol Buffer
messages would be considered invalid. But hey, it's his
implementation. And I sound grouchy because I got no sleep last night
whatsoever - sorry :)

JT Olds

unread,
Jul 17, 2008, 3:47:15 PM7/17/08
to Protocol Buffers, ken...@google.com
As I'm looking into this, a useful piece of documentation would be
something like "Life of a Protocol Buffer" that explains what happens
to a protocol buffer as it's both serialized and parsed. It would be
nice if such a document made reference to the various classes and
parts of the codebase that did actual logic on protocol buffer
processing.

Is that something someone could do and put up on the docs? It's less
that quick-going otherwise trying to follow the codepath augmented
with the existing public documentation.

Kenton Varda

unread,
Jul 17, 2008, 4:38:16 PM7/17/08
to JT Olds, Protocol Buffers
I've put that on my todo list, but I don't know when I'll have time.  If someone else wants to take a crack at it, let me know.

The short story is:

serialization:  Generated code iterates through all fields and calls functions defined in wire_format_inl.h to serialize them to a CodedOutputStream, which in turn writes to a ZeroCopyOutputStream, which is an abstract interface that can send the data wherever it wants.

parsing:  Generated code repeatedly reads a tag from the CodedInputStream, switches on it, then reads the corresponding value by calling functions in wire_format_inl.h.

This is when optimize_for = SPEED.  When optimize_for = CODE_SIZE (the default), then instead of generated code, the generic code in wire_format.cc is used, but otherwise operates the same.

I guess I'm not completely sure what information it is that you want to know.

Jon Skeet

unread,
Aug 12, 2008, 9:39:12 AM8/12/08
to Protocol Buffers
Just to revisit this thread, I've been considering the prospect of
stream, both for reading and writing.

My use case is something like this: I want to write log messages to a
file. I don't want to have to buffer them, I just want to be able to
write them as I generate them (with suitable synchronization between
threads, of course). I want to be able to read them in one at a time
and process or ignore them, potentially from a very big file.

I see two simple ways of doing this:
1) Add a "0" tag between buffers, and change the deserialization code
to allow it to (optionally) finish gracefully at a 0 tag (instead of
throwing an exception, which I believe is the current behaviour). The
same CodedInputStream should then be usable to deserialize the next
message, etc.

2) Add a length prefix before each buffer, and then limit the
CodedInputStream. In fact, I think this can be done on top of
CodedInputStream with PushLimit/PopLimit with no real extra work, as
that's how submessages are already read.

Option 2 is slightly less efficient than option 1 when writing to a
non-seekable stream, due to the length computation.

It wouldn't take a lot to add this "container" functionality as a pair
of classes on top of CodedInputStream/CodedOutputStream - can anyone
think of a reason not to just try it for my C# port? I could then
reasonably easily port it back to Java if it were deemed useful. The C+
+ would be left up to someone else though :)

Jon

Kenton Varda

unread,
Aug 12, 2008, 1:02:27 PM8/12/08
to Jon Skeet, Protocol Buffers
Option 2 sounds better to me.  Option 1 might work better if you used an end-group tag instead of zero to delimit the message.  CodedInputStream already has facilities for stopping at an end-group tag, so then you wouldn't have to modify it.

In both cases, if you also wrote something that looked like a tag *before* each message (a start-group tag for option 1 and a length-delimited tag for option 2), then your overall format would be identical to that of a repeated message, but you could still read it in a streaming fashion.  This could be a useful property.

Jon Skeet

unread,
Aug 12, 2008, 2:00:02 PM8/12/08
to Protocol Buffers
On Aug 12, 6:02 pm, "Kenton Varda" <ken...@google.com> wrote:
> Option 2 sounds better to me.  Option 1 might work better if you used an
> end-group tag instead of zero to delimit the message.  CodedInputStream
> already has facilities for stopping at an end-group tag, so then you
> wouldn't have to modify it.
> In both cases, if you also wrote something that looked like a tag *before*
> each message (a start-group tag for option 1 and a length-delimited tag for
> option 2), then your overall format would be identical to that of a repeated
> message, but you could still read it in a streaming fashion.  This could be
> a useful property.

Right, yes. So option 2 would mean each record was:
Byte 2 (length delimited, field 0)
Varint: length
Data

If that's right, it does indeed sound pretty much trivial to do. I'll
give it a try at the next opportunity... it would be nice to pseudo-
standardise this across implementations, of course - the client would
need to know that's what they're reading, but there's no reason it
shouldn't be portable.

Jon

Kenton Varda

unread,
Aug 12, 2008, 2:44:09 PM8/12/08
to Jon Skeet, Protocol Buffers
Note that tag zero is not considered valid even with a non-zero wire type.  I think you should just use tag 1.

Kenton Varda

unread,
Aug 12, 2008, 2:44:29 PM8/12/08
to Jon Skeet, Protocol Buffers
On Tue, Aug 12, 2008 at 11:44 AM, Kenton Varda <ken...@google.com> wrote:
Note that tag zero is not considered valid even with a non-zero wire type.  I think you should just use tag 1.

Sorry, I meant field number zero is not valid.

Jon Skeet

unread,
Aug 12, 2008, 2:55:31 PM8/12/08
to Protocol Buffers
On Aug 12, 7:44 pm, "Kenton Varda" <ken...@google.com> wrote:
> On Tue, Aug 12, 2008 at 11:44 AM, Kenton Varda <ken...@google.com> wrote:
> > Note that tag zero is not considered valid even with a non-zero wire type.
> >  I think you should just use tag 1.
>
> Sorry, I meant field number zero is not valid.

I was just about to check :)

That strikes me as a good reason *to* use it - it makes it crystal
clear that this isn't a field of another message. In fact it means we
might even be able to recover to some extent if we lost track of where
we were - although in my experience as soon as data starts going bad,
it's best to barf. It could help in manual recovery, maybe :)

The Java code certainly checks that the *tag* wasn't 0, but I don't
see anything which would care if the field were 0. In some ways I'd
prefer that things *did* check that, as it would make sure we didn't
"overread" or come in in the middle. I'm not sure what would happen if
something tried to read this as a DynamicMessage - I think we'd just
end up with an UnknownFieldSet with lots of LengthDelimited values.

What would the disadvantages of using field 0 be?

Jon

Kenton Varda

unread,
Aug 12, 2008, 3:11:51 PM8/12/08
to Jon Skeet, Protocol Buffers
The disadvantage is that you would not be able to define a regular protocol message whose wire format matches your protocol.  Using field number 1, you could define an outer message which simply contains a repeated message.  This would allow you to use the outer message in cases where you know you don't actually need streaming and want to avoid the extra development cost, e.g. for debugging and testing.

Jon Skeet

unread,
Aug 12, 2008, 3:25:17 PM8/12/08
to Protocol Buffers
On Aug 12, 8:11 pm, "Kenton Varda" <ken...@google.com> wrote:
> The disadvantage is that you would not be able to define a regular protocol
> message whose wire format matches your protocol.  Using field number 1, you
> could define an outer message which simply contains a repeated message.
>  This would allow you to use the outer message in cases where you know you
> don't actually need streaming and want to avoid the extra development cost,
> e.g. for debugging and testing.

Ooh, I like it. Yes, that's a very cool feature. As a fortunate by-
product, I think it *may* mean I could use Marc's Northwind sample
data file "as is" - the top layer is just a repeated "Order" field,
which is indeed field 1. It'll make a nice little test case :)

Sold!

Jon

Marc Gravell

unread,
Aug 13, 2008, 3:40:36 AM8/13/08
to Protocol Buffers
> >  This would allow you to use the outer message in cases where you know you
> > don't actually need streaming and want to avoid the extra development cost,
> > e.g. for debugging and testing.

Another advantage of using an index of 1 (rather than a deliberate 0
stop) is that it allows you to send different types of messages in the
same stream, exactly as though you had:

message Outer {
repeated Foo = 1;
repeated Bar = 2;
}

Now you can serialize "Foo, Foo, Foo, Bar, Foo, Bar, Bar, Foo" etc.
Which fits very nicely into (for example) an RPC message stream.

This aproach seems to work pretty well, and (conveniently for me)
exactly fits the pattern that protobuf-net already uses for streaming
scenarios; for info, I've added a unit test that does precicesly this:
treats the existing NWind binary file as a stream of objects (rather
than buffering them all) and aggregates some values as it goes (the
same tallies etc previously posted).

I don't know what the code-generated version will make of it, but a
handy side-effect of this for protobuf-net is that the same inner
objects can be re-used, simply swapping the parent from something with
a List<T> (for buffering) to something "IEnumerable<T> and
Add(T)" (for streaming). Which is sweet ;-p

For writing a stream of messages, the length prefix is a minor
inconvenience, but perfectly maneagable (it should fit into the
existing code without too much strong-arming); obviously groups are
easier to implement, though...

Marc

edan

unread,
Aug 14, 2008, 3:45:27 AM8/14/08
to Protocol Buffers

I am pretty sure that this thread is talking about the issue/question
I have, but I'm not sure, so forgive me if this is off-topic or such a
stupid, obvious question that I should know the answer myself (I am
only getting my feet wet with protobuf since yesterday, so I'm a
n00b).

My problem is summed up in the doc for the "message.h" API:

bool Message::ParseFromIstream(
istream * input)

Parse a protocol buffer from a C++ istream.

If successful, the entire input will be consumed.

This is not what I expected - I was expecting that just the next
message would be consumed, and all subsequent messages would be left
on the stream, to be consumed by later calls to ParseFromIstream on
the same istream.

Is getting the behavior I desire what you're talking about?
Is it true that the existing API doesn't support this and I need to
write my own code to do this?
Does anyone have code that already does this?
Is it just me or does it seem like this should Just Work (tm)?
Is this planned to be supported out of the box in the very near
future?

Thanks!

--edan

Jeremy Leader

unread,
Aug 14, 2008, 1:46:01 PM8/14/08
to edan, Protocol Buffers
edan, the thing you're missing is that protobuf messages aren't
self-delimiting, since they can have repeated fields and even extra
fields (fields defined in a newer version of the .proto file).

So there's no way for ParseFromIstream to know when it's reached the end
of a message.

--
Jeremy Leader
jle...@oversee.net

Reply all
Reply to author
Forward
0 new messages