Numerical Formats

494 views
Skip to first unread message

hepaminondas

unread,
Jul 8, 2008, 8:25:26 AM7/8/08
to Protocol Buffers
Hi,
Pretty cool stuff!
Could be really useful for data flow in scientific applications.
Just wondering how I'd write a proto which would allow the encoding
of, for example, a Numpy array?

Best,
Miguel

Kenton Varda

unread,
Jul 8, 2008, 2:13:57 PM7/8/08
to hepaminondas, Protocol Buffers
2008/7/8 hepaminondas <hepami...@gmail.com>:

What's a Numpy array, exactly?  Google search suggests it's a multi-dimensional array?

Unfortunately this would have to be pretty ad-hoc at the moment.  You do something like:

message Row { repeated int32 element = 1; }
message Matrix { repeated Row row = 1; }

Ugly, I know.  Another problem with this is that a 1-byte tag will be emitted for every element, meaning it's somewhat inefficient.

Another option is to just pack all the data arbitrarily and store them in a "bytes" field, but that sort of defeats the purpose of protocol buffers.

I have plans to add a new "packed" encoding for repeated fields, which would allow there to be a single tag for an entire repeated field, but it's not there yet, and it still doesn't provide good multi-dimensional support.

hepaminondas

unread,
Jul 8, 2008, 6:08:22 PM7/8/08
to Protocol Buffers


On Jul 8, 8:13 pm, "Kenton Varda" <ken...@google.com> wrote:
> 2008/7/8 hepaminondas <hepaminon...@gmail.com>:
>
>
> What's a Numpy array, exactly?  Google search suggests it's a
> multi-dimensional array?
>
Google's right :)

> Unfortunately this would have to be pretty ad-hoc at the moment.  You do
> something like:
>
> message Row { repeated int32 element = 1; }
> message Matrix { repeated Row row = 1; }
>
> Ugly, I know.  Another problem with this is that a 1-byte tag will be
> emitted for every element, meaning it's somewhat inefficient.

Yeah, it would considerably increase the overhead. 20% more data.
>
> Another option is to just pack all the data arbitrarily and store them in a
> "bytes" field, but that sort of defeats the purpose of protocol buffers.
>
> I have plans to add a new "packed" encoding for repeated fields, which would
> allow there to be a single tag for an entire repeated field, but it's not
> there yet, and it still doesn't provide good multi-dimensional support.

Actually, that'd already be very helpful, since the actual dimensions
of the matrix could be carried with the data, allowing the original
geometry to be recovered in the end.
Smth like (dunno about the syntax):

message Dimensions {repeated int32 dimension = 1}
message Data { repeated int32 element = 1; }

message NDimMatrix {
required Dimensions dim = 1;
required Data data = 2;
}

Just a thought...

Darren Dale

unread,
Jul 8, 2008, 6:49:01 PM7/8/08
to Protocol Buffers
On Jul 8, 2:13 pm, "Kenton Varda" <ken...@google.com> wrote:
> 2008/7/8 hepaminondas <hepaminon...@gmail.com>:
>
>
>
> > Hi,
> > Pretty cool stuff!
> > Could be really useful for data flow in scientific applications.
> > Just wondering how I'd write a proto which would allow the encoding
> > of, for example, a Numpy array?

I also got really excited for exactly the same reason, data flow and
storage in scientific applications.

> What's a Numpy array, exactly?  Google search suggests it's a
> multi-dimensional array?
>
> Unfortunately this would have to be pretty ad-hoc at the moment.  You do
> something like:
>
> message Row { repeated int32 element = 1; }
> message Matrix { repeated Row row = 1; }
>
> Ugly, I know.  Another problem with this is that a 1-byte tag will be
> emitted for every element, meaning it's somewhat inefficient.
>
> Another option is to just pack all the data arbitrarily and store them in a
> "bytes" field, but that sort of defeats the purpose of protocol buffers.
>
> I have plans to add a new "packed" encoding for repeated fields, which would
> allow there to be a single tag for an entire repeated field, but it's not
> there yet, and it still doesn't provide good multi-dimensional support.

Thanks for considering support for such a feature, it would make
protocol buffers extremely useful in our field.

Regards,
Darren

Scott Stafford

unread,
Aug 4, 2008, 7:25:19 PM8/4/08
to Protocol Buffers
This feature would be very useful to us, too. We frequently want to
pack fixed-size, variable-sized, and symmetric matrices (which can be
packed with only the upper triangle). Right now we'd need to use the
form suggested earlier, which would (1) require a char before each
double and (2) would not enforce our rules at the specification
level. The packed array type would solve (1), which removes most of
the "efficiency" argument. Solving (2) would require even more
changes to protobuf.

We imagine writing wrapper code to set the protobuf array from the
existing matrix, like: packSymmetric(protoBufMatrix, ourMatrix) and
unpackSymmetric(...). Unfortunately, that leaves the description of
"how to interpret" out of the .proto. But at least it's simpler to
use than ASN.1. ;)

On Jul 8, 6:49 pm, Darren Dale <dsdal...@gmail.com> wrote:
> On Jul 8, 2:13 pm, "Kenton Varda" <ken...@google.com> wrote:
>
> > 2008/7/8 hepaminondas <hepaminon...@gmail.com>:
>
> > > Hi,
> > > Pretty cool stuff!
> > > Could be really useful for data flow in scientific applications.
> > > Just wondering how I'd write a proto which would allow the encoding
> > > of, for example, a Numpyarray?
>
> I also got really excited for exactly the same reason, data flow and
> storage in scientific applications.
>
>
>
> > What's a Numpyarray, exactly?  Google search suggests it's a

Kenton Varda

unread,
Aug 4, 2008, 10:37:28 PM8/4/08
to Scott Stafford, Protocol Buffers
Hate to say it, but I don't think it's likely that protobufs will ever support multi-dimensional arrays.  I think it adds too much complication with too few potential users.  If we do too much of that, eventually protobufs will no longer be simpler than ASN.1.

Scott Stafford

unread,
Aug 18, 2008, 8:38:30 PM8/18/08
to Protocol Buffers
Yeah, I'm not surprised that you don't see enough use-cases for
multidimensional arrays, I sort of expected that. We're looking at
going ahead with a custom patch to add in the couple features we need
that are missing.

Have you considered syntaxes for the packed array you're planning?
I'd like to parallel your roadmap to ease integration, and make the
patch available too if possible.

Now, I expect protobuf, if it had packed array support, would only
have variable-length arrays and put them into a std::vector or
something. But we want a bit more for ourselves -- I was considering
a syntax something like this, where if a number is specified in the
[]'s then it would be a fixed-size dimension and otherwise variable-
sized, and it would pack/unpack itself into the datatypes we use
through the codebase: boost::ublas::matrix and/or
boost::ublas::c_matrix.

message Test
{
required double [][3] Nx3matrix = 1;
required int32 [2][2] 2x2matrix = 2;
required double [] Nx1vector = 3;
}

Thoughts?

On Aug 4, 10:37 pm, "Kenton Varda" <ken...@google.com> wrote:
> Hate to say it, but I don't think it's likely that protobufs will ever
> support multi-dimensional arrays.  I think it adds too much complication
> with too few potential users.  If we do too much of that, eventually
> protobufs will no longer be simpler than ASN.1.
>

Kenton Varda

unread,
Aug 18, 2008, 8:50:21 PM8/18/08
to Scott Stafford, Protocol Buffers
Consider using options:

  required double Nx3matrix = 1 [x_size=-1, y_size=3];

You can add new options to the language by just modifying descriptor.proto -- no need to change the parser.  In fact, we're working on a change which will allow you to define options without modifying descriptor.proto (or the protocol compiler) at all, by defining them as extensions of the messages in descriptor.proto.

Packed arrays will use option syntax:

  repeated int32 foo = 1 [packed = true];

The wire format will use the "length-delimited" wire type and pack values inside that using the encoding appropriate for their type.  So, for int32, the length-delimited bytes will contain a sequence of varints.

SM

unread,
Aug 19, 2008, 2:30:11 AM8/19/08
to Protocol Buffers
Why not CDF (PyNIO or other linked to http://www.unidata.ucar.edu/software/netcdf/)
or HDF (pytables, or others at http://en.wikipedia.org/wiki/Hierarchical_Data_Format)?
These are designed for large scientific datasets, have nice python
bindings, and play nice with numpy. They are imminently unsuitable for
the high-speed message passing that PB is in fact exceptionally well
suited for, but it sounds like they might fit your data better. And
they are way nicer to high-precision floats.

S.M.

Alek Storm

unread,
Aug 19, 2008, 2:39:37 AM8/19/08
to Protocol Buffers
On Aug 18, 7:50 pm, "Kenton Varda" <ken...@google.com> wrote:
> Packed arrays will use option syntax:
>
>   repeated int32 foo = 1 [packed = true];
>
> The wire format will use the "length-delimited" wire type and pack values
> inside that using the encoding appropriate for their type.  So, for int32,
> the length-delimited bytes will contain a sequence of varints.

Could you consider using the 'delimiter = LENGTH' option I suggested
in http://groups.google.com/group/protobuf/browse_thread/thread/50aa6cb61a809a3c/5c71b9c5b6da6be0#5c71b9c5b6da6be0?
If we end up implementing that, or some subset of it, I think we
should use consistent syntax.

Kenton Varda

unread,
Aug 19, 2008, 3:58:08 PM8/19/08
to Alek Storm, Protocol Buffers
Your "delimiter = LENGTH" option was for embedded messages.  It seems orthogonal to packed arrays.

Alek Storm

unread,
Aug 26, 2008, 11:43:42 PM8/26/08
to Protocol Buffers
On Aug 19, 2:58 pm, "Kenton Varda" <ken...@google.com> wrote:
> Your "delimiter = LENGTH" option was for embedded messages.  It seems
> orthogonal to packed arrays.

I was under the impression they used the same encoding sceme - prefix
a bunch of blobs with a length specifier.

Kenton Varda

unread,
Aug 27, 2008, 1:33:23 PM8/27/08
to Alek Storm, Protocol Buffers
On Tue, Aug 26, 2008 at 8:43 PM, Alek Storm <alek....@gmail.com> wrote:
I was under the impression they used the same encoding sceme - prefix
a bunch of blobs with a length specifier.

Maybe, in a very abstract sense...  if you think of the tag-value pairs in an embedded message as being "blobs" in the same sense as individual values in a packed array.  But I think they're pretty different.  Anyway, it would make sense to define a packed array of messages, but if the "delimiter" option has another meaning when used with messages then that doesn't work.

Alek Storm

unread,
Aug 27, 2008, 5:22:03 PM8/27/08
to Protocol Buffers
On Aug 27, 12:33 pm, "Kenton Varda" <ken...@google.com> wrote:
> Maybe, in a very abstract sense...  if you think of the tag-value pairs in
> an embedded message as being "blobs" in the same sense as individual values
> in a packed array.  But I think they're pretty different.  Anyway, it would
> make sense to define a packed array of messages, but if the "delimiter"
> option has another meaning when used with messages then that doesn't work.

Yeah, they're probably too different. Never mind, thanks.

Tim

unread,
Feb 23, 2009, 10:26:24 PM2/23/09
to Kenton Varda, prot...@googlegroups.com
Hi Kenton. I was wondering if you had any update on implementing
packed repeated fields using wire format 2. I'm evaluating GPB for use
in an embedded device, and love it for its ability to generalize data
storage/serialization/introspection. But having a tag for each
repeated element is kind of a deal-breaker for us due to space
requirements. Especially since we're planning to use 4 byte tags
across the board in our system so that there's enough space for a
unique tag for each attribute in our system.
Curious, and Thanks,
Tim

On Jul 8 2008, 10:13 am, "Kenton Varda" <ken...@google.com> wrote:
> 2008/7/8 hepaminondas <hepaminon...@gmail.com>:
>
>
>

Kenton Varda

unread,
Feb 23, 2009, 10:30:12 PM2/23/09
to Tim, prot...@googlegroups.com
On Mon, Feb 23, 2009 at 7:26 PM, Tim <timb...@gmail.com> wrote:
Hi Kenton. I was wondering if you had any update on implementing
packed repeated fields using wire format 2. I'm evaluating GPB for use
in an embedded device, and love it for its ability to generalize data
storage/serialization/introspection. But having a tag for each
repeated element is kind of a deal-breaker for us due to space
requirements. Especially since we're planning to use 4 byte tags
across the board in our system so that there's enough space for a
unique tag for each attribute in our system.
Curious, and Thanks,
Tim

It's implemented in SVN, and will be in the next release (2.0.4), whenever that may be.  Just stick [packed=true] on to any repeated scalar field:
  repeated int32 foo = 1 [packed=true];
Reply all
Reply to author
Forward
0 new messages