arrays??

183 views
Skip to first unread message

sergei175

unread,
Oct 8, 2009, 11:35:48 AM10/8/09
to Protocol Buffers

Hi,

I've looked at protocol buffers, and I've noted that there is no
support for arrays
of values (double, integers). This is a significant drawback, for
example
JSOM, HDF5 etc they all have this.

One post suggested that one should put an array as one single
string in a field
I've did this, and the performance was very bad in Java and very
memory consuming
(compare to the standard Java serialization).
I've wrote 500 times the same array (10,000 double numbers ), and
after the array 500,
my computer was out of memory,

Secondly, all tutorials suggest that the file should be written at
once, i.e. at the end
of the program, when the messages
are filled. I want to write data to the disk in several steps. say I
want to write one record first (say, one array),
then I want to append data to the existing file, and so on, this way
I will not need
to keep all records in the computer memory. The merge mechanism
shown in the tutorial
seems parses the old file
first, and then add new record, and write a new file.

Do I understand this correctly? If yes, then the protocol buffers is
not too good for large data volumes,
especially with numerical arrays

best wishes, Sergei

Constantinos Michael

unread,
Oct 8, 2009, 11:46:46 AM10/8/09
to sergei175, Protocol Buffers
On Thu, Oct 8, 2009 at 5:35 PM, sergei175 <serg...@googlemail.com> wrote:


 Hi,

 I've looked at protocol buffers, and I've noted that there is no
support for arrays
 of values (double, integers). This is a significant drawback, for
example
 JSOM, HDF5 etc they all have this.

Have you looked at "repeated" fields? You can define one like so:

repeated double my_number = 1;
 

Marc Gravell

unread,
Oct 8, 2009, 11:48:42 AM10/8/09
to Protocol Buffers
For basic types, you can also use packed encoding to reduce the space
required; just add [packed=true] to a "repeated" element.

Marc

On Oct 8, 4:46 pm, Constantinos Michael <constanti...@google.com>
wrote:

sergei175

unread,
Oct 8, 2009, 12:21:04 PM10/8/09
to Protocol Buffers

Hi,

This is exactly what I've done before putting arrays into a string.
When I've implemented arrays via repeated fields, the program was
even slower,
and the file size was too large (compare to Java serialization
mechanism+ zip).
This is why I've moved my array into a string, thinking that there
will be no significant overhead storing such object. I guess, each
repeated
filed has some used additional bits to store them

Yes, I used [packed=true] for "double" field. I did not check what
will happen after removing at (probably, the file size will be even
bigger!!)


cheers, Sergei

Henner Zeller

unread,
Oct 8, 2009, 12:33:35 PM10/8/09
to sergei175, Protocol Buffers
Hi,

>  This is exactly what I've done before putting arrays into a string.
>  When I've implemented arrays via repeated fields, the program was
> even slower,
>  and the file size was too large (compare to Java serialization
> mechanism+ zip).

If you put the values in a string and do you own array management on
top as compared to using a repeated field with packed option, there
should not be a significant difference because it is essentially the
same.
Protobufs don't come with a compression, so if you compare the sizes,
you need to compare compressed Java serialization with compressed
proto serialization.

If you provide an example of what you want to do and what are your
current solutions you compare, people on this list might be able to
help.

-h

sergei175

unread,
Oct 8, 2009, 1:57:33 PM10/8/09
to Protocol Buffers

Ok, this is a simple example of proto buffers file.
I want to write 1000 "Records". Each record has its name and
"NamedArray"

Each array has its name and a set of double numbers, For my example,
I've filled array with 10 000 numbers for all 1000 Records.

There are 2 things you will see:

1) After event 500, even 200MB memory is not enough.
2) It's slower by factor ~5 compare to the java serialization with
the
compression.
3) File size is very large. I do not know how to fill
compressed recorsd on fly using this package.

Finally, there is no even sensible approach to append new "Records"
to the existing file (without "merge", which in fact has to parse
the
existing file first!)

So, I do not see any superiority of Protocol Buffers compare
to use file formats, it's actually much worst as it come to such
situations..


******************************************************
// orginize in repeated records
message Record {

optional string name = 1;

message NamedArray {
required string name=1 [default = "none"];
repeated double value=2 [packed=true];
}
optional NamedArray array = 2;

message PBuffer {
repeated Record record = 1;
}

************************************************

Henner Zeller

unread,
Oct 8, 2009, 2:16:19 PM10/8/09
to sergei175, Protocol Buffers
Hi,

On Thu, Oct 8, 2009 at 10:57, sergei175 <serg...@googlemail.com> wrote:
>
>
>  Ok, this is a simple example of proto buffers file.
>  I want to write 1000 "Records". Each record has its name and
> "NamedArray"
>
>  Each array has its name and a set of double numbers,  For my example,
>  I've filled array with 10 000 numbers for all 1000 Records.
>
>  There are 2 things you will see:
>
>  1) After event 500, even 200MB memory is not enough.
>  2) It's slower by factor ~5 compare to the java serialization with
> the
>    compression.

So for java serialization, you have a class that contains a
ArrayList<NamedArray> with NamedArray objects containing a
Vector<double> and then serialize the whole ArrayList<NamedArray> to
disk ?

>  3) File size is very large. I do not know how to fill
>     compressed recorsd on fly using this package.

If you want to write the independent records, you should write them
delimited to a file and not put everything in memory.
Regarding compression: you write the stuff to a stream eventually, so
you can wrap that with a GZipOutputStream - I guess that is what you
do with the Java serialization with compression as well.

>  Finally, there is no even sensible approach to append new "Records"
>  to the existing file (without "merge", which in fact has to parse
> the
>  existing file first!)

Protocol buffers don't provide the transport or storage layer. They
provide the encoding. You have to provide for the storage yourself. A
simple default implementation might be useful to start but still many
people still would need to write their own way of storing things.
OTOH, it is only a handful of lines to write it yourself.

For things like this (and is has been discussed many times on this
list), you should write out delimiters telling the size of the next
record followed by the record itself. I think there even has been
something added recently to the API to make this simpler (don't know,
I use my own implementation ;) )

-h

Kenton Varda

unread,
Oct 8, 2009, 2:24:39 PM10/8/09
to sergei175, Protocol Buffers
On Thu, Oct 8, 2009 at 10:57 AM, sergei175 <serg...@googlemail.com> wrote:
 1) After event 500, even 200MB memory is not enough.
 2) It's slower by factor ~5 compare to the java serialization with
the
   compression.

Protocol Buffers do not include compression, so to make this comparison fair you would need to add compression on top of them too.  If your speed is dominated by file I/O time (likely!) then you might find that this makes protocol buffers faster.
 
 3) File size is very large. I do not know how to fill
    compressed recorsd on fly using this package.

Use java.util.zip.GZIPOutputStream.
 
 Finally, there is no even sensible approach to append new "Records"
 to the existing file (without "merge", which in fact has to parse
the
 existing file first!)


Protocol Buffers convert between raw bytes and structures.  They are not intended to provide a mechanism for managing multiple individually-loadable records.  If you have a very large data set, you need to split that set into individual records in order to avoid reading/writing the whole thing at once.  Each individual record can be encoded using protobufs, but you should not encode the entire file as a protobuf.
 
 So, I do not see any superiority of Protocol Buffers compare
 to use file formats, it's actually much worst as it come to such
situations..

By all means, don't use them then.

sergei175

unread,
Oct 8, 2009, 10:05:29 PM10/8/09
to Protocol Buffers

Thanks, I've started to understand this better. Indeed, I have to
implement
my own approach for I/O - protobuf alone is not enough. I only worry
that
my own I/O to read/write records will not be cross platform, so I
could not
benefit from the strength of this package.


On Oct 8, 1:24 pm, Kenton Varda <ken...@google.com> wrote:
> On Thu, Oct 8, 2009 at 10:57 AM, sergei175 <sergei...@googlemail.com> wrote:
> >  1) After event 500, even 200MB memory is not enough.
> >  2) It's slower by factor ~5 compare to the java serialization with
> > the
> >    compression.
>
> Protocol Buffers do not include compression, so to make this comparison fair
> you would need to add compression on top of them too.  If your speed is
> dominated by file I/O time (likely!) then you might find that this makes
> protocol buffers faster.
>
> >  3) File size is very large. I do not know how to fill
> >     compressed recorsd on fly using this package.
>
> Use java.util.zip.GZIPOutputStream.
>
> >  Finally, there is no even sensible approach to append new "Records"
> >  to the existing file (without "merge", which in fact has to parse
> > the
> >  existing file first!)
>
> http://code.google.com/apis/protocolbuffers/docs/techniques.html#stre...
Reply all
Reply to author
Forward
0 new messages