For the curious about google protocol buffers

42 views
Skip to first unread message

Angel

unread,
Oct 22, 2008, 3:00:27 PM10/22/08
to spctools-discuss
So Google release a project called protocol buffers (PB2), which is
basically a semi-structured data format (e.g. XML, ASN.1, JSON) for
fast and efficient wire-transfer of a given data type specification.
The protocol compiler is a C library, but there are API to read and
write a give format spec in Java, Python, C/C++ and others.

I was curious as to what this type of encoding would do for spectra.
Here are my test cases:
1) encode a RAW file to mzXML, MGF and PB2 files
2) encode an MGF file (from the same RAW file that had been
processed by DeconMSn) into PB2, mzML and mzXML

For (1) I dug into the source of Manor Askenazi's Multiplierz project
and used the mzAPI python module to read the RAW file. Hence I was
limited to the data I could pull out of the RAW file to encode.
Hence the mzXML and mzML have a lot more annotations in them, so take
that into account.

For (1) (2) I used pwiz to encode the mzML and mzXML with zlib
compression.

Here is the protocol buffer specification I used, so you can see the
information that is encoded on the PB2 files:

package mz;

message Scan {
optional int32 num = 1;
optional string scan_type = 2;
optional string scan_mode = 3;
optional float retinsec = 4;
repeated double mz = 5;
repeated double inten = 6;
}

message Mz {
repeated Scan scan = 1;
}

message Ion {
optional string title = 1;
optional int32 charge = 2;
optional double pepmass = 3;
optional float retinsec = 4;
repeated double mz = 5;
repeated double inten = 6;
}

message MGF {
repeated string header = 1;
repeated Ion ion = 2;
}


Here are the resulting sizes of the files:

34M SCX-A1.RAW (orig)
41M SCX-A1.mzXML (mzconvert from RAW)
2.9M SCX-A1.mgf (DeconMSn from RAW)
9.1M mzscans.pb2 (PB2 from RAW)
9.3M mzscans.mgf (straight from RAW with same info as PB2 from RAW)
6.9M SCX-A1-mgf.mzML (mzconvert from SCX-A1.mgf from DeconMSn)
4.0M SCX-A1-mgf.mzXML (mzconvert from SCX-A1.mgf from DeconMSn)
3.3M mgf_scans.pb2 (PB2 from SCX-A1.mgf from DeconMSn)

Answer:

If space is your primary concern, zlib compressed base64 strings
actually encode the peaks better (read "less space") than PB2. The
rest of the space in the mzML/mzXML files are just XML overhead. This
is shown by the gzipping the SCX-A1.mzXML file down to 26M and the SCX-
A1-mgf.mzXML down to 2.0M.

But all that zipping and encoding take CPU. PB2 may or may not be more
efficient to encode/decode arrays of doubles , I have to test this
out. One drawback that I am seeing is that PB2 expects a complete
structure when decoding (e.g. you need the entire encoded file read
into memory to decode it back to arrays of doubles. )

Anyway, I'll be working on this some more on and off over the next
couple of months, let me know if you are interested and I'll post it
up someplace for collaboration.

Matthew Chambers

unread,
Oct 22, 2008, 3:10:50 PM10/22/08
to spctools...@googlegroups.com
Neat stuff Angel. What do you think about rendering the mzML schema into
HDF (where the binary data can be stored without encoding). It's still a
self-describing and open rendition of the format and it'd support the
same information, it'd just be a different encoding. I can see Eric
cringing now. ;)

Do I understand you correctly that PB2 does not support random access?
That seems odd.

-Matt

Brian Pratt

unread,
Oct 22, 2008, 3:37:56 PM10/22/08
to spctools...@googlegroups.com
Interesting stuff!

>> One drawback that I am seeing is that PB2 expects a complete
structure when decoding (e.g. you need the entire encoded file read
into memory to decode it back to arrays of doubles. )

Oof - that probably doesn't scale for mass spec data.

>> But all that zipping and encoding take CPU.

I've generally found that data compression is better than free in terms of
overall performance. CPUs just continue to get faster but disk and network
speeds are more or less stalled out, so reducing bandwidth usage at the
expense of a bit more CPU usage is usually a win.

I've been steadily working my way through the TPP tools (and pwiz, and
LabKey's CPAS), to make .mzXML.gz, .mzdata.gz, and .mzML.gz behave as
additional native input/output formats (also .pep.xml.gz, .prot.xml.gz,
.fasta.gz, etc). This is of interest since I'm working on making TPP and
CPAS amenable to use in the Amazon compute cloud, where network bandwidth
and disk storage are metered, so less is better (but that's true everywhere,
really).

Brian

Angel

unread,
Oct 22, 2008, 3:45:18 PM10/22/08
to spctools-discuss
Please note that this is in no way shape or form related to the mzML
spec, I am just playing around with encoding formats for my own
amusement. Eric can relax ;)

On Oct 22, 3:10 pm, Matthew Chambers <matthew.chamb...@vanderbilt.edu>
wrote:
> Neat stuff Angel. What do you think about rendering the mzML schema into
> HDF (where the binary data can be stored without encoding). It's still a
> self-describing and open rendition of the format and it'd support the
> same information, it'd just be a different encoding. I can see Eric
> cringing now. ;)

When I tried HDF5 last time, the complexity of encoding into the
format was pretty big. Also since HDF5 relies on NetCDF for the matrix
of doubles (last time I checked) you inherit some rather limiting
constraints on column headers and annotations. Finally the file sizes
where huge, at least in my naive implementation of it.

>
> Do I understand you correctly that PB2 does not support random access?

Sorry, I should have been clearer about this point. PB2 does allow
random access of the data structures, but only once they are read into
and decoded in memory. Here is a small python example that reads in a
binary file (the PB2 created from the RAW file) and prints out the
766th scan's information (this is the first scan with more than 0
peaks). Notice that the arrays are zero-indexed. It uses something on
the order of 50MB of system memory:
<code>

import mz.mzml_pb2
from mz.mzml_pb2 import *

m = Mz()
f = open("mzscans.pb2",'rb')
b = f.read()
m.ParseFromString(b)

print m.scan[765].num
print m.scan[765].retinsec
print m.scan[765].scan_type
print m.scan[765].scan_mode
print m.scan[765].mz[0]
print m.scan[765].inten[0]

# OUTPUTS
# angel$ python test_read.py
# 766
# 5.30894851685
# MS2
# c
# 173.032318115
# 1.01018822193

</code>

Angel

unread,
Oct 22, 2008, 4:01:09 PM10/22/08
to spctools-discuss
On Oct 22, 3:37 pm, "Brian Pratt" <brian.pr...@insilicos.com> wrote:
> Interesting stuff!
>
> >> One drawback that I am seeing is that PB2 expects a complete
>
> structure when decoding (e.g. you need the entire encoded file read
> into memory to decode it back to  arrays of doubles. )
>
> Oof - that probably doesn't scale for mass spec data.
>
As I said, it's a drawback ;) I used the dumbest possible approach for
my tests, which was to wrap the set of scans in a PBS2 message,
Strictly speaking this is abusing the protocol. You would probably
only encode/decode one spectra at a time in a production system.

> >> But all that zipping and encoding take CPU.
>
> I've generally found that data compression is better than free in terms of
> overall performance.   CPUs just continue to get faster but disk and network
> speeds are more or less stalled out, so reducing bandwidth usage at the
> expense of a bit more CPU usage is usually a win.  
>

True, but considering that PB2 is what Google uses internally to talk
between their services, I am making the assumption that the format is
built for efficient wire transfer and fast [en/de]coding.

> I've been steadily working my way through the TPP tools (and pwiz, and
> LabKey's CPAS), to make .mzXML.gz, .mzdata.gz, and .mzML.gz behave as
> additional native input/output formats (also .pep.xml.gz, .prot.xml.gz,
> .fasta.gz, etc).  This is of interest since I'm working on making TPP and
> CPAS amenable to use in the Amazon compute cloud, where network bandwidth
> and disk storage are metered, so less is better (but that's true everywhere,
> really).  

yeah, I saw that! You definitely have to watch those data transfers
with AWS, since really that's how they make their money. Anywho,
trying to create AWS images for the TPP sounds like a great thing for
the community. Keep me abreast of the progress.

As a side-note for all the bit-heads out there, you should check out
the ParaMEDIC + mpiBLAST
http://portal.acm.org/citation.cfm?id=1383444
Awesome stuff.
-angel
Reply all
Reply to author
Forward
0 new messages