Re: [spctools-discuss] Faster mzML parser

54 views
Skip to first unread message

Brian Pratt

unread,
Aug 30, 2012, 11:37:39 AM8/30/12
to spctools...@googlegroups.com
Quite a lot of performance work was done in pwiz in late 2011 - do you
know what version of pwiz is in use?

Brian Pratt

On Thu, Aug 30, 2012 at 2:16 AM, Thomas Dybdal Pedersen
<thom...@gmail.com> wrote:
> Hi
>
> I've recently begun to investigate whether it was time to change our
> pipeline from mxXML to mxML format. Our pipeline is partly based on xcmd
> which uses RAMP to read in ms data. I quick comparison of the parsing speed
> between mzML and mzXML showed that mzML was slower by a factor ~7. This is
> quite substantial especially for larger files.
>
> I understand that the mzML parsing is based on pwiz instead of implemented
> from the ground, which is probably the cause for the difference. Is there
> any effort in creating a more effecient parser for the mzML format as this
> format increase in relevance?
>
> with best wishes
>
> Thomas
>
> --
> You received this message because you are subscribed to the Google Groups
> "spctools-discuss" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/spctools-discuss/-/sDYMC6Ynd3cJ.
> To post to this group, send email to spctools...@googlegroups.com.
> To unsubscribe from this group, send email to
> spctools-discu...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/spctools-discuss?hl=en.

Magnus....@gmail.com

unread,
Aug 30, 2012, 12:03:12 PM8/30/12
to spctools...@googlegroups.com
Hi Thomas,
 
Was the mzML file indexed or not? This can sometimes make a big difference if the program you use accesses the data randomly (or one spectra at a time). The mzXML files always have indices, but for mzML they are optional. Then the mzML parser must also make use of the indices of course - see previous comment.
 
 
Cheers,
 
Magnus

Brian Pratt

unread,
Aug 30, 2012, 12:11:38 PM8/30/12
to spctools...@googlegroups.com
Actually the index is an optional element in mzXML, too - a proper
mzXML parser will function without it.

- Brian
> https://groups.google.com/d/msg/spctools-discuss/-/gFktILjGAOAJ.

Matthew Chambers

unread,
Aug 30, 2012, 12:17:39 PM8/30/12
to spctools...@googlegroups.com
Hi Thomas,

lol wut?

ProteoWizard release: 3.0.3916 (2012-8-27)
ProteoWizard MSData: 3.0.3898 (2012-8-22)

msbenchmark spectra binary c:\test\B06-11071.mzXML
Enumerating spectra: 6687/6687 (78477534 data points)
Time elapsed: 00:00:19.681968

msbenchmark spectra binary c:\test\B06-11071.mzXML
Enumerating spectra: 6687/6687 (78477534 data points)
Time elapsed: 00:00:19.630963

msbenchmark spectra binary c:\test\B06-11071.mzML
Enumerating spectra: 6687/6687 (78477534 data points)
Time elapsed: 00:00:14.470447

msbenchmark spectra binary c:\test\B06-11071.mzML
Enumerating spectra: 6687/6687 (78477534 data points)
Time elapsed: 00:00:14.546455

Actually I'm not sure why mzML is faster here - even though mzXML to mzML means copying the binary
data twice instead of just once, I would expect that to be offset by the bloated metadata in mzML.
These were both converted from RAW with the same settings: no peak picking, default precision (which
in mzML is 64-bit for m/z, 32-bit for intensity, but in mzXML it's 64-bit for both).

It's a good idea to explain your benchmarking practice when you come out with "factor ~7" ;)

-Matt

Magnus....@gmail.com

unread,
Aug 30, 2012, 12:54:25 PM8/30/12
to spctools...@googlegroups.com
Absolutely right! I should have said that at least we encounter non-indexed mzML files much more often than non-indexed mzXML files (for instance in compassXport 3.0.5 the default is indices with mzXML, no indices with mzML...). We did some benchmarking with our parsers and having the indices do make a difference, although not as much as a factor 7.

Cheers,

Magnus
Reply all
Reply to author
Forward
0 new messages