msconvert of Agilent .d raw files => wrong scan numbering

429 views
Skip to first unread message

lgillet

unread,
Feb 7, 2011, 10:56:23 AM2/7/11
to spctools-discuss
Dear group,
While trying to search files obtained upon conversion of Agilent
QTOF .d files to mzXML with msconvert, I realized that the .dta files
created for Sequest search were very few and had a very strange
numbering.
(Windows XP, msconvert version tried: Build_12_ProteoWizard_r2127 or
pwiz-bin-windows-x86-vc90-1_6_1386 or pwiz-bin-windows-x86-vc90-
release-shared-2_1_2485).

Upon investigating a bit, it seems that msconvert performs a wrong
numbering of the scans from the .d files.
Here is an example of the 2 first scans obtained from an mzXML files
from msconvert:

-----msconvert---------

<scan num="2567"
scanType="FULL"
centroided="1"
msLevel="1"
peaksCount="428"
polarity="+"
retentionTime="PT2.56S"
basePeakIntensity="179"
totIonCurrent="26996"
msInstrumentID="IC1">
<peaks compressedLen="0"
precision="32"
byteOrder="network"
pairOrder="m/z-int">Q5PZKkCAzM1DlQ...</peaks>
</scan>
<scan num="3560"
scanType="FULL"
centroided="1"
msLevel="1"
peaksCount="422"
polarity="+"
retentionTime="PT3.553S"
basePeakIntensity="282"
totIonCurrent="26252"
msInstrumentID="IC1">
<peaks compressedLen="0"
precision="32"
byteOrder="network"
pairOrder="m/z-int">Q5QsjkEmAABDlKV/...</peaks>
</scan>

----------------------

Here is the same 2 first scans from the trapper conversion
-------trapper converter-------

<scan num="1"
msLevel="1"
peaksCount="428"
polarity="+"
scanType="MS1SurveyScan"
retentionTime="PT2.56S"
lowMz="295.697"
highMz="1953.19"
basePeakMz="1222"
basePeakIntensity="183.267"
totIonCurrent="26996" >
<nativeScanRef coordinateType="Agilent" >
<coordinate name="scan"
value="2567" />
</nativeScanRef>
<peaks precision="32"
byteOrder="network"
contentType="m/z-int"
compressionType="none"
compressedLen="0" >Q5PZKkCAzM1DlQsCQgn...</peaks>
</scan>
<scan num="2"
msLevel="1"
peaksCount="422"
polarity="+"
scanType="MS1SurveyScan"
retentionTime="PT3.553S"
lowMz="296.348"
highMz="1965.69"
basePeakMz="391.288"
basePeakIntensity="283.565"
totIonCurrent="26252" >
<nativeScanRef coordinateType="Agilent" >
<coordinate name="scan"
value="3560" />
</nativeScanRef>
<peaks precision="32"
byteOrder="network"
contentType="m/z-int"
compressionType="none"
compressedLen="0" >Q5QsjkEmAABDlKV/Qb...</peaks>
</scan>

-------------------


As you may realize, the scan numbers from the msconver conversion are
actually, for whatever reason, the time in milli-seconds?!? (e.g. scan
1 <=> 2.56sec <=> scan num="2567" !!!!!)

So here are my questions:

1) is there a script (I think there is but I cannot put hands on it
anymore) among the TPP executables to renumber the scans
consecutively?

2) Could you release a fix for this Agilent file conversion in a
future msconvert release?

Thanks in advance for you inputs,

Best,

Ludovic

Matthew Chambers

unread,
Feb 7, 2011, 11:15:36 AM2/7/11
to spctools...@googlegroups.com
Hi Ludovic,

The Agilent scanIds are the numbers used by the raw file and the API. It's not a bug.

One of my primary requirements for ids is being able to unambiguously find the raw spectrum given
the id, whether it's been converted to mzML, mzXML, or whatever. Artificially forcing the scan
numbers to be consecutive violates that for several vendor formats (not to mention filtering by MS
level, activation type, or sorting on something other than time, etc.).

I think TPP has some refresh program to force the numbers to be consecutive but I'll let someone
more familiar with TPP answer that; I remember this issue came up a while ago. I never noticed the
correlation between scan time and scanId before, that's interesting! I suppose that at least for
your instrument, the (micro?)scan rate was close to 1000/sec.

HTH,
-Matt


On 2/7/2011 9:56 AM, lgillet wrote:
> Dear group,
> While trying to search files obtained upon conversion of Agilent
> QTOF .d files to mzXML with msconvert, I realized that the .dta files
> created for Sequest search were very few and had a very strange
> numbering.
> (Windows XP, msconvert version tried: Build_12_ProteoWizard_r2127 or
> pwiz-bin-windows-x86-vc90-1_6_1386 or pwiz-bin-windows-x86-vc90-
> release-shared-2_1_2485).
>
> Upon investigating a bit, it seems that msconvert performs a wrong
> numbering of the scans from the .d files.
> Here is an example of the 2 first scans obtained from an mzXML files
> from msconvert:
>

lgillet

unread,
Feb 11, 2011, 3:19:35 AM2/11/11
to spctools-discuss
Dear Matt,
thanks for your answer. I do not know what the Agilent API does or not
to the data, but what I can tell is that the scan are indeed
*consecutively* numbered (from 1 till 5'000 or more) in the raw data
when you browse them with the Agilent MassHunter Qual software. So my
guess is that there might still be something fishy about msconvert
here. My understanding was that the former converter (Trapper from
Natalie Tasman) was actually relying on the same Agilent API as well!
Maybe Natalie could comment on that. And since Trapper was conserving
the proper numbering of the scan as in the raw data, something might
have changed upon switching to msconvert.

My problem (and others' from our lab) is that, with the current
version of msconvert, you almost cannot do anything with the converted
Agilent data. For example, MzXML2Search splits out a "segmentation
fault" error message as soon as one scan number exceed 27'219 (i.e. if
scan>27'220 it crashes; this probably has something to do with single/
double integers stuff?). Second, our Sequest server (Sage-Sorcerer)
also crashes on those files (the number of .dta files created from the
mzXML are again very much limited to a well defined scan number limit
and therefore very few spectra are actually searched).

I don't know if I make myself clear but here are my comments:

1) could you verify why msconvert is behaving differently than Trapper
(while they supposedly use the same Agilent libraries) when exporting
the scan numbers (Trapper performing the correct conversion by
conserving the same scan numbering as the raw file)

2) If that's not possible for you to fix msconvert in that respect,
would it be possible to provide an option in msconvert in order to
renumber the scan consecutively from 1 till the end. I guess such
option may anyway one day be useful for other people for other
applications.

Thanks a lot for your help,

Best,

Ludovic

Matthew Chambers

unread,
Feb 11, 2011, 11:58:14 AM2/11/11
to spctools...@googlegroups.com
Hi Ludovic,

On 2/11/2011 2:19 AM, lgillet wrote:
> Dear Matt,


>
> My problem (and others' from our lab) is that, with the current
> version of msconvert, you almost cannot do anything with the converted
> Agilent data. For example, MzXML2Search splits out a "segmentation
> fault" error message as soon as one scan number exceed 27'219 (i.e. if
> scan>27'220 it crashes; this probably has something to do with single/
> double integers stuff?). Second, our Sequest server (Sage-Sorcerer)
> also crashes on those files (the number of .dta files created from the
> mzXML are again very much limited to a well defined scan number limit
> and therefore very few spectra are actually searched).

If this is true of MzXML2Search then it's a bug. I thought it was fixed actually. Thermo Velos
instruments easily exceed 30000 spectra. And if it's LTQ, you double that to 60000 (DTAs).


> I don't know if I make myself clear but here are my comments:
>
> 1) could you verify why msconvert is behaving differently than Trapper
> (while they supposedly use the same Agilent libraries) when exporting
> the scan numbers (Trapper performing the correct conversion by
> conserving the same scan numbering as the raw file)
>

> I do not know what the Agilent API does or not
> to the data, but what I can tell is that the scan are indeed
> *consecutively* numbered (from 1 till 5'000 or more) in the raw data
> when you browse them with the Agilent MassHunter Qual software. So my
> guess is that there might still be something fishy about msconvert
> here. My understanding was that the former converter (Trapper from
> Natalie Tasman) was actually relying on the same Agilent API as well!
> Maybe Natalie could comment on that. And since Trapper was conserving
> the proper numbering of the scan as in the raw data, something might
> have changed upon switching to msconvert.

I'll quote my post to the psidev-ms mailing list from 6/30/2009:
> In the MassHunter API there are two ways to uniquely address a spectrum: by
> "row number" or "scan id". Row number is essentially a 0-based index
> that refers to the spectra after the acquisition software has done
> something...perhaps internal merging? Scan id represents the ordinal
> number of acquisitions as they come off the instrument. So, at least on
> their (Q)TOF instruments, the rowNumber is very disparate from the
> scanId, but both of them are unique identifiers that can technically be
> used to refer to a native spectrum. The kink is that the MassHunter API
> only refers to the parent scan by its scan id and doesn't provide a way
> to directly translate a scan id to a row number - translation must be
> done indirectly by enumerating all the row numbers and building a
> mapping of scan id to row number. For this reason I would recommend that
> the nativeID format be defined as "scanId=xsd:nonNegativeInteger" but
> I'm open to comment on this!
This explains why we adopted scanId to be used as the nativeID despite it not being consecutive. It
was not a strong reason for choosing one over the other, but ids being consecutive means even less.

However, if it's true that it's impossible to find a scan in MassHunter with the scanId, that's a
major issue of which I was unaware! That's a pretty compelling reason to switch to the row number,
but we've never had to change a nativeID format before. We'll have to discuss it with Agilent and
the PSI-MS working group.


> 2) If that's not possible for you to fix msconvert in that respect,
> would it be possible to provide an option in msconvert in order to
> renumber the scan consecutively from 1 till the end. I guess such
> option may anyway one day be useful for other people for other
> applications.

Yes, it's possible to implement this, but as I said above there is an imminent problem with your
pipeline if you can't support scan numbers over 27219. I have no idea why that number would be a
threshold; 32767 is the max for a signed 16-bit integer and 65535 is the max for unsigned. This
should be an easy bug to fix too (just changing the scan number data type). If the 16-bit integer
problems are fixed, is the consecutive option still necessary?

Hope this helps,
-Matt

Natalie Tasman

unread,
Feb 11, 2011, 2:23:24 PM2/11/11
to spctools...@googlegroups.com
Hello Ludovic, Matt (and probably Joe, whose comments I'd appreciate),

I have not been involved with MS data conversion for some time now,
and certainly direct anyone to use msconvert and the ProteoWizard
project if they are using the newer standard, mzML.

However, that said, some users still have compelling uses for mzXML.
My first recommendation is that you think about using mzML if
possible; as Matt describes, the Agilent MassHunter format is one
which internally maintains scans with a coordinate system rather than
internally sequential scans. It is of course a good thing to maintain
those full coordinates to maintain tracking of your scans back to the
originals in the raw files.

On to mzXML. As the SPC is the originator of the mzXML format, we
have to take the TPP's requirements as defining 'correct' mzXML. For
some background, first:

mzXML was a very successful *format*, but not a standard. There was a
need for a fast, open format in the proteomics software community, and
mzXML worked well. It was defined as needed by the SPC/ISB and other
developers in Zurich. As such, it mainly served the needs of the TPP
and other related software, but was picked up by many other projects
as well.

As an academic-based format, it often changed, and was documented
fairly well for an academic project-- but not completely. In the
mzXML usage document-- very difficult to find, by the way:
http://sashimi.sourceforge.net/schema_revision/mzXML_2.0/Doc/mzXML_2.0_tutorial.pdf,
page 12, it spells out the assumption that mzXML scan numbers must
occur in consecutive ascending numbers: " scan num (required): the
scan number for the current scan element. The values of the num must
start from 1 and increase sequentially!"

For mzWiff and Trapper (and maybe others) this requires artificially
renumbering the scans. In some of the later (3.0+) revisions of
mzXML, Chee-Hong (the excellent developer who gave us mzWiff) and I
came up with some additional separate fields in mzXML for storing the
original scan coordinates. However, being somewhat of an ad-hoc
format, I don't think that these requirements were *strongly* encoded
into the mzXML schema.

So in short, I believe that msconvert's reluctance to generate
sequential consecutive scan numbers may result in *valid* mzXML, by
the schema (if mzXML even validates-- I'm honestly not sure; some of
the optional elements were difficult to grammatically encode in the
XML schema if I recall). BUT the TPP, and most other SPC/Aebersold
tools really, honestly expect and require those sequential scan
numbers.

On the other hand, you can get past this dilema by switching to mzML
if you like, which is a *standard*, not just a format. This means it
is throughly curated, tested, and documented by a consortium of
intersted parties. They contain MUCH more detailed metadata and are
better for archival purposes. And pwiz/mscovert is certainly the
authoritative software suite for all things mzML.

All this said, it has been some time since I have worked on the SPC
converters in any detail, nor mzXML/mzML formats/standards, so perhaps
the TPP team has relaxed this sequential mzXML scan requirement, and I
hope they correct me if so (Joe?). But last time this came up it was
still a requirement and I think it explains why Ludovic is having
trouble.

Hope this helps with some background and explanation.

Best wishes to all,

Natalie

> --
> You received this message because you are subscribed to the Google Groups
> "spctools-discuss" group.
> To post to this group, send email to spctools...@googlegroups.com.
> To unsubscribe from this group, send email to
> spctools-discu...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/spctools-discuss?hl=en.
>
>

Natalie Tasman

unread,
Feb 11, 2011, 2:34:16 PM2/11/11
to spctools...@googlegroups.com
For anyone curious, by the way, in the Trapper output that Ludovic
posted, you can see the section:

<nativeScanRef coordinateType="Agilent" >
<coordinate name="scan"
value="3560" />
</nativeScanRef>

for mzXML scan "1". I do remember working Agilent to request that
they expose at least unique single ID for each scan through their API,
as it would be easier for things in the TPP world; Chee-Hong and I
came up with this nativeScan ref system and it may be one of the
things that is optional but not required in the later mzXML schemas.

-Natalie

Reply all
Reply to author
Forward
0 new messages