The Agilent scanIds are the numbers used by the raw file and the API. It's not a bug.
One of my primary requirements for ids is being able to unambiguously find the raw spectrum given
the id, whether it's been converted to mzML, mzXML, or whatever. Artificially forcing the scan
numbers to be consecutive violates that for several vendor formats (not to mention filtering by MS
level, activation type, or sorting on something other than time, etc.).
I think TPP has some refresh program to force the numbers to be consecutive but I'll let someone
more familiar with TPP answer that; I remember this issue came up a while ago. I never noticed the
correlation between scan time and scanId before, that's interesting! I suppose that at least for
your instrument, the (micro?)scan rate was close to 1000/sec.
HTH,
-Matt
On 2/7/2011 9:56 AM, lgillet wrote:
> Dear group,
> While trying to search files obtained upon conversion of Agilent
> QTOF .d files to mzXML with msconvert, I realized that the .dta files
> created for Sequest search were very few and had a very strange
> numbering.
> (Windows XP, msconvert version tried: Build_12_ProteoWizard_r2127 or
> pwiz-bin-windows-x86-vc90-1_6_1386 or pwiz-bin-windows-x86-vc90-
> release-shared-2_1_2485).
>
> Upon investigating a bit, it seems that msconvert performs a wrong
> numbering of the scans from the .d files.
> Here is an example of the 2 first scans obtained from an mzXML files
> from msconvert:
>
On 2/11/2011 2:19 AM, lgillet wrote:
> Dear Matt,
>
> My problem (and others' from our lab) is that, with the current
> version of msconvert, you almost cannot do anything with the converted
> Agilent data. For example, MzXML2Search splits out a "segmentation
> fault" error message as soon as one scan number exceed 27'219 (i.e. if
> scan>27'220 it crashes; this probably has something to do with single/
> double integers stuff?). Second, our Sequest server (Sage-Sorcerer)
> also crashes on those files (the number of .dta files created from the
> mzXML are again very much limited to a well defined scan number limit
> and therefore very few spectra are actually searched).
If this is true of MzXML2Search then it's a bug. I thought it was fixed actually. Thermo Velos
instruments easily exceed 30000 spectra. And if it's LTQ, you double that to 60000 (DTAs).
> I don't know if I make myself clear but here are my comments:
>
> 1) could you verify why msconvert is behaving differently than Trapper
> (while they supposedly use the same Agilent libraries) when exporting
> the scan numbers (Trapper performing the correct conversion by
> conserving the same scan numbering as the raw file)
>
> I do not know what the Agilent API does or not
> to the data, but what I can tell is that the scan are indeed
> *consecutively* numbered (from 1 till 5'000 or more) in the raw data
> when you browse them with the Agilent MassHunter Qual software. So my
> guess is that there might still be something fishy about msconvert
> here. My understanding was that the former converter (Trapper from
> Natalie Tasman) was actually relying on the same Agilent API as well!
> Maybe Natalie could comment on that. And since Trapper was conserving
> the proper numbering of the scan as in the raw data, something might
> have changed upon switching to msconvert.
I'll quote my post to the psidev-ms mailing list from 6/30/2009:
> In the MassHunter API there are two ways to uniquely address a spectrum: by
> "row number" or "scan id". Row number is essentially a 0-based index
> that refers to the spectra after the acquisition software has done
> something...perhaps internal merging? Scan id represents the ordinal
> number of acquisitions as they come off the instrument. So, at least on
> their (Q)TOF instruments, the rowNumber is very disparate from the
> scanId, but both of them are unique identifiers that can technically be
> used to refer to a native spectrum. The kink is that the MassHunter API
> only refers to the parent scan by its scan id and doesn't provide a way
> to directly translate a scan id to a row number - translation must be
> done indirectly by enumerating all the row numbers and building a
> mapping of scan id to row number. For this reason I would recommend that
> the nativeID format be defined as "scanId=xsd:nonNegativeInteger" but
> I'm open to comment on this!
This explains why we adopted scanId to be used as the nativeID despite it not being consecutive. It
was not a strong reason for choosing one over the other, but ids being consecutive means even less.
However, if it's true that it's impossible to find a scan in MassHunter with the scanId, that's a
major issue of which I was unaware! That's a pretty compelling reason to switch to the row number,
but we've never had to change a nativeID format before. We'll have to discuss it with Agilent and
the PSI-MS working group.
> 2) If that's not possible for you to fix msconvert in that respect,
> would it be possible to provide an option in msconvert in order to
> renumber the scan consecutively from 1 till the end. I guess such
> option may anyway one day be useful for other people for other
> applications.
Yes, it's possible to implement this, but as I said above there is an imminent problem with your
pipeline if you can't support scan numbers over 27219. I have no idea why that number would be a
threshold; 32767 is the max for a signed 16-bit integer and 65535 is the max for unsigned. This
should be an easy bug to fix too (just changing the scan number data type). If the 16-bit integer
problems are fixed, is the consecutive option still necessary?
Hope this helps,
-Matt
I have not been involved with MS data conversion for some time now,
and certainly direct anyone to use msconvert and the ProteoWizard
project if they are using the newer standard, mzML.
However, that said, some users still have compelling uses for mzXML.
My first recommendation is that you think about using mzML if
possible; as Matt describes, the Agilent MassHunter format is one
which internally maintains scans with a coordinate system rather than
internally sequential scans. It is of course a good thing to maintain
those full coordinates to maintain tracking of your scans back to the
originals in the raw files.
On to mzXML. As the SPC is the originator of the mzXML format, we
have to take the TPP's requirements as defining 'correct' mzXML. For
some background, first:
mzXML was a very successful *format*, but not a standard. There was a
need for a fast, open format in the proteomics software community, and
mzXML worked well. It was defined as needed by the SPC/ISB and other
developers in Zurich. As such, it mainly served the needs of the TPP
and other related software, but was picked up by many other projects
as well.
As an academic-based format, it often changed, and was documented
fairly well for an academic project-- but not completely. In the
mzXML usage document-- very difficult to find, by the way:
http://sashimi.sourceforge.net/schema_revision/mzXML_2.0/Doc/mzXML_2.0_tutorial.pdf,
page 12, it spells out the assumption that mzXML scan numbers must
occur in consecutive ascending numbers: " scan num (required): the
scan number for the current scan element. The values of the num must
start from 1 and increase sequentially!"
For mzWiff and Trapper (and maybe others) this requires artificially
renumbering the scans. In some of the later (3.0+) revisions of
mzXML, Chee-Hong (the excellent developer who gave us mzWiff) and I
came up with some additional separate fields in mzXML for storing the
original scan coordinates. However, being somewhat of an ad-hoc
format, I don't think that these requirements were *strongly* encoded
into the mzXML schema.
So in short, I believe that msconvert's reluctance to generate
sequential consecutive scan numbers may result in *valid* mzXML, by
the schema (if mzXML even validates-- I'm honestly not sure; some of
the optional elements were difficult to grammatically encode in the
XML schema if I recall). BUT the TPP, and most other SPC/Aebersold
tools really, honestly expect and require those sequential scan
numbers.
On the other hand, you can get past this dilema by switching to mzML
if you like, which is a *standard*, not just a format. This means it
is throughly curated, tested, and documented by a consortium of
intersted parties. They contain MUCH more detailed metadata and are
better for archival purposes. And pwiz/mscovert is certainly the
authoritative software suite for all things mzML.
All this said, it has been some time since I have worked on the SPC
converters in any detail, nor mzXML/mzML formats/standards, so perhaps
the TPP team has relaxed this sequential mzXML scan requirement, and I
hope they correct me if so (Joe?). But last time this came up it was
still a requirement and I think it explains why Ludovic is having
trouble.
Hope this helps with some background and explanation.
Best wishes to all,
Natalie
> --
> You received this message because you are subscribed to the Google Groups
> "spctools-discuss" group.
> To post to this group, send email to spctools...@googlegroups.com.
> To unsubscribe from this group, send email to
> spctools-discu...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/spctools-discuss?hl=en.
>
>
<nativeScanRef coordinateType="Agilent" >
<coordinate name="scan"
value="3560" />
</nativeScanRef>
for mzXML scan "1". I do remember working Agilent to request that
they expose at least unique single ID for each scan through their API,
as it would be easier for things in the TPP world; Chee-Hong and I
came up with this nativeScan ref system and it may be one of the
things that is optional but not required in the later mzXML schemas.
-Natalie