mzXML Support

Joshua Klein

unread,

Aug 19, 2016, 6:59:21 PM8/19/16

to pyteomics

An application requirement I encountered involved supporting mzXML. I've implemented an mzXML parser using the existing tools in `pyteomics.xml`, with adaptions to account for the fact that the format isn't as well behaved as later HUPO standards.

I've attached a draft of my implementation. I'm looking for comments before drafting unit tests and making a formal pull request.

Weaknesses:

Requires a lot of code duplication to deal with the fact that mzXML uses "num" instead of "id" to uniquely identify scans within a single run.
Completely ignores MALDI-specific details. I don't know if those details even get filled in by Proteowizard.
I've heard of "scans within scans" in mzXML files, but I didn't encounter any using Proteowizard-generated files. While this implementation should be able to parse those, it would unspool those chunks out of order.
Naming of precursor information isn't sensible because of how tags are named.

mzxml.py

Lev Levitsky

unread,

Aug 20, 2016, 7:35:32 AM8/20/16

to pyteomics

Thank you! Supporting mzXML is something that has been requested several times, it's great that it will soon be supported.

If you feel like some tweaking in pyteomics.xml could be done to improve DRYness of mzxml and make the class structure more balanced, we can do that. From the first look, the need for duplication is because "id" is hardcoded into IndexedXML._find_by_id_no_reset; if adding another class attribute and referring to it there helps, we can easily do it.

--

---
You received this message because you are subscribed to the Google Groups "Pyteomics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pyteomics+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Lev Levitsky
Institute for Energy Problems of Chemical Physics RAS
Laboratory of Physical and Chemical Methods for Structure Analysis
Leninsky pr. 38, bld. 2 119334 Moscow Russia
tel: +7 499 1378257 fax: +7 499 1378257, +7 499 1378258

Joshua Klein

unread,

Aug 20, 2016, 3:30:14 PM8/20/16

to pyteomics

The `id` vs `num` issue goes deeper, touching the XML indexer as well. The core indexing machinery was designed to handle this, but the two or three layers of abstraction I wrote on top of it didn't provide a way to parameterize that machinery accordingly. I derive two classes to fix this in `mzxml`, but proper handling could be moved in the main classes to better handle this problem in the future.

Since the tag attribute to be used for looking up elements may be different for different tags, we'd need to adjust _find_by_id_no_reset in a fashion which lets you specify the attribute to match by as an argument, and then provide a default for each class.

There's also no support for automatic schema deduction, since the version extractor doesn't work with mzXML, and the schema extractor doesn't retrieve anything but the schema for the offset index because the actual useful components are inside xsd files added by XInclude, which lxml doesn't automatically expand by default for security reasons. Since mzXML is a dead format, and will never be updated again, I'm not too concerned about this part.

Reply all

Reply to author

Forward