Extracting Enzyme/Instrument Name/Tissue/Modifications data from xmls

20 views
Skip to first unread message

Vinay Duggineni

unread,
Aug 29, 2020, 9:13:15 AM8/29/20
to Pyteomics
Dear Pyteomics group,
I am new to the field of proteomics. Is there any tool in the Pyteomics collection which can help us get the digestive enzyme name, instrument name, tissue name and modifications from the output xml files of different mass-spec raw data processing softwares (like maxquant/mascot etc.). Any help or how to approach this problem would be appreciated.

Thanks!

Vinay Kumar Duggineni

mobiu...@gmail.com

unread,
Aug 29, 2020, 12:15:18 PM8/29/20
to Pyteomics

Each of those tools you listed outputs a variety of different file formats, but a few of them are standardized. For example, MASCOT can output mzIdentML (mzid) which includes, in addition to all peptides and proteins identified, the modifications and protease(s) used to configure the search, and pyteomics.mzid can parse that for you from the SpectrumIdentificationProtocol elements. MaxQuant can output mzTab which records the same sort of information in a different structure, and pyteomics.mztab can parse that, which would be located in the top-level metadata section.

Neither search engine you listed really knows about sample metadata like tissue type, the vendor binary raw data files barely know that either. If you want to know the instrument name for the raw data searched, you’ll either need to convert the raw file to mzML and use pyteomics.mzml to parse the instrumentConfiguration elements, which will reference a specific referenceableParamGroup which will contain the instrument name, or will contain that information themselves, depending upon the vendor and the converter used. This may include the instrument model name and/or its serial number. You can get similar information from mzXML, but that format should be avoided if at all possible.

Finding the proteases and modifications used in an mzIdentML file:

import sys
from pprint import pprint
from pyteomics import mzid

mzid_path = sys.argv[1]

print("Proteases")
reader = mzid.MzIdentML(mzid_path, retrieve_refs=True, iterative=True)
for elem in reader.iterfind("Enzyme"):
    pprint(elem)

# At this point the parser has read through the whole document, so we need to reset the file stream
reader.reset()  
print("\nModifications")
for elem in reader.iterfind("ModificationParams"):
    pprint(elem)

Here’s how to get the modifications from mzTab, I don’t have an example with the protease specified.

import sys
from pyteomics import mztab

mztab_path = sys.argv[1]
tab = mztab.MzTab(mztab_path)
for key, value in tab.collapse_properties(tab.metadata).items():
    if "_mod" in key:
        print(key)
        pprint(value)
Reply all
Reply to author
Forward
0 new messages