Each of those tools you listed outputs a variety of different file formats, but a few of them are standardized. For example, MASCOT can output mzIdentML (mzid) which includes, in addition to all peptides and proteins identified, the modifications and protease(s) used to configure the search, and pyteomics.mzid
can parse that for you from the SpectrumIdentificationProtocol
elements. MaxQuant can output mzTab which records the same sort of information in a different structure, and pyteomics.mztab
can parse that, which would be located in the top-level metadata
section.
Neither search engine you listed really knows about sample metadata like tissue type, the vendor binary raw data files barely know that either. If you want to know the instrument name for the raw data searched, you’ll either need to convert the raw file to mzML and use pyteomics.mzml
to parse the instrumentConfiguration
elements, which will reference a specific referenceableParamGroup
which will contain the instrument name, or will contain that information themselves, depending upon the vendor and the converter used. This may include the instrument model name and/or its serial number. You can get similar information from mzXML, but that format should be avoided if at all possible.
Finding the proteases and modifications used in an mzIdentML file:
import sys
from pprint import pprint
from pyteomics import mzid
mzid_path = sys.argv[1]
print("Proteases")
reader = mzid.MzIdentML(mzid_path, retrieve_refs=True, iterative=True)
for elem in reader.iterfind("Enzyme"):
pprint(elem)
# At this point the parser has read through the whole document, so we need to reset the file stream
reader.reset()
print("\nModifications")
for elem in reader.iterfind("ModificationParams"):
pprint(elem)
Here’s how to get the modifications from mzTab, I don’t have an example with the protease specified.
import sys
from pyteomics import mztab
mztab_path = sys.argv[1]
tab = mztab.MzTab(mztab_path)
for key, value in tab.collapse_properties(tab.metadata).items():
if "_mod" in key:
print(key)
pprint(value)