Making a subset of spectra from an mzXML file in Python

157 views
Skip to first unread message

Ben Temperton

unread,
Jun 14, 2012, 1:02:14 PM6/14/12
to spctools...@googlegroups.com
Hi there,

I am trying to pull out a subset of data from an mzXML file to run against a database using MSGF-db (for instance, to re-run any non matching spectra against the database searching for phosphorylation). To generate the subset I am currently using:

import lxml.etree as le

def makeHQSpectraFile(spectraFile, spectraList, outputFile):
    """Takes a spectra file, a list of scan ids to include and an output file as parameters"""
    with open(spectraFile,'r') as f:
        doc=le.parse(f)
        root = doc.getroot()
        for elem in doc.xpath('/t:mzXML/t:msRun/t:scan', namespaces={'t' : SASHIMI_NAMESPACE}):
            if not elem.attrib['num'] in spectraList:
                parent=elem.getparent()
                parent.remove(elem)
        for elem in doc.xpath('/t:mzXML/t:index/t:offset', namespaces={'t' : SASHIMI_NAMESPACE}):
            if not elem.attrib['id'] in spectraList:
                parent=elem.getparent()
                parent.remove(elem)
    handle = open(outputFile, 'wb')
    handle.write(le.tostring(doc) + '\n')
    handle.close()

However, when I run MSGF-db on the new file it throws a:

Reading spectra...
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,1]
Message: Premature end of file.
at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(Unknown Source)
at org.systemsbiology.jrap.stax.IndexParser.parseIndexes(IndexParser.java:176)
at org.systemsbiology.jrap.stax.MSXMLParser.randomInits(MSXMLParser.java:117)
at org.systemsbiology.jrap.stax.MSXMLParser.<init>(MSXMLParser.java:134)
at parser.MzXMLSpectraMap.<init>(MzXMLSpectraMap.java:39)
at parser.MzXMLSpectraIterator.<init>(MzXMLSpectraIterator.java:36)
at parser.MzXMLSpectraIterator.<init>(MzXMLSpectraIterator.java:26)
at ui.MSGFDB.runMSGFDB(MSGFDB.java:269)
at ui.MSGFDB.runMSGFDB(MSGFDB.java:106)
at ui.MSGFDB.main(MSGFDB.java:82)

Whilst the original (non-parsed version) works fine. I can't get the mzXMLValidator to work on our systems (see post here https://groups.google.com/d/msg/spctools-discuss/bAxu-In-ju4/z9_g3mdWSFcJ), so I was wondering if anyone else had ever encountered a similar issue and had any tips.

Many thanks,

Ben

Jimmy Eng

unread,
Jun 14, 2012, 1:10:14 PM6/14/12
to spctools...@googlegroups.com
Ben,

Have you done anything special to handle the scan numbers (which
presumably are not consecutive anymore starting from scan 1) and the
scan index? If not, address those and re-test or find out if those
are important for MSGF-db.
> --
> You received this message because you are subscribed to the Google Groups
> "spctools-discuss" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/spctools-discuss/-/psC3ABG8sNcJ.
> To post to this group, send email to spctools...@googlegroups.com.
> To unsubscribe from this group, send email to
> spctools-discu...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/spctools-discuss?hl=en.

Ben Temperton

unread,
Jun 14, 2012, 3:38:14 PM6/14/12
to spctools...@googlegroups.com
Hi Jimmy,

I have updated the scan numbers using a new method (attached), but I am stuck at changing the offset values for each of the scans in the /index/offset elements. I think this is what is causing the 'premature end of file' error - by not changing the offsets, MSGF-db can't find the spectra.

Any ideas how I can regenerate the index once the poor-quality spectra have been removed?
broken.script.py

Brian Pratt

unread,
Jun 14, 2012, 3:45:14 PM6/14/12
to spctools...@googlegroups.com
The index is technically optional, you should be able to just skip it
in your output.

("technically" and "should", as some parsers are brittle and will fail
without it - good luck!)
> --
> You received this message because you are subscribed to the Google Groups
> "spctools-discuss" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/spctools-discuss/-/egyhnHrt7HsJ.
Reply all
Reply to author
Forward
0 new messages