Hi there,
I am trying to pull out a subset of data from an mzXML file to run against a database using MSGF-db (for instance, to re-run any non matching spectra against the database searching for phosphorylation). To generate the subset I am currently using:
import lxml.etree as le
def makeHQSpectraFile(spectraFile, spectraList, outputFile):
"""Takes a spectra file, a list of scan ids to include and an output file as parameters"""
with open(spectraFile,'r') as f:
doc=le.parse(f)
root = doc.getroot()
for elem in doc.xpath('/t:mzXML/t:msRun/t:scan', namespaces={'t' : SASHIMI_NAMESPACE}):
if not elem.attrib['num'] in spectraList:
parent=elem.getparent()
parent.remove(elem)
for elem in doc.xpath('/t:mzXML/t:index/t:offset', namespaces={'t' : SASHIMI_NAMESPACE}):
if not elem.attrib['id'] in spectraList:
parent=elem.getparent()
parent.remove(elem)
handle = open(outputFile, 'wb')
handle.write(le.tostring(doc) + '\n')
handle.close()
However, when I run MSGF-db on the new file it throws a:
Reading spectra...
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,1]
Message: Premature end of file.
at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(Unknown Source)
at org.systemsbiology.jrap.stax.IndexParser.parseIndexes(IndexParser.java:176)
at org.systemsbiology.jrap.stax.MSXMLParser.randomInits(MSXMLParser.java:117)
at org.systemsbiology.jrap.stax.MSXMLParser.<init>(MSXMLParser.java:134)
at parser.MzXMLSpectraMap.<init>(MzXMLSpectraMap.java:39)
at parser.MzXMLSpectraIterator.<init>(MzXMLSpectraIterator.java:36)
at parser.MzXMLSpectraIterator.<init>(MzXMLSpectraIterator.java:26)
at ui.MSGFDB.runMSGFDB(MSGFDB.java:269)
at ui.MSGFDB.runMSGFDB(MSGFDB.java:106)
at ui.MSGFDB.main(MSGFDB.java:82)
Many thanks,
Ben