MzIdentML problem

40 views
Skip to first unread message

Mathieu Courcelles

unread,
Oct 7, 2015, 7:16:18 PM10/7/15
to pyte...@googlegroups.com
Hello,
 
I have a problem when I turn on retrieve_refs with MZIdentML.
 
My code:
with mzid.MzIdentML(file, retrieve_refs=True) as psms:
   
    for i, psm in enumerate(psms):
        print i
 
 
Trace:
0
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-71c3a0553018> in <module>()
      1 with mzid.MzIdentML(file, retrieve_refs=True) as psms:
      2 
----> 3     for i, psm in enumerate(psms):
      4         print i
      5 

C:\Python27\lib\site-packages\pyteomics\auxiliary.pyc in __next__(self)
    391     def __next__(self):
    392         try:
--> 393             return next(self._reader)
    394         except StopIteration:
    395             self.__exit__(None, None, None)

C:\Python27\lib\site-packages\pyteomics\xml.pyc in iterfind(self, path, **kwargs)
    393                         if (absolute and elem.getparent() is None) or not absolute:
    394                             for child in get_rel_path(elem, nodes[1:]):
--> 395                                 info = self._get_info_smart(child, **kwargs)
    396                                 if cond is None or satisfied(info, cond):
    397                                     yield info

C:\Python27\lib\site-packages\pyteomics\mzid.pyc in _get_info_smart(self, element, **kwargs)
    118             return self._get_info(element,
    119                     recursive=(rec if rec is not None else True),
--> 120                     **kwargs)
    121 
    122     def _retrieve_refs(self, info, **kwargs):

C:\Python27\lib\site-packages\pyteomics\xml.pyc in _get_info(self, element, **kwargs)
    296                     else:
    297                         info.setdefault(cname, []).append(
--> 298                                 self._get_info_smart(child, **kwargs))
    299 
    300         # process element text

C:\Python27\lib\site-packages\pyteomics\mzid.pyc in _get_info_smart(self, element, **kwargs)
    118             return self._get_info(element,
    119                     recursive=(rec if rec is not None else True),
--> 120                     **kwargs)
    121 
    122     def _retrieve_refs(self, info, **kwargs):

C:\Python27\lib\site-packages\pyteomics\xml.pyc in _get_info(self, element, **kwargs)
    296                     else:
    297                         info.setdefault(cname, []).append(
--> 298                                 self._get_info_smart(child, **kwargs))
    299 
    300         # process element text

C:\Python27\lib\site-packages\pyteomics\mzid.pyc in _get_info_smart(self, element, **kwargs)
    118             return self._get_info(element,
    119                     recursive=(rec if rec is not None else True),
--> 120                     **kwargs)
    121 
    122     def _retrieve_refs(self, info, **kwargs):

C:\Python27\lib\site-packages\pyteomics\xml.pyc in _get_info(self, element, **kwargs)
    316         # resolve refs
    317         if kwargs.get('retrieve_refs'):
--> 318             self._retrieve_refs(info, **kwargs)
    319 
    320         # flatten the excessive nesting

C:\Python27\lib\site-packages\pyteomics\mzid.pyc in _retrieve_refs(self, info, **kwargs)
    125         for k, v in dict(info).items():
    126             if k.endswith('_ref'):
--> 127                 info.update(self.get_by_id(v, retrieve_refs=True))
    128                 del info[k]
    129                 info.pop('id', None)

TypeError: 'NoneType' object is not iterable
 
 
Any idea of the problem?
 
Thanks
 
Mathieu Courcelles, Ph.D. Bio-informatique
Associé de recherche, Centre d’analyse protéomique avancée
IRIC - Institut de recherche en immunologie et cancérologie
Université de Montréal

Mathieu Courcelles

unread,
Oct 7, 2015, 7:16:39 PM10/7/15
to pyte...@googlegroups.com
Hello,
 
You can get my mzIdentML file here:
 
It is generated by PEAKS.
 
I trace the problem to this value that is received by get_by_id(v, retrieve_refs=True))
 
PEPTIDEEVIDENCE_PEPTIDE_2_DBSEQUENCE_22_756:22_752
 
 
I guess it doesn`t handle the 22_756:22_752 concatenation.
 
Mathieu Courcelles, Ph.D. Bio-informatique
Associé de recherche, Centre d’analyse protéomique avancée
IRIC - Institut de recherche en immunologie et cancérologie
Université de Montréal

Lev Levitsky

unread,
Oct 8, 2015, 9:43:54 AM10/8/15
to pyteomics, math...@hotmail.com
Dear Mathieu,

thanks for the report! Indeed, the problem is the "PEPTIDEEVIDENCE_PEPTIDE_2_DBSEQUENCE_22_756:22_752" reference.
The reference is not resolved because the file does not contain an element with ID equal to that string, hence get_by_id returns None, which in turn leads to the exception you see.
Here is how I am checking for presence of a reference:
This reference from the first entry is found fine by the parser: 'PEPTIDEEVIDENCE_PEPTIDE_1_DBSEQUENCE_#DECOY#7_4146'
So if I grep the file for this string, I get:

$ grep -F 'PEPTIDEEVIDENCE_PEPTIDE_1_DBSEQUENCE_#DECOY#7_4146' peptides_1_1_0.mzid 
    <PeptideEvidence isDecoy="true" end="164" start="157" peptide_ref="PEPTIDE_1" dBSequence_ref="DBSEQUENCE_6040" id="PEPTIDEEVIDENCE_PEPTIDE_1_DBSEQUENCE_#DECOY#7_4146" pre="N" post="A"/>
            <PeptideEvidenceRef peptideEvidence_ref="PEPTIDEEVIDENCE_PEPTIDE_1_DBSEQUENCE_#DECOY#7_4146"/>
            <PeptideHypothesis peptideEvidence_ref="PEPTIDEEVIDENCE_PEPTIDE_1_DBSEQUENCE_#DECOY#7_4146">

Note how there is a PeptideEvidence element with id="PEPTIDEEVIDENCE_PEPTIDE_1_DBSEQUENCE_#DECOY#7_4146".

And for the problematic reference:

$ grep -F 'PEPTIDEEVIDENCE_PEPTIDE_2_DBSEQUENCE_22_756:22_752' peptides_1_1_0.mzid 
            <PeptideEvidenceRef peptideEvidence_ref="PEPTIDEEVIDENCE_PEPTIDE_2_DBSEQUENCE_22_756:22_752"/>

There is a PeptideEvidenceRef, but no corresponding PeptideEvidence.
If my understanding is correct, this is ultimately a problem with the file containing an unresolved reference, and all I can do is make the parser ignore it and move on.
The latest commit in the pyteomics repo should make your code work and print a warning for the references it cannot resolve.
Also note that it's a good idea to pass build_id_cache=True when creating a MzIdentML object if you want to use retrieve_refs, because it saves a lot of time if you can afford the RAM:

with mzid.MzIdentML(file, build_id_cache=True, retrieve_refs=True) as psms:
    for i, psm in enumerate(psms):
        print i

Please let me know if there is anything else I can do.

Best regards,
Lev


--

---
You received this message because you are subscribed to the Google Groups "Pyteomics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pyteomics+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Lev Levitsky
Institute for Energy Problems of Chemical Physics RAS
Laboratory of Physical and Chemical Methods for Structure Analysis
Leninsky pr. 38, bld. 2 119334 Moscow Russia
tel: +7 499 1378257 fax: +7 499 1378257, +7 499 1378258

Lev Levitsky

unread,
Oct 9, 2015, 8:28:10 AM10/9/15
to Mathieu Courcelles, pyteomics
On Thu, Oct 8, 2015 at 5:08 PM, Mathieu Courcelles <math...@hotmail.com> wrote:
Hello Lev,
 
I think the appropriate correction for this problem would be to search for two references since PEAKS concatenate protein identifiers and separate them with :.
 
PEPTIDEEVIDENCE_PEPTIDE_2_DBSEQUENCE_22_756
PEPTIDEEVIDENCE_PEPTIDE_2_DBSEQUENCE_22_752
 
These two exists.
  
So you would need to split 22_756:22_752 with : delimiter, search for them and then add these two the list x['SpectrumIdentificationItem'][0]['PeptideEvidenceRef'] 

Thanks for the heads-up!
Looks like PEAKS goes against the mzIdentML standard here, which says that to reference multiple sequences one should use multiple PeptideEvidenceRef elements.
Unfortunately, I don't see a simple way to support this behavior reliably without breaking anything, so the best I can suggest at the moment is using this subclass of MzIdentML (code attached):

import re
from pyteomics import mzid, xml

class PEAKSMzIdentML(mzid.MzIdentML):

    _id_pattern = r'([^:]*_DBSEQUENCE_)(.*)'

    def _retrieve_refs(self, info, **kwargs):
        for k, v in dict(info).items():
            if k.endswith('_ref'):
                by_id = self.get_by_id(v, retrieve_refs=True)
                if by_id is None:
                    m = re.match(PEAKSMzIdentML._id_pattern, v)
                    if m is None:
                        warnings.warn('Ignoring unresolved reference: ' + v)
                    info[k] = []
                    pref = m.group(1)
                    for suf in m.group(2).split(':'):
                        info[k].append(self.get_by_id(pref+suf, retrieve_refs=True))
                else:
                    info[k] = [by_id]


The output will be slightly different from the regular MzIdentML, because it will have an extra level of nesting for refs, but at least the data will be there:

from peaks_parser import PEAKSMzIdentML
with PEAKSMzIdentML(fname, retrieve_refs=True, build_id_cache=True) as f:
    for i, x in enumerate(f):
        print i, len(x['SpectrumIdentificationItem'][0]['PeptideEvidenceRef'][0]['peptideEvidence_ref'])

Best regards,
Lev

peaks_parser.py

Lev Levitsky

unread,
Oct 9, 2015, 11:49:29 AM10/9/15
to Mathieu Courcelles, pyteomics
On Fri, Oct 9, 2015 at 5:43 PM, Mathieu Courcelles <math...@hotmail.com> wrote:
Thanks for the subclass.
 
How can I use it with mzid.qvalues method?
 
I tried the following without success:
 
with PEAKSMzIdentML(file, retrieve_refs=True, build_id_cache=True) as psms:
 
    q1 = mzid.qvalues(psms, retrieve_refs=True,
                      key=lambda x: x['SpectrumIdentificationItem'][0]['PEAKS:peptideScore'],
                      is_decoy=my_isDecoy
                      )

mzid.qvalues is just a thin wrapper around the auxiliary.qvalues function that does the work. Its job is to accept MzIdentML file names instead of iterables and to provide some default values for key and is_decoy.
Since you use neither of that, you can just switch to using auxiliary.qvalues directly. Note that on the current release this exposes a bug that I hadn't noticed before, so until the 3.1.1 fix release please use the latest repo version:

from peaks_parser import PEAKSMzIdentML
from pyteomics import auxiliary as aux

with PEAKSMzIdentML(fname, retrieve_refs=True, build_id_cache=True) as psms:
    q1 = aux.qvalues(psms, key=lambda x: x['SpectrumIdentificationItem'][0]['PEAKS:peptideScore'], is_decoy=my_isDecoy)

Mathieu Courcelles

unread,
Aug 3, 2016, 1:53:44 PM8/3/16
to Lev Levitsky, pyteomics
Hello,
 
I upgraded from pytomics version 3.1.1 to 3.3.1 and I now have an issue with parsing PEAKS mzIdentML file. You previously sent me a custom subclass for this to work.
 
I got this error since the upgrade:
 
  File "C:\Users\Mathieu\OneDrive\code\mhc\venv\lib\site-packages\pyteomics\xml.py", line 325, in _get_info
    self._retrieve_refs(info, **kwargs)
  File "C:\Users\Mathieu\OneDrive\code\mhc\src\library\peaks_parser.py", line 89, in _retrieve_refs
    by_id = self.get_by_id(v, retrieve_refs=True)
  File "C:\Users\Mathieu\OneDrive\code\mhc\venv\lib\site-packages\pyteomics\xml.py", line 900, in get_by_id
    data = self._get_info_smart(elem, **kwargs)
  File "C:\Users\Mathieu\OneDrive\code\mhc\src\library\peaks_parser.py", line 30, in _get_info_smart
    name = xml._local_name(element)
  File "C:\Users\Mathieu\OneDrive\code\mhc\venv\lib\site-packages\pyteomics\xml.py", line 51, in _local_name
    if element.tag and element.tag[0] == '{':
AttributeError: 'NoneType' object has no attribute 'tag'
 
 
Thanks for your help
 
Mathieu Courcelles, Ph.D. Bio-informatique
Associé de recherche, Centre d’analyse protéomique avancée
IRIC - Institut de recherche en immunologie et cancérologie
Université de Montréal

Joshua Klein

unread,
Aug 3, 2016, 5:01:29 PM8/3/16
to pyteomics, Lev Levitsky

Hello,

It looks like your example mzIdentML file is no longer hosted, so I can’t test this on your exact file, but I consume mzIdentML files from PEAKS too. Here’s the work-around to their dbSequence id problem I use. It involved rewriting more of the guts of _get_info and _retrieve_refs to handle the multiple-protein problem. Just use MultiProteinMzIdentML as you would the other versions of MzIdentML.

This solution may not be as stable because it involved rewriting the guts of how _get_info and _retrieve_refs work. It is also a little more aggressive in pursuit of Param tags.

--

peaks_mzid.py
Reply all
Reply to author
Forward
0 new messages