pyBioLCCC stumbles when non-standard amino acids are in a sequence

31 views
Skip to first unread message

Achim

unread,
Jul 20, 2011, 10:34:05 AM7/20/11
to biolccc
Dear Anton,

I was processing a long list of sequences, when pyBioLCCC fell over
due to the presence of an 'X' in the sequence. The error message that
I got was:

========================
Traceback (most recent call last):
File "<pyshell#20>", line 3, in <module>
pyBioLCCC.standardChromoConditions)
File "/usr/local/lib/python2.6/dist-packages/pyBioLCCC.py", line
2022, in calculateRT
return _pyBioLCCC.calculateRT(*args)
BioLCCCException
========================

It might be worth trapping these occurences to at least produce a more
meaningful error message. You could use this 'isAminoAcid' function
that was inspired by Katja Schuerer/Catherine Letondal's python
tutorial:

========================
# test if this is an amino acid
from string import *

aa_mass = { 'A': 71.03711, 'R':156.10111, 'N':114.04293,
'D':115.02694, 'C':103.00919, 'E':129.04259,
'Q':128.05858, 'G': 57.02146, 'H':137.05891,
'I':113.08406, 'L':113.08406, 'K':128.09496,
'M':131.04049, 'F':147.06841, 'P': 97.05276,
'S': 87.03203, 'T':101.04768, 'W':186.07931,
'Y':163.06333, 'V': 99.06841
}


def isAminoAcid(aa):
aa=upper(aa)
if aa in 'ACDEFGHIKLMNPQRSTVWY':
return True
else:
return False



# main program for testing

pepMass = 0
seq = raw_input("Please enter a peptide sequence: ")
seq = upper(seq)

for i in range(len(seq)):
if not isAminoAcid(seq[i]):
print "This peptide contains an undefined amino acid in
position",(i+1)
break
else:
pepMass += aa_mass[seq[i]]
========================

Best wishes,
Achim

Anton Goloborodko

unread,
Jul 21, 2011, 9:16:13 AM7/21/11
to bio...@googlegroups.com
Dear Achim,

thank you for your bug report.
This behavior was intentionally implemented into the libBioLCCC/pyBioLCCC string parser. However, we absolutely agree with you that we need to make the error message more clear.

libBioLCCC/pyBioLCCC accepts sequence strings in the extended IUPAC notation http://theorchromo.ru/lib/tutorial.html#peptide-sequence-notation . We call it "modX", since an amino acid can be defined there by a string label optionally starting with several lowercase letters defining a modification and ending with a single uppercase letter. Moreover, modX uses a simple dash-notation for terminal peptide modifications, e.g. 'Ac-' would mean N-terminal acetylation and '-NH2' means C-terminal amidation. 

The error you have shown occurs when a sequence cannot be parsed according to modX rules. This may happen if it contains a non-standard amino acid or uses non-alphabet symbols in a wrong way. The reason libBioLCCC fails and does not try to process this event is that we cannot guarantee correct retention time prediction in this case. A "hard" failure forces a user to define an ambiguous sequence or remove it from consideration. Otherwise, the result would be ambiguous itself and a user wouldn't even notice that.

But as you have noticed, the error message is too obscure and does not help to find its source. We will fix this behavior in the next libBioLCCC/pyBioLCCC release.

With best regards,
Anton Goloborodko.
Reply all
Reply to author
Forward
0 new messages