Question regarding fasta.IndexedUniProt

19 views
Skip to first unread message

Vartika

unread,
Mar 30, 2022, 6:02:35 AM3/30/22
to pyte...@googlegroups.com
Hi, 

I am Vartika, I had a question about fasta.IndexedUniProt. Can it not read protein isoforms? For example, a protein isoform of Q86U42 such as Q86U42-2 cannot be read? 
The protein isoform Q86U42-2 is present in the fasta file but when I try to access it, it gives a key error. 

CODE: 
from pyteomics import fasta
human_fasta = fasta.IndexedUniProt('HUMAN.fasta')
human_fasta["Q86U42-2"] 
The last command gives a key error, although the protein Q86U42-2 is present in the fasta sequence (checked using grep). 

Thank you for your time and kind assistance.

Best regards,
Vartika K 


Vartika

unread,
Mar 30, 2022, 6:11:48 AM3/30/22
to pyte...@googlegroups.com
Hi, 

It must be the hyphen in the key that is causing the problem. I suppose the way the dictionary is initialised needs to be changed!?

Thanks & Regards,
Vartika K 

Lev Levitsky

unread,
Mar 30, 2022, 6:55:57 AM3/30/22
to pyte...@googlegroups.com
Hi Vartika,

The UniProt parser should work with IDs containing a hyphen. Your code works for me without any changes:

In [1]: from pyteomics import fasta

In [2]: human_fasta = fasta.IndexedUniProt("HUMAN.fasta")

In [3]: human_fasta["Q86U42-2"]
Out[3]: Protein(description={'db': 'sp', 'id': 'Q86U42-2', 'entry': 'PABP2_HUMAN', 'name': 'Isoform 2 of Polyadenylate-binding protein 2', 'gene_id': 'PABP2', 'taxon': 'HUMAN', 'OS': 'Homo sapiens', 'GN': 'PABPN1'}, sequence='MAAAAAAAAAAGAAGGRGSGPGRRRHLVPGAGGEAGEGAPGGAGDYGNGLESEELEPEELLLEPEPEPEPEEEPPRPRAPPGAPGPGPGSGAPGSQEEEEEPGLVEGDPGDGAIEDPELEAIKARVREMEEEAEKLKELQNEVEKQMNMSPPPGNAGPVIMSIEEKMEADARSIYVGNVDYGATAEELEAHFHGCGSVNRVTILCDKFSGHPKGFAYIEFSDKESVRTSLALDESLFRGRQIKVIPKRTNRPGISTTDRGFPRARYRARTTNYNSSRSRFYSGFNSRPRGRVYRSG')

This is with an old file where such an entry actually exists. Right now such isoform doesn't seem to exist in UniProt, so with a freshly downloaded database I get a KeyError, too (but also grep shows it's really not there).

Can you show how the entry looks in your file? Or maybe share a copy of your file that allows reproducing the problem? It doesn't have to be the full database, can be just an excerpt.

Best regards,
Lev


On Wed, Mar 30, 2022 at 1:11 PM Vartika <vartika.kh...@gmail.com> wrote:
Hi, 

It must be the hyphen in the key that is causing the problem. I suppose the way the dictionary is initialised needs to be changed!?

Thanks & Regards,
Vartika K 

On Wed, Mar 30, 2022 at 5:59 PM Vartika <vartika.kh...@gmail.com> wrote:
Hi, 

I am Vartika, I had a question about fasta.IndexedUniProt. Can it not read protein isoforms? For example, a protein isoform of Q86U42 such as Q86U42-2 cannot be read? 
The protein isoform Q86U42-2 is present in the fasta file but when I try to access it, it gives a key error. 

CODE: 
from pyteomics import fasta
human_fasta = fasta.IndexedUniProt('HUMAN.fasta')
human_fasta["Q86U42-2"] 
The last command gives a key error, although the protein Q86U42-2 is present in the fasta sequence (checked using grep). 

Thank you for your time and kind assistance.

Best regards,
Vartika K 


--
You received this message because you are subscribed to the Google Groups "Pyteomics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pyteomics+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pyteomics/CAJhwc_rxJkyAUDS7XXwN%2BcxvJcae8Qy7Oq7X%2Bx%2Bf7tCG_L9MBA%40mail.gmail.com.


--
Lev Levitsky
Institute for Energy Problems of Chemical Physics RAS
Laboratory of Physical and Chemical Methods for Structure Analysis
Leninsky pr. 38, bld. 2 119334 Moscow Russia
tel: +7 499 1378257 fax: +7 499 1378257, +7 499 1378258

Lev Levitsky

unread,
Apr 4, 2022, 11:47:12 AM4/4/22
to Vartika, pyteomics
Hi Vartika,

Indeed, with your file the Uniprot header parser cannot parse the entries with a hyphen. However, the problematic hyphen is not in the accession, which is allowed, rather in the entry name (in this case, PABP2-2_HUMAN) where it is not allowed.
If you want, you can use your own indexed FASTA class with a relaxed pattern which allows hyphens in the entry name, e.g.:

class MyIndexedUniProt(fasta.IndexedUniProt):
    header_pattern = r'^(\w+)\|([-\w]+)\|([-\w]+)\s+([^=]*\S)((\s+\w+=[^=]+(?!\w*=))+)\s*$'

and use it instead of fasta.IndexedUniProt:

human_fasta = MyIndexedUniProt('HUMAN.fasta')
human_fasta["Q86U42-2"]

Best regards,
Lev

On Mon, Apr 4, 2022 at 6:27 PM Vartika <vartika.kh...@gmail.com> wrote:
Hi Lev,
Sorry, I did not receive your reply. So the fasta file I use is a combined version of both the fasta and additional fasta sequences (isoforms and predicted). I have attached the fasta file below. Thanks for your time and kind consideration. 
Best regards,
Vartika K 

On Mon, Apr 4, 2022 at 7:55 PM Lev Levitsky <lev.le...@phystech.edu> wrote:
Hi Vartika, I am re-sending my reply to your email in case you did not receive it.
Reply all
Reply to author
Forward
0 new messages