Making BLAST database for the family consensus sequences...
Making BLAST database for the reference protein sequences...
BLASTing the consensus family sequences against themselves...
BLASTing the consensus family sequences against the reference protein sequences...
Finding overlap with reference database...
Finding overlap with family consensus database...
Traceback (most recent call last):
File "./shortbred_identify.py", line 364, in <module>
dictGOICounts = pb.MarkX(dictGOIGenes,dictGOICounts)
File "/scicomp/home/kui8/shortbred/src/process_blast.py", line 491, in MarkX
dictOverlap[strName][i] = dictOverlap[strName][i] + 9999999
KeyError: 'gene_10886|GeneMark_hmm|34_aa|+|22053|22154'
Any thoughts on why I'm still getting this error? Unfortunately, I can't share the data at this point but below are a few sequences from MetaGeneMark output:
>gene_5|GeneMark.hmm|24_aa|-|2954|3025 >1BAAD10000013AAF2211ABC0A/A//AA//E/BGF11000B0B/AC//BB11BD112DE222FG21@BG11B/>EEHG>//@/?12B0B>>B>EB112>BF////111B1B11/B@
MPKTGQPFLLSSIPTQKQKSQYTN
>gene_6|GeneMark.hmm|86_aa|-|5786|6046 >1BAAD10000013AAF2211ABC0A/A//AA//E/BGF11000B0B/AC//BB11BD112DE222FG21@BG11B/>EEHG>//@/?12B0B>>B>EB112>BF////111B1B11/B@
LxxxxxxxxxxxxxxxxxxxxxxxxxxxxxPxxQYTNxxKKKKKKKKKKNKKKKNKKKKI
KNKKLKKKNNKLKKKNYNGKNLKFKL
>gene_7|GeneMark.hmm|52_aa|+|16043|16201 >1BAAD10000013AAF2211ABC0A/A//AA//E/BGF11000B0B/AC//BB11BD112DE222FG21@BG11B/>EEHG>//@/?12B0B>>B>EB112>BF////111B1B11/B@
LIETDESSEKKKEKEGEKQTESESEKGNKTGNGDCEIAIQRMADPTDRKSVV
I ran awk '{print $1}' to keep just the first column:
>gene_5|GeneMark.hmm|24_aa|-|2954|3025
MPKTGQPFLLSSIPTQKQKSQYTN
>gene_6|GeneMark.hmm|86_aa|-|5786|6046
LxxxxxxxxxxxxxxxxxxxxxxxxxxxxxPxxQYTNxxKKKKKKKKKKNKKKKNKKKKI
KNKKLKKKNNKLKKKNYNGKNLKFKL
>gene_7|GeneMark.hmm|52_aa|+|16043|16201
LIETDESSEKKKEKEGEKQTESESEKGNKTGNGDCEIAIQRMADPTDRKSVV
Then ran the modified AdjustFasta...:
>gene_5|GeneMark_hmm|24_aa|-|2954|3025
MPKTGQPFLLSSIPTQKQKSQYTN
>gene_6|GeneMark_hmm|86_aa|-|5786|6046
LxxxxxxxxxxxxxxxxxxxxxxxxxxxxxPxxQYTNxxKKKKKKKKKKNKKKKNKKKKI
KNKKLKKKNNKLKKKNYNGKNLKFKL
>gene_7|GeneMark_hmm|52_aa|+|16043|16201
LIETDESSEKKKEKEGEKQTESESEKGNKTGNGDCEIAIQRMADPTDRKSVV
The output of the adjusted fasta is what I used as goi for shortbred.identify
Please help!
Thanks,
Nsa