Help with generating markers.

454 views
Skip to first unread message

Nsa Dada

unread,
Jun 6, 2016, 10:29:31 AM6/6/16
to shortbred-users
Hi everyone!

I'm new to NGS and bioinformatics so please bear with me.
I am trying to create markers using shortbred.identify and I'm getting the following error message.

Making BLAST database for the family consensus sequences...
Making BLAST database for the reference protein sequences...
BLASTing the consensus family sequences against themselves...
BLASTing the consensus family sequences against the reference protein sequences...
Finding overlap with reference database...
Finding overlap with family consensus database...
Traceback (most recent call last):
  File "./shortbred_identify.py", line 364, in <module>
    dictGOICounts = pb.MarkX(dictGOIGenes,dictGOICounts)
  File "/scicomp/home/kui8/shortbred/src/process_blast.py", line 491, in MarkX
    dictOverlap[strName][i] = dictOverlap[strName][i] + 9999999
KeyError: 'gene_10878|GeneMark.hmm|89_aa|+|21012|21281'

I used predicted proteins (aa seqs output from MetaGeneMark) as the goi for shortbred.identify since I am trying to classify the proteins in my metagenomic library.

Any idea what the error means and how to overcome it will be greatly appreciated.
Thanks,
nsa


Eric Franzosa

unread,
Jun 6, 2016, 5:22:08 PM6/6/16
to Nsa Dada, shortbred-users
Hello Nsa,

This might be due to some of the special characters in your FASTA header. There is a util bundled with ShortBRED for "simplifying" FASTA headers ("util/AdjustFastaHeadersForShortBRED.py"). You would use this program as follows:

python AdjustFastaHeadersForShortBRED.py < special.fasta > nospecial.fasta

Can you try that and let us know if it resolves the error?

Thanks,
Eric



--
You received this message because you are subscribed to the Google Groups "shortbred-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to shortbred-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nsa

unread,
Jun 7, 2016, 2:25:59 PM6/7/16
to shortbred-users, nsa...@gmail.com
Thank you for getting back Eric.

I ran the command (after cp [my protein.faa file] to special.fasta) and i got the following error:

kui8:/scicomp/home/kui8/shortbred/utils: python AdjustFastaHeadersForShortBRED.py < special.fasta > nospecial.fasta

Traceback (most recent call last):
  File "AdjustFastaHeadersForShortBRED.py", line 37, in <module>
    if strLine[0]==">":
IndexError: string index out of range


It produces the nospecial.fasta file with only one sequence.

What does the error mean and what output(s) should I expect? Also which of the outputs do I use as input for shortbred.identify?

Thank you for your help.
Nsa

Jim Kaminski

unread,
Jun 7, 2016, 2:40:24 PM6/7/16
to Nsa, shortbred-users
Hi Nsa,

It looks like my program is having trouble removing the special characters from your file. If you're willing, could you please send me your fasta file? I can update the program to handle these cases.

If you would prefer to not send the sequences, I understand. I can look at typical output from MetaGeneMark and try to adjust the program for those cases.

Thank you,

Jim

Nsa

unread,
Jun 7, 2016, 3:36:56 PM6/7/16
to shortbred-users, nsa...@gmail.com
Thanks for your prompt response Jim.

We added an extra condition to the script (in bold red) and it seemed to solve the problem:
for strLine in sys.stdin:
    strLine = strLine.strip()
    if strLine:
        if strLine[0]==">":

I'm running shortbred.identify with the AdjustFasta... output to see if it works.
I'll keep you posted.

thanks,
Nsa

Nsa

unread,
Jun 8, 2016, 9:14:53 AM6/8/16
to shortbred-users, nsa...@gmail.com
Hi all,

So I ran shortbred.identify using the adjustedfasta out put as goi and I'm still getting the same error as I did before:


Making BLAST database for the family consensus sequences...
Making BLAST database for the reference protein sequences...
BLASTing the consensus family sequences against themselves...
BLASTing the consensus family sequences against the reference protein sequences...
Finding overlap with reference database...
Finding overlap with family consensus database...
Traceback (most recent call last):
  File "./shortbred_identify.py", line 364, in <module>
    dictGOICounts = pb.MarkX(dictGOIGenes,dictGOICounts)
  File "/scicomp/home/kui8/shortbred/src/process_blast.py", line 491, in MarkX
    dictOverlap[strName][i] = dictOverlap[strName][i] + 9999999
KeyError: 'gene_10886|GeneMark_hmm|34_aa|+|22053|22154'

Any thoughts on why I'm still getting this error? Unfortunately, I can't share the data at this point but below are a few sequences from MetaGeneMark output:

>gene_5|GeneMark.hmm|24_aa|-|2954|3025    >1BAAD10000013AAF2211ABC0A/A//AA//E/BGF11000B0B/AC//BB11BD112DE222FG21@BG11B/>EEHG>//@/?12B0B>>B>EB112>BF////111B1B11/B@
MPKTGQPFLLSSIPTQKQKSQYTN

>gene_6|GeneMark.hmm|86_aa|-|5786|6046    >1BAAD10000013AAF2211ABC0A/A//AA//E/BGF11000B0B/AC//BB11BD112DE222FG21@BG11B/>EEHG>//@/?12B0B>>B>EB112>BF////111B1B11/B@
LxxxxxxxxxxxxxxxxxxxxxxxxxxxxxPxxQYTNxxKKKKKKKKKKNKKKKNKKKKI
KNKKLKKKNNKLKKKNYNGKNLKFKL

>gene_7|GeneMark.hmm|52_aa|+|16043|16201    >1BAAD10000013AAF2211ABC0A/A//AA//E/BGF11000B0B/AC//BB11BD112DE222FG21@BG11B/>EEHG>//@/?12B0B>>B>EB112>BF////111B1B11/B@
LIETDESSEKKKEKEGEKQTESESEKGNKTGNGDCEIAIQRMADPTDRKSVV

I ran awk '{print $1}' to keep just the first column:

>gene_5|GeneMark.hmm|24_aa|-|2954|3025
MPKTGQPFLLSSIPTQKQKSQYTN

>gene_6|GeneMark.hmm|86_aa|-|5786|6046
LxxxxxxxxxxxxxxxxxxxxxxxxxxxxxPxxQYTNxxKKKKKKKKKKNKKKKNKKKKI
KNKKLKKKNNKLKKKNYNGKNLKFKL

>gene_7|GeneMark.hmm|52_aa|+|16043|16201
LIETDESSEKKKEKEGEKQTESESEKGNKTGNGDCEIAIQRMADPTDRKSVV

Then ran the modified AdjustFasta...:

>gene_5|GeneMark_hmm|24_aa|-|2954|3025
MPKTGQPFLLSSIPTQKQKSQYTN
>gene_6|GeneMark_hmm|86_aa|-|5786|6046
LxxxxxxxxxxxxxxxxxxxxxxxxxxxxxPxxQYTNxxKKKKKKKKKKNKKKKNKKKKI
KNKKLKKKNNKLKKKNYNGKNLKFKL
>gene_7|GeneMark_hmm|52_aa|+|16043|16201
LIETDESSEKKKEKEGEKQTESESEKGNKTGNGDCEIAIQRMADPTDRKSVV

The output of the adjusted fasta is what I used as goi for shortbred.identify

Please help!
Thanks,
Nsa

Jim Kaminski

unread,
Jun 8, 2016, 11:49:02 AM6/8/16
to Nsa, shortbred-users
Hi Nsa,

I can look into this in more detail later today.

Based on the error, it looks like ShortBRED_Identify is having trouble finding the sequence for 'gene_10886|GeneMark_hmm|34_aa|+|22053|22154' in the dictionary it builds for the input data. Could you please send me the input for that sequence (header and sequence) that you pass to ShortBRED?

Thank you,

Jim

Nsa

unread,
Jun 8, 2016, 1:06:07 PM6/8/16
to shortbred-users, nsa...@gmail.com
All right, thanks Jim.

Below is the header and sequence for the offending sequence:


>gene_10886|GeneMark_hmm|34_aa|+|22053|22154
VSVRxxIRVLTxxxxxxxxxxxxxxxxPAxRxxG
>gene_10887|GeneMark_hmm|96_aa|-|22208|22495
MMRIVIGGCQQLIKNHWSVHEFxxGGxPxxxxxxxxxxxxxQYTNxxDLSSVERCAVSAN
THPADGAARTRAPSCSVCAGQYTNxxxxxxxxxxxx

Thank you!
Nsa

Jim Kaminski

unread,
Jun 8, 2016, 11:22:25 PM6/8/16
to Nsa, shortbred-users
Hi Nsa,

Thank you for sending your sequences, and for your patience.

To track down the problem, I pulled the current version of ShortBRED, added those two sequences to the example sequences in ShortBRED (example/input_prots.faa) and ran the program with the call below:

./shortbred_identify.py --goi  example/input_prots.faa --ref example/ref_prots.faa --markers mytestmarkers.faa --tmp example_identify --blastp /n/sw/ncbi-blast-2.2.28/bin/blastp --muscle /n/sw/muscle-3.8.31/bin/muscle --cdhit /n/sw/cd-hit-v4.6-2012-04-25/cd-hit --usearch /n/home11/jkaminski/usearch/usearch --makeblastdb /n/sw/ncbi-blast-2.2.28/bin/makeblastdb

The mytestmarkers.faa output file included the following (among the markers for the other test sequences):

>gene_10887|GeneMark_hmm|96_aa|-|22208|22495_TM_#01
MMRIVIGGCQQLIKNHWSVHEF
>gene_10887|GeneMark_hmm|96_aa|-|22208|22495_TM_#02
DLSSVERCAVSANTHPADGAARTRAPSCSVCAGQYTN
>gene_10886|GeneMark_hmm|34_aa|+|22053|22154_JM_#01__[gene_10886|GeneMark_hmm|34_aa|+|22053|22154_w=1.000]
VSVRxxIRVLTxxxxxxxxxxxxxxxxPAxRxxG

I believe ShortBRED ran fine, which leads me to think that there may be an issue in one of the programs it calls. Could you please let me know which versions of blastp, muscle, cdhit, and usearch you are using? 

In particular, it may be that we are using different versions of cdhit. If cdhit is changing the name of the sequence to remove special characters when it clusters the input sequences, it could cause problems for ShortBRED. In the tmp folder created by ShortBRED, could you please check the files clust/clust.faa and clust/clust.map and let me know what names are used for "gene_10886"? 

Thank you,

Jim

Nsa

unread,
Jun 9, 2016, 12:08:54 PM6/9/16
to shortbred-users, nsa...@gmail.com
Thank you Jim, I appreciate your help. Below are the versions of the programs I used:

blastp:
/apps/x86_64/blast/2.2.30+/bin/blastp

muscle:
/apps/x86_64/muscle/3.8.31/muscle

cdhit:
/apps/x86_64/CD-HIT/cd-hit-v4.6-stable/

usearch:
usearch v6.0.307


And the names of gene_10886 are below:


clust.faa

 >gene_10886|GeneMark_hmm|34_aa|+|22053|22154
VSVRxxIRVLTxxxxxxxxxxxxxxxxPAxRxxG

clust.map
gene_10886|GeneMark_hmm|34_aa|+|22053|22154    gene_10886|GeneMark_hmm|34_aa|+|22053|22154

Our bioinformatician ran the script with the following modifications and it seemed to work:

i modifed the MarkX function so it looks like this:
        for strName in dictGenes:
                if strName in dictOverlap:
                        for i in range(len(dictGenes[strName])):
                                if dictGenes[strName][i]=="X" or dictGenes[strName][i]=="x":

                                        dictOverlap[strName][i] = dictOverlap[strName][i] + 9999999

        return dictOverlap

added the "if strName in dictOverlap:" line

maybe we should run this by the shortbred authors.. they might have another fix for the selfblast step

But it I think it's throwing out the offending genes (our temp solution).

Thanks and looking forward to hearing back from you.
Cheers,
Nsa



Jim Kaminski

unread,
Jun 10, 2016, 2:06:18 PM6/10/16
to Nsa, shortbred-users
Thanks, Nsa.

I believe that edit will throw out the genes, as you noted, so I'd like to try to fix that. I'll see if I can reproduce the error on my end (this weekend or early next week) using the program versions you listed, and will dig in further.

Jim



Jim Kaminski

unread,
Jun 15, 2016, 10:24:58 PM6/15/16
to Nsa, shortbred-users
Hello Nsa,

I tried to replicate the error using programs with the same versions as yours, but did not encounter any problems. The versions I used are listed in the command below. It's possible my version of cdhit is different, but I also tried "cd-hit-v4.6-2012-04-25" and that one worked fine as well.

./shortbred_identify.py --goi  example/input_prots.faa --ref example/ref_prots.faa --markers mytestmarkers.faa --tmp example_identify --blastp /n/home11/jkaminski/blast/ncbi-blast-2.2.30+/bin/blastp   --muscle /n/sw/muscle-3.8.31/bin/muscle --cdhit /n/home11/jkaminski/cdhit/cd-hit-v4.6.1-2012-08-27/cd-hit  --usearch /n/home11/jkaminski/usearch/usearch --makeblastdb /n/home11/jkaminski/blast/ncbi-blast-2.2.30+/bin/makeblastdb 

/n/home11/jkaminski/usearch/usearch --version
usearch_i86linux32 v6.0.307

One other thing we can try is inspecting the contents of the dictionaries that are causing the error. It may be that the gene name is stored as a key, but has a slightly different format due to the special characters in genemark output, or a difference between how cdhit and blast store the name.

Could you please add the lines below to shortbred_identify.py, at line number 365?

print("Inspecting dict GOICounts:")
print(dictGOICounts.keys())
print("Inspecting dict GOIHits:")
print(dictGOIHits.keys())

These would be inserted right after:

#Get high-identity hits of *all lengths* in GOI database
sys.stderr.write( "Finding overlap with family consensus database...\n")
dictGOICounts, dictGOIHits  = pb.getOverlapCounts(strBlastSelf, args.dID, iLenMin, 1.0, 0,True)

Please then run the program again, and send me the output after "Inspecting dict GOICounts:"? I'd like to see if we find a string similar to the one for the genemark name that causes the problem, or if it is missing entirely. Also, which version of Python are you using?


Thank you,

Jim

Adam Blanchard

unread,
Mar 6, 2017, 9:44:04 AM3/6/17
to shortbred-users, nsa...@gmail.com
Hi Jim,

I've just started playing with ShortBRED and i'm encountering the exact same issue as described here regarding gene10886, which is odd as I'm using my own input proteins. I've added the print command into the script and run that with my inputs and here are the outputs:
GOIHits.txt
GOICounts.txt

Jim Kaminski

unread,
Mar 7, 2017, 2:44:34 PM3/7/17
to Adam Blanchard, shortbred-users, Nsa
Hi Adam,

This might also be a naming issue, but I'm not certain. As a check, could you please try what I just sent to Howard on the message board?

Could you please try running your fasta file through the program /shortbred/utils/AdjustFastaHeadersForShortBRED.py ? It will change any characters that can cause issues. After doing that, I would suggest you try to do a test run of these sequences against a very small reference database.

Thank you,

Jim



To unsubscribe from this group and stop receiving emails from it, send an email to shortbred-users+unsubscribe@googlegroups.com.

Adam Blanchard

unread,
Mar 8, 2017, 6:36:41 AM3/8/17
to shortbred-users, addyb...@gmail.com, nsa...@gmail.com
Hi Jim, 

Thanks for taking a look, I have run the adjustfasta and still get the same output unfortunately.

Traceback (most recent call last):
  File "./shortbred_identify.py", line 365, in <module>
    dictGOICounts = pb.MarkX(dictGOIGenes,dictGOICounts)
  File "/home/svzab1/Programs/ShortBRED/src/process_blast.py", line 491, in MarkX
    dictOverlap[strName][i] = dictOverlap[strName][i] + 9999999
KeyError: 'gene_10886|GeneMark_hmm|34_aa|+|22053|22154'

It does however run fine with the test dataset.

Jim Kaminski

unread,
Mar 10, 2017, 3:42:39 PM3/10/17
to Adam Blanchard, shortbred-users, Nsa
Hi Adam,

My apologies for the late response, I was traveling earlier this week and am now sick.

I have scheduled some time Monday night to look into this problem. For some reason, this fasta header is not being added to the dictionaries. I will try to find out why, and see if I can adjust the code to allow it in.

Best,

Jim

To unsubscribe from this group and stop receiving emails from it, send an email to shortbred-users+unsubscribe@googlegroups.com.

Jim Kaminski

unread,
Mar 14, 2017, 12:49:51 AM3/14/17
to Adam Blanchard, shortbred-users, Nsa
Hi Adam,

Thank you for your patience.

At any point in your run of ShortBRED-Identify, does the following line appear?

FASTA-Reader: Ignoring invalid residues at position(s): On line 44: 1, 13-16

(The line numbers can be something different. )

I'm curious to see if GeneMark is outputting any symbols that are invalid sequence characters (perhaps an accidental tab or carriage return). I experimented with the fasta input to ShortBRED, and I was able to generate the dictionary error by adding in some invalid characters and spaces to my input sequences. 

If it's possible, could you please send me your clust/clust.faa and clust/clust.faa.clstr files from the temporary folder created by ShortBRED? I'm curious to see how cd-hit is grouping this particular sequence (gene_10886|GeneMark_hmm|34_aa|+|22053|22154). Also, does it have its own fasta file in /clust/fams?

Thank you,

Jim

Adam Blanchard

unread,
Mar 15, 2017, 10:41:39 AM3/15/17
to shortbred-users, addyb...@gmail.com, nsa...@gmail.com
Hi Jim, 

No, it doesn't,  clust.faa and clust.faa.clus attached as requested. I've also attached the gene_10886 fasta from the clus/fams folder

Regards
Adam
clust.faa.clstr
clust.faa
gene_10886|GeneMark_hmm|34_aa|+|22053|22154.faa

Dave Yarmosh

unread,
Apr 9, 2018, 11:37:22 AM4/9/18
to shortbred-users
Hi Jim,

Any updates to this? I've just encountered the same issue with this --


Traceback (most recent call last):
  File "/home/src/biobakery/shortbred_identify.py", line 367, in <module>
    dictGOICounts = pb.MarkX(dictGOIGenes,dictGOICounts)
  File "/net/clusterhn.cluster.com/home/src/biobakery/src/process_blast.py", line 499, in MarkX

    dictOverlap[strName][i] = dictOverlap[strName][i] + 9999999
KeyError: 'PAS03207_1_DUF3265_domain-containing_protein__Vibrio_cholerae_'

Thanks,
Dave
Reply all
Reply to author
Forward
0 new messages