Application of ShortBRED to bacterial isolate genomes

165 views
Skip to first unread message

Gary Xie

unread,
Jan 12, 2016, 12:19:51 PM1/12/16
to shortbred-users
I am wondering anyone has some experience on applying shortBRED on bacterial isolate genomes:
Here is what has been described in ShortBRED manuscript, but it is not very clear to me what exact command line I should use if i am searching ABR families in E.coli K-12. Would someone kindly share your command line used?
===========================
ShortBRED can be applied to identify protein families in a bacterial isolate genome given a corresponding set of ShortBRED markers for those families. To do so, ShortBRED first creates a USEARCH database for the genome and then searches the markers against that database (allowing for multiple hits per marker query). For protein families characterized by more than one marker sequence, ShortBRED requires that a critical fraction of the markers map to a gene in the genome before assigning it to that protein family. The default value for this cutoff is 10% [i.e. 1 in 10 markers], but it can be tuned to be more conservative.
===========================
Based on the manuscript, it looks like a 2-step processes:
1)  create a USEARCH database for K-12.  I am assuming I should use the DNA sequences of E.coli K12, right?  If so, should I use the whole genome fna file, or predicted ORF ffn file.

usearch -makeudb_ublast k12.fna -output k12.udb

2) searches the markers against that database (allowing for multiple hits per marker query).  Here, I am assuming I should use the centroid_consensus_sequences.faa as my markers and K12.udb as —wgs file.   If so, how can I enable allowing multiple hits per marker query? Is following command line correct? 

shortbred_quantify.py --markers ShortBRED_ABR_centroid_consensus_sequences.faa --wgs K12.udb --results exampleresults.txt --tmp example_quantify

Thanks for your time.

Gary Xie

Jim Kaminski

unread,
Jan 12, 2016, 5:13:45 PM1/12/16
to Gary Xie, shortbred-users
Dear Gary,

Thank you for calling this to my attention. I have just added a section to the tutorial about searching bacterial genomes.

Regarding your questions:
1) We used AA ORFs in the paper, but ShortBRED can also handle nucleotide files for genomes. I'm inclined toward the ORFs as that uses usearch instead of tblastn.

2) You can still refer to the markers with "--markers" ShortBRED will do the rest. If you include the "--genome" argument as explained below, ShortBRED will handle it, and you don't need to work with usearch manually.

Using ShortBRED to Search ORFs in Isolate Genomes 

ShortBRED can search a file of amino acid ORF's for hits to amino acid markers. The command is similar to running ShortBRED on wgs files, but uses the --genome flag to specify a genome instead of a wgs file.

python shortbred_quantify.py --genome isolate_ORFs.faa --markers sample_markers/ShortBRED_ABR_101bp_markers.faa --tmp bacterialgenome --usearch /path/to/usearch --maxhits 0 --maxrejects 0

There are two additional flags, "--maxhits" and  "--maxrejects" which instruct usearch to allow multiple markers to map to each ORF, and try multiple markers. If you set these to 0, usearch will test the full database of ORFs against each of the markers.

ShortBRED can also search unannotated isolate genome files using tblastn. I will prepare that section as well.

Hope this is helpful,

Jim


--
You received this message because you are subscribed to the Google Groups "shortbred-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to shortbred-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gary Xie

unread,
Jan 13, 2016, 1:18:38 PM1/13/16
to shortbred-users, jet...@gmail.com
Hi, Jim,
Thanks for info. I was able to run ShortBRED_ABR_101bp_markers.faa against Ecoli K-12 protein sequences. I have noticed there are two important result files (defined by —tmp): SBhits.txt and annotated_genomefull_results.tab. It looks like SBhits.txt is the final results after applying certain cut-off to annotated_genomefull_results.tab. Based on ShortBRED paper, By default, ShortBRED-Quantify will record a hit to a marker if the resulting alignment has at least 95% identity, and is at least as long as the minimum of (i) the marker length or (ii) 95% of the read length.
My question is:  In the case for isolated genome, has the same cutoff applied here for protein sequences from bacterial isolate genomes?

Gary

Jim Kaminski

unread,
Jan 19, 2016, 6:09:58 PM1/19/16
to shortbred-users, jet...@gmail.com
Hi Gary,

Yes, ShortBRED still applies these limits when searching isolate genomes. Here, though, the length would be 95% of the marker instead of the read.

Jim

Gary Xie

unread,
Jan 25, 2016, 4:14:27 PM1/25/16
to shortbred-users, jet...@gmail.com
Hi, Jim,
When i searched the AR markers against E.coli K12 protein sequences, i also got an txt file. Instead of having 3 columns of normalized count (C), Usearch Hits to a marker (H), TotMarkerLength, this txt file only have one column of result (see below). My question is how this normalized count is calculated when searching against Ecoli K12 ORF protein sequences. Is anything >0 considered as a significant hit???
thanks,
Gary

NP_390536    0
NP_390993    0
NP_391957    0
NP_414593    0.5
NP_414995    1.0
NP_416758    0
NP_417970    0
NP_418166    0
NP_418504    0.25
NP_418505    0.0666666666667
NP_460867    0
NP_465227    0
NP_478145    0
NP_511223    0
Message has been deleted

Gary Xie

unread,
Feb 24, 2016, 1:15:01 PM2/24/16
to shortbred-users
Just saw ShortBRED source files have been uploaded, now it includes ABR_150bp marker sequences. In the past, we have been recommended to use ABR_101bp marker set for isolated genomes. Right now should we use the 150bp version instead?
BTW., What is the difference between these two in term of applying to isolated genome (by blasting marker sequences against protein sequences of isolated genome)
Thanks,
Gary

Jim Kaminski

unread,
Feb 25, 2016, 5:29:02 PM2/25/16
to Gary Xie, shortbred-users
Hi Gary,

Thank you for your email. I uploaded the 150bp markers in response to a researcher who was working with metagenome reads with an average length of 150bp. The quick answer for their use on isolate genomes is that I expect results will be similar using either set of markers. I have provided a more detailed response below.

I do not anticipate a substantial difference in quantification results between using the 101bp markers or the 150bp markers. The main differences that can arise when changing the expected read length in ShortBRED-Identify are that 1) the Quasi Markers (QM's) and Junction Markers (JM's) will have different lengths, and 2) it can affect how many JM's and QM's are produced.

Regarding #1, I set the QM and JM length to be 1/3 of the expected read length. (The 1/3 comes from converting nucleotide length to expected amino acid marker length.) Thus, the JM's and QM's will be a bit longer (50 AA's vs 34 AA's).

Regarding #2, the change in expected read length can result in more or less JM's. JM's are produced when ShortBRED cannot find a unique region long enough to build a valid True Marker (TM). We call them Junction Markers because each part of a JM overlaps with some other protein family, but the junction of these regions is unique to the family it represents. Having a longer expected read length can result in more JM's, because you can imagine instances where ShortBRED could not find a JM that is 34 AA long, but can find one that is 50 AA long.

There is also an effect that can result in less JM's, though. Early in the ShortBRED-Identify process, the program clusters sequences with CD-Hit, and then constructs consensus sequences for each family. When the AA's in a given column of the aligned sequences of a particular family are not at least 95% identical, ShortBRED represents the AA with an "X". When forming the JM's, ShortBRED first tries to find a region 40% of the expected read length that does not contain any X's. Thus, increasing the expected read length can reduce the number of JM's.

Comparing the two sets of markers, I find that we have 48 JM's in the 101bp set, and 47 in the JM set. We have 6 QM's in the 101bp set, and 9 QM's in the 150bp set. (We build 3 QM's when we cannot find a JM, so this makes sense.) Both sets have 4,078 TM's, and I believe 4,072 of these are identical. (I am not sure why six are different - I may have used a different number for the maximum length of all markers in a family or changed some other parameter between versions of ShortBRED).

The length of these JM's and QM's could affect whether some isolate genome AA ORFs or nucleotide sequences are called as valid hits, which will then affect the final ShortBRED score for the associated protein families. When dealing with metagenomic data, we require an alignment of a read to a marker to be greater than or equal to 95% of the read, or contain the full marker (in addition to the identity requirements). When dealing with genomic data, we switch these, and require either 95% of the marker length, or the full ORF/nuc seq to align. Since markers will very likely be shorter than the sequences from the isolate genome, the limit in most cases will be the marker length. 

The upshot of all this is that I expect the results to be largely the same. The distribution of JM's and QM's has changed very slightly, and they are a bit longer, but 4,072 of the 4,132 original 101bp markers are the same.

Hope this is helpful,

Jim

--
Reply all
Reply to author
Forward
0 new messages