Retrieving species name from the output of blastp against uniprot_uniref90.trinotate_v2.0.pep_2

1,004 views
Skip to first unread message

setar...@gmail.com

unread,
Mar 1, 2016, 8:56:56 AM3/1/16
to trinityrnaseq-users
Hi Brian and all,

I performed the blastp of Trinity assembly (after applying Transdecoder tool) against uniprot_uniref90.trinotate_v2.0.pep_2 database. I would like to check the species distribution in the blastp output; I usually got the species name by adding stitle to the blast command (-outfmt '6 std stitle'), but it didn't work for this database. Could you please let me know how I can obtain the species name from the blast output?



Thank you in advance

Brian Haas

unread,
Mar 1, 2016, 9:14:43 PM3/1/16
to maryam moazam, trinityrnaseq-users
Hi Maryam,

If you run Trinotate, the report file should include the full taxonomy string for the best blast match.

Otherwise, I'm pretty sure there's a way to run blast and have it include taxonomy information in the output, but there is some additional setup for it.  Either others can help here or it'll involve some Googling.  

best,

~b

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.



--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

 

Ken Field

unread,
Mar 1, 2016, 11:03:07 PM3/1/16
to Brian Haas, maryam moazam, trinityrnaseq-users
I think you will want to do this:
Run blastx -help then look for the field called outfmt
When you run blastx, you will include an option like

-outfmt "6 qacc sacc qseq sseq sscinames"

*** Formatting options
 -outfmt <String>
   alignment view options:
     0 = pairwise,
     1 = query-anchored showing identities,
     2 = query-anchored no identities,
     3 = flat query-anchored, show identities,
     4 = flat query-anchored, no identities,
     5 = XML Blast output,
     6 = tabular,
     7 = tabular with comment lines,
     8 = Text ASN.1,
     9 = Binary ASN.1,
    10 = Comma-separated values,
    11 = BLAST archive format (ASN.1) 

   Options 6, 7, and 10 can be additionally configured to produce
   a custom format specified by space delimited format specifiers.
   The supported format specifiers are:
           qseqid means Query Seq-id
              qgi means Query GI
             qacc means Query accesion
          qaccver means Query accesion.version
             qlen means Query sequence length
           sseqid means Subject Seq-id
        sallseqid means All subject Seq-id(s), separated by a ';'
              sgi means Subject GI
           sallgi means All subject GIs
             sacc means Subject accession
          saccver means Subject accession.version
          sallacc means All subject accessions
             slen means Subject sequence length
           qstart means Start of alignment in query
             qend means End of alignment in query
           sstart means Start of alignment in subject
             send means End of alignment in subject
             qseq means Aligned part of query sequence
             sseq means Aligned part of subject sequence
           evalue means Expect value
         bitscore means Bit score
            score means Raw score
           length means Alignment length
           pident means Percentage of identical matches
           nident means Number of identical matches
         mismatch means Number of mismatches
         positive means Number of positive-scoring matches
          gapopen means Number of gap openings
             gaps means Total number of gaps
             ppos means Percentage of positive-scoring matches
           frames means Query and subject frames separated by a '/'
           qframe means Query frame
           sframe means Subject frame
             btop means Blast traceback operations (BTOP)
          staxids means Subject Taxonomy ID(s), separated by a ';'
        sscinames means Subject Scientific Name(s), separated by a ';'
        scomnames means Subject Common Name(s), separated by a ';'
       sblastnames means Subject Blast Name(s), separated by a ';'
                (in alphabetical order)
       sskingdoms means Subject Super Kingdom(s), separated by a ';'
                (in alphabetical order) 
           stitle means Subject Title
       salltitles means All Subject Title(s), separated by a '<>'
          sstrand means Subject Strand
            qcovs means Query Coverage Per Subject
          qcovhsp means Query Coverage Per HSP
Ken Field, Ph.D.
Associate Professor of Biology
Program in Cell Biology/Biochemistry
Bucknell University
Room 203A Biology Building

Brian Haas

unread,
Mar 2, 2016, 7:34:44 AM3/2/16
to Ken Field, maryam moazam, trinityrnaseq-users
If I remember correctly you also need to have the ncbi taxonomy database installed and some configuration file.  I did this a while ago but don't remember the specifics.  There was good documentation for it though.

-Brian
(by iPhone)

setar...@gmail.com

unread,
Mar 2, 2016, 4:52:53 PM3/2/16
to trinityrnaseq-users
Hi Brian, Ken and all,

Thank you for your responses. At the moment, I just run the blastp step of Trinoate, that its output is something like below:

c10002_g1_i1|m.3787     Q94239_CAEEL  52.00 50  22 2 18 67      448     495     3e-04   47.4
c10003_g1_i1|m.3788     K8E9J5_9CHLO  27.05 122 76 5 101        217     261     374     1e-04 51.2

However, by adding the sscinames stitle into output format (-outfmt '6 sscinames stitle'), it changed to 

c10002_g1_i1|m.3787     Q94239_CAEEL  52.00 50  22 2 18 67      448     495     3e-04   47.4  109       705     Q94239_CAEEL SubName: Full=Protein STAU-1 {ECO:0000313|EMBL:CCD62871.1}
c10003_g1_i1|m.3788     K8E9J5_9CHLO  27.05 122 76 5 101        217     261     374     1e-04 51.2      256     471     K8E9J5_9CHLO SubName: Full=Uncharacterized protein {ECO:0000313|EMBL:CCO14304.1}

However, there is not any species name in the output. Unlike running blastp against uniprot_uniref90.trinotate_v2.0.pep_2 database, when I run the blastp against Uniprot (out of Trinoate), I could get the species name (OS) by using -outfmt '6 std stitle',

tr|B9GG50|B9GG50_POPTR 3-isopropylmalate dehydrogenase OS=Populus trichocarpa tr|B9GG50|B9GG50_POPTR 3 isopropylmalate dehydrogenase OS=Populus trichocarpa GN=POPTR_0001s18630g PE=3 SV=2


Another thing that may not much be related to this subject, but I would be highly appreciated if you could please help me out with it. As I am working on a non-model plant, I tried to do blastx against two databases (Uniprot and the closet species proteome obtained from phytozome). Now, I have two blastx output in tabular format (-outfmt 6), which for example in the Uniprot's output, there were many "uncharacterized protein" as the best hit, which many of them turned to be known protein in the phytozme output.
Could you please advise me how I can integrate two outputs as a single informative output? so that, all uncharacterized proteins replaced with known protein.


All the best,
Mary

Brian Haas

unread,
Mar 3, 2016, 6:50:16 PM3/3/16
to maryam moazam, trinityrnaseq-users
Hi Mary,

I had done this before like so:

export BLASTDB=/broad/data/blastdb/taxdb

(the above directory contains files: taxdb.bti and taxdb.btd, and setting up the taxonomy database is supposedly explained in the blast manual - though I just used what was on-site)

blastn -db /broad/data/blastdb/nt/nt -query  trinity_out_dir.Trinity.fasta  -max_target_seqs 1 -num_threads 4 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore staxids sskingdoms scomnames" > blastnt.outfmt6

setar...@gmail.com

unread,
Mar 6, 2016, 4:23:38 AM3/6/16
to trinityrnaseq-users
Hi Brian and all,

Regarding getting the species name from the output of blastp against uniprot_uniref90.trinotate_v2.0.pep_2 database, I put taxdb.tar.gz file in the same directory where other blast database was installed. Two files, taxdb.btd and taxdb.bti were produced after uncompressing the taxdb.tar.gz. Also, I did export BLASTDB=$BLASTDB:/home/Mary//ncbi-blast-2.2.30+/bin in the bash_profile. However, I still could not obtain the species name in the output. May taxdb is set up for a preformated blast database? Could you please again help me out to solve this issue? 


Thank you very much,
Mary





On Tuesday, March 1, 2016 at 5:26:56 PM UTC+3:30, setar...@gmail.com wrote:

setar...@gmail.com

unread,
Mar 7, 2016, 8:42:37 AM3/7/16
to trinityrnaseq-users
Dear community,

Nothing, yet. Please kindly tells me anything to solve this issue.



On Tuesday, March 1, 2016 at 5:26:56 PM UTC+3:30, setar...@gmail.com wrote:

Brian Haas

unread,
Mar 7, 2016, 9:05:26 AM3/7/16
to maryam moazam, trinityrnaseq-users

I don't have a good answer for this.  Given that it's not entirely Trinity-specific, you might try SEQanswers and see if the larger community can help here.


best of luck!

~b

setar...@gmail.com

unread,
Mar 8, 2016, 6:11:14 AM3/8/16
to trinityrnaseq-users
Hi Brian and all,

Regarding the current problem, it sounds that the problem is Trinoate project-specific as I did not have such a problem when I performed the local blast against Uniprot database (out of Trinoate). I also contacted with Uniprot helpdesk; unfortunately, they could not help me and told me the problem is a project-specific. I try to solve the problem in other forums, however, here is the best place to resolve such an issue. 

 Best regards,
Maryam



On Tuesday, March 1, 2016 at 5:26:56 PM UTC+3:30, setar...@gmail.com wrote:

Brian Haas

unread,
Mar 8, 2016, 6:30:16 AM3/8/16
to maryam moazam, trinityrnaseq-users
Hi Maryam,

If you run blast according to the Trinotate documentation, and have Trinotate generate it's final annotation report, the taxonomy string will be included in that report.  It's not part of the blast output, but rather formulated by Trinotate based on taxonomic information embedded in the original swissprot records.

best,

~brian

setar...@gmail.com

unread,
Mar 8, 2016, 6:43:28 AM3/8/16
to trinityrnaseq-users
Hi Brian,

Thank you very much for letting me know it. Actually, I just started to use Trinotate and stop at the blast step for this issue, so I go ahead, again thank


All the best,
Maryam






On Tuesday, March 1, 2016 at 5:26:56 PM UTC+3:30, setar...@gmail.com wrote:

Brian Haas

unread,
Mar 8, 2016, 6:44:45 AM3/8/16
to maryam moazam, trinityrnaseq-users
If all you really want is the taxonomic info, you can just load in the blast results and generate your report. You don't have to go through the full Trinotate protocol (pfam, sigP, tmhmm, etc.)

best,

~b

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages