TransDecoder significantly reduces the number of complete BUSCOs

200 views
Skip to first unread message

Emily Jennings

unread,
Apr 10, 2021, 6:21:37 PM4/10/21
to trinityrnaseq-users
Hello,

I generated my assembly as follows:
1. Run Trinity
2. Filter for transcripts > 500nt
3. Select the longest isoform for each gene
4. Run TransDecoder.LongOrfs
5. Run TransDecoder.Predict

I ran BUSCO on the outputs from steps 3-5. For step 3 I have 71.8% complete BUSCOs. However, for steps 4 & 5 I have 49% complete BUSCOs. I am concerned that TransDecoder may be excluding potentially important transcripts. Is there an explanation for this? Is this common?

Thank you!

Brian Haas

unread,
Apr 11, 2021, 7:06:25 AM4/11/21
to Emily Jennings, trinityrnaseq-users
Hi Emily,

That does sound a bit troubling.  I'd suggest taking a few examples of transcripts where you have busco hits in step 3 but not step 4 and see if there are actual uninterrupted open reading frames where those blast matches are occurring on the transcript sequence.  If you want to privately send me a few examples, I'd be happy to take a look too.


best,

~brian

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/trinityrnaseq-users/0f850460-3176-4613-ad96-7a4d0ce13e01n%40googlegroups.com.


--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

 

Emily Jennings

unread,
Apr 11, 2021, 12:55:50 PM4/11/21
to trinityrnaseq-users
Hi Brian, thanks for the fast reply! 

As you suggested, I extracted the transcript sequences for all BUSCOs that were present in step 3 but missing from step 4. I am sending you a copy of this file. 

I randomly selected one of these seqs (TRINITY_DN34533_c0_g1_i1, which matched initially with BUSCO 986) and searched it using ORFfinder (NCBI), which was able to identify multiple ORFs, one of which was quite long:

Screenshot 2021-04-11 123732.jpg

A BLAST search with ORF4 yielded very good hits, which match BUSCO 986:

Screenshot 2021-04-11 124018.jpg

The only thing that I can think of is that I ran TransDecoder using the -S flag, as my libraries are strand-specific. I will re-run without -S and see if I have the same problem. 

Emily Jennings

unread,
Apr 11, 2021, 1:35:40 PM4/11/21
to trinityrnaseq-users
I re-ran TransDecoder without the -S flag and the BUSCO results were unchanged (still ~49% complete after TransDecoder). 

Emily Jennings

unread,
Apr 11, 2021, 2:05:05 PM4/11/21
to trinityrnaseq-users
I think this may be an issue with BUSCO. I searched the actual .cds files output by TransDecoder for several of the transcripts that are 'missing' according to BUSCO and found 'complete' cds present:

$ head BUSCOs_missing_after_TransDecoder_fulllist.txt
4291    Complete        TRINITY_DN33922_c0_g1_i9        945.2   795
4459    Fragmented      TRINITY_DN22208_c0_g1_i5        825.9   424

# BUSCO reports them as missing
$ grep -w '4291' longest_orfscds_Unstranded_BUSCO/run_embryophyta_odb10/full_table.tsv
4291    Missing
$ grep -w '4459' longest_orfscds_Unstranded_BUSCO/run_embryophyta_odb10/full_table.tsv
4459    Missing

# But they are indeed present in the .cds output (whether I ran with -S or not)
$ grep -w TRINITY_DN22208_c0_g1_i5 Trinity_500_longestisoform.fasta.transdecoder.cds
>TRINITY_DN22208_c0_g1_i5|m.5407 TRINITY_DN22208_c0_g1_i5|g.5407  ORF TRINITY_DN22208_c0_g1_i5|g.5407 TRINITY_DN22208_c0_g1_i5|m.5407 type:complete len:169 (+) TRINITY_DN22208_c0_g1_i5:2005-2511(+)
>TRINITY_DN22208_c0_g1_i5|m.5406 TRINITY_DN22208_c0_g1_i5|g.5406  ORF TRINITY_DN22208_c0_g1_i5|g.5406 TRINITY_DN22208_c0_g1_i5|m.5406 type:5prime_partial len:572 (+) TRINITY_DN22208_c0_g1_i5:2-1717(+)
$ grep -w TRINITY_DN33922_c0_g1_i9 Trinity_500_longestisoform.fasta.transdecoder.cds
>TRINITY_DN33922_c0_g1_i9|m.1708 TRINITY_DN33922_c0_g1_i9|g.1708  ORF TRINITY_DN33922_c0_g1_i9|g.1708 TRINITY_DN33922_c0_g1_i9|m.1708 type:complete len:209 (+) TRINITY_DN33922_c0_g1_i9:3159-3785(+)
>TRINITY_DN33922_c0_g1_i9|m.1706 TRINITY_DN33922_c0_g1_i9|g.1706  ORF TRINITY_DN33922_c0_g1_i9|g.1706 TRINITY_DN33922_c0_g1_i9|m.1706 type:complete len:336 (+) TRINITY_DN33922_c0_g1_i9:99-1106(+)
>TRINITY_DN33922_c0_g1_i9|m.1707 TRINITY_DN33922_c0_g1_i9|g.1707  ORF TRINITY_DN33922_c0_g1_i9|g.1707 TRINITY_DN33922_c0_g1_i9|m.1707 type:complete len:286 (+) TRINITY_DN33922_c0_g1_i9:1239-2096(+)


Brian Haas

unread,
Apr 12, 2021, 9:28:47 AM4/12/21
to Emily Jennings, trinityrnaseq-users
That is pretty interesting.

Note, the file:  Trinity_500_longestisoform.fasta.transdecoder.cds
are just the top 500 longest coding regions used for training the Markov model.

There should be a 'longest_orfs.pep' file that contains all candidate orfs examined by TransDecoder.

Then, of course, there's the final predicted orfs from TransDecoder.

If your orfs of interest are in the 'longest_orfs.pep', but not in the final TransDecoder result file, then it's a true false negative - for some reason, TransDecoder deemed the orf non-coding and discarded it.  This would be very peculiar.     Just running TransDecoder on your small input file you provided, it finds that earlier missing entry just fine.

Let me know if I can help troubleshoot this further with you.  There should be a good explanation for any aberrant behavior.

best,

~b




On Sun, Apr 11, 2021 at 2:05 PM Emily Jennings <emy...@gmail.com> wrote:
>
> I think this may be an issue with BUSCO. I searched the actual .cds files output by TransDecoder for several of the transcripts that are 'missing' according to BUSCO and found 'complete' cds present:
>
> $ head BUSCOs_missing_after_TransDecoder_fulllist.txt
> 4291    Complete        TRINITY_DN33922_c0_g1_i9        945.2   795
> 4459    Fragmented      TRINITY_DN22208_c0_g1_i5        825.9   424
>
> # BUSCO reports them as missing
> $ grep -w '4291' longest_orfscds_Unstranded_BUSCO/run_embryophyta_odb10/full_table.tsv
> 4291    Missing
> $ grep -w '4459' longest_orfscds_Unstranded_BUSCO/run_embryophyta_odb10/full_table.tsv
> 4459    Missing
>
> # But they are indeed present in the .cds output (whether I ran with -S or not)
> $ grep -w TRINITY_DN22208_c0_g1_i5 Trinity_500_longestisoform.fasta.transdecoder.cds
> >TRINITY_DN22208_c0_g1_i5|m.5407 TRINITY_DN22208_c0_g1_i5|g.5407  ORF TRINITY_DN22208_c0_g1_i5|g.5407 TRINITY_DN22208_c0_g1_i5|m.5407 type:complete len:169 (+) TRINITY_DN22208_c0_g1_i5:2005-2511(+)
> >TRINITY_DN22208_c0_g1_i5|m.5406 TRINITY_DN22208_c0_g1_i5|g.5406  ORF TRINITY_DN22208_c0_g1_i5|g.5406 TRINITY_DN22208_c0_g1_i5|m.5406 type:5prime_partial len:572 (+) TRINITY_DN22208_c0_g1_i5:2-1717(+)
> $ grep -w TRINITY_DN33922_c0_g1_i9 Trinity_500_longestisoform.fasta.transdecoder.cds
> >TRINITY_DN33922_c0_g1_i9|m.1708 TRINITY_DN33922_c0_g1_i9|g.1708  ORF TRINITY_DN33922_c0_g1_i9|g.1708 TRINITY_DN33922_c0_g1_i9|m.1708 type:complete len:209 (+) TRINITY_DN33922_c0_g1_i9:3159-3785(+)
> >TRINITY_DN33922_c0_g1_i9|m.1706 TRINITY_DN33922_c0_g1_i9|g.1706  ORF TRINITY_DN33922_c0_g1_i9|g.1706 TRINITY_DN33922_c0_g1_i9|m.1706 type:complete len:336 (+) TRINITY_DN33922_c0_g1_i9:99-1106(+)
> >TRINITY_DN33922_c0_g1_i9|m.1707 TRINITY_DN33922_c0_g1_i9|g.1707  ORF TRINITY_DN33922_c0_g1_i9|g.1707 TRINITY_DN33922_c0_g1_i9|m.1707 type:complete len:286 (+) TRINITY_DN33922_c0_g1_i9:1239-2096(+)
>
>
> On Sunday, April 11, 2021 at 1:35:40 PM UTC-4 Emily Jennings wrote:
>>
>> I re-ran TransDecoder without the -S flag and the BUSCO results were unchanged (still ~49% complete after TransDecoder).
>>
>> On Sunday, April 11, 2021 at 12:55:50 PM UTC-4 Emily Jennings wrote:
>>>
>>> Hi Brian, thanks for the fast reply!
>>>
>>> As you suggested, I extracted the transcript sequences for all BUSCOs that were present in step 3 but missing from step 4. I am sending you a copy of this file.
>>>
>>> I randomly selected one of these seqs (TRINITY_DN34533_c0_g1_i1, which matched initially with BUSCO 986) and searched it using ORFfinder (NCBI), which was able to identify multiple ORFs, one of which was quite long:
>>>
>>>
>>>
>>> A BLAST search with ORF4 yielded very good hits, which match BUSCO 986:
>>>
>>>
>>>

Emily Jennings

unread,
Apr 12, 2021, 3:36:50 PM4/12/21
to trinityrnaseq-users
The file Trinity_500_longestisoform.fasta.transdecoder.cds is actually the full output from TransDecoder.Predict. I included the "500_" as a note to myself that the original .fasta was filtered for transcripts > 500 nt.
My orfs of interest are present in both the TransDecoder.LongOrfs and TransDecoder.Predict outputs, so I believe this is an issue solely with BUSCO. I have submitted this issue to BUSCO (which you can follow here).  

Brian Haas

unread,
Apr 12, 2021, 3:43:52 PM4/12/21
to Emily Jennings, trinityrnaseq-users
gotcha.   OK, I'll add this as a github issue for tracking purposes.  If you get some resolution for this and could update this post, that'd be terrific.

many thx

Emily Tallerday

unread,
Apr 16, 2021, 1:10:27 PM4/16/21
to Brian Haas, trinityrnaseq-users

Unfortunately, BUSCO’s issues board is not as active as this one. I mean that as a compliment to you, Brian! If/when I hear back from them I will update here. In the mean time, for anyone interested, I plan to use the stats provided by BUSCO from my (pre-cds prediction) .fasta file in addition to a BLAST-based assessment of the .cds/.pep file as described at: https://github.com/trinityrnaseq/trinityrnaseq/wiki/Counting-Full-Length-Trinity-Transcripts  

 

Thanks!

EJT

Reply all
Reply to author
Forward
0 new messages