TransDecoder significantly reduces the number of complete BUSCOs

Emily Jennings

unread,

Apr 10, 2021, 6:21:37 PM4/10/21

to trinityrnaseq-users

Hello,

I generated my assembly as follows:

1. Run Trinity

2. Filter for transcripts > 500nt

3. Select the longest isoform for each gene

4. Run TransDecoder.LongOrfs

5. Run TransDecoder.Predict

I ran BUSCO on the outputs from steps 3-5. For step 3 I have 71.8% complete BUSCOs. However, for steps 4 & 5 I have 49% complete BUSCOs. I am concerned that TransDecoder may be excluding potentially important transcripts. Is there an explanation for this? Is this common?

Thank you!

Brian Haas

unread,

Apr 11, 2021, 7:06:25 AM4/11/21

to Emily Jennings, trinityrnaseq-users

Hi Emily,

That does sound a bit troubling. I'd suggest taking a few examples of transcripts where you have busco hits in step 3 but not step 4 and see if there are actual uninterrupted open reading frames where those blast matches are occurring on the transcript sequence. If you want to privately send me a few examples, I'd be happy to take a look too.

bh...@broadinstitute.org

best,

~brian

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/trinityrnaseq-users/0f850460-3176-4613-ad96-7a4d0ce13e01n%40googlegroups.com.

--

--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

Emily Jennings

unread,

Apr 11, 2021, 12:55:50 PM4/11/21

to trinityrnaseq-users

Hi Brian, thanks for the fast reply!

As you suggested, I extracted the transcript sequences for all BUSCOs that were present in step 3 but missing from step 4. I am sending you a copy of this file.

I randomly selected one of these seqs (TRINITY_DN34533_c0_g1_i1, which matched initially with BUSCO 986) and searched it using ORFfinder (NCBI), which was able to identify multiple ORFs, one of which was quite long:

A BLAST search with ORF4 yielded very good hits, which match BUSCO 986:

The only thing that I can think of is that I ran TransDecoder using the -S flag, as my libraries are strand-specific. I will re-run without -S and see if I have the same problem.

Emily Jennings

unread,

Apr 11, 2021, 1:35:40 PM4/11/21

to trinityrnaseq-users

I re-ran TransDecoder without the -S flag and the BUSCO results were unchanged (still ~49% complete after TransDecoder).

Emily Jennings

unread,

Apr 11, 2021, 2:05:05 PM4/11/21

to trinityrnaseq-users

I think this may be an issue with BUSCO. I searched the actual .cds files output by TransDecoder for several of the transcripts that are 'missing' according to BUSCO and found 'complete' cds present:

$ head BUSCOs_missing_after_TransDecoder_fulllist.txt

4291 Complete TRINITY_DN33922_c0_g1_i9 945.2 795

4459 Fragmented TRINITY_DN22208_c0_g1_i5 825.9 424

# BUSCO reports them as missing

$ grep -w '4291' longest_orfscds_Unstranded_BUSCO/run_embryophyta_odb10/full_table.tsv

4291 Missing

$ grep -w '4459' longest_orfscds_Unstranded_BUSCO/run_embryophyta_odb10/full_table.tsv

4459 Missing

# But they are indeed present in the .cds output (whether I ran with -S or not)

$ grep -w TRINITY_DN22208_c0_g1_i5 Trinity_500_longestisoform.fasta.transdecoder.cds

>TRINITY_DN22208_c0_g1_i5|m.5407 TRINITY_DN22208_c0_g1_i5|g.5407 ORF TRINITY_DN22208_c0_g1_i5|g.5407 TRINITY_DN22208_c0_g1_i5|m.5407 type:complete len:169 (+) TRINITY_DN22208_c0_g1_i5:2005-2511(+)

>TRINITY_DN22208_c0_g1_i5|m.5406 TRINITY_DN22208_c0_g1_i5|g.5406 ORF TRINITY_DN22208_c0_g1_i5|g.5406 TRINITY_DN22208_c0_g1_i5|m.5406 type:5prime_partial len:572 (+) TRINITY_DN22208_c0_g1_i5:2-1717(+)

$ grep -w TRINITY_DN33922_c0_g1_i9 Trinity_500_longestisoform.fasta.transdecoder.cds

>TRINITY_DN33922_c0_g1_i9|m.1708 TRINITY_DN33922_c0_g1_i9|g.1708 ORF TRINITY_DN33922_c0_g1_i9|g.1708 TRINITY_DN33922_c0_g1_i9|m.1708 type:complete len:209 (+) TRINITY_DN33922_c0_g1_i9:3159-3785(+)

>TRINITY_DN33922_c0_g1_i9|m.1706 TRINITY_DN33922_c0_g1_i9|g.1706 ORF TRINITY_DN33922_c0_g1_i9|g.1706 TRINITY_DN33922_c0_g1_i9|m.1706 type:complete len:336 (+) TRINITY_DN33922_c0_g1_i9:99-1106(+)

>TRINITY_DN33922_c0_g1_i9|m.1707 TRINITY_DN33922_c0_g1_i9|g.1707 ORF TRINITY_DN33922_c0_g1_i9|g.1707 TRINITY_DN33922_c0_g1_i9|m.1707 type:complete len:286 (+) TRINITY_DN33922_c0_g1_i9:1239-2096(+)

Brian Haas

unread,

Apr 12, 2021, 9:28:47 AM4/12/21

to Emily Jennings, trinityrnaseq-users

That is pretty interesting.

Note, the file: Trinity_500_longestisoform.fasta.transdecoder.cds
are just the top 500 longest coding regions used for training the Markov model.

There should be a 'longest_orfs.pep' file that contains all candidate orfs examined by TransDecoder.

Then, of course, there's the final predicted orfs from TransDecoder.

If your orfs of interest are in the 'longest_orfs.pep', but not in the final TransDecoder result file, then it's a true false negative - for some reason, TransDecoder deemed the orf non-coding and discarded it. This would be very peculiar. Just running TransDecoder on your small input file you provided, it finds that earlier missing entry just fine.

Let me know if I can help troubleshoot this further with you. There should be a good explanation for any aberrant behavior.

best,

~b

On Sun, Apr 11, 2021 at 2:05 PM Emily Jennings <emy...@gmail.com> wrote:
>
> I think this may be an issue with BUSCO. I searched the actual .cds files output by TransDecoder for several of the transcripts that are 'missing' according to BUSCO and found 'complete' cds present:
>
> $ head BUSCOs_missing_after_TransDecoder_fulllist.txt
> 4291 Complete TRINITY_DN33922_c0_g1_i9 945.2 795
> 4459 Fragmented TRINITY_DN22208_c0_g1_i5 825.9 424
>
> # BUSCO reports them as missing
> $ grep -w '4291' longest_orfscds_Unstranded_BUSCO/run_embryophyta_odb10/full_table.tsv
> 4291 Missing
> $ grep -w '4459' longest_orfscds_Unstranded_BUSCO/run_embryophyta_odb10/full_table.tsv
> 4459 Missing
>
> # But they are indeed present in the .cds output (whether I ran with -S or not)
> $ grep -w TRINITY_DN22208_c0_g1_i5 Trinity_500_longestisoform.fasta.transdecoder.cds
> >TRINITY_DN22208_c0_g1_i5|m.5407 TRINITY_DN22208_c0_g1_i5|g.5407 ORF TRINITY_DN22208_c0_g1_i5|g.5407 TRINITY_DN22208_c0_g1_i5|m.5407 type:complete len:169 (+) TRINITY_DN22208_c0_g1_i5:2005-2511(+)
> >TRINITY_DN22208_c0_g1_i5|m.5406 TRINITY_DN22208_c0_g1_i5|g.5406 ORF TRINITY_DN22208_c0_g1_i5|g.5406 TRINITY_DN22208_c0_g1_i5|m.5406 type:5prime_partial len:572 (+) TRINITY_DN22208_c0_g1_i5:2-1717(+)
> $ grep -w TRINITY_DN33922_c0_g1_i9 Trinity_500_longestisoform.fasta.transdecoder.cds
> >TRINITY_DN33922_c0_g1_i9|m.1708 TRINITY_DN33922_c0_g1_i9|g.1708 ORF TRINITY_DN33922_c0_g1_i9|g.1708 TRINITY_DN33922_c0_g1_i9|m.1708 type:complete len:209 (+) TRINITY_DN33922_c0_g1_i9:3159-3785(+)
> >TRINITY_DN33922_c0_g1_i9|m.1706 TRINITY_DN33922_c0_g1_i9|g.1706 ORF TRINITY_DN33922_c0_g1_i9|g.1706 TRINITY_DN33922_c0_g1_i9|m.1706 type:complete len:336 (+) TRINITY_DN33922_c0_g1_i9:99-1106(+)
> >TRINITY_DN33922_c0_g1_i9|m.1707 TRINITY_DN33922_c0_g1_i9|g.1707 ORF TRINITY_DN33922_c0_g1_i9|g.1707 TRINITY_DN33922_c0_g1_i9|m.1707 type:complete len:286 (+) TRINITY_DN33922_c0_g1_i9:1239-2096(+)
>
>
> On Sunday, April 11, 2021 at 1:35:40 PM UTC-4 Emily Jennings wrote:
>>
>> I re-ran TransDecoder without the -S flag and the BUSCO results were unchanged (still ~49% complete after TransDecoder).
>>
>> On Sunday, April 11, 2021 at 12:55:50 PM UTC-4 Emily Jennings wrote:
>>>
>>> Hi Brian, thanks for the fast reply!
>>>
>>> As you suggested, I extracted the transcript sequences for all BUSCOs that were present in step 3 but missing from step 4. I am sending you a copy of this file.
>>>
>>> I randomly selected one of these seqs (TRINITY_DN34533_c0_g1_i1, which matched initially with BUSCO 986) and searched it using ORFfinder (NCBI), which was able to identify multiple ORFs, one of which was quite long:
>>>
>>>
>>>

>>> A BLAST search with ORF4 yielded very good hits, which match BUSCO 986:
>>>
>>>
>>>

> To view this discussion on the web visit https://groups.google.com/d/msgid/trinityrnaseq-users/3c73c977-5e6c-4fae-a685-cd6f78b2ba63n%40googlegroups.com.

Emily Jennings

unread,

Apr 12, 2021, 3:36:50 PM4/12/21

to trinityrnaseq-users

The file Trinity_500_longestisoform.fasta.transdecoder.cds is actually the full output from TransDecoder.Predict. I included the "500_" as a note to myself that the original .fasta was filtered for transcripts > 500 nt.

My orfs of interest are present in both the TransDecoder.LongOrfs and TransDecoder.Predict outputs, so I believe this is an issue solely with BUSCO. I have submitted this issue to BUSCO (which you can follow here).

Brian Haas

unread,

Apr 12, 2021, 3:43:52 PM4/12/21

to Emily Jennings, trinityrnaseq-users

gotcha. OK, I'll add this as a github issue for tracking purposes. If you get some resolution for this and could update this post, that'd be terrific.

many thx

To view this discussion on the web visit https://groups.google.com/d/msgid/trinityrnaseq-users/a46f2783-c15e-45af-aef8-8b06a151e4e6n%40googlegroups.com.

Emily Tallerday

unread,

Apr 16, 2021, 1:10:27 PM4/16/21

to Brian Haas, trinityrnaseq-users

Unfortunately, BUSCO’s issues board is not as active as this one. I mean that as a compliment to you, Brian! If/when I hear back from them I will update here. In the mean time, for anyone interested, I plan to use the stats provided by BUSCO from my (pre-cds prediction) .fasta file in addition to a BLAST-based assessment of the .cds/.pep file as described at: https://github.com/trinityrnaseq/trinityrnaseq/wiki/Counting-Full-Length-Trinity-Transcripts