High duplication in my transcriptome in BUSCO using tr2aacds4.pl

Jeferson Alexis Durán Fuentes

unread,

Oct 13, 2025, 4:11:45 PM10/13/25

to EvidentialGene

Dear Don and other readers,

I have a question and I’m not sure if I did something incorrectly. I hope you can help me.

I have combined two different assemblers (SPAdes and Trinity) for 15 replicates of a sea anemone species. Every three replicates correspond to a different thermal stress condition (26°C to 35°C) across five time points. I used tr2aacds.pl to combine all these assemblies (n=30) to generate a pan-transcriptome reference. Currently, I have a very large output with the following results:

# Class Table for Transcriptome_merged.trclass

class %okay %drop okay drop
althi 0.9 0.01 197808 2540
althi1 8.3 0.03 1778544 7892
althinc 0.26 0 55574 0
altmfrag 0.24 0 51545 118
altmid 0.12 0 26014 439
main 0.5 0 123447 100
mainnc 0.31 0 67522 0
noclass 1.8 0 394398 13
noclassnc 0.33 0 70594 0
parthi 0 1.1 0 244921
parthi1 0 1.3 0 294890
perfdupl 0 59.3 0 12602590
perffrag 0 8.8 0 1872393
smallorf 0 16.2 0 3444943
---------------------------------------------
total 13 86.9 2765446 18470839
=============================================

Then, I used cd-hit-est at 0.99 to remove redundant sequences. After that, I ran BUSCO and obtained the following results: (C:97.9%[S:7.3%,D:90.6%],F:1.2%,M:0.9%,n:954).
I notice that there is 90.6% duplication.

My goal is to obtain a reference transcriptome that can later be used for mapping reads with Salmon and for differential expression analysis with DESeq2.

My questions are:

1. How can I reduce this high percentage of duplicates? Would it be a good idea to keep the 123,447 transcripts that are categorized as “main”?

2. Do you think the high level of duplication (90.6%) could affect downstream quantification and differential expression analysis? If so, what strategies would you recommend to reduce redundancy or manage isoforms in this context?

I would greatly appreciate any guidance.

Best regards,

Don Gilbert

unread,

Oct 13, 2025, 7:38:25 PM10/13/25

to Jeferson Alexis Durán Fuentes, EvidentialGene

Jeferson,

The BUSCO software doesn't understand alternate transcripts. Your transcript set has many more alts. than main transcripts,

which gives you that spurious D:90.6% value.

Also, this is likely a mistake that reduces the value of your transcript set:

"Then, I used cd-hit-est at 0.99 to remove redundant sequences.."

since Evigene's tr2aacds already runs cd-hit-est as part of its reduction, but does it in a way that

preserves valid alternates and valuable genes.

This scripts will properly classify busco results, using the main/alt ID tag of evigene 't1, t2, ... tN'

evigene/scripts/omcl/evg_buscogenesum.pl

usage:

env dotab=1 summary=busco.sum.txt evg_buscogenesum.pl buscof/full_table*.tsv

where 'summary=busco.sum.txt' is the summary output file, and 'dotab=1' means

rewrite the busco full_table.tsv changing spurious 'Duplicate' to 'Complete' for

cases of alternates of one gene locus

One output of tr2aacds is a table of gene locus, alt. transcript ids, that may help.

You can create a separate sequence set of only main transcripts, ie. all with ID suffix 't1',

but the reason for testing all transcripts for homology (busco, other) is that alternates 't2..tN'

sometimes have much greater homology than the 't1' longest protein alternate.

See here

https://sourceforge.net/p/evidentialgene/blog/2018/03/gene-transcript-id-table-from-evgmrna2tsa/

--
You received this message because you are subscribed to the Google Groups "EvidentialGene" group.
To unsubscribe from this group and stop receiving emails from it, send an email to evidentialgen...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/evidentialgene/3c8c6fa7-7c67-435a-889b-762784d8be4cn%40googlegroups.com.

--

don gilbert - www.bio.net - bioinformatics - indiana.u.

Jeferson Alexis Durán Fuentes

unread,

Oct 14, 2025, 11:24:05 AM10/14/25

to EvidentialGene

Hi Don,

I understand, I omitted the ct-hit-est step.
As part of my methods, I am using Kraken2 and FCS-GX to remove exogenous sequences, since this sea anemone has symbiosis with zooxanthellae. There is a presence of between 15-40% of Symbiodinium sequences in the reads and around 5-70 Mb in the assemblies. Then, I run BUSCO.

It will take me a long time, since I am processing 30 transcriptomes.

I have been thinking about merging these 30 decontaminated and processed transcriptomes, then using tr2aacds4.pl, since I am going to annotate each of these transcriptomes and compare the presence of toxins.

If I have more questions, I will ask you.

Thanks.

Best regards,

Jef

Reply all

Reply to author

Forward