Issue with transcriptome completeness after applying EvidentialGene

berger juliette

unread,

Apr 7, 2025, 4:13:47 AM4/7/25

to EvidentialGene

Hello,

I am currently working on comparing de novo transcriptomes from three species, based on orthologous genes. To do so, after assembly, I have used the EvidentialGene software after performing clustering with cd-hit for sequence filtering on EvidentialGene.

The BUSCO results are very good at each step (assembly, cd-hit, EvidentialGene), except for one of the transcriptomes, which loses about 60% of its sequences and shows only 54% completeness on BUSCO after the EvidentialGene step.

I have tested several configurations with different parameters, such as:

MINAA=30, MINAA=20, MINAA=15
pHeterozygosity=1
Other parameters, but I still cannot obtain a good completeness score for this particular transcriptome.

I was wondering if there is anything I can adjust or any specific approach I should try to resolve this issue. I would greatly appreciate any advice or suggestions that could help me improve the completeness of this transcriptome after the EvidentialGene step.

Thank you in advance for your help.

Juliette

Don Gilbert

unread,

Apr 8, 2025, 4:10:57 PM4/8/25

to berger juliette, EvidentialGene

If I understand this case, Juliette, the problem you find is loss of BUSCO conserved genes after a separate cd-hit reduction step, following that to EvidentialGene (which runs cd-hit internally), the loss is noted.

I have seen papers report using Evigene after a separate cd-hit reduction. This may be your mistake: Evigene uses cd-hit on coding-sequences to reduce duplicates, and you should not be running cd-hit before that, especially not cd-hit-est version on full transcripts. This is because some transcripts are mis-assembled with two+ genes. It is possible that the loss of conserved proteins at Evigene step is due to a large portion of mis-assemblies in that one sample (or possibly biological effect), combined with a conflict from cd-hit reduction followed by Evigene cd-hit reduction.

If it is the case you used cd-hit separately, I'd suggest trying w/o that step: Evigene's tr2aacds with input of original-trassembly.fasta should do what you need.

If this doesn't resolve this problem, a look at your data and logs from Evigene runs may help me to understand this.

- Don Gilbert

--
You received this message because you are subscribed to the Google Groups "EvidentialGene" group.
To unsubscribe from this group and stop receiving emails from it, send an email to evidentialgen...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/evidentialgene/6d96dd74-4266-4edd-b2f0-b1f9f66de5cfn%40googlegroups.com.

--

don gilbert - www.bio.net - bioinformatics - indiana.u.

berger juliette

unread,

Apr 11, 2025, 1:08:40 AM4/11/25

to Don Gilbert, evident...@googlegroups.com

Le jeu. 10 avr. 2025, 15:05, berger juliette <bergerju...@gmail.com> a écrit :

Hello Don Gilbert,

After testing evigene on raw assembly, I obtained 70% of Busco gene completeness. In order to better understand this significant loss compared to busco on raw assembly (97%), I tested a busco on the evigene classes showing the most loss (dropsed). Busco found 42% completeness for the perfectdup class, I really hesitate to add these sequences to those obtained in the okeyset results file because the dropped perfectdup sequences still represent 15% of the total sequences. I'm really afraid of losing biological information.

Thank you again for your help,
Have a nice day,
Juliette
Le mer. 9 avr. 2025 à 10:23, berger juliette <bergerju...@gmail.com> a écrit :
Dear Don Gilbert,
I hope you're doing well. I am writing to share the results I obtained after performing an analysis with Evigene on the raw assembly of Cryptocercus. I used the following command:
perl /home/juliette/softwares/evigene/evigene/scripts/prot/tr2aacds.pl -mrna Trinity.fasta -NCPU=2 -MAXMEM=5000 -MINAA=30 -logfile evigene_test1.log
Results obtained:

BUSCO score:

After running cd-hit + Evigene, I obtained a 54% completeness score on busco.

After running Evigene on the raw assembly, the score slightly increased to 70%.

However, this score remains relatively low compared to the 97% completeness I obtained from the raw assembly.

Comparison with other species: When comparing with the two other species I assembled (Salganea and Tryonicus), here are the BUSCO scores after raw assembly + Evigene:

S. incerta: 96% completeness

Tryonicus sp.: 96% completeness

In contrast, for Cryptocercus, despite metrics indicating a superior assembly quality (see below), the BUSCO score after Evigene remains relatively low (70%).

Assembly metrics:
Specimen C. punctulatus S. incerta Tryonicus sp.
Number of transcripts 599,525 732,825 873,967
GC content (%) 39.44 36.08 36.90
N50 1,012 817 955
Ex90N50 (bp) 1,545 1,516 1,296
Backmapping rate (%) > 97 > 97 > 97
Full-length transcripts (> 80% coverage) 5,369 4,491 4,657
BUSCO completeness (%) 97% 96% 96%
Issue:
Despite the overall better metrics for Cryptocercus, the BUSCO score after Evigene is significantly lower than for the other species. I am therefore perplexed as to the cause of this discrepancy, and I would like to know if you have any insights or suggestions on additional steps I could take to improve this score.
I am also attaching the complete Evigene log file for Cryptocercus so you can examine it in more detail.
Thank you in advance for your help. I look forward to your suggestions.

Best regards,
Juliette
#t2ac: EvidentialGene tr2aacds.pl VERSION 2022.04.05
#t2ac: CMD: tr2aacds.pl -mrna Trinity.fasta -NCPU=2 -MAXMEM=5000 -MINAA=30 -logfile evigene_test1.log
#t2ac: app=blastn, path=/usr/bin/blastn
#t2ac: app=makeblastdb, path=/usr/bin/makeblastdb
#t2ac: app=fastanrdb, path=/usr/bin/fastanrdb
#t2ac: app=cd-hit-est, path=/home/juliette/cd-hit-v4.8.1-2019-0228/cd-hit-est
#t2ac: app=cd-hit, path=/home/juliette/cd-hit-v4.8.1-2019-0228/cd-hit
#t2ac: evigeneapp=cdna_bestorf.pl, path=/home/juliette/softwares/evigene/evigene/scripts/prot/../cdna_bestorf.pl
#t2ac: evigeneapp=prot/traa2cds.pl, path=/home/juliette/softwares/evigene/evigene/scripts/prot/../prot/traa2cds.pl
#t2ac: evigeneapp=prot/aaqual.sh, path=/home/juliette/softwares/evigene/evigene/scripts/prot/../prot/aaqual.sh
#t2ac: evigeneapp=rnaseq/asmrna_dupfilter4.pl, path=/home/juliette/softwares/evigene/evigene/scripts/prot/../rnaseq/asmrna_dupfilter4.pl
#t2ac: evigeneapp=rnaseq/asmrna_altreclass4.pl, path=/home/juliette/softwares/evigene/evigene/scripts/prot/../rnaseq/asmrna_altreclass4.pl
#t2ac: evigeneapp=makeblastscore.pl, path=/home/juliette/softwares/evigene/evigene/scripts/prot/../makeblastscore.pl
#t2ac: evigeneapp=prot/cdsqual.sh, path=/home/juliette/softwares/evigene/evigene/scripts/prot/../prot/cdsqual.sh
#t2ac: evigeneapp=genes/blasttrset2exons2.pl, path=/home/juliette/softwares/evigene/evigene/scripts/prot/../genes/blasttrset2exons2.pl
#t2ac: evigeneapp=genes/trclass2pubset.pl, path=/home/juliette/softwares/evigene/evigene/scripts/prot/../genes/trclass2pubset.pl
#t2ac: BEGIN with cdnaseq= Trinity.fasta date= Wed Apr 9 07:57:28 CEST 2025
#t2ac: bestorf_cds== Trinity.cds nrec= 607922
#t2ac: isStrandedRNA=0, f:494/r:506
#t2ac: nonredundant_cds== Trinitynr.cds nrec= 550587
#t2ac: nonredundant_reassignbest= 0 of 0
#t2ac: add_consensus_idset n=0 from .//Trinity_clustered.consensus
#t2ac: cds -extend2utr 900,60 bp in Trinitynrxu.cds
#t2ac: nofragments_cds== Trinitynrxucd1.cds nrec= 543803
#t2ac: blastn_cds= Trinitynrcd1x-self98.blastn
#t2ac: skip step4.1 aadup clustering
#t2ac: CMD= /home/juliette/softwares/evigene/evigene/scripts/prot/../rnaseq/asmrna_dupfilter4.pl -aasize Trinity.aa.qual -CDSALIGN -blastab Trinitynrcd1x-self98.blastn -aconsensus Trinity.consensus -pCDSOK=20 -pCDSBAD=20 -ALTFRAG=0.5 -outeqtab Trinity.alntab -outclass Trinity.trclass >Trinity.adupfilt.log 2>&1
#t2ac: asmdupfilter_cds= Trinity.trclass
# Class Table for Trinity.trclass
class %okay %drop okay drop
althi 2.8 6.1 10602 23090
althi1 3 2.4 11208 9242
althinc 3.1 0 11786 0
altmfrag 0.2 1.7 759 6666
altmid 0.7 5.1 2760 19359
main 4.4 1.4 16403 5338
mainnc 5.5 0 20581 0
noclass 7.1 20.9 26588 78211
noclassnc 11.9 0 44517 0
parthi 0 4.5 0 16884
parthi1 0 1.2 0 4587
perfdupl 0 15.3 0 57335
perffrag 0 1.8 0 6807
smallorf 0 0 0 0
---------------------------------------------
total 38.9 61 145204 227519
=============================================
# AA-quality for okay set of Trinity.aa.qual (no okalt): all and longest 1000 summary
okay.top n=1000; average=1602; median=1430; min,max=1094,5317; nfull=922; sum=1602884; gaps=0,0
okay.all n=108089; average=121; median=68; min,max=20,5317; nfull=83113; sum=13129864; gaps=0,0
#t2ac: asmdupfilter_fileset= Trinity.okay.tr Trinity.okalt.tr Trinity.drop.tr Trinity.okay.aa Trinity.okalt.aa Trinity.drop.aa Trinity.okay.cds Trinity.okalt.cds Trinity.drop.cds
#t2ac: tidyup output folders: okayset dropset inputset tmpfiles
#t2ac: CMD= mv Trinity.okay.tr okayset/Trinity.okay.tr
#t2ac: CMD= mv Trinity.okalt.tr okayset/Trinity.okalt.tr
#t2ac: CMD= mv Trinity.okay.aa okayset/Trinity.okay.aa
#t2ac: CMD= mv Trinity.okalt.aa okayset/Trinity.okalt.aa
#t2ac: CMD= mv Trinity.okay.cds okayset/Trinity.okay.cds
#t2ac: CMD= mv Trinity.okalt.cds okayset/Trinity.okalt.cds
#t2ac: CMD= mv Trinity.drop.tr dropset/Trinity.drop.tr
#t2ac: CMD= mv Trinity.drop.aa dropset/Trinity.drop.aa
#t2ac: CMD= mv Trinity.drop.cds dropset/Trinity.drop.cds
#t2ac: CMD= mv Trinity.cds inputset/Trinity.cds
#t2ac: CMD= mv Trinity.aa inputset/Trinity.aa
#t2ac: CMD= mv Trinity.aa.qual inputset/Trinity.aa.qual
#t2ac: CMD= mv Trinitynr.cds tmpfiles/Trinitynr.cds
#t2ac: CMD= mv Trinitynr.aa tmpfiles/Trinitynr.aa
#t2ac: CMD= mv Trinitynrxu.cds tmpfiles/Trinitynrxu.cds
#t2ac: CMD= mv Trinitynrxucd1.cds tmpfiles/Trinitynrxucd1.cds
#t2ac: CMD= mv Trinitynrxucd1.cds.clstr tmpfiles/Trinitynrxucd1.cds.clstr
#t2ac: CMD= mv Trinitynrxucd1.log tmpfiles/Trinitynrxucd1.log
#t2ac: CMD= mv Trinitynrcd1x.cds tmpfiles/Trinitynrcd1x.cds
#t2ac: CMD= mv Trinitynrcd1x-self98.blastn tmpfiles/Trinitynrcd1x-self98.blastn
#t2ac: CMD= mv Trinitynrcd1x_db.log tmpfiles/Trinitynrcd1x_db.log
#t2ac: CMD= mv Trinity.alntab tmpfiles/Trinity.alntab
#t2ac: CMD= mv Trinity.adupfilt.log tmpfiles/Trinity.adupfilt.log
#t2ac: CMD= env outcds=1 /home/juliette/softwares/evigene/evigene/scripts/prot/../prot/cdsqual.sh tmpfiles/Trinitynrcd1x.cds
#t2ac: CMD= /home/juliette/softwares/evigene/evigene/scripts/prot/../makeblastscore.pl -pIDENTMIN 99.999 -pmin 0.01 -CDSSPAN -showspan=2 -tall -sizes tmpfiles/Trinitynrcd1x.cds.qual tmpfiles/Trinitynrcd1x-self98.blastn > tmpfiles/Trinitynrcd1x-self100.btall
#t2ac: CMD= /home/juliette/softwares/evigene/evigene/scripts/prot/../genes/trclass2pubset.pl -onlypub -norealt -noaltdrops -log -debug -class Trinity.trclass
#t2ac: CMD= sort -k7,7nr -k2,2 -k6,6nr -k1,1 tmpfiles/Trinitynrcd1x-self100.btall | env pubids=publicset/Trinity.pubids debug=1 /home/juliette/softwares/evigene/evigene/scripts/prot/../genes/blasttrset2exons2.pl > tmpfiles/Trinitynrcd1x.exontab
#t2ac: CMD= /home/juliette/softwares/evigene/evigene/scripts/prot/../genes/trclass2pubset.pl -noaltdrops -exontab tmpfiles/Trinitynrcd1x.exontab -log -debug -class Trinity.trclass
#t2ac: tidy okayset => okayset1st, stage2 reduction => okayset
#t2ac: DONE at date= Wed Apr 9 08:03:15 CEST 2025
#t2ac: ======================================
(END)

Specimen	C. punctulatus	S. incerta	Tryonicus sp.
Number of transcripts	599,525	732,825	873,967
GC content (%)	39.44	36.08	36.90
N50	1,012	817	955
Ex90N50 (bp)	1,545	1,516	1,296
Backmapping rate (%)	> 97	> 97	> 97
Full-length transcripts (> 80% coverage)	5,369	4,491	4,657
BUSCO completeness (%)	97%	96%	96%

berger juliette

unread,

Apr 11, 2025, 1:09:09 AM4/11/25

to Don Gilbert, evident...@googlegroups.com

Reply all

Reply to author

Forward