Dear Laureano and others,
This is a common question, that BUSCO scores of Evigene's, or any methods, transcript sets have many "Duplicates", and how to fix that. The general answer is BUSCO needs to be taught about alternate transcripts of genes.
Q1: Given the high amount of duplicates in busco [78% of evigene okay set]..
Q1b: So, all evgclass=main and evgclass=noclass were extracted to a multi-fasta..
A1b: ... this is partway there, see below.
Q1c: In the filtration step, I losted alternative transcripts that were being recognized by busco..
A1c: Yes, this happens as the longest protein (gene0001t1, class=main) is not always the most conserved protein.
A1:
Busco software doesn't recognize that alternate transcripts exist, so its "duplicate" score is meaningless when alternates are included. But any of the alternate transcripts at a locus may be the most conserved variant, not just the longest, so if one ones the most precise measure of conserved genes in a transcriptome with BUSCO methods, one should (a) measure all isoforms (e.g. the Evigene okayset proteins), then (b) recalculate BUSCO's full_table to produce scores per gene locus, rather than per transcript. That is available with an Evigene script: evigene/scripts/omcl/
evg_buscogenesum.plRun as
env dotab=1 summary=busco.sum.txt $evigene/scripts/omcl/
evg_buscogenesum.pl your_busco_full_table.tsv
Output is your_busco_full_table.tsv.evg, and summary in "busco.sum.txt"
There is also a modified BUSCO software script (vers2) that understands Evigene's gene/isoform IDs, to skip this extra step of
evg_buscogenesum.pl :
$evigene/scripts/omcl/evg_busco2v.py
===== Example =================
Input BUSCO table: C:99.4%[S:37.8%,D:61.6%]
Output BUSCO.evg table: C:99.4%[S:93.3%,D:6.1%]
Thus 55% of BUSCO "Duplicates" are alternates at one gene locus. This is normal gene biology.
Input BUSCO full_table.tsv with evigene alternate transcripts -----------------
# zfish17evgpub_buscofull_table.tsv
# BUSCO version is: 2.0.1
# The lineage dataset is: vertebrata_odb9 (Creation date: 2016-02-13, number of species: 65, number of BUSCOs: 2586)
# To reproduce this run: python /home/ux455375/bio/orthodb/BUSCOv2.py -i zfish17evg6m6v.aa -o buvezfish17evg6m6v -l /home/ux455375/scratchn/chrs/pubdata/orthodb/vertebrata/ -m proteins -c 12 -sp human
#
# Busco id
Status
Sequence
Score
Length
EOG090B000N
Duplicated
vDanrer6pEVm000030t1
8261.4
4123
EOG090B000N
Duplicated
vDanrer6pEVm000030t3
8280.6
4124
EOG090B000N
Duplicated
vDanrer6pEVm000030t4
8187.6
4123
EOG090B000N
Duplicated
vDanrer6pEVm000030t5
8217.5
4123
EOG090B0021
Duplicated
vDanrer6pEVm000196t1
927.4
675
Original busco summary
C:99.4%[S:37.8%,D:61.6%],F:0.2%,M:0.4%,n:2586
Run this:
env dotab=1 summary=busco.sum.txt $evigene/scripts/omcl/
evg_buscogenesum.pl zfish17evgpub_buscofull_table.tsv
Evigene busco summary
C:99.4%[S:93.3%,D:6.1%],F:0.3%,M:0.3%,n:2586, ngene:2761,ntr:6197
Complete:2570,Single:2413,Duplicated:157,Fragmented:7,Missing:9,n:2586, ngene:2761,ntr:6197, align:442.3, score:738.3
## output evigene busco table corrected for one transcript per gene ---------------
# zfish17evgpub_buscofull_table.tsv.evg
# BUSCO version is: 2.0.1
# The lineage dataset is: vertebrata_odb9 (Creation date: 2016-02-13, number of species: 65, number of BUSCOs: 2586)
# To reproduce this run: python /home/ux455375/bio/orthodb/BUSCOv2.py -i zfish17evg6m6v.aa -o buvezfish17evg6m6v -l /home/ux455375/scratchn/chrs/pubdata/orthodb/vertebrata/ -m proteins -c 12 -sp human
#
#BuscoID
Status
TranscriptID
Score
Align
GeneID
EOG090B000N
Complete
vDanrer6pEVm000030t3
8280.6
4124
vDanrer6pEVm000030:4alts,t1:4123,t4:4123,t5:4123
EOG090B0021
Complete
vDanrer6pEVm000196t1
927.4
675
vDanrer6pEVm000196:5alts,t2:675,t3:675,t4:675,t5:675
EOG090B002I
Complete
vDanrer6pEVm000141t1
4296.8
2680
vDanrer6pEVm000141:2alts,t2:2679
EOG090B0039
Complete
vDanrer6pEVm000136t1
5188.2
2615
vDanrer6pEVm000136:18alts, t2:2615,t3:2615,t4:2615,t7:2615,t8:2615,t9:2615,t13:2599,t10:2591,t11:2591,t5:2591,t17:2351,t23:2297,t18:2273,t19:2273,t24:2273,t20:2272,t22:2259
EOG090B003S
Complete
vDanrer6pEVm000167t1
4041.1
2325
vDanrer6pEVm000167:1alt
=== END example =========================================
Q2: For instance [ a problem of Evigene gene alternates having different proteins ]
>TF_000003t6 ---> recognized as 28at5204 entry in OrthoDb
>TF_000003t1 ---> missing Busco
Blast search gives different hits for t1 and t6.
A2: Misclassification of transcripts to one gene locus happens, but isn't limited to Evigene's methods.
I would need to look at your transcript sequences to say what this example means, but options include
(a) an artifact/poor transcript assembly TF_000003t1, which has a longer protein than TF_000003t6, but shares exon-size or more sequence spans with 3t6, i.e. this is the classical definition of alternates for one gene: they share the same exons;
(b) true biological alternate transcripts with different proteins. This does occur in nature, though not as often as assembly artifacts.
It is always good to use two or more methods to measure genomic (and gene-omic) entities, as there are flaws in many methods, and comparisons when time permits often help you to decide what the true answer is. I have measured these kinds of mistakes in several gene/transcript construction tools besides EvidentialGene, and the results are that Evigene is more accurate, though none are fully accurate, and all depend on what kind and amount of gene data you use. It may be that for your gene set, CD-Hit or some other method would work better than Evigene, but one needs to measure full gene sets to compare. There are often a few discrepancies in one method that another method resolves, so that a human expert-curated gene set can often improve on any computed method.
One option you have is to include protein homology scores when running Evigene's tr2aacds: that will improve its choices of "best" primary transcript for each gene group with the homology scores.
tr2aacds.pl -cdnaseq transcripts.fasta[.gz]
options:
-ablastab=blastp_table : use blast scores to keep some transcripts (blastp traa x refdb -outfmt 7; see docs)
Please see this paper for detailed comparisons of methods around transcript classification and/or clustering.
Gilbert, DG. (2019). Longest protein, longest transcript or most expression,
for accurate gene reconstruction of transcriptomes? bioRxiv ,
https://doi.org/10.1101/829184