Question about missing genes

62 views

Skip to first unread message

Laureano Jose Giordano

unread,

Apr 18, 2024, 2:30:17 PM4/18/24

to EvidentialGene

Im studying a smut fungi that infect peanut crops. A rna-seq analysis was performed and two transcriptomes were assembled (trinity and spades). The transcriptomes were then collapsed with evigene. Busco assessment of this collapsed transcriptome showed 98.6% of completeness (20% singles and 78% duplicates). Given the high amount of duplicates in busco and the fact that I wanted to do a differential expression analysis, I decided to filter alternative transcripts. So, all evgclass=main and evgclass=noclass were extracted to a multi-fasta. Busco that multi-fasta and I get 97% of completeness (70% singles, 27% duplicates). In the filtration step, I losted alternative transcripts that were being recognized by busco as orthologous and the main transcript representing that alternative transcript could not be recognized by busco.
For instance:
>TF_000003t6 ---> recognized as 28at5204 entry in OrthoDb
>TF_000003t1 ---> missing Busco
Blast search gives different hits for t1 and t6.

When ch-hit was performed, instead of filtering by main or noclass, genes with different locus id were clustered.
>Cluster 4
0 4088nt, >TF_000590t1... at -/100.00%
1 436nt, >TF_009306t1... at +/100.00%
2 16577nt, >TF_000003t6... at +/97.38%
3 16611nt, >TF_000003t7... * ---> also recognized as 28at5204

The question therefore is, Why is evigene classifying as alternatives genes that evidence suggest are different from its main transcript? Is it a good procedure to filter by class? is it convenient to filter raw transcripts (trinity and spades) using cd-hit (Ahmadi et al. 2023) despites what evi docs recommend?
If you could take the time to respond my question, it would be very helpfull.
Laureano Giordano, MSc.

Don Gilbert

unread,

Apr 20, 2024, 4:39:18 PM4/20/24

to EvidentialGene

Dear Laureano and others,

This is a common question, that BUSCO scores of Evigene's, or any methods, transcript sets have many "Duplicates", and how to fix that. The general answer is BUSCO needs to be taught about alternate transcripts of genes.

Q1: Given the high amount of duplicates in busco [78% of evigene okay set]..
Q1b: So, all evgclass=main and evgclass=noclass were extracted to a multi-fasta..
A1b: ... this is partway there, see below.

Q1c: In the filtration step, I losted alternative transcripts that were being recognized by busco..
A1c: Yes, this happens as the longest protein (gene0001t1, class=main) is not always the most conserved protein.

A1:
Busco software doesn't recognize that alternate transcripts exist, so its "duplicate" score is meaningless when alternates are included. But any of the alternate transcripts at a locus may be the most conserved variant, not just the longest, so if one ones the most precise measure of conserved genes in a transcriptome with BUSCO methods, one should (a) measure all isoforms (e.g. the Evigene okayset proteins), then (b) recalculate BUSCO's full_table to produce scores per gene locus, rather than per transcript. That is available with an Evigene script: evigene/scripts/omcl/evg_buscogenesum.pl

Run as
env dotab=1 summary=busco.sum.txt $evigene/scripts/omcl/evg_buscogenesum.pl your_busco_full_table.tsv
Output is your_busco_full_table.tsv.evg, and summary in "busco.sum.txt"

There is also a modified BUSCO software script (vers2) that understands Evigene's gene/isoform IDs, to skip this extra step of evg_buscogenesum.pl :
$evigene/scripts/omcl/evg_busco2v.py

===== Example =================

Input BUSCO table: C:99.4%[S:37.8%,D:61.6%]
Output BUSCO.evg table: C:99.4%[S:93.3%,D:6.1%]
Thus 55% of BUSCO "Duplicates" are alternates at one gene locus. This is normal gene biology.

Input BUSCO full_table.tsv with evigene alternate transcripts -----------------
# zfish17evgpub_buscofull_table.tsv
# BUSCO version is: 2.0.1
# The lineage dataset is: vertebrata_odb9 (Creation date: 2016-02-13, number of species: 65, number of BUSCOs: 2586)
# To reproduce this run: python /home/ux455375/bio/orthodb/BUSCOv2.py -i zfish17evg6m6v.aa -o buvezfish17evg6m6v -l /home/ux455375/scratchn/chrs/pubdata/orthodb/vertebrata/ -m proteins -c 12 -sp human
#
# Busco id Status Sequence Score Length
EOG090B000N Duplicated vDanrer6pEVm000030t1 8261.4 4123
EOG090B000N Duplicated vDanrer6pEVm000030t3 8280.6 4124
EOG090B000N Duplicated vDanrer6pEVm000030t4 8187.6 4123
EOG090B000N Duplicated vDanrer6pEVm000030t5 8217.5 4123
EOG090B0021 Duplicated vDanrer6pEVm000196t1 927.4 675

Original busco summary
C:99.4%[S:37.8%,D:61.6%],F:0.2%,M:0.4%,n:2586

Run this:
env dotab=1 summary=busco.sum.txt $evigene/scripts/omcl/evg_buscogenesum.pl zfish17evgpub_buscofull_table.tsv

Evigene busco summary
C:99.4%[S:93.3%,D:6.1%],F:0.3%,M:0.3%,n:2586, ngene:2761,ntr:6197
Complete:2570,Single:2413,Duplicated:157,Fragmented:7,Missing:9,n:2586, ngene:2761,ntr:6197, align:442.3, score:738.3

## output evigene busco table corrected for one transcript per gene ---------------

# zfish17evgpub_buscofull_table.tsv.evg
# BUSCO version is: 2.0.1
# The lineage dataset is: vertebrata_odb9 (Creation date: 2016-02-13, number of species: 65, number of BUSCOs: 2586)
# To reproduce this run: python /home/ux455375/bio/orthodb/BUSCOv2.py -i zfish17evg6m6v.aa -o buvezfish17evg6m6v -l /home/ux455375/scratchn/chrs/pubdata/orthodb/vertebrata/ -m proteins -c 12 -sp human
#
#BuscoID Status TranscriptID Score Align GeneID
EOG090B000N Complete vDanrer6pEVm000030t3 8280.6 4124 vDanrer6pEVm000030:4alts,t1:4123,t4:4123,t5:4123
EOG090B0021 Complete vDanrer6pEVm000196t1 927.4 675 vDanrer6pEVm000196:5alts,t2:675,t3:675,t4:675,t5:675
EOG090B002I Complete vDanrer6pEVm000141t1 4296.8 2680 vDanrer6pEVm000141:2alts,t2:2679
EOG090B0039 Complete vDanrer6pEVm000136t1 5188.2 2615 vDanrer6pEVm000136:18alts, t2:2615,t3:2615,t4:2615,t7:2615,t8:2615,t9:2615,t13:2599,t10:2591,t11:2591,t5:2591,t17:2351,t23:2297,t18:2273,t19:2273,t24:2273,t20:2272,t22:2259
EOG090B003S Complete vDanrer6pEVm000167t1 4041.1 2325 vDanrer6pEVm000167:1alt
=== END example =========================================

Q2: For instance [ a problem of Evigene gene alternates having different proteins ]

>TF_000003t6 ---> recognized as 28at5204 entry in OrthoDb
>TF_000003t1 ---> missing Busco
Blast search gives different hits for t1 and t6.

A2: Misclassification of transcripts to one gene locus happens, but isn't limited to Evigene's methods.
I would need to look at your transcript sequences to say what this example means, but options include
(a) an artifact/poor transcript assembly TF_000003t1, which has a longer protein than TF_000003t6, but shares exon-size or more sequence spans with 3t6, i.e. this is the classical definition of alternates for one gene: they share the same exons;
(b) true biological alternate transcripts with different proteins. This does occur in nature, though not as often as assembly artifacts.

It is always good to use two or more methods to measure genomic (and gene-omic) entities, as there are flaws in many methods, and comparisons when time permits often help you to decide what the true answer is. I have measured these kinds of mistakes in several gene/transcript construction tools besides EvidentialGene, and the results are that Evigene is more accurate, though none are fully accurate, and all depend on what kind and amount of gene data you use. It may be that for your gene set, CD-Hit or some other method would work better than Evigene, but one needs to measure full gene sets to compare. There are often a few discrepancies in one method that another method resolves, so that a human expert-curated gene set can often improve on any computed method.

One option you have is to include protein homology scores when running Evigene's tr2aacds: that will improve its choices of "best" primary transcript for each gene group with the homology scores.
tr2aacds.pl -cdnaseq transcripts.fasta[.gz]
options:
-ablastab=blastp_table : use blast scores to keep some transcripts (blastp traa x refdb -outfmt 7; see docs)

Please see this paper for detailed comparisons of methods around transcript classification and/or clustering.
Gilbert, DG. (2019). Longest protein, longest transcript or most expression,
for accurate gene reconstruction of transcriptomes? bioRxiv , https://doi.org/10.1101/829184

Reply all

Reply to author

Forward

0 new messages