Some questions before EvidentialGene usage

56 views
Skip to first unread message

Yuwei Xiao

unread,
Mar 8, 2024, 6:04:32 PM3/8/24
to EvidentialGene

Dear Dr. Gilbert, 

 Thank you for allowing me to join the group first. I have some questions before I start to use EvidentialGene for incorporation of de novo assembly results. I really appreciate it if you can address them. 


 Since I worked on two close related plant species with high ploidy (12X) without genome sequence, I'm wondering how EvidentialGene will treat those 12 alleles (or maybe as high as 24 for genes with high heterozygosity), as paralogs or splicing variants? In your paper entitled "Longest protein, longest transcript or most expression, for accurate gene reconstruction of transcriptomes?", you classified reference transcripts based on exon identity. Transcripts sharing exons at >99% are considered as alternates, at >97% <=99% as paralogs (may have alternates). Is the same principle applied in EvidentialGene pipeline?


Second, I would like to sum the expression of all the alleles together for comparison since presumably all the expressed alleles are identical in function. I saw some papers using the strategy already in their analysis. Stern wrote that “multiple transcripts were present for a species in an orthogroup, expected counts were summed across transcripts.” In his paper entitled “The Evolution of Gene Expression Underlying Vision Loss in Cave Animals”.  (https://academic.oup.com/mbe/article/35/8/2005/5000155). How do you think of this strategy? Because I know in most cases of cross-species comparison, reciprocal pairwise best hits are preferred. But I feel it would be very complicated for high ploidy species in my case.


 Thank you in advance for your reply. Hope you have a good weekend.


Bests,

Yuwei

Don Gilbert

unread,
Apr 20, 2024, 3:53:11 PM4/20/24
to EvidentialGene
Dear Yuwei and other readers,

My apologies in the long delay answering this .. I've been stuck trying to understand genomic DNA measures, which are probably more complex than transcriptomic, and at least have way more research to wade through.  The problem of measuring duplications in genes and genomes is both important and very difficult, even with the newest DNA sequencing methods.

Q1:

Since I worked on two close related plant species with high ploidy (12X) without genome sequence, I'm wondering how EvidentialGene will treat those 12 alleles (or maybe as high as 24 for genes with high heterozygosity), as paralogs or splicing variants?
In your paper entitled "Longest protein, longest transcript or most expression, for accurate gene reconstruction of transcriptomes?", you classified reference transcripts based on exon identity. Transcripts sharing exons at >99% are considered as alternates, at >97% <=99% as paralogs (may have alternates). Is the same principle applied in EvidentialGene pipeline?

A1:
Yes, the tr2aacds portion of Evigene uses only intrinsic evidence, ie. sequence identity of exon sized spans, to classify between alternates of one gene locus and paralogs.  There is no better way using only intrinsic evidence, though this way makes mistakes where paralogs are highly identical.  I have testing variations on this including cut-off identity of 97%-99% for separating paralogs from alternates.  The paper you note has some of this work explained.  

The "best" way is to add DNA evidence, specificly DNA coverage copy numbers of transcripts.
This is part of the Gnodes portion of Evigene.  I still am working on Gnodes DNA measurements, and it will be merged with the tr2aacds pipeline for better paralog/alternate classifying.   Gnodes can be used now if you have DNA sequences, to classify your transcripts for duplicates and copy number.  Use inputs of (a) DNA reads, short or/and long (pacbio, nanopore), and (b) primary coding sequences of Evigene's okayset CDS, the "t1" primary transcript per gene.  Gnodes maps DNA to transcripts, and produces a table for those t1 transcripts, marked as unique or duplicated, with copy number estimates.

I will have an updated paper on this subject, hopefully this year, of ways to discriminate paralogs from alternates using transcript sequence and DNA evidence.  There are some statistical measures which help, others do not work as well.  The pattern of SNPs or variants differs between alternates with shared exons and high-identity paralogs that don't share exons.


Q2:

I would like to sum the expression of all the alleles together for comparison since presumably all the expressed alleles are identical in function.

A2:
This is outside my area of expertise.  I do suggest to measure expression on all alternate transcripts of a gene, and then somehow reduce or correct for those alternates, to get the most valid measure of gene expression. If you don't include all the alternate exons of a gene, your expression measure can be biased by mis-mapping RNA seq to "wrong" exons.  The best statistical method to do this is not something I can advise on.
Reply all
Reply to author
Forward
0 new messages