Dear Dr. Gilbert,
Thank you for allowing me to join the group first. I have some questions before I start to use EvidentialGene for incorporation of de novo assembly results. I really appreciate it if you can address them.
Since I worked on two close related plant species with high ploidy (12X) without genome sequence, I'm wondering how EvidentialGene will treat those 12 alleles (or maybe as high as 24 for genes with high heterozygosity), as paralogs or splicing variants? In your paper entitled "Longest protein, longest transcript or most expression, for accurate gene reconstruction of transcriptomes?", you classified reference transcripts based on exon identity. Transcripts sharing exons at >99% are considered as alternates, at >97% <=99% as paralogs (may have alternates). Is the same principle applied in EvidentialGene pipeline?
Second, I would like to sum the expression of all the alleles together for comparison since presumably all the expressed alleles are identical in function. I saw some papers using the strategy already in their analysis. Stern wrote that “multiple transcripts were present for a species in an orthogroup, expected counts were summed across transcripts.” In his paper entitled “The Evolution of Gene Expression Underlying Vision Loss in Cave Animals”. (https://academic.oup.com/mbe/article/35/8/2005/5000155). How do you think of this strategy? Because I know in most cases of cross-species comparison, reciprocal pairwise best hits are preferred. But I feel it would be very complicated for high ploidy species in my case.
Thank you in advance for your reply. Hope you have a good weekend.
Bests,
Yuwei