Re: Queries related to Evigene for my work

41 views

Skip to first unread message

Don Gilbert

unread,

Apr 20, 2024, 4:50:14 PM4/20/24

to Deepti Rao, gilb...@indiana.edu, EvidentialGene

"..We generated RNA-seq (Illumina) and Iso-seq (PacBio) data .. built transcripts with these datasets using RNA-STAR and Stringtie [and] Iso-seq3.. I have a non-redundant dataset from stringtie merge .. I've also tried running MAKER.. "

Q1: Can I use my non-redundant transcript set as input, along with the genome assembly and avoid the trinity assembly step?

A1: Yes.
I don't know how stringtie merge operates. You may want to take time/effort to compare that with what Evigene does with your full set of transcripts, which it has been designed to do.

I don't know how Maker operates these days; I had tried it and compared gene data to it some years ago, and don't have anything useful to say about it. Given your RNA transcript assemblies from a variety of sources, I'd say Evigene will produce as good or better of a non-redundant gene set as you are likely to get via other methods.

Q2: How are the runtimes of evigene compared to MAKER?
A2:
The tr2aacds portion of Evigene is fairly quick. You can use many CPU, but be advised it needs about 2 GB of memory per CPU (memory use depends on your transcript set complexity), so roughly 64 GB memory with 32 CPU will reduce some 100,000s of assemblies to a non-redundant set in less than 1 hour run-time. It reduces large over-assemblies of transcripts to a non-redundant subset of coding genes with alternate transcripts. There are additional portions of Evigene, notably tr2ncrna for non-coding genes, and Gnodes if you care to measure with DNA the copy numbers of your genes.

PS, as I've been spending much time comparing long and short read DNA data for measuring gene/genome copy numbers, one observation is that some of the PacBio results (esp. for hifi ccs data) are under-performing other data types, esp. Oxford Nanopore. Some years back when I first measured transcriptome data from Pacbio Iso-seq versus Illumina, the tradeoff was that Pacbio Iso-seq, while producing long transcripts, was deficient in recovering *all* biological transcripts, that well-assembled Illumina short reads would find. So you are going about this well to merge transcripts from both sources to get a complete gene-ome.

-- Don GIlbert

On Thu, Apr 4, 2024 at 2:41 AM Deepti Rao <deep...@csirccmb.org> wrote:

Dear Don,
I'm writing to get your perspectives on whether I can use Evigene for my work. I have a rice genome assembly that I am trying to annotate. We generated RNA-seq (Illumina) and Iso-seq (PacBio) data for this rice variety. I have built transcripts with these datasets using RNA-STAR and Stringtie (for the short reads), and the Iso-seq3 pipeline for the long reads. I have a non-redundant dataset from stringtie merge that's given me 24k genes and 35k transcripts. I've also tried running MAKER using this data and transcript data from 2 other rice genomes, and protein evidence from Swiss-prot. I have a little over 5k transcripts after round 1 of MAKER.

I have some questions:
Can I use my non-redundant transcript set as input, along with the genome assembly and avoid the trinity assembly step?
How are the runtimes of evigene compared to MAKER?

--
Regards,

Deepti Rao
Graduate Student,
CCMB, Hyderabad
India