How to identify isoforms post-Evigene

65 views

Skip to first unread message

Eric Edsinger

unread,

Jul 17, 2023, 3:37:28 PM7/17/23

to Don Gilbert, EvidentialGene

Hi Don,

I saw your new tools Gnomes and am looking forward to working with it on several genome projects where we are getting chromosome-scale assemblies!

I have a question regarding EvidentialGene tr2cdsaa4 - which I think is the latest version (2020) for using EvidentialGene to merge multiple transcriptomes. It runs to completion but there is no structure in the okay transcript headers where I can work out isoforms.

>evgUnnamedSample_HQ_transcript_13 type=protein; aalen=4454,98%,complete; clen=13619; strand=+; offs=58-13422; codepot=Code/0.003; evgclass=main,okay,match:evgUnnamedSample_HQ_transcript_975,pct:100/100/.;

For older versions there was something in the headers along the lines of t1, t2, t3 etc - which I could parse to get the top isoform per "gene/locus".

The total transcripts or proteins in the okay transcriptomes are several hundred thousand - versus the genome should have 20-30,000 coding genes. Its over a million in the alternates, so there is clearly some cleanup going on by tr2cdsaa4.

Looking at the log report it says it was unable to do consensus:

#t2ac: CMD= touch output/82-all-trinity.assemblies-uniquenr.cds.isbest

#t2ac: too many cds have consensus to be usable: 2095809 of 3359082 (62%); skipping consensus..

#t2ac: nonredundant_reassignbest= 2149814 of 2233007

I'm wondering if this might be related to either the transcriptome size or lack of identifying information to sort out isoforms. I've attached the log report.

It's been a while since I used EvidentialGene, so maybe there is something obvious I'm forgetting. If you have any advice, I would love to hear it.

Thank you very much,
Eric

83-log-evigene-all-trinity-assemblies

Don Gilbert

unread,

Jul 17, 2023, 4:35:08 PM7/17/23

to Eric Edsinger, EvidentialGene

Eric,

Something is wrong, but your log file doesn't make it clear what.
Your output
>evgUnnamedSample_HQ_transcript_13 type=protein; aalen=4454,98%,complete; clen=13619

Expected default
>NonamEVm000123t1 type=protein; aalen=4454,98%,complete; clen=13619; evgclass=main;
where t1 indicates 1st or main transcript of gene m000123, t2 .. t99 are alternates.

This pipeline of scripts is somewhat sensitive to input transcript file name, and sometimes the IDs in the file. I suggest you try running your input transcript set thru Evigene's trformat, that may fix this problem. Also try starting tr2aacds in a new, folder empty of all but the input transcript file:

mkdir evg2run
$evigene/rnaseq/trformat.pl -format trinity \
-input output/82-all-trinity.assemblies-unique.fasta \
-output evg2run/alltrin82uniq.tr

cd evg2run/
$evigene/prot/tr2aacds.pl -NCPU=45 MAXMEM=9500000 -MINAA=30 -log -cdnaseq alltrin82uniq.tr

The consensus method in tr2aacds.pl requires different assemblers, or different assembly parameters, and it uses ID patterns to decide when to measure consensus. The default consensus detection won't work for all-trinity input data.

There is a newer version of tr2aacds4.pl, '2022.04.05', that automatically detects stranded-rna inputs

Your okayset/ folder should also have a "pubids" table of transcript, gene IDs and other info, to give you main and alternate transcript info. E.g. okayset/alltrin82uniq.pubids

#Public_mRNA_ID originalID PublicGeneID AltNum Class AAqual pIdAln Notes Oids

HspbcEVm000001t1 SRR12519036.1175616 HspbcEVm000001 1 main 5388,96%,complete 100/100/. pflag:0,selfbest:SRR12519035.565790,exonch:100%cs,100%cx,20cs,icalt:l492.t7,a45s45,tscore:274788, SRR12519036.1175616
HspbcEVm000001t2 SRR12519035.565790 HspbcEVm000001 2 alt 5338,92%,complete 100/100/. pflag:0,selfbest:SRR12519036.1175616,exonch:100%cs,95%cx,20cs,icsub:l492.t7,a43s43,tscore:272238, SRR12519035.565790