Question about duplicate ID warning in Evigene

15 views
Skip to first unread message

Laureano Jose Giordano

unread,
Mar 6, 2025, 8:15:15 AM3/6/25
to EvidentialGene

Dear Don,

I've been using EvidentialGene for a couple of months, and it has proven to be an amazing tool. So, thank you for the great work!

I'm creating this thread because I've been encountering a warning/error in some of my Evigene runs. The message is something like this":

#ta2c: ERR: output skip dup id "some_id"

For instance:

#ta2c: ERR: output skip dup id TRINITY_GG_99004_c0_g2_i1

I know for a fact that the ID is not duplicated in my FASTA file.

To investigate further, I modified traa2cds.pl to print @$rinfo, and heres what I found:

$VAR1 = [ 'TRINITY_GG_99004_c0_g2_i1', 'TRINITY_GG_99004_c0_g2_i1', 'na', '34,30%,complete', '347', '-', '116-12', '' ];
$VAR1 = [ 'TRINITY_GG_99004_c0_g2_i1', 'TRINITY_GG_99004_c0_g2_i1', 'na', '34,30%,complete', '347', '+', '232-336', '' ];

How does EvidentialGene handle transcripts with ORFs in both orientations?
Should this transcript be kept or discarded in the final dataset?

Thanks in advance for your help!

Best,
Laureano

Don Gilbert

unread,
Mar 6, 2025, 3:06:38 PM3/6/25
to EvidentialGene
Laureano,

I suspect this is an ID format problem, the same trinity ID is being used for two ORFs found in this transcript.  
Evigene does look for multiple ORFs in each transcript, using metrics to decide if they may be real.
It tries to append a new tag 'utrorf' to original ID, to keep these distinct.  In your data this may have failed.
Would you try this simple test, if you haven't?

   $evigene/scripts/rnaseq/trformat.pl -format trinity -output trin1reformat.fa -input trin1set.fasta [.. tr2set .. ]

Then rerun your evigene tr2aacds script on that reformatted input file (with -log option).  If this dup-id error persists after reformat, let me see example transcript fasta causing this, and a log file from your tr2aacds -log ... run

The -format trinity option shouldn't be needed but those id formats keep changing and I don't always catch what the changes are.

Artifacts and rarer biology produce 2 or more proteins in one transcript, and it is best to look for them,
then decide with other measures if they are real.

- Don Gilbert
Reply all
Reply to author
Forward
0 new messages