Hi Mary
As far as I've seen, different isoforms align to the same sequence (at least using blast), although alignments will have differences between isoforms. I think that removing isoform information (_i) shouldn't be such a big deal.
Running
align_and_estimate_abundance.pl without a given gene to transcript map makes the script to generate it's own map (this example was obtained using RSEM). The gene_trans_map generated is as follows:
TR55145|c0_g1 TR55145|c0_g1_i1
TR55145|c0_g1 TR55145|c0_g1_i2
TR55145|c0_g1 TR55145|c0_g1_i3
TR55145|c0_g1 TR55145|c0_g1_i4
TR55145|c0_g1 TR55145|c0_g1_i5
This map is used afterwards for RSEM for gene level analysis. Hence, all isoforms are associated to the same gene, so at that level, all the isoform annotations would be associated to the same gene. That suggests me that removing _i from the annotation wouldn't be a big deal for merging annotation at gene level (please, correct me if I'm wrong)
The issue that I see from performing the analysis at gene level is what isoform should we keep/report afterwards? for example, in this case we would have 5 different isoforms associated with the same gene. If we decide to remove
the _i from the fasta file, we would have five sequences with the same header. Which sequence (isoform) should we keep and which ones should we discard? The longest one? the better aligned?
Best regards,
Gabriel