Hi, Carson.
I’ve downloaded a number genes from GenBank using Entrez Direct, which I’m using with est and protein to annotate a plant mitochondrion. Most of these reference sequences have sensible and consistent gene names, and so I’m using est_forward to retain the gene names. This workflow is working well for me. Some of the genes pulled in from GenBank have less useful names like orf1234 or other numeric IDs. When multiple evidence sequences map to the same location, how does est_forward choose which name to use? If it’s chosen arbitrarily, could it be possible to choose the most common name instead?
Thanks,
Shaun
Hi, Carson. Could you give an example of how to add gene_id= to the header of the FASTA file? I’m not clear on what you mean by this. In the FASTA header, what portion is the transcript name, and what portion is the gene name?
Cheers,
Shaun
Interesting. Thanks for the clarification. I’m working on a plant mitochondrion, and so as far as I know, there’s no alternative splicing. My protein FASTA file is composed of the protein sequences of ~100 species downloaded from GenBank. It looks like this:
>cox1|lcl|KJ461445.1_cdsid_AHY20320.1 [gene=cox1] [protein=cytochrome c oxidase subunit 1] [protein_id=AHY20320.1] [location=complement(59212..60795)]
…
>cox1|lcl|EU534409.1_cdsid_ACA62629.1 [gene=cox1] [protein=cox1] [protein_id=ACA62629.1] [location=245282..246856]
…
>cox1|lcl|NC_023103.1_cdsid_YP_008964124.1 [gene=cox1] [protein=cytochrome c oxidase subunit 1] [protein_id=YP_008964124.1] [location=join(317824..318438,319511..320368)]
…
I’m not sure that I actually want the fancy behaviour that you describe, though it probably wouldn’t hurt anything. Will this FASTA format trigger the fancy behaviour?
Cheers,
Shaun