[maker-devel] est_forward and conflicting names

49 views
Skip to first unread message

Shaun Jackman

unread,
Apr 30, 2014, 7:25:17 PM4/30/14
to maker...@yandell-lab.org

Hi, Carson.

I’ve downloaded a number genes from GenBank using Entrez Direct, which I’m using with est and protein to annotate a plant mitochondrion. Most of these reference sequences have sensible and consistent gene names, and so I’m using est_forward to retain the gene names. This workflow is working well for me. Some of the genes pulled in from GenBank have less useful names like orf1234 or other numeric IDs. When multiple evidence sequences map to the same location, how does est_forward choose which name to use? If it’s chosen arbitrarily, could it be possible to choose the most common name instead?

Thanks,
Shaun


Carson Holt

unread,
May 2, 2014, 2:55:27 PM5/2/14
to Shaun Jackman, maker...@yandell-lab.org
Whichever has the best AED score I believe, but you can add gene_id= to the header of each fasta file to ensure MAKER doesn't try and cluster unrelated transcripts into a single gene.  Then the transcript name and gene name will be guaranteed to match up.

--Carson


_______________________________________________ maker-devel mailing list maker...@box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

Shaun Jackman

unread,
May 8, 2014, 6:26:34 PM5/8/14
to Carson Holt, maker...@yandell-lab.org

Hi, Carson. Could you give an example of how to add gene_id= to the header of the FASTA file? I’m not clear on what you mean by this. In the FASTA header, what portion is the transcript name, and what portion is the gene name?

Cheers,
Shaun

Carson Holt

unread,
May 8, 2014, 6:33:36 PM5/8/14
to Shaun Jackman, maker...@yandell-lab.org
When moving transcripts onto a new assembly, you may have multiple transcripts of the same gene. Because your transcript name should be your fasta ID there is no way for MAKER to know that they go together when moving the models forward, so you can use the gene= option to make MAKER aware that these belong to the same genes.  They will be grouped and you recover all splice forms as a group. 

Example:

>SMEDT_00004   gene=dpp
AAAAAAA

>SMEDT_00005 gene=dpp
AAAAAAA

Shaun Jackman

unread,
May 8, 2014, 6:41:41 PM5/8/14
to Carson Holt, maker...@yandell-lab.org

Interesting. Thanks for the clarification. I’m working on a plant mitochondrion, and so as far as I know, there’s no alternative splicing. My protein FASTA file is composed of the protein sequences of ~100 species downloaded from GenBank. It looks like this:

>cox1|lcl|KJ461445.1_cdsid_AHY20320.1 [gene=cox1] [protein=cytochrome c oxidase subunit 1] [protein_id=AHY20320.1] [location=complement(59212..60795)]
…
>cox1|lcl|EU534409.1_cdsid_ACA62629.1 [gene=cox1] [protein=cox1] [protein_id=ACA62629.1] [location=245282..246856]
…
>cox1|lcl|NC_023103.1_cdsid_YP_008964124.1 [gene=cox1] [protein=cytochrome c oxidase subunit 1] [protein_id=YP_008964124.1] [location=join(317824..318438,319511..320368)]
…

I’m not sure that I actually want the fancy behaviour that you describe, though it probably wouldn’t hurt anything. Will this FASTA format trigger the fancy behaviour?

Cheers,
Shaun

Carson Holt

unread,
May 8, 2014, 6:43:40 PM5/8/14
to Shaun Jackman, maker...@yandell-lab.org
Only if you were to remove the brackets around gene=.
Reply all
Reply to author
Forward
0 new messages