[maker-devel] est_forward and conflicting names

Shaun Jackman

unread,

Apr 30, 2014, 7:25:17 PM4/30/14

to maker...@yandell-lab.org

Hi, Carson.

I’ve downloaded a number genes from GenBank using Entrez Direct, which I’m using with est and protein to annotate a plant mitochondrion. Most of these reference sequences have sensible and consistent gene names, and so I’m using est_forward to retain the gene names. This workflow is working well for me. Some of the genes pulled in from GenBank have less useful names like orf1234 or other numeric IDs. When multiple evidence sequences map to the same location, how does est_forward choose which name to use? If it’s chosen arbitrarily, could it be possible to choose the most common name instead?

Thanks,
Shaun

Carson Holt

unread,

May 2, 2014, 2:55:27 PM5/2/14

to Shaun Jackman, maker...@yandell-lab.org

Whichever has the best AED score I believe, but you can add gene_id= to the header of each fasta file to ensure MAKER doesn't try and cluster unrelated transcripts into a single gene. Then the transcript name and gene name will be guaranteed to match up.

--Carson

_______________________________________________ maker-devel mailing list maker...@box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

Shaun Jackman

unread,

May 8, 2014, 6:26:34 PM5/8/14

to Carson Holt, maker...@yandell-lab.org

Hi, Carson. Could you give an example of how to add gene_id= to the header of the FASTA file? I’m not clear on what you mean by this. In the FASTA header, what portion is the transcript name, and what portion is the gene name?

Cheers,
Shaun

http://sjackman.ca

Carson Holt

unread,

May 8, 2014, 6:33:36 PM5/8/14

to Shaun Jackman, maker...@yandell-lab.org

When moving transcripts onto a new assembly, you may have multiple transcripts of the same gene. Because your transcript name should be your fasta ID there is no way for MAKER to know that they go together when moving the models forward, so you can use the gene= option to make MAKER aware that these belong to the same genes. They will be grouped and you recover all splice forms as a group.

Example:

>SMEDT_00004 gene=dpp

AAAAAAA

>SMEDT_00005 gene=dpp

AAAAAAA

Shaun Jackman

unread,

May 8, 2014, 6:41:41 PM5/8/14

to Carson Holt, maker...@yandell-lab.org

Interesting. Thanks for the clarification. I’m working on a plant mitochondrion, and so as far as I know, there’s no alternative splicing. My protein FASTA file is composed of the protein sequences of ~100 species downloaded from GenBank. It looks like this:

>cox1|lcl|KJ461445.1_cdsid_AHY20320.1 [gene=cox1] [protein=cytochrome c oxidase subunit 1] [protein_id=AHY20320.1] [location=complement(59212..60795)]
…
>cox1|lcl|EU534409.1_cdsid_ACA62629.1 [gene=cox1] [protein=cox1] [protein_id=ACA62629.1] [location=245282..246856]
…
>cox1|lcl|NC_023103.1_cdsid_YP_008964124.1 [gene=cox1] [protein=cytochrome c oxidase subunit 1] [protein_id=YP_008964124.1] [location=join(317824..318438,319511..320368)]
…

I’m not sure that I actually want the fancy behaviour that you describe, though it probably wouldn’t hurt anything. Will this FASTA format trigger the fancy behaviour?

Cheers,
Shaun

http://sjackman.ca

Carson Holt

unread,

May 8, 2014, 6:43:40 PM5/8/14

to Shaun Jackman, maker...@yandell-lab.org

Only if you were to remove the brackets around gene=.

Reply all

Reply to author

Forward