[maker-devel] Mapping gene names

221 views
Skip to first unread message

Shaun Jackman

unread,
Feb 25, 2014, 7:06:03 PM2/25/14
to maker...@yandell-lab.org

Hi,

I’m annotating a genome using a closely related genome from Genbank, using the .frn (RNA) and .faa (protein) files from Genbank as evidence to annotate my genome. I’ve run Maker, and the annotation seems to have worked well. Is it possible to map the names of the genes from the related species to my annotation? I see the map_forward option, which applies to the model_gff parameter. Is there a similar option for est and protein?

maker_opts.ctl

est=NC_123456.frn
protein=NC_123456.faa
est2genome=1
protein2genome=1

Thanks,
Shaun

Carson Holt

unread,
Feb 25, 2014, 7:58:08 PM2/25/14
to Shaun Jackman, maker...@yandell-lab.org
There is a way.  It’s not a standard option and it’s undocumented, but if you add est_forward=1 to the maker_opts.ctl file, then it will do just that.  The option won’t already be there so you’ll have to type it in.

There is also a feature designed to work with this option.  If you add tags to your fasta headers, those can be used to guide the mapping and naming.  For example, gene_id=<some_gene>  will ensure different isoforms that share a common gene_id get clustered into the same gene, and maker_coor=chr1:1-10000 in the fasta header will force a particular sequence to only be mapped against chr1 within the range of 1-10000 bp  and just using maker_coor=chr1 will force it to only be mapped against chr1.

This is an undocumented way to remap genes onto new assemblies using blast alignments of earlier transcript or protein annotations as a guide.

—Carson




_______________________________________________ maker-devel mailing list maker...@box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

Carson Holt

unread,
Feb 25, 2014, 8:04:48 PM2/25/14
to Shaun Jackman, maker...@yandell-lab.org
One more note.  When using this option, the score column of mRNA features will represent how completely this gene matches the source EST/protein (fraction coverage multiplied by % identity).  So a value of 100 means there is perfect match.  This way if the same transcript maps to multiple locations, then you can identify which locations is the closest match (also works for identifying likly orthologs vs. paralogs).

—Carson

Mikael Brandström Durling

unread,
Feb 26, 2014, 7:30:44 AM2/26/14
to Carson Holt, maker...@yandell-lab.org
Can this use of maker_coor be used only to hint about the placement of the ests, without affecting the naming of the final genes? Ie if I have a database of EST where I have a priori knowledge of their rough placement, can this placement be given to maker without providing est_forward=1?

Thanks,
Mikael

Carson Holt

unread,
Feb 26, 2014, 8:22:34 AM2/26/14
to Mikael Brandström Durling, maker...@yandell-lab.org
Yes.  That should work as well as an accidental feature.

--Carson 

Sent from my iPhone

Mikael Brandström Durling

unread,
Feb 26, 2014, 8:37:29 AM2/26/14
to Carson Holt, maker...@yandell-lab.org
That might be a useful and time saving accidental feature. But, reading the code, it seems that I need to supply maker_coor but not gene_id, as well as the configuration option est_forward for this to work. Any occurrences of maker_coor in GI.pm seems to be conditioned on set_forward=1 right?

Mikael

Carson Holt

unread,
Feb 26, 2014, 11:09:14 AM2/26/14
to Mikael Brandström Durling, maker...@yandell-lab.org
It will still work without est_forward.  It just works a little differently.  Keep in mind this was a hidden feature I used to find stubborn or hard to find missing genes after reassembly of a genome.

If est_forward is provided, MAKER will parse the database to look for the maker_coor tags early in the pipeline.  Then it will create a list of locations to search, and it will search them even if there are no BLAST results to seed the search (normally MAKER gets a BLAST result first and then polishes it with exonerate).  So maker_coor=chr1 will cause MAKER to look for a match using all of chr1 as the input to exonerate even when BLAST finds nothing (this is a very very slow search, but can help pick up one or two stubborn genes that don’t remap well).  To allow this, MAKER gives exonerate looser matching parameters (i.e. allows for single base pair introns perhaps caused by assembly errors).  The logic here is that given the fact that I already told MAKER that with some degree of confidence I expect sequence A to map to to location X, it will try its hardest to make it match. 

Without est_forward set, the maker_coor= flag still gets read in GI.pm at line 1563, but only after a BLAST alignment has already seeded it to the region (that BLAST result has the information in its description parameter).  MAKER will then ignore seeds completely outside of maker_coor. In addition any BLAST seeds that overlap maker_coor will get the search space for alignment polishing adjusted to match maker_coor exactly.  Also match parameters for exonerate will not be relaxed as they were with est_forward.

As you can see the behavior, is slightly different (because it’s an accidental feature).

Thanks,
Carson



From: Mikael Brandström Durling <mikael....@slu.se>
Date: Wednesday, February 26, 2014 at 6:37 AM
To: Carson Holt <cars...@gmail.com>
Cc: "maker...@yandell-lab.org" <maker...@yandell-lab.org>
Subject: Re: [maker-devel] Mapping gene names

Mikael Brandström Durling

unread,
Feb 26, 2014, 5:04:37 PM2/26/14
to Carson Holt, maker...@yandell-lab.org
It seems that this could be a very useful option in those cases where you have firm a priori knowledge of the placement of ESTs. However, while trying it I note that est_forward implies that the est2genome predictor is turned on, implicitly. Is this necessary for this to work? I’m after the behavior you describe below where exonerate is made to try really hard within a limited region to align an est, but I would not like maker to produce est2genome predictions.

In general, I think this maker_coor and est_forward is a feature set that is worthy to be promoted into a documented feature.

THanks,
Mikael

Carson Holt

unread,
Feb 26, 2014, 5:50:30 PM2/26/14
to Mikael Brandström Durling, maker...@yandell-lab.org
What you can do is run it once with just est_forward=1 and est2genome/protein2genome set to 1.  Then take those results, pass them in as model_gff and use the map_forward option to then filter the results based on mRNA score and that would copy names onto new gene under the standard MAKER pipeline.  Eventually it’s really supposed to go into a separate tool that will map genes onto new assemblies (but under the hood the tool will just be calling MAKER with certain parameters restricted).  I do this because if people commonly use it mixed with things like SNAP I can start to get some very weird behaviors. 

Carson Holt

unread,
Feb 26, 2014, 6:45:30 PM2/26/14
to Mikael Brandström Durling, maker...@yandell-lab.org
Sorry I meant to say prefilter on the score in the mRNA column before passing the gff3 to model_gff.

--Carson 

Sent from my iPhone

Shaun Jackman

unread,
Feb 27, 2014, 6:17:22 PM2/27/14
to Carson Holt, maker...@yandell-lab.org
Is there a corresponding protein_forward=1 option to map forward protein names from protein2genome?

Cheers,
Shaun

Shaun Jackman

unread,
Feb 27, 2014, 7:27:30 PM2/27/14
to Carson Holt, maker...@yandell-lab.org

Sorry, ignore my previous question. est_forward also carries forward the names of protein evidence and works like a charm. Thank you!

The larger rrn16 and rrn23 genes annotated perfectly, but the smaller rrn4.5 and rrn5 and tRNA genes didn’t make it into the all.gff file. They are in the blastn output, and in the evidence_0.gff. rrn5 has perfect identity, sufficient bits (242 > bit_blastn=40) and sufficient E Value (2e-66 < eval_blastn=1e-10). How should I debug which filter is removing these hits?

organism_type=prokaryotic
est2genome=1
protein2genome=1
est_forward=1

Cheers,
Shaun

Carson Holt

unread,
Feb 27, 2014, 8:13:06 PM2/27/14
to Shaun Jackman, maker...@yandell-lab.org
Set single_exon=1, and the minimum size to a smaller value.  I think it's set to 250 right now.  Also est2genome is looking for ORF, so if there is none (as with tRNAs) they probably won't get picked up.

--Carson 

Sent from my iPhone

Shaun Jackman

unread,
Mar 4, 2014, 9:10:42 PM3/4/14
to Carson Holt, maker...@yandell-lab.org

Hi, Carson. I set single_length=50, and it worked like a charm. Thanks for the tip.

The rRNA genes that are found with est2genome have the feature type set to mRNA and have corresponding five_prime_UTR, CDS and three_prime_UTR features. Ideally the feature type would be set to rRNA or tRNA as appropriate, and would omit the UTR and CDS features. Is that a feature that you would be interested in adding to MAKER? The rRNA gene names all start with “rrn” and the tRNA gene names with “trn”, as is standard, so determining the appropriate type should be straight forward.

Thanks again for your help with this. Cheers,
Shaun

Carson Holt

unread,
Mar 4, 2014, 9:33:12 PM3/4/14
to Shaun Jackman, maker...@yandell-lab.org
Trying to call non-coding RNA from ESTs or even sequence homology is extremely messy (non-trivial problem in most organisms with high false positive rate), so MAKER for the most part doesn’t even try to do that.  It focuses only on the coding genes.  You can now use tRNAscan and snoscan in the newest version for some non-coding RNA support (those features were only added a couple of months ago).  So just like other prediction tools (snap, augustus etc.), the primary focus has always been the coding genes.  We’ve only started adding non-coding RNA support recently for iPlant, so it’s still relatively immature.

Thanks,

Shaun Jackman

unread,
Mar 6, 2014, 3:56:34 PM3/6/14
to Carson Holt, maker...@yandell-lab.org
Hi, Carson. I agree that identifying non-coding RNA by homology in general is a non-trivial problem. In my particular case, I have a well annotated reference species that is very closely related (99.2% sequence identity), so lifting over the annotations from that reference species to my species should be pretty straight forward. It would be great if MAKER had an option for RNA sequence homology similar to est2genome that does not imply the sequence is coding.

The integration of MAKER-P with tRNAscan is very useful. The identified genes are named e.g. `trnascan-205522-processed-gene-0.38`.  tRNA genes are conventionally named according to the amino acid and anticodon, such as `trnW-CCA`. Would it be possible for MAKER to name or perhaps prefix the names with that convention?

Cheers,
Shaun
_______________________________________________ maker-devel mailing list maker...@box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
_______________________________________________
maker-devel mailing list
maker...@box290.bluehost.com
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

Carson Holt

unread,
Mar 6, 2014, 3:58:41 PM3/6/14
to Shaun Jackman, maker...@yandell-lab.org
Yes.  I’ll fix the naming.

Shaun Jackman

unread,
Mar 6, 2014, 6:33:04 PM3/6/14
to Carson Holt, maker...@yandell-lab.org
Fantastic. Thanks, Carson. When I use both est2genome and tRNAscan to identify tRNA, I was hoping that both forms of evidence would be used to create a single gene model, which doesn’t seem to be the case. I get duplicate overlapping gene models (one mRNA from est and one tRNA from tRNAscan). Could MAKER merge these models?

Cheers,
Shaun

Carson Holt

unread,
Mar 6, 2014, 6:38:48 PM3/6/14
to Shaun Jackman, maker...@yandell-lab.org
Well… not really.  I have no plans to add est2genome support for noncoding genes (non-trivial), so you would either have to remove the ncRNA from your input, or filter it out downstream.

Shaun Jackman

unread,
May 2, 2014, 3:40:42 PM5/2/14
to Carson Holt, maker...@yandell-lab.org

Hi, Carson. Do you happen to have a patch that I could test out that fixes the naming of the tRNA identified by tRNAscan?

Is the MAKER subversion repository public, and if so, what’s its URL?

Cheers,
Shaun

Shaun wrote…

The integration of MAKER-P with tRNAscan is very useful. The identified genes are named e.g. trnascan-205522-processed-gene-0.38. tRNA genes are conventionally named according to the amino acid and anticodon, such as trnW-CCA. Would it be possible for MAKER to name or perhaps prefix the names with that convention?

Carson Holt

unread,
May 2, 2014, 3:50:23 PM5/2/14
to Shaun Jackman, maker...@yandell-lab.org
That should already be fixed in the current 2.31.3 download. I'll also send you the subversion credentials in a separate e-mail.

Thanks,
Carson


From: Shaun Jackman <sjac...@gmail.com>
Reply-To: Shaun Jackman <sjac...@gmail.com>
Date: Friday, May 2, 2014 at 1:40 PM
To: Carson Holt <cars...@gmail.com>
Cc: "maker...@yandell-lab.org" <maker...@yandell-lab.org>
Subject: Re: [maker-devel] Mapping gene names

Shaun Jackman

unread,
May 2, 2014, 4:00:22 PM5/2/14
to Carson Holt, maker...@yandell-lab.org

Fantastic. Thanks, Carson. I didn’t realize that there was a point release of MAKER. It’s not announced on the MAKER home page, which still reports Last Software Update v2.31 (Feb 11, 2014). Where are point releases announced?

The static link for MAKER 2.31 reports 403 Forbidden. Is there a new static link for MAKER 2.31.3?

Cheers,
Shaun

Carson Holt

unread,
May 2, 2014, 4:14:11 PM5/2/14
to maker...@yandell-lab.org
I need to fix that last update tag.  I did a point release, because there were a couple of very minor fixes that didn't justify a full release (tRNA naming and a fasta_merge bug for tRNAs - I think three lines total of code).  There won't be another major version release for a while because we're working on MAKER-EVM which will be version 3.0 (joint project for full MAKER integration with EVM).  So just point releases on 2.31 (which will be the very last version of MAKER2).  I'll fix the static link and add an new one for 2.31.3.

--Carson

Shaun Jackman

unread,
May 14, 2014, 5:07:52 PM5/14/14
to Carson Holt, maker...@yandell-lab.org

Hi, Carson. Perhaps MAKER could integrate Barrnap to predict rRNA.

Cheers,
Shaun

Carson Holt

unread,
May 14, 2014, 5:18:52 PM5/14/14
to Shaun Jackman, maker...@yandell-lab.org
Thanks.  Looks interesting. Also since output is already GFF3, you could probably just use it with gff passthrough.  It doesn't appear to support eukaryotes though.

--Carson


Sent from my iPhone

Shaun Jackman

unread,
May 14, 2014, 5:25:21 PM5/14/14
to Carson Holt, Torsten Seemann, maker...@yandell-lab.org

Hi, Carson, Torsten.

It doesn’t appear to support eukaryotes though.

Barrnap supports bacteria, archaea, mitochondria and eukaryotes. The barrnap --help output seems to be out of date.

Barrnap predicts the location of ribosomal RNA genes in genomes. It supports bacteria (5S,23S,16S), archaea (5S,5.8S,23S,16S), mitochondria (12S,16S) and eukaryotes (5S,5.8S,28S,18S).

barrnap --help
…
  --kingdom [X]     Kingdom: [b]acteria [a]rchaea (default 'bacteria')

Cheers,
Shaun

Shaun Jackman

unread,
May 14, 2014, 8:06:31 PM5/14/14
to Carson Holt, maker...@yandell-lab.org

Hi, Carson. I used other_gff to pass the following four-line GFF file of Barrnap rRNA annotations through. The output of gff3_merge is quite bizarre. See below.

Input:

##gff-version 3
200408_86    barrnap:0.4    rRNA    2171785    2173036    .    +    .    Name=12S_rRNA;product=12S ribosomal RNA
200408_86    barrnap:0.4    rRNA    3665772    3666686    .    -    .    Name=16S_rRNA;product=16S ribosomal RNA (partial);note=aligned only 57 percent of the 16S ribosomal RNA
200408_86    barrnap:0.4    rRNA    3826637    3827887    .    -    .    Name=12S_rRNA;product=12S ribosomal RNA
200408_86    barrnap:0.4    rRNA    4355857    4357119    .    +    .    Name=12S_rRNA;product=12S ribosomal RNA

Output:

###
ARRAY(0x7feceb928780)
###
ARRAY(0x7feceaa548a0)
###
ARRAY(0x7feceeb01c60)
###
ARRAY(0x7fecedf6fef8)
###

Cheers,
Shaun

On 14 May 2014 14:18, Carson Holt <cars...@gmail.com> wrote:

Carson Holt

unread,
May 14, 2014, 8:19:43 PM5/14/14
to Shaun Jackman, maker...@yandell-lab.org
That should be fixed in the current download?  It came up on the mailing list a couple of weeks ago.  I'll check.

--Carson

Shaun Jackman

unread,
May 14, 2014, 8:22:37 PM5/14/14
to Carson Holt, maker...@yandell-lab.org
I'm using MAKER 2.31.4.

Torsten Seemann

unread,
May 14, 2014, 7:33:55 PM5/14/14
to Shaun Jackman, maker...@yandell-lab.org
Carson & Shaun 

It doesn’t appear to support eukaryotes though.

Barrnap supports bacteria, archaea, mitochondria and eukaryotes. The barrnap --help output seems to be out of date.

Barrnap predicts the location of ribosomal RNA genes in genomes. It supports bacteria (5S,23S,16S), archaea (5S,5.8S,23S,16S), mitochondria (12S,16S) and eukaryotes (5S,5.8S,28S,18S).

It does support eukaryota and mitochondria, I just forgot to push the documentation changes. This has been resolved now in the 0.4.2 release.

  --kingdom [X]     Kingdom: euk arc bac mito (default 'bac')

Next release 0.5 will have an 'accurate' mode which will fine tune the predictions using cmalign glocal alignment.

Thanks for your interest!

--
--Dr Torsten Seemann
--Victorian Bioinformatics Consortium, Monash University, AUSTRALIA
--Life Sciences Computation Centre, VLSCI, Parkville, AUSTRALIA
--http://www.bioinformatics.net.au/


Fields, Christopher J

unread,
May 14, 2014, 11:23:03 PM5/14/14
to Torsten Seemann, maker...@yandell-lab.org
\o/

(now I can get rid of rnammer forever!)

chris

Sajeet Haridas

unread,
May 15, 2014, 1:36:00 PM5/15/14
to Fields, Christopher J, Torsten Seemann, maker...@yandell-lab.org
My brief test of barrnap suggests that it does not perform well on rRNA genes with introns such as those found in fungal mitochondria. Setting a lower threshold for --reject and --evalue helps, but is not enough.
Looks like I cannot abandon rnammer for now.
FYI - if you want to test barrnap with fungal mitochondria, use --kingdom bacteria because they have 23S and 16S unlike the human mitochondria.

Sajeet

Torsten Seemann

unread,
May 15, 2014, 6:42:53 PM5/15/14
to Sajeet Haridas, maker...@yandell-lab.org
Sajeet,

Brief test of barrnap suggests that it does not perform well on rRNA genes with introns such as those found in fungal mitochondria. Setting a lower threshold for --reject and --evalue helps, but is not enough.
Looks like I cannot abandon rnammer for now.
FYI - if you want to test barrnap with fungal mitochondria, use --kingdom bacteria because they have 23S and 16S unlike the human mitochondria.

This is good feedback. Paul Gardner also mentioned the intron issue. A "fungi" kingdom is clearly needed. I am not a mycologist so any assistance is coming up with a detailed rRNA architecture for eukaryotict phyla etc is something I have started but need assistance with. Adjustment of nhmmer alignment parameters could be done to improve the intronic rRNAs too.

Here is what I have so far in terms of models:

- do i need to split euk into protist / plant / animal / fungi?
- should the current 'mito' be places inside the current 'euk' ? as mito data is likely to end up in assemblies, but keep separate for mito-only data?
- plastids, chloroplasts, apicoplasts; i am not sure of the subtleties of these organelles' rRNA but am willing to learn.

Thank you again for testing. Any help appreciated,
Reply all
Reply to author
Forward
0 new messages