[maker-devel] iterative Maker2

Dennis, Alice

unread,

Dec 12, 2014, 10:32:01 AM12/12/14

to maker...@yandell-lab.org

Hi all,

I am a relatively new user to Maker2, and I’m looking for advise on running many iterations of the same dataset in Maker2.

I have a relatively small genome (~124 MB) from a wasp that is assembled into ~1,500 scaffold. I have run several iterations of Maker2 by re-generating .hmms in SNAP and feeding them into the next round, and my gene predictions keep increasing (in number and in size). The only thing that changes at each round is the .hmm.

This is the evidence that I give is:

- de novo assembled ESTs from a different strain of the same species (70,000 contigs… I am currently working on improving this assembly with the hope that this will be helpful here)

- 610 proteins extracted from the genome scaffolds using CEGMA and HaMSTr

For my 1^st iteration, I used the Nasonia .hmm from SNAP, and the est2genome/protein2genome option.

For the 2^nd, 3^rd and 4^th rounds I have used .hmms generated from the previous round, all without the est2genome/protein2genome option. All other files are the same as in the original run.

As I understand it, after the second round, nothing should change in Maker2. But the differences are obvious between runs. Some entirely new exons are annotated. For example, just counting “exon” in the .gff file gives me 73,000 after the third iteration and 96,000 after the fourth! Actually the biggest leap in this number is between the third and fourth round. I can also see that many features are longer when I look at the files in Geneious.

Is this sort of change possible after the second round of Maker2? Is there something I have done wrong in my runs, or am a understanding this output incorrectly?

Thank you,

Alice

Carson Holt

unread,

Dec 12, 2014, 10:42:03 AM12/12/14

to Dennis, Alice, maker...@yandell-lab.org

The gene models are actually produced by SNAP, Augustus, or whatever gene predictor you are using, so if you change the HMM every round, then the models will change too. But I have one concern. You are using a very sparse protein evidence dataset. The protein dataset is very important to MAKER’s performance, and for itterative training of the ab initio predictors. Normally after the second iteration, additional training should not be beneficial, but if you are getting wildly different results on 3rd and 4th round, then you probably aren’t getting sufficient good models to train with.

For a protein dataset you should be using the entire a proteome from a minimum of two related species and perhaps all of UniProt/Swiss-prot to get a broad protein database. Don’t use the proteins extracted by CEGMA and HaMSTr. CEGMA can be used to guide the first HMM creation (cegma2zff scrip that comes with MAEKR), but don’t give the proteins to MAKER as evidence, also the HaMSTr results will be redundant with the ESTs. You need proteins from related species to look for homology not found in the EST dataset.

Also repeat masking is important for any genome and has a huge effect on ab initio predictor performance. Make sure you run something like RepeatModeler to look for species specific repeats that will not already be in RepBase. Then add those results to the rmlib= option in the maker control files.

Thanks,

Carson

_______________________________________________
maker-devel mailing list
maker...@box290.bluehost.com
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

Alice Dennis

unread,

Mar 26, 2015, 6:34:39 AM3/26/15

to Carson Holt, maker...@yandell-lab.org, Dennis, Alice

Hello again,

I posted a while ago about a genome I'm running through the Maker2
pipeline. I was concerned because my results were still changing with
3 and 4 iterations.

Following the very useful advice of Carson (below), I've made a few
modifications (adding a RepeatModeler run, using a big protein
database), but my gene predictions are still changing between the 3rd
and 4th iterations. Perhaps this is ok, but these increasing gene
lengths make me worry that I haven't built stable models.

Here is the short version of what I've done.
1. Run RepeatModeler, but this only produced 47 sequences in the
resulting .fasta... so that seemed a bit small.

2. Run Maker2 using:
- RepeatModeler output + "model_org=all" and "softmask=1" in the
Repeat Masking section.
- protein evidence from 2 distantly related species AND all of Uniprot
- ests from a different strain of my species (a parasitoid wasp)
- the .hmm from Nasonia, one of the 2 distantly related species whose
proteome I also provided as protein evidence
- my assembled genome of 1,509 scaffolds.

3. After this, I did three subsequent rounds of Maker2 (cleverly named
Rounds 2, 3 and 4). Each one used the same input, except the Nasonia
.hmm was replaced by a SNAP generated .hmm from the previous round.
Also, the est2genome and protein2genome was changed from 1 to 0 in all
runs after the first.

Here are some results:
Round1: 14,647 genes, average length 2,491
Round2: 12,158 genes, average length 3,760
Round3: 13,515 genes, average length 3,090
Round4: 12,169 genes, average length 3,918

This is a bit confusing because the number of genes predicted goes up
and down, as does their lengths. I've doubly checked the dates of my
files, and they are all labeled such that I don't think anything could
be swapped.

So my questions are:
Is this an indication that my models are unstable and I shouldn't
trust these predictions?
Is the decreasing number of genes, while also getting longer perhaps a
good thing?
How do I know when to stop if genes keep getting longer?

Thanks very much,
Alice

> _______________________________________________
> maker-devel mailing list
> maker...@box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>

--

Alice Dennis
aliceb...@gmail.com

Postdoctoral Researcher
Institute for Integrative Biology, ETH Zürich & EAWAG
Überlandstrasse 133
P.O. Box 611
8600 Dübendorf, Switzerland

https://adennis5.wordpress.com/

Michael Campbell

unread,

Mar 26, 2015, 11:50:55 AM3/26/15

to Alice Dennis, maker...@yandell-lab.org, Dennis, Alice

Hi Alice,

In my experience the fewer longer genes is generally a good thing (and very normal) resulting from the merging of split models and extension of incomplete models. I find it helpful to load the annotations and evidence into a browser to get a visual idea of what is happening.

Mike

--

Michael Campbell MS, RD.
Doctoral Candidate
Eccles Institute of Human Genetics
University of Utah
15 North 2030 East, Room 2100
Salt Lake City, UT 84112-5330
ph:585-3543

Reply all

Reply to author

Forward