[maker-devel] repeats masking

Quanwei Zhang

unread,

Jul 31, 2017, 12:46:00 PM7/31/17

to maker...@yandell-lab.org

Hello:

We are using the Maker2 pipeline to annotating a new genome. We just read something about the repeat masking from repeatMasker's documents. It suggests to leave low complexity region unmasked and to do gene annotation using both masked and unmasked genome. I wonder what your opinion and suggestions on this? Many thanks

The paragraph below is from http://www.binfo.ncku.edu.tw/RM/webrepeatmaskerhelp.html

Use in association with gene prediction programs

Predicting genes from a masked sequence faces several problems. First, one should not mask low complexity regions, e.g. to avoid masking trinucleotide repeats in coding regions. But even with only interspersed repeats masked, gene prediction programs may fail to identify exons correctly. As mentioned above, sometimes tail ends of coding regions may have originated from transposable elements. Even if no coding regions have been masked, splice sites may be compromised; e.g. the polypyrimidine region that is part of the acceptor splice site may be contained within a repeat.

Thus, I generally recommend to run a gene prediction program on unmasked DNA (as well) and compare the predicted genes and exons with the RepeatMasker output. Some gene prediction program allow you to force certain exons out of the predictions (e.g. often the old ORFs of LINE1 elements and endogenous retroviruses are included in genes). Work is also in progress at several sites to incorporate RepeatMasker into gene prediction programs, in which cases matches to repeats are weighted in along with the other parameters used.

Best

Quanwei

Carson Holt

unread,

Jul 31, 2017, 12:49:10 PM7/31/17

to Quanwei Zhang, maker...@yandell-lab.org

MAKER uses the masking primarily for the evidence alignment step. Low complexity regions are soft masked which means alignments can extend through them but must seed outside of the masked region first. Successful BLAST alignments are then polished using exonerate on the unmasked region.

Also for the gene predictor, the first run is done with hard masking of the transposons only. So they can still predict in low complexity regions. The second round of hint based prediction is done on the unmasked assembly. So MAKER already handles all the issues you are mentioning.

--Carson

_______________________________________________
maker-devel mailing list
maker...@box290.bluehost.com
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

Daniel Ence

unread,

Jul 31, 2017, 12:57:27 PM7/31/17

to Quanwei Zhang, maker...@yandell-lab.org

Hi Quanwei, Running maker on the unmasked genome will probably give you more genes, but won’t be helpful in the end. Maker soft-masks repeats, which prevents blast alignments from being seeded in the masked regions, but still allows them to extend into those regions. This solves the problem missing exons mentioned in the text you sent. There’s an option in the control file to run the ab-inition programs on the unmasked sequence (“unmask”) which is set to false (0) by default.

Hope this helps,

Daniel Ence

Quanwei Zhang

unread,

Jul 31, 2017, 12:59:22 PM7/31/17

to Carson Holt, maker...@yandell-lab.org

Hi Carson:

I see. Thank you for your explanation!

Best

Quanwei

Carson Holt

unread,

Jul 31, 2017, 1:02:43 PM7/31/17

to Daniel Ence, maker...@yandell-lab.org, Quanwei Zhang

Please note that the unmask option Dan is talking about is a feature to run both masked and unmasked raw predictions in the first round of prediction (it does not affect alignemnt of the second round of predictiopn). It tends to increase the false positive rate but can be a quick test when you believe you are missing a gene because of overmasking from a user created library and protein/EST evidence is overly sparse (so the gene cannot be recovered through evidence alignment and the second round of unmasked prediction).

—Carson

Quanwei Zhang

unread,

Aug 16, 2017, 4:02:05 PM8/16/17

to Carson Holt, maker...@yandell-lab.org, Daniel Ence

Dear Carson and Daniel:

Thank you for your explanation about the details of repeat masking. But we still have some concerns, would you please give us some suggestions? Thanks

(1) We are doing genome annotation for a new rodent species, we wonder whether we should use repeat library for "Mammalia" or "rodent"? Which is more proper, if we did not construct a species-specific repeat library for the new genome?

#-----Repeat Masking (leave values blank to skip repeat masking)
model_org=Mammalia #select a model organism for RepBase masking in RepeatMasker

repeat_protein=/gs/gsfs0/hpc01/apps/MAKER/2.31.9/data/te_proteins.fasta #provide a fasta file of transposable element proteins for RepeatRunner

(2) With some concerns as discussed above emails, we did not train a species-specific repeat library. Since we have finished the annotation only using the repeat library from repeatMasker and Maker2, we wonder whether it is worth for us to firstly train a species-specific repeat library and then do the genome annotation again? Will it (i.e., trainning a species-specific repeat library) significantly affect the gene annotation and downstream analysis (e.g., gene family expansion analysis, positive selection)?

(3) We identified some gene families under contraction, but we want to confirm those gene families really lost copies in our new genome. Do you think it is worth to do the genome annotation without repeat masking, so there will not be genes missing from annotation due to repeat mask?

Many thanks.

Best

Quanwei

Carson Holt

unread,

Aug 18, 2017, 11:36:09 AM8/18/17

to Quanwei Zhang, maker...@yandell-lab.org, Daniel Ence

Hi Quanwei,

> (1) We are doing genome annotation for a new rodent species, we wonder whether we should use repeat library for "Mammalia" or "rodent"? Which is more proper, if we did not construct a species-specific repeat library for the new genome?

Over masking can occur, but you should really only worry about it if there is a specific gene you are looking for or gene family and you don’t care about false positive gene models. On a genome wide level you will find that undermasking is almost always the greater danger. So I’d recommend using Mammalia. Also you should always build a species specific library when working with repeat rich organisms like mammals.

> (2) With some concerns as discussed above emails, we did not train a species-specific repeat library. Since we have finished the annotation only using the repeat library from repeatMasker and Maker2, we wonder whether it is worth for us to firstly train a species-specific repeat library and then do the genome annotation again? Will it (i.e., trainning a species-specific repeat library) significantly affect the gene annotation and downstream analysis (e.g., gene family expansion analysis, positive selection)?

It might be ok. Both Mammalia and rodent are already rich in related species repeats in RepBase. But you still may have a lot of false positives because of missed repeats. Repeats and transposable elements tend to create false regions of high evidence homology (make it look like you are getting evidence for a gene in the region, but when you look at the underlying sequence you realize it is a spurious alignment).

> (3) We identified some gene families under contraction, but we want to confirm those gene families really lost copies in our new genome. Do you think it is worth to do the genome annotation without repeat masking, so there will not be genes missing from annotation due to repeat mask?

Without repeat masking you will get a lot of false alignments. If you find anything without repeat masking you will need to do heavy manual review of the alignment and perhaps even domain identification to further weed out the many false positives you are sure to get.

—Carson

Quanwei Zhang

unread,

Aug 21, 2017, 12:58:07 PM8/21/17

to Carson Holt, maker...@yandell-lab.org, Daniel Ence

Dear Carson:

I am trying to build a species specific repeat library for our new rodent species, following "http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic". But there are somethings not clear to us, would you please explain? Thanks

(1) For the predicted unknown (unclassified) repeat sequences (those in Modelerunknown.lib), it mentioned "Sequences in Modelerunknown.lib were searched against a transposase database (derived from RepeatMasker) and sequences matching transposase were considered as transposons belonging to the relevant superfamily".

I wonder how to do this search. Annotate the "unknown" repeat sequences using the Repeatmaker? Then what to do, if for an "unknown" repeat sequence, only part of the sequence match the known repeat elements.

(2) To exclude gene fragments, I need map the predicted repeat sequences against a protein database, and then run the package "ProExcluder". Right? I wonder how to get such protein database. Since I am working on a new rodent species, can I use all the rodent proteins from Uniprot (both Swiss-Prot and TrEMBL)?

(3) After I generate the species specific repeat library, do I still need to select a model organism for RepBase masking (as shown below).

In the file "maker_opts.ctl"

#-----Repeat Masking (leave values blank to skip repeat masking)
model_org=Mammalia #select a model organism for RepBase masking in RepeatMasker

rmlib=myRepeat.fa #provide an organism specific repeat library in fasta format for RepeatMasker

Many thanks

Best

Quanwei

Carson Holt

unread,

Aug 23, 2017, 2:11:04 PM8/23/17

to Quanwei Zhang, maker...@yandell-lab.org, Daniel Ence

(1) For the predicted unknown (unclassified) repeat sequences (those in Modelerunknown.lib), it mentioned "Sequences in Modelerunknown.lib were searched against a transposase database (derived from RepeatMasker) and sequences matching transposase were considered as transposons belonging to the relevant superfamily".
I wonder how to do this search. Annotate the "unknown" repeat sequences using the Repeatmaker? Then what to do, if for an "unknown" repeat sequence, only part of the sequence match the known repeat elements.

You can use RepBase match I guess, but I would not be overly worried about classification. MAKER won’t use any classification info you give it.

(2) To exclude gene fragments, I need map the predicted repeat sequences against a protein database, and then run the package "ProExcluder". Right? I wonder how to get such protein database. Since I am working on a new rodent species, can I use all the rodent proteins from Uniprot (both Swiss-Prot and TrEMBL)?

Try Swiss-Prot. That is a well curated cross species set.

(3) After I generate the species specific repeat library, do I still need to select a model organism for RepBase masking (as shown below).

In the file "maker_opts.ctl"
#-----Repeat Masking (leave values blank to skip repeat masking)
model_org=Mammalia #select a model organism for RepBase masking in RepeatMasker
rmlib=myRepeat.fa #provide an organism specific repeat library in fasta format for RepeatMasker

Yes. Supply both.

—Carson

Quanwei Zhang

unread,

Aug 30, 2017, 10:01:24 AM8/30/17

to Carson Holt, maker...@yandell-lab.org, Daniel Ence

Dear Carson:

Thank you again for all you valuable suggestions. Now I am generating the species specific repeat library. I wonder whether I need to remove the masked the regions by existing repeatMasker library, before I run repeatModeler? I think there may be some redundancy if I run repeatModeler directly on the genome and then use both existing repeatMasker library and the repeatModeler library to mask the genome. Does it matter, if there is such redundancy?

Thanks

Best

Quanwei

Quanwei Zhang

unread,

Aug 30, 2017, 10:22:10 AM8/30/17

to Daniel Ence, maker...@yandell-lab.org

Dear Daniel:

Thank you! I am running it on an unmasked genome. Just want to make sure it is the correct way.

Have a nice day!

Best

Quanwei

2017-08-30 10:19 GMT-04:00 Daniel Ence <dand...@gmail.com>:

Hi Quanwei,

I think you should run it on an unmasked genome. I don’t think that redundancy between repeat libraries will be an issue.

Thanks,
Daniel

Daniel Ence

unread,

Aug 30, 2017, 10:53:26 AM8/30/17

to Quanwei Zhang, maker...@yandell-lab.org

Hi Quanwei,

I think you should run it on an unmasked genome. I don’t think that redundancy between repeat libraries will be an issue.

Thanks,

Daniel

On Aug 30, 2017, at 10:01 AM, Quanwei Zhang <qwzha...@gmail.com> wrote:

Carson Holt

unread,

Aug 30, 2017, 10:54:09 AM8/30/17

to Quanwei Zhang, maker...@yandell-lab.org, Daniel Ence

You don’t need to worry about redundancy.

—Carson

Reply all

Reply to author

Forward