Predicting genes from a masked sequence faces several problems. First,
one should not mask low complexity regions, e.g. to avoid masking
trinucleotide repeats in coding regions. But even with only
interspersed repeats masked, gene prediction programs may fail to
identify exons correctly. As mentioned above, sometimes tail ends of
coding regions may have originated from transposable elements. Even if
no coding regions have been masked, splice sites may be compromised;
e.g. the polypyrimidine region that is part of the acceptor splice
site may be contained within a repeat.
Thus, I generally recommend to run a gene prediction program on
unmasked DNA (as well) and compare the predicted genes and exons with
the RepeatMasker output. Some gene prediction program allow you to
force certain exons out of the predictions (e.g. often the old ORFs of
LINE1 elements and endogenous retroviruses are included in
genes). Work is also in progress at several sites to incorporate
RepeatMasker into gene prediction programs, in which cases matches to
repeats are weighted in along with the other parameters used.
Best
Quanwei
_______________________________________________
maker-devel mailing list
maker...@box290.bluehost.com
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
> (1) We are doing genome annotation for a new rodent species, we wonder whether we should use repeat library for "Mammalia" or "rodent"? Which is more proper, if we did not construct a species-specific repeat library for the new genome?
Over masking can occur, but you should really only worry about it if there is a specific gene you are looking for or gene family and you don’t care about false positive gene models. On a genome wide level you will find that undermasking is almost always the greater danger. So I’d recommend using Mammalia. Also you should always build a species specific library when working with repeat rich organisms like mammals.
> (2) With some concerns as discussed above emails, we did not train a species-specific repeat library. Since we have finished the annotation only using the repeat library from repeatMasker and Maker2, we wonder whether it is worth for us to firstly train a species-specific repeat library and then do the genome annotation again? Will it (i.e., trainning a species-specific repeat library) significantly affect the gene annotation and downstream analysis (e.g., gene family expansion analysis, positive selection)?
It might be ok. Both Mammalia and rodent are already rich in related species repeats in RepBase. But you still may have a lot of false positives because of missed repeats. Repeats and transposable elements tend to create false regions of high evidence homology (make it look like you are getting evidence for a gene in the region, but when you look at the underlying sequence you realize it is a spurious alignment).
> (3) We identified some gene families under contraction, but we want to confirm those gene families really lost copies in our new genome. Do you think it is worth to do the genome annotation without repeat masking, so there will not be genes missing from annotation due to repeat mask?
Without repeat masking you will get a lot of false alignments. If you find anything without repeat masking you will need to do heavy manual review of the alignment and perhaps even domain identification to further weed out the many false positives you are sure to get.
—Carson
(1) For the predicted unknown (unclassified) repeat sequences (those in Modelerunknown.lib), it mentioned "Sequences in Modelerunknown.lib were searched against a transposase database (derived from RepeatMasker) and sequences matching transposase were considered as transposons belonging to the relevant superfamily".I wonder how to do this search. Annotate the "unknown" repeat sequences using the Repeatmaker? Then what to do, if for an "unknown" repeat sequence, only part of the sequence match the known repeat elements.
(2) To exclude gene fragments, I need map the predicted repeat sequences against a protein database, and then run the package "ProExcluder". Right? I wonder how to get such protein database. Since I am working on a new rodent species, can I use all the rodent proteins from Uniprot (both Swiss-Prot and TrEMBL)?
(3) After I generate the species specific repeat library, do I still need to select a model organism for RepBase masking (as shown below).In the file "maker_opts.ctl"
#-----Repeat Masking (leave values blank to skip repeat masking)
model_org=Mammalia #select a model organism for RepBase masking in RepeatMasker
rmlib=myRepeat.fa #provide an organism specific repeat library in fasta format for RepeatMasker
Hi Quanwei,I think you should run it on an unmasked genome. I don’t think that redundancy between repeat libraries will be an issue.Thanks,Daniel
On Aug 30, 2017, at 10:01 AM, Quanwei Zhang <qwzha...@gmail.com> wrote: