_______________________________________________
maker-devel mailing list
maker...@box290.bluehost.com
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Hi Ray,
I think you’re on the right track with training Genemark with RNAseq data. It should only change the training steps, which are external to MAKER, but not how MAKER runs Genemark. You’ll still give MAKER the path to the “es.mod" file made by Genemark.
For the 2nd question, in the MAKER beta 3, MAKER creates a control file for EVM, in which you set your weights for the various inputs, and then MAKER runs EVM alongside all the other gene predictors and chooses the model that is best supported by the evidence.
~Daniel
On Feb 14, 2017, at 7:38 AM, Ray Cui <rc...@age.mpg.de> wrote:
On Mar 16, 2017, at 3:07 AM, Ray Cui <rc...@age.mpg.de> wrote:
Dear Carson,thank you so much! I am now peeking into the results for the finished scaffolds. In the gff file, the gene id confuses me a bit. In this file, column 2 is always "maker", but the "ID" attribute in the annotation is prefixed with "snap", "maker", "evm" , "augustus" etc. Does that mean the final annotation is a superset of all gene predictors? If EVM was used to obtain a consensus gene model, why would the other models still show up in the final result set?Best Regards,Ray
Dr. Rongfeng (Ray) CuiMax-Planck-Institut für Biologie des Alterns / Max Planck Institute for Biology of AgeingWissenschaftlicher MA / Postdoctoral researcherOffice: Joseph-Stelzmann 9b, D-50931 Köln / ColognePostal address: Postfach 41 06 23, D-50866 Köln / CologneTel.:+49 (0)221 496Mobile: +49 0221 37970 496
On Wed, Mar 15, 2017 at 3:52 PM, Carson Holt <cars...@gmail.com> wrote:Maybe. I haven’t tested this, but it should work. Maker supports labels for input by placing a ‘:’ and a label after each file name.Example—>est=file1.fasta:label_1,file2.fasta:label_2If you label your files, then the label will go into the GFF3. So instead of est2genome in column 2, you will get est2genome:label_1 in column 2.As a result, you should be able to add that label to the EVM settings like so and it will match column 2 of the GFF3—>evmtrans:est2genome:label1=10I don’t know if the label will force anything raw analysis to rerun, but it shouldn’t.—CarsonOn Mar 15, 2017, at 5:13 AM, Ray Cui <rc...@age.mpg.de> wrote:Hi Carson,currently I am partitioning the protein evidence based on phylogenetic relationship into several datasets, supplied as comma delimited list. Is it possible then to specify higher weight for protein2genome models from closer related species than further related taxa?Ray
Dr. Rongfeng (Ray) CuiMax-Planck-Institut für Biologie des Alterns / Max Planck Institute for Biology of AgeingWissenschaftlicher MA / Postdoctoral researcherOffice: Joseph-Stelzmann 9b, D-50931 Köln / ColognePostal address: Postfach 41 06 23, D-50866 Köln / CologneTel.:+49 (0)221 496Mobile: +49 0221 37970 496
On Wed, Mar 15, 2017 at 11:47 AM, Ray Cui <rc...@age.mpg.de> wrote:
Dear Carson,thank you for the pointers! Before running the first round of Maker, I mapped conspecific Trinity assembled proteins (long, "full length" subset) to an earlier version of the genome assembly using my own pipeline and trained Augustus and SNAP that way. I also trained Genemark-ET using TopHat alignments per their instructions. I'm wondering if it will be worth doing a second round, but I guess I will see.It is good to know that MAKER will reuse the old results.Best Regards,Ray
Dr. Rongfeng (Ray) CuiMax-Planck-Institut für Biologie des Alterns / Max Planck Institute for Biology of AgeingWissenschaftlicher MA / Postdoctoral researcherOffice: Joseph-Stelzmann 9b, D-50931 Köln / ColognePostal address: Postfach 41 06 23, D-50866 Köln / CologneTel.:+49 (0)221 496Mobile: +49 0221 37970 496
On Tue, Mar 14, 2017 at 5:58 PM, Carson Holt <cars...@gmail.com> wrote:You can find lots of info in the devel archives on training. Example —> https://groups.google.com/forum/#!topic/maker-devel/FWMSTdqWQqIAlso example of training SNAP on the wiki —> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_PredictorsMAKER will reuse old raw results if you rerun in the same directory (only deleting what would be different given altered settings between runs). It will see the existing alignments archived in the datastore as raw reports and just reuse them. The exception to this are the exonerate alignments. They are generated relatively quickly compared to the BLAS T runs, so rerunning them is not too much overhead. Also they are not archived because doing so created IO issues (exonerate is not running in bulk batches like BLAST, rather as multiple small separate runs for each polished read, and archiving a lot of small raw reports can occur so fast when using MPI that it crashes storage servers). So we decided to just not archive exonerate rather than develop a database like bundling/compression mechanism to get around the IO issues.Thanks,CarsonOn Mar 14, 2017, at 10:44 AM, Ray Cui <rc...@age.mpg.de> wrote:Hi Carson,Thanks for your prompt response!I have a somewhat unrelated question. After the first run of Maker, I want to train Augustus, SNAP and Genemark-ET using the most reliable gene models produced in the first round. What would be a good way to select these gene models?After retraining the ab initio predictors, I also wonder if it's necessary to redo all the alignments (blastx, est2genome, protein2genome etc) in the second iteration, since they are exactly the same as the first run. Perhaps maker can take in the alignment results from the previous run?Best Regards,Ray
Dr. Rongfeng (Ray) CuiMax-Planck-Institut für Biologie des Alterns / Max Planck Institute for Biology of AgeingWissenschaftlicher MA / Postdoctoral researcherOffice: Joseph-Stelzmann 9b, D-50931 Köln / ColognePostal address: Postfach 41 06 23, D-50866 Köln / CologneTel.:+49 (0)221 496Mobile: +49 0221 37970 496
On Tue, Mar 14, 2017 at 5:37 PM, Ray Cui <rc...@age.mpg.de> wrote:I see. If my evm config looks like this:evmab=5 #default weight for source unspecified ab initio predictionsevmab:snap=5 #weight for snap sourced predictionsevmab:augustus=10 #weight for augustus sourced predictionsevmab:fgenesh=10 #weight for fgenesh sourced predictionsevmab:genemark=5 #weight for genemark sourced predictionsand Column 2 in the genemark.gff is "GeneMark.hmm" , then the value from "evmab" (=5) will be used, is that correct?Best Regards,Ray
Dr. Rongfeng (Ray) CuiMax-Planck-Institut für Biologie des Alterns / Max Planck Institute for Biology of AgeingWissenschaftlicher MA / Postdoctoral researcherOffice: Joseph-Stelzmann 9b, D-50931 Köln / ColognePostal address: Postfach 41 06 23, D-50866 Köln / CologneTel.:+49 (0)221 496Mobile: +49 0221 37970 496
On Tue, Mar 14, 2017 at 5:29 PM, Carson Holt <cars...@gmail.com> wrote:Column 2 in the GFF3 file is the source column. It is used to specify the source fo the data. That column will also be used by EVM to bin features by their source and apply weights based on source.—CarsonOn Mar 14, 2017, at 10:26 AM, Ray Cui <rc...@age.mpg.de> wrote:Thanks! I didn't know you can also name the gff, but I think using the default is fine, that's what I'm doing now.Best Regards,Ray
Dr. Rongfeng (Ray) CuiMax-Planck-Institut für Biologie des Alterns / Max Planck Institute for Biology of AgeingWissenschaftlicher MA / Postdoctoral researcherOffice: Joseph-Stelzmann 9b, D-50931 Köln / ColognePostal address: Postfach 41 06 23, D-50866 Köln / CologneTel.:+49 (0)221 496Mobile: +49 0221 37970 496
On Tue, Mar 14, 2017 at 5:11 PM, Carson Holt <cars...@gmail.com> wrote:These are set in the maker_evm.ctl file.Use whatever you used in the source column of the input GFF3. For example if column 2 is set as GENEMARK, then do this —>evmab:GENEMARK=7This also works —>evmab:pred_gff:GENEMARK=7Or just set the default —>evmab=7—Carson
On Mar 10, 2017, at 8:48 AM, Ray Cui <rc...@age.mpg.de> wrote:
Dear Carson,I think it may be the most straight foward to input the GFF3 instead.What is the correct way of setting a weight for the EVM step for this GFF3 models passed through the pred_gff option?Ray
Dr. Rongfeng (Ray) CuiMax-Planck-Institut für Biologie des Alterns / Max Planck Institute for Biology of AgeingWissenschaftlicher MA / Postdoctoral researcherOffice: Joseph-Stelzmann 9b, D-50931 Köln / ColognePostal address: Postfach 41 06 23, D-50866 Köln / CologneTel.:+49 (0)221 496Mobile: +49 0221 37970 496
On Mon, Feb 20, 2017 at 10:53 AM, Carson Holt <cars...@gmail.com> wrote:It may work as is as long as you don’t need any of the additional options that have been added. If not, you can also just run it outside of MAKER then provide the result in GFF3 format to pred_gff.—CarsonOn Feb 20, 2017, at 2:51 AM, Ray Cui <rc...@age.mpg.de> wrote:I see. Is there any recent plans to incorporate it into Maker?If not, I could try to see if I can adapt the current Maker script.Ray
Dr. Rongfeng (Ray) CuiMax-Planck-Institut für Biologie des Alterns / Max Planck Institute for Biology of AgeingWissenschaftlicher MA / Postdoctoral researcherOffice: Joseph-Stelzmann 9b, D-50931 Köln / ColognePostal address: Postfach 41 06 23, D-50866 Köln / CologneTel.:+49 (0)221 496Mobile: +49 0221 37970 496
On Mon, Feb 20, 2017 at 10:46 AM, Carson Holt <cars...@gmail.com> wrote:Yes. This is a recent update. It’s an attempt to merge GeneMark-ET and GeneMark-EP into GeneMark-ES scripts.—CarsonOn Feb 20, 2017, at 2:43 AM, Ray Cui <rc...@age.mpg.de> wrote:I see, I will take a look at the wrapper gmhmm_wrap.I think there must have been a big update between different Genemark versions. It seems that they now also supports evidence being fed into the prediction stage.The name of the latest version of the genemark script has been changed to "gmes_petap.pl", with the following command lines options:Usage: /beegfs/group_dv/software/source/gm_et_linux_64/gmes_petap/gmes_petap.pl [options] --sequence [filename]GeneMark-ES Suite version 4.33includes transcript (GeneMark-ET) and protein (GeneMark-EP) based training and predictionInput sequence/s should be in FASTA formatAlgorithm options--ES to run self-training--fungus to run algorithm with branch point model (most useful for fungal genomes)--ET [filename]; to run training with introns coordinates from RNA-Seq read alignments (GFF format)--et_score [number]; 4 (default) minimum score of intron in initiation of the ET algorithm--evidence [filename]; to use in prediction external evidence (RNA or protein) mapped to genome--training_only to run only training step--prediction_only to run only prediction step--predict_with [filename]; predict genes using this file species specific parameters (bypass regular training and prediction steps)Sequence pre-processing options--max_contig [number]; 5000000 (default) will split input genomic sequence into contigs shorter then max_contig--min_contig [number]; 50000 (default); will ignore contigs shorter then min_contig in training--max_gap [number]; 5000 (default); will split sequence at gaps longer than max_gapLetters 'n' and 'N' are interpreted as standing within gaps--max_mask [number]; 5000 (default); will split sequence at repeats longer then max_maskLetters 'x' and 'X' are interpreted as results of hard masking of repeats--soft_mask [number] to indicate that lowercase letters stand for repeats; utilize only lowercase repeats longer than specified lengthRun options--cores [number]; 1 (default) to run program with multiple threads--pbs to run on cluster with PBS support--v verboseCustomizing parameters:--max_intron [number]; default 10000 (3000 fungi), maximum length of intron--max_intergenic [number]; default 10000, maximum length of intergenic regions--min_gene_prediction [number]; default 300 (120 fungi) minimum allowed gene length in prediction stepDeveloper options:--usr_cfg [filename]; to customize configuration file--ini_mod [filename]; use this file with parameters for algorithm initiation--test_set [filename]; to evaluate prediction accuracy on the given test set--key_bin--debug# -------------------
Dr. Rongfeng (Ray) CuiMax-Planck-Institut für Biologie des Alterns / Max Planck Institute for Biology of AgeingWissenschaftlicher MA / Postdoctoral researcherOffice: Joseph-Stelzmann 9b, D-50931 Köln / ColognePostal address: Postfach 41 06 23, D-50866 Köln / CologneTel.:+49 (0)221 496Mobile: +49 0221 37970 496
On Mon, Feb 20, 2017 at 10:28 AM, Carson Holt <cars...@gmail.com> wrote:Also note that the gmhmme3 executable distributed with different flavors of genemark has had the same name but has been quite different in both command line structure and output between flavors.—CarsonOn Feb 20, 2017, at 2:08 AM, Ray Cui <rc...@age.mpg.de> wrote:Thanks.Are the "--max_intron" and "--max_intergenic" parameters automatically set by Maker when calling Genemark?If you can point me to the part of the maker source code that construct the final genemark command line I can also take a look.Best Regards,Ray
On Mar 16, 2017, at 11:22 AM, Ray Cui <rc...@age.mpg.de> wrote:
Hi Carson,due to some reason I can't seem to post anymore on the google group.After looking at the results, it appears that SNAP performs poorly compared to genemark-ET and augustus. It looks like it's very prone to fusing neighboring genes and getting false positives. Is that a general thing you see in vertebrate genomes with SNAP? I saw that you didn't recommend SNAP for primates, perhaps the issue is similar?Attached you can see a screen shot of IGV browser, with all evidence tracks separated.Ray
<example.pdf>