protein= <just try and add more protein evidence in general>est2genome=0protein2genome=1split_hit=20000min_contig=50000
Your ESTs are very short especially if this is a lamprey species which have very long introns and really short exons.  In lamprey (i.e. Petromyzon marinus), genes tend to be very long (remember gene lengths include introns and UTR and is not just the size of the coding sequence), so contigs shorter than 50kb are useless for training as you are unlikely to get nice complete gene models on those.  Also lampreys have very long introns, so you have to allow for bigger introns in alignments (split_hit parameter).  Finally add as much protein evidence from as many sources as possible.  Your maker training run will take a long time as proteins take forever to align, but because of the evolutionary distance of lamprey from everything else and the short exon structure of its genome, very little aligns directly to its genome from other deuterstome and vertebrate species.  I'm assuming this is a lamprey species because of what you said about the  augustus species file you are using.  Really the only thing closely related to lampreys unfortunately are other lampreys.  Lancelets, hagfish, and sharks are not closely related to lamprey (while they branch closely together on the tree of life, there are too many years since the last common ancestor).  So while they may have similar issues related to annotation (long introns and short exons etc.) they will not really match that well for the gene predictor or even protein alignments.
I have training files for the lamprey species Petromyzon marinus for both Augustus and SNAP that I could share with you in a few week, when the genome publication is is released.  But before that happens, new gene models will be available  through the UCSC browser (hopefully within a couple of weeks), and  gene models are already available through ENSEMBL.  Get those protein files for training, it may be a big help for you.  If you want early access to the lamprey training files for Augustus and SNAP, you would have to request it from Weiming Li at Michigan State University (the head of that genome project).
Optimally you would be doing de novo training using mRNAseq results, but with on;ly sparse protein alignments and such a fragmented assembly, you are probably better off just trying to adapt the human HMM files.  They won't match that well, but you probably won't have the evidence for De Novo training.  First make a copy of the augustus human species directory and rename it to lamprey (cp -R  …/augustus/config/species/human …/augustus/config/species/lamprey).  Use it as the base species for retraining augustus using your new models.  You will have to edit multiple files in the directory after you copy it so that they no longer say human or homo sapiens internally or in the file name.  Use maker2zff to generate the filtered ZFF file for training SNAP, but don't train SNAP.  Rather use the training file to better train Augustus info here (just ignore the CEGMA part) --> http://sourceforge.net/mailarchive/message.php?msg_id=29361270
MAKE a backup of the step 1 maker output directory and run step 2 in the old step 1 directory (this allows you to change the parameters and reuse files form step 1 so you don't have to recalculate all the protein and EST alignments). Â So control files for step 2 are identical to step 1 except for these parameters.protein2genome=0augustus_species=lampreysnaphmm=lamprey.hmm #optional if you decide to use SNAPDon't both training SNAP here as you probably won't have enough data and you assembly is too fragmented for it to work well, so just stick to augustus. Â Try SNAP if you want just to see how well it works. Â Manually open up the largest contigs in a viewer to look at the models produced from the MAKER run to see if they look reasonable (this will also help you decide whether to keep SNAP).
Step 3 should just be a clone of step 2 as it is bootstrapping.  But make copies of …/augustus/config/species/lamprey and save it to …/augustus/config/species/lamprey2 (editing all the files and names as you did in step 2). This way you don't loose that training data if you decide to step back.  Also Give your SNAP HMM a new name (I.e. lamprey2.hmm)
augustus_species=lamprey2snaphmm=lamprey2.hmm  #optional if you decide to use SNAPMake a backup of Step 2 and run step 3 in the old Step 2 directory (This is for file reuse, so the step will run fast).  This must be the exact same step directory as step 2 for the reuse trick to work.Manually review the models and if you are satisfied move to step 4.  Also note that most parameters including the protein, EST, and repeats should not change from step1-step3, and should not be removed for step 4 either, you can add more evidence, but don't remove evidence (like the repeats).
For this step, just set min_contig=10000 and rerun MAKER inside the step 3 directory to get the smaller contigs annotated. Â This should be your final step, although you can try altering other parameters or adding more evidence sources here etc. Â
.o
Thanks for detailed explanation and help. Now we know the exact parameters
that should help us with generating good gene-model.
The genome for which we are working on gene predictions is
Echinoderm(sea-urchin). lamprey was the closest organism for training
augustus that I could find. A quick question for training augustus, there
is augustus_species option how would you go from training data generated
by zff2augustus_gbk.pl > train.gb as specified here
http://sourceforge.net/mailarchive/message.php?msg_id=29361270 to
generating the species folder that can be specified to augustus_species
option.
The average expected gene-length in our case is ~15kb.
We also have a good repeat library for our genome.
autoAugTrain.pl [OPTIONS] --species=sname --trainingset=training.gb
I believe should work better than using lamprey to train augustus.
Thanks and regards,
Parul Kudtarkar
> Dear Carson,
>
> Thanks for detailed explanation and help. Now we know the exact parameters
> that should help us with generating good gene-model.
>
> The genome for which we are working on gene predictions is
> Echinoderm(sea-urchin). lamprey was the closest organism for training
> augustus that I could find. A quick question for training augustus, there
> is augustus_species option how would you go from training data generated
> by zff2augustus_gbk.pl > train.gb as specified here
> http://sourceforge.net/mailarchive/message.php?msg_id=29361270 to
> generating the species folder that can be specified to augustus_species
> option.
>
> The average expected gene-length in our case is ~15kb.
> We also have a good repeat library for our genome.
>
--Carson
--Carson
Could you please advice how to combine results from protein2genome
derived models and cegma based training to be provided as training species
for augustus or are theses supposed to be run as separate maker2 run(i.e.
run maker2 with training set from protein2genome first followed by
cegma-based training set)
Thanks and regards,
Parul
--Carson