[maker-devel] AED score

1,817 views
Skip to first unread message

Parul Kudtarkar

unread,
Nov 26, 2012, 4:35:48 PM11/26/12
to maker...@yandell-lab.org
Dear Maker community,

For gene-prediction I get training data-set from evidence based
prediction, I use this data-set to train SNAP as well as Augustus
predictions, followed by boot-strapping. I would typically expect 20-30K
genes however I am getting 8 times the expected gene count indicating too
many false positives. Is there a way to further refine these
predication/script to retain predictions with AED score 1 and if yes how
to go about this?

Thanks and regards,
Parul Kudtarkar

--
Scientific Programmer
Center for Computational Regulatory Genomics
Beckman Institute,
California Institute of Technology
http://www.spbase.org


_______________________________________________
maker-devel mailing list
maker...@box290.bluehost.com
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

Carson Holt

unread,
Nov 26, 2012, 4:46:48 PM11/26/12
to Parul Kudtarkar, maker...@yandell-lab.org
AED score with 1 are the ones you don't want. 0 is best and 1 is worst as
it is a distance metric. You can use the AED_threshold parameter to
require better matching to the evidence by setting it closer to 0. You can
also try to increase protein homology evidence as some of your calls may
be split genes due to lack of evidence linking them.

--Carson

Parul Kudtarkar

unread,
Nov 27, 2012, 5:39:18 PM11/27/12
to Carson Holt, maker...@yandell-lab.org
Hello Carson,

Just to confirm, Is there a script that would filter gene models at
specific AED score.
Alternatively if I were to do this within maker with regards to parameters
in maker_opts.ctl file I would have to provide my predicted genes gff3
file to model_gff and set AED_threshold at desired threshold?

Parul Kudtarkar

unread,
Nov 27, 2012, 5:41:12 PM11/27/12
to Parul Kudtarkar, maker...@yandell-lab.org
Also, are there any other parameters that are required when filtering
based on AED score?

Daniel Ence

unread,
Nov 27, 2012, 5:55:49 PM11/27/12
to maker...@yandell-lab.org
Hi Parul,

I think the way you described (with the maker_opts.ctl file) is how you want to proceed. You still need to give the genome too.

Daniel


Daniel Ence
Graduate Student
Eccles Institute of Human Genetics
University of Utah
15 North 2030 East, Room 2100
Salt Lake City, UT 84112-5330
________________________________________
From: maker-dev...@yandell-lab.org [maker-dev...@yandell-lab.org] on behalf of Parul Kudtarkar [par...@caltech.edu]
Sent: Tuesday, November 27, 2012 3:41 PM
To: Parul Kudtarkar
Cc: maker...@yandell-lab.org
Subject: Re: [maker-devel] AED score

Parul Kudtarkar

unread,
Nov 27, 2012, 5:59:22 PM11/27/12
to Daniel Ence, maker...@yandell-lab.org
Thanks for quick response Daniel, I'll do it through maker.

Carson Holt

unread,
Nov 27, 2012, 10:39:56 PM11/27/12
to Daniel Ence, maker...@yandell-lab.org
Use the AED_threshold option in the maker_opts.ctl file if you just want
to restrict final gene models to close matches directly within maker. On
the other hand, if you are trying to build a dataset for training gene
predictors, use the maker2zff script for generating a filtered dataset for
SNAP training. There are a number of filters available. Just call the
script once without parameters to see the options.

Thanks,
Carson

Carson Holt

unread,
Nov 29, 2012, 9:22:42 AM11/29/12
to Parul Kudtarkar, maker...@yandell-lab.org
There are certain characteristics that are apparent in this contig. First
it seems to be repeat rich with a very low gene density. You also have
very short ESTs, and because of the lengths you are probably getting many
of them to align spuriously which produces very short gene models that are
more than likely false positives or at the very least just a piece of a
gene. I would turn off est2genome as a predictor for this reason unless
you can get longer EST assemblies (i.e. From mRNAseq). Your protein
alignments also seem to be few and far between. You probably need to add
more proteins from a couple of related species, and you might consider
using protein2genome rather than est2genome as a predictor if you are
still working to generate a training set. Also est2genome produced models
almost always have an AED score near 0 so mixing est2genome with the
AED_threshold with such limited protein support does create an artificial
bias to get back very short and incomplete models.

How many contigs do you have in total and what is the N50 value for the
assembly? If you have a large number of very short contigs, you will get
very inflated gene counts because you get genes split across contigs and
many contigs tend t be subtle rearrangements of other contigs just
assembled in a slightly different way (so you can get bits and pieces of
the same genes just rearranged). This scenario is another confounding
factor if using the est2genome predictor with short ESTs. I would
recommend running CEGMA to get an estimate for the genome completeness as
well as get an estimate of fragmentation as one of the statistics produced
is a percent of genes that are found complete (end to end) vs those that
are partial. CEGMA identifies house keeping genes that tend to be shorter
and less intron rich than other genes in the genome, so if CEGMA gives a
high partial percentage and a low complete percentage, then this pattern
can be expected to be even more exaggerated for other genes in the genome.

If your genome is highly fragmented or proteins do not align well then
there are other strategies. For example, some vertebrate genomes end up
having extremely fragmented assemblies (on the order of 100,000 contigs),
and if they are distantly related to other annotated species few proteins
may align to the contigs because the introns in the alignments tend to be
so long and exons so short that it pushes down the significance scores too
much. In those cases heavy mRNAseq seems to be the best if not only way
to get enough evidence to stitch gene models together.

Thanks,
Carson



On 12-11-28 4:40 PM, "Parul Kudtarkar" <par...@caltech.edu> wrote:

>Dear Carson and Daniel,
>
>Thanks. I ran sample file for filtering genes based on AED score. The
>input gff3 file was provided to option model_pred(see attached file
>Scaffold1.gff), the cutoff AED score was set to 0.75. There are at least 5
>genes with AED score less than 0.75. However there were no genes predicted
>in the output file(see attached file Scaffold1_out). I have also attached
>the maker_opts.ctl. Could you please advice on this.

Parul Kudtarkar

unread,
Nov 28, 2012, 4:40:51 PM11/28/12
to Carson Holt, maker...@yandell-lab.org
Dear Carson and Daniel,

Thanks. I ran sample file for filtering genes based on AED score. The
input gff3 file was provided to option model_pred(see attached file
Scaffold1.gff), the cutoff AED score was set to 0.75. There are at least 5
genes with AED score less than 0.75. However there were no genes predicted
in the output file(see attached file Scaffold1_out). I have also attached
the maker_opts.ctl. Could you please advice on this.

Scaffold1.gff
Scaffold1_out.gff
maker_opts.ctl

Parul Kudtarkar

unread,
Nov 29, 2012, 7:31:40 PM11/29/12
to Carson Holt, maker...@yandell-lab.org
Thanks for the guidance Carson, total contig size is 330,611 with N50 of
39.17kb. I agree we have short ESTs. So this is the possible reason when
filtering based on AED score 0.75 there are no gene models predicted
despite the model_gff file has few genes with scores less than 0.75?

Carson Holt

unread,
Nov 29, 2012, 8:39:31 PM11/29/12
to Parul Kudtarkar, maker...@yandell-lab.org
Wow 330,000 is a lot. a large portion of genes are likely to be partial at
best. You should seriously consider using mRNAseq to capture those by
using maker's est_gff option to pass in results from cufflinks or trinity.
Also I wouldn't even try to annotate contigs less than 10kb in size, just
have maker skip them by setting the min_contig filter in the
maker_opts.ctl file.

Thanks,
Carson

Parul Kudtarkar

unread,
Dec 4, 2012, 7:27:18 PM12/4/12
to Carson Holt, maker...@yandell-lab.org
Dear Carson,

Thanks once again, we have limited experimental data with very short ESTs.
CEGMA is useful for us to gauge our gene-model.
On a different note to re-annotate genome(post evidence based
prediction(used as training dataset)and abinitio gene-prediction). Here
are the control parameters I am using with AED score set to 0.5, however I
get predictions that includes the ones with AED score of 1.00 in resulting
gff3 file. Though I do see the number of genes reduced to 1/3 of initial
gff3 file.

#-----Genome (Required for De-Novo Annotation)
genome=Scaffold1.fa #genome sequence file in fasta format
organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic

#-----Re-annotation Using MAKER Derived GFF3
genome_gff= Scaffold1.gff #re-annotate genome based on this gff3 file
est_pass=1 #use ests in genome_gff: 1 = yes, 0 = no
altest_pass=0 #use alternate organism ests in genome_gff: 1 = yes, 0 = no
protein_pass=1 #use proteins in genome_gff: 1 = yes, 0 = no
rm_pass=0 #use repeats in genome_gff: 1 = yes, 0 = no
model_pass=1 #use gene models in genome_gff: 1 = yes, 0 = no
pred_pass=1 #use ab-initio predictions in genome_gff: 1 = yes, 0 = no
other_pass=0 #passthrough everything else in genome_gff: 1 = yes, 0 = no

#-----MAKER Behavior Options
AED_threshold=0.5 #Maximum Annotation Edit Distance allowed (bound by 0
and 1)

Carson Holt

unread,
Dec 5, 2012, 10:44:52 AM12/5/12
to Parul Kudtarkar, maker...@yandell-lab.org
You should always get all predictions, but should only get models
(match/match_part) with AED scores less than 0.5. You should never get
models (gene/mRNA/CDS) with an AED score of 1.00 unless you have
keep_preds set.

Do you have keeps_preds set or gene models with AED values above your
threshold (not match/match_part features)? If the first one is the case,
just set it to 0. If the second one in the case, could you send me
example GFF3?

Thanks,
Carson

Carson Holt

unread,
Dec 5, 2012, 3:31:26 PM12/5/12
to Parul Kudtarkar, maker...@yandell-lab.org
Your output has no genes. They've all been filtered out. The gene
predictions are left for reference purposes, but there are no gene models
in the file. You need to look at the type columns in the GFF3 file -->
match/match_part features are evidence and reference data but not models.
All models will have types gene/mRNA/exon/CDS. For example if you just
want gene models in your file you can use the gff3_merge script with the
-g option, and it will only print out the gene models.

I think you may be misinterpreting what is happening at different steps,
as well as how to read the result files. Could you give me a detailed
explanation of what you expect to get back together with your control
files and I can walk you through the configuration, and indicate what to
expect. Also what was the report like from CEGMA. Could you include the
report file that shows how complete your genome is and how fragmented it
is?

Thanks,
Carson



On 12-12-05 3:03 PM, "Parul Kudtarkar" <par...@caltech.edu> wrote:

>Dear Carson,
>
>Thanks for a quick response. keep_preds is set to 0
>Though for previous step(ab-intio gene predictions keep_preds was set to
>1, see Scaffold1_input.gff). I have also attached the output file
>Scaffold1_out.gff.
>Please advice.

Parul Kudtarkar

unread,
Dec 5, 2012, 3:03:46 PM12/5/12
to Carson Holt, maker...@yandell-lab.org
Dear Carson,

Thanks for a quick response. keep_preds is set to 0
Though for previous step(ab-intio gene predictions keep_preds was set to
1, see Scaffold1_input.gff). I have also attached the output file
Scaffold1_out.gff.
Please advice.

Scaffold1_input.gff
Scaffold1_out.gff

Daniel Ence

unread,
Dec 6, 2012, 2:17:57 PM12/6/12
to maker...@yandell-lab.org
Hi Parul,

In step one, if protein2genome isn't turned on, then the protein alignments wont be used to generate gene models. Carson said before to use protein2genome to generate gene models for your trainings set, but you should know that those data aren't being used right now.

In step 2, you can include the est and protein evidence that you used in step one, and the ab-initio predictors will takes those data as hints to guide their predictions. Also, are you reannotating human here? I'm just wondering why the snap file is called Pult, but the augustus species model is human. Also, I think you should include the same masking files from step 1, otherwise the ab-initio predictors will be predicting on the unmasked sequence which will give you many spurious predictions.

In step 3, I'm not certain what you mean by boot-strapping. The control file you sent for step 3 will just pass-through all of the data from before.

In step 4, I think what you'll get is pretty close to what you already had in step 3. The gene models from step 3 are already based on the est and protein data from step 1, so without giving any different evidence or ab-initio predictors, I think that you will just get the same gene models that you got from step 3. Those gene models are what you should be using to train snap. Also, is this the same genome as in the other steps? The genome file here is called genome.linear.fa, but before it was called genome.fa.

Step 5. So maker uses both evidence and ab-initio predictions to create models. If you don't give any evidence (EST or protein) maker will not annotate any gene models. You have also changed the augustus species model in this step, but I don't know what you've gained by going from one species to another. You should be training augustus with the gene models and then creating a new species model in augustus. As I understand it, the filter only operates on gene models, not ab-initio predictions or alignments, so it probably isn't doing anything the way you have it set.

Step 6. I think step 5 and 6 should be combined. I don't know what you mean by boot-strapping here too.

Hopefully that clears up some of the confusion with how maker works. Carson will probably have a lot of suggestions too.

Thanks,
Daniel

Daniel Ence
Graduate Student
Eccles Institute of Human Genetics
University of Utah
15 North 2030 East, Room 2100
Salt Lake City, UT 84112-5330
________________________________________
From: Parul Kudtarkar [par...@caltech.edu]
Sent: Thursday, December 06, 2012 10:08 AM
To: cars...@gmail.com
Cc: Daniel Ence; maker...@yandell-lab.org
Subject: Re: [maker-devel] AED score

Hi Carson,

Once again thanks for a quick response. We expect to get 25,000 to 30,000
genes.
Here are the step
Step1. EST and proteins are used to predict gene-models. the resulting
file is used as training data-set for SNAP in step2. est2genome is turned
ON
Step2. Augustus and SNAP is used to predict genes.
Step3. The results are re-annoated(boot-strap)
Step4. EST and proteins are used to predict gene-models. The gff file from
step3. is used as model_gff. The resulting file is used as training
data-set for SNAP in step2
Step 5. Augustus and SNAP is used to predict genes. with AED set to 0.5
Step6. The results are re-annoated(boot-strap) with AED set to 0.5 and
contig_size to 10kb

Thanks and regards,
Parul Kudtarkar

I have attached the configuration file for each step.
> Your output has no genes. They've all been filtered out. The gene
predictions are left for reference purposes, but there are no gene models
> in the file. You need to look at the type columns in the GFF3 file -->
match/match_part features are evidence and reference data but not
models.
> All models will have types gene/mRNA/exon/CDS. For example if you just
want gene models in your file you can use the gff3_merge script with the
-g option, and it will only print out the gene models.
> I think you may be misinterpreting what is happening at different steps,
as well as how to read the result files. Could you give me a detailed
explanation of what you expect to get back together with your control
files and I can walk you through the configuration, and indicate what to
expect. Also what was the report like from CEGMA. Could you include the
report file that shows how complete your genome is and how fragmented it
is?
> Thanks,
> Carson
>>>>--
>>>>Scientific Programmer
>>>>Center for Computational Regulatory Genomics
>>>>Beckman Institute,
>>>>California Institute of Technology
>>>>http://www.spbase.org
>>--
>>Scientific Programmer
>>Center for Computational Regulatory Genomics
>>Beckman Institute,
>>California Institute of Technology
>>http://www.spbase.org


--
Scientific Programmer
Center for Computational Regulatory Genomics
Beckman Institute,
California Institute of Technology
http://www.spbase.org

--
Scientific Programmer
Center for Computational Regulatory Genomics
Beckman Institute,
California Institute of Technology
http://www.spbase.org



Parul Kudtarkar

unread,
Dec 6, 2012, 3:28:18 PM12/6/12
to Daniel Ence, maker...@yandell-lab.org
Dear Daniel,

Thanks for clearing doubts. For augustus I am using the closest
species(lamprey) to species we are annotating. For SNAP the training
set(hmm file) is generated using predictions made from evidence based
gene-predictions(sorry the file name was mis-leading). I think STEP 3-6
are not required(more or less repetitive without further improving
genemodel). Post Step 1(generating training data for SNAP using evidence
based prediction) and Step2(ab-initio gene-prediction using SNAP and
augustus) do you have recommendations for further improving the gene
model?

Daniel Ence

unread,
Dec 6, 2012, 4:35:04 PM12/6/12
to maker...@yandell-lab.org
Hi Parul,
I dont' really have many suggestions for improving the gene models after the annotation is done. Annotation is very dependent on the data you have at hand (the assembly, ESTs and proteins). One thing you could do to get more annotations is to run something like iprscan on the ab-initio predictions that didn't overlap any evidence and look for predictions that contain a pfam domain. Then you can send those predictions back through maker and promote them to gene model status.

Do you have the CEGMA results that Carson asked about? That really will tell you what kind of annotation results you can expect. If the assembly doesn't have an N50 greater than the expected median gene size, then you can't expect very good results from automated annotation.

Thanks,
Daniel

Daniel Ence
Graduate Student
Eccles Institute of Human Genetics
University of Utah
15 North 2030 East, Room 2100
Salt Lake City, UT 84112-5330
________________________________________
From: Parul Kudtarkar [par...@caltech.edu]
Sent: Thursday, December 06, 2012 1:28 PM
To: Daniel Ence

Parul Kudtarkar

unread,
Dec 6, 2012, 5:51:42 PM12/6/12
to Daniel Ence, maker...@yandell-lab.org
Dear Daniel,

Once again thanks for extensive help. We get over estimation of number of
genes at the end of step 2. So was wondering if there is a way to pick
only the best annotated. That been said assembly, ESTs and proteins are
not good and I am currently in process of running cegma.

Also for step 2(ab-initio gene-prediction using SNAP and Augustus) is it
best to turn off keep_preds to remove any possible false-positives. Also
should est2genome turned ON? Of course as pointed in previous email use
est,proteins apart from training data to generate better gene-model and
repeat masking as well to mask repeats for ab-initio predictors?
I believe while running gff3_merge it is best to use option -g to get only
the gene-predictions and filter evidence.

Parul Kudtarkar

unread,
Dec 6, 2012, 12:08:23 PM12/6/12
to cars...@gmail.com, maker...@yandell-lab.org
maker_opts_step1.ctl
maker_opts_step2.ctl
maker_opts_step3.ctl
maker_opts_step4.ctl
maker_opts_step5.ctl
maker_opts_step6.ctl

Carson Holt

unread,
Dec 7, 2012, 10:22:10 AM12/7/12
to Daniel Ence, maker...@yandell-lab.org, Parul Kudtarkar
Just to add to Daniels comments.

Things to change in step 1:
protein= <just try and add more protein evidence in general>
est2genome=0
protein2genome=1
split_hit=20000
min_contig=50000

Reasoning:
Your ESTs are very short especially if this is a lamprey species which have very long introns and really short exons.  In lamprey (i.e. Petromyzon marinus), genes tend to be very long (remember gene lengths include introns and UTR and is not just the size of the coding sequence), so contigs shorter than 50kb are useless for training as you are unlikely to get nice complete gene models on those.  Also lampreys have very long introns, so you have to allow for bigger introns in alignments (split_hit parameter).  Finally add as much protein evidence from as many sources as possible.  Your maker training run will take a long time as proteins take forever to align, but because of the evolutionary distance of lamprey from everything else and the short exon structure of its genome, very little aligns directly to its genome from other deuterstome and vertebrate species.  I'm assuming this is a lamprey species because of what you said about the  augustus species file you are using.   Really the only thing closely related to lampreys unfortunately are other lampreys.  Lancelets, hagfish, and sharks are not closely related to lamprey (while they branch closely together on the tree of life, there are too many years since the last common ancestor).  So while they may have similar issues related to annotation (long introns and short exons etc.) they will not really match that well for the gene predictor or even protein alignments.

Additional note:
I have training files for the lamprey species Petromyzon marinus for both Augustus and SNAP that I could share with you in a few week, when the genome publication is is released.  But before that happens, new gene models will be available  through the UCSC browser (hopefully within a couple of weeks), and  gene models are already available through ENSEMBL.  Get those protein files for training, it may be a big help for you.  If you want early access to the lamprey training files for Augustus and SNAP, you would have to request it from Weiming Li at Michigan State University (the head of that genome project).



Things to change in step 2:
Optimally you would be doing de novo training using mRNAseq results, but with on;ly sparse protein alignments and such a fragmented assembly, you are probably better off just trying to adapt the human HMM files.  They won't match that well, but you probably won't have the evidence for De Novo training.  First make a copy of the augustus human species directory and rename it to lamprey (cp -R  …/augustus/config/species/human …/augustus/config/species/lamprey).  Use it as the base species for retraining augustus using your new models.  You will have to edit multiple files in the directory after you copy it so that they no longer say human or homo sapiens internally or in the file name.  Use maker2zff to generate the filtered ZFF file for training SNAP, but don't train SNAP.  Rather use the training file to better train Augustus info here (just ignore the CEGMA part) --> http://sourceforge.net/mailarchive/message.php?msg_id=29361270

MAKE a backup of the step 1 maker output directory and run step 2 in the old step 1 directory (this allows you to change the parameters and reuse files form step 1 so you don't have to recalculate all the protein and EST alignments).  So control files for step 2 are identical to step 1 except for these parameters.

protein2genome=0
augustus_species=lamprey
snaphmm=lamprey.hmm #optional if you decide to use SNAP

Don't both training SNAP here as you probably won't have enough data and you assembly is too fragmented for it to work well, so just stick to augustus.  Try SNAP if you want just to see how well it works.  Manually open up the largest contigs in a viewer to look at the models produced from the MAKER run to see if they look reasonable (this will also help you decide whether to keep SNAP).


Things to change in step 3:
Step 3 should just be a clone of step 2 as it is bootstrapping.  But make copies of …/augustus/config/species/lamprey and save it to …/augustus/config/species/lamprey2 (editing all the files and names as you did in step 2). This way you don't loose that training data if you decide to step back.  Also Give your SNAP HMM a new name (I.e. lamprey2.hmm)

augustus_species=lamprey2
snaphmm=lamprey2.hmm  #optional if you decide to use SNAP

Make a backup of Step 2 and run step 3 in the old Step 2 directory (This is for file reuse, so the step will run fast).  This must be the exact same step directory as step 2 for the reuse trick to work.

Manually review the models and if you are satisfied move to step 4.  Also note that most parameters including the protein, EST, and repeats should not change from step1-step3, and should not be removed for step 4 either, you can add more evidence, but don't remove evidence (like the repeats).


Things to change in step 4:
For this step, just set min_contig=10000 and rerun MAKER inside the step 3 directory to get the smaller contigs annotated.  This should be your final step, although you can try altering other parameters or adding more evidence sources here etc.  


For other things to keep in mind, you should consider taking extra time to build a comprehensive library of repeats for your species, I know that Petromyzon marinus was virtually unannotatable until we had a very deep repeat library built for it.  Any repeat library should be used in all steps.  Also for lamprey, you will expect between 20,000-30,000 genes both because of your assemblies fragmentation and because of some ancestral genome duplication.  Also be aware that lampreys appear to undergo programmed genome loss in somatic tissues, so any gene count you get is only going to represent a maximum of 75-80% of all genes unless the assembly is derived from germline tissue.

Thanks,
Carson



Carson Holt

unread,
Dec 7, 2012, 10:27:22 AM12/7/12
to Parul Kudtarkar, Daniel Ence, maker...@yandell-lab.org
Don't use keep_preds on any of your steps. That should really only be
used for Fungi, Oomycetes, or some insect species. Other genomes with
long introns and short exons like tend to overcall so you will get way to
many false positives. I gave a more detailed explanation of what I would
do in my other e-mail.

--Carson

Carson Holt

unread,
Dec 7, 2012, 10:47:39 AM12/7/12
to Daniel Ence, maker...@yandell-lab.org, Parul Kudtarkar
For step 2 and 3, use lamprey rather than human in augustus.  For some reason the current version of augustus doesn't diplay it as an option for the species help menu (so I didn't see it), but it's there in …/augustus/config/species/lamprey

For any additional training, make copies into …/augustus/config/species/lamprey2 and …/augustus/config/species/lamprey3

Thanks,
Carson



From: Carson Holt <cars...@gmail.com>
Date: Friday, 7 December, 2012 10:22 AM
To: Daniel Ence <de...@genetics.utah.edu>, "maker...@yandell-lab.org" <maker...@yandell-lab.org>, Parul Kudtarkar <par...@caltech.edu>
Subject: Re: [maker-devel] AED score

.o

Parul Kudtarkar

unread,
Dec 7, 2012, 5:06:26 PM12/7/12
to Carson Holt, maker...@yandell-lab.org
Dear Carson,

Thanks for detailed explanation and help. Now we know the exact parameters
that should help us with generating good gene-model.

The genome for which we are working on gene predictions is
Echinoderm(sea-urchin). lamprey was the closest organism for training
augustus that I could find. A quick question for training augustus, there
is augustus_species option how would you go from training data generated
by zff2augustus_gbk.pl > train.gb as specified here
http://sourceforge.net/mailarchive/message.php?msg_id=29361270 to
generating the species folder that can be specified to augustus_species
option.

The average expected gene-length in our case is ~15kb.
We also have a good repeat library for our genome.

Parul Kudtarkar

unread,
Dec 7, 2012, 8:09:53 PM12/7/12
to Carson Holt, Daniel Ence, maker...@yandell-lab.org, Parul Kudtarkar
Sorry for not checking this before, just to answer my own question I just
found after reading augustus documentation that autoAugTrain.pl should
work for generating training dataset for Augustus from using the
gene-model hints from first pass run of maker and using
http://sourceforge.net/mailarchive/message.php?msg_id=29361270 script to
generate training.gb

autoAugTrain.pl [OPTIONS] --species=sname --trainingset=training.gb

I believe should work better than using lamprey to train augustus.

Thanks and regards,
Parul Kudtarkar

> Dear Carson,


>
> Thanks for detailed explanation and help. Now we know the exact parameters
> that should help us with generating good gene-model.
>
> The genome for which we are working on gene predictions is
> Echinoderm(sea-urchin). lamprey was the closest organism for training
> augustus that I could find. A quick question for training augustus, there
> is augustus_species option how would you go from training data generated
> by zff2augustus_gbk.pl > train.gb as specified here
> http://sourceforge.net/mailarchive/message.php?msg_id=29361270 to
> generating the species folder that can be specified to augustus_species
> option.
>
> The average expected gene-length in our case is ~15kb.
> We also have a good repeat library for our genome.
>

Carson Holt

unread,
Dec 7, 2012, 8:23:18 PM12/7/12
to Parul Kudtarkar, Daniel Ence, maker...@yandell-lab.org
Yes. Lamprey is not a good match for sea urchin. When training Augusuts
you can sometimes use other genomes as starting points and let Augustus
then modify that species to match the models you provided, which can be
better than de novo training under certain circumstances. But given sea
urchin's evolutionary distance from all the other species bundled with
augustus, its probably not a good starting point. So protein2genome
derived models and cegma based training are probably your best options.

--Carson

Parul Kudtarkar

unread,
Dec 10, 2012, 10:28:24 PM12/10/12
to Carson Holt, maker...@yandell-lab.org
Dear Carson, Thanks a lot for detailed explanation and all the help with
running and understanding maker2.

Carson Holt

unread,
Dec 11, 2012, 9:21:44 AM12/11/12
to Parul Kudtarkar, maker...@yandell-lab.org
Thanks Parul. If you need any more help, just let us know.

--Carson

Parul Kudtarkar

unread,
Dec 13, 2012, 5:16:53 PM12/13/12
to Carson Holt, maker...@yandell-lab.org
Dear Carson,

Could you please advice how to combine results from protein2genome
derived models and cegma based training to be provided as training species
for augustus or are theses supposed to be run as separate maker2 run(i.e.
run maker2 with training set from protein2genome first followed by
cegma-based training set)

Thanks and regards,
Parul

Carson Holt

unread,
Dec 14, 2012, 4:17:55 PM12/14/12
to Parul Kudtarkar, maker...@yandell-lab.org
You can either train with cegma first and then train a second time with
the protein2genome calls, or combine the cegma and protein2genome ZFF
files (from maker2zff and cegma2zff), then use that one file to generate
the training data for augustus. Augustus allows you to keep training, so
rather than training a new species each time, you can specify an existing
species you trained before, and then refine the existing model using the
additional training data.

--Carson

Reply all
Reply to author
Forward
0 new messages