[maker-devel] Maker output

758 views
Skip to first unread message

Panos Ioannidis

unread,
Apr 23, 2015, 9:58:04 AM4/23/15
to maker-devel
Hello Maker community,

I have, at last, finished annotating my genome with Maker (!) and have a few questions on the final output.

1. I used gff3_merge and fasta_merge in order to merge all the gffs and all the different fasta files that were produced during the runs (I split my assembly to smaller chunks that ran in parallel). Are these two scripts the only ones I have to run after Maker has finished? Am I leaving anything important behind?

2. I noticed that all my transcripts (both in the fasta files as well as in the gff) have the name "XXX-mRNA-1". The fact that I can't find any of them containing "mRNA-2" means that there are no splice variants from the same gene?

3. In my *maker.proteins.fasta file I see that some proteins have a name like

snap_masked-XXX

whereas others (apparently, also predicted by SNAP) have a name, like

maker-XXX-snap-gene-XXX

What is the difference between these two genes that are both predicted by SNAP? By reading other posts in this list, I was left with the impression that all genes predicted by SNAP/Augustus that lie in a masked region (as the first name implies), are put to another fasta file, named *maker.snap_masked.proteins.fasta.

4. By looking at a few genes in the *maker.transcripts.fasta file I came to the conclusion that only complete genes (i.e. with a start and a stop codon) are reported in this file. Am I right?

Thanks in advance,
Panos

Carson Holt

unread,
Apr 23, 2015, 11:36:41 AM4/23/15
to Panos Ioannidis, maker-devel
Hi Panos,

See below.

1. I used gff3_merge and fasta_merge in order to merge all the gffs and all the different fasta files that were produced during the runs (I split my assembly to smaller chunks that ran in parallel). Are these two scripts the only ones I have to run after Maker has finished? Am I leaving anything important behind?

These scripts will collect results into a single file for convenience. Any additional analysis you want to perform downstream can be done using programs from GMOD or things like BLAST2GO or InterProScan.  Whether you want to do aditional analysis and what analyses you want to run will be entirely dependent on your project.

Here is an example of analyses you may want to perform downstream of annotation, and how to integrate them into MAEKR results —> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Post_Processing_of_Annotations


2. I noticed that all my transcripts (both in the fasta files as well as in the gff) have the name "XXX-mRNA-1". The fact that I can't find any of them containing "mRNA-2" means that there are no splice variants from the same gene?

Ab initio gene predictors (and MAKER itself) will not attempt to predict alternate splicing.  You can tell MAKER to try and use the evidence alignments to identify potential instances of alternate splicing.  You do this be setting alt_splice=1 in the conttol files. Most of the time you still won’t get back any alternate transcripts though as this option requires long high quality EST/mRNA-seq evidence to work.


3. In my *maker.proteins.fasta file I see that some proteins have a name like

snap_masked-XXX

whereas others (apparently, also predicted by SNAP) have a name, like

maker-XXX-snap-gene-XXX

What is the difference between these two genes that are both predicted by SNAP? By reading other posts in this list, I was left with the impression that all genes predicted by SNAP/Augustus that lie in a masked region (as the first name implies), are put to another fasta file, named *maker.snap_masked.proteins.fasta.

The maker-snap tag means that this result is from SNAP after receiving hints about evidence alignment from MAKER (the hints actually alter SNAPs call).  The snap tag without maker means that this is a raw ab initio call made by SNAP without MAKER input from evidence alignments (model made strictly off of mathematical probabilities in the HMM provided for SNAP).  The masked tag means it came from SNAP run over the repeat masked assembly.  To run snap on the unmasked assembly, you can either turn off repeat masking or set the unmask=1 option in the control files (not recommended). 


4. By looking at a few genes in the *maker.transcripts.fasta file I came to the conclusion that only complete genes (i.e. with a start and a stop codon) are reported in this file. Am I right?

No.  Sometimes SNAP/Augustus, and other programs can produce partial models.  This usually happens at the edges of contigs or near stretches of NNNN’s in the assembly (places where only partial models are available).  You can try and force full end to end models by setting always_complete=1 in the control files.

Also look at other sections in MAKER the tutorial above for more info on the annotation process.

Thanks,
Carson
Reply all
Reply to author
Forward
0 new messages