[maker-devel] maker annotation with cufflinks output

920 views
Skip to first unread message

dhivya arasappan

unread,
Jan 30, 2014, 1:18:48 PM1/30/14
to maker...@yandell-lab.org
Hello,

I am trying to annotate a 200 mb plant genome for which I have a very
good assembly.

I tried to denovo assemble RNA-seq data using trinity and ran maker
using my genome assembly and the trinity results. I did not get as
many transcripts as expected, around 10,000 transcripts.

So, I decided to try a different approach. I did a genome assisted
assembly of the RNA-seq data using tophat/cufflinks. This pipeline
generated 21,000 genes, 29,000 transcripts. I then ran maker using my
genome assembly and the cufflinks result. I get much less number of
transcripts as a result.

If cufflinks found 29000 transcripts by mapping to the genome, I'm
confused as to why maker is not finding the same.

Any suggestions would be appreciated.

Thanks
Dhivya


_______________________________________________
maker-devel mailing list
maker...@box290.bluehost.com
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

Daniel Ence

unread,
Jan 30, 2014, 3:51:10 PM1/30/14
to dhivya arasappan, maker...@yandell-lab.org
Hi Dhivya,

I think there a few numbers that could be helpful to understand what's happening here.

How many transcripts did Trinity assembly the RNA-seq data into? Also, you had 29,000 transcripts from cufflinks, but fewer from MAKER when you gave it the cufflinks data. How many transcripts did MAKER identify with the cufflinks data? Did you still get more than the 10,000 transcripts that you found with just the Trinity data?

A key part of MAKER's approach to genome annotation that might be affecting it's performance is that it only annotates a gene where there is both evidence (like your RNA-seq data) and an ab-initio prediction. If a prediction is unsupported by the evidence, then MAKER won't annotate a gene and if evidence aligns where there's no prediction, MAKER won't annotate a gene either. What ab-initio predictors are you using and have they been trained specific genome?

You can force MAKER to automatically promote evidence alignments to a gene model by setting the est2genome option to 1, but that will usually give you many false positives.

Try rerunning it with either the Trinity data or the Cufflinks data and with est2genome set to 1, and let us know how that affects the MAKER results.

Thanks,
Daniel

Daniel Ence
Graduate Student
Eccles Institute of Human Genetics
University of Utah
15 North 2030 East, Room 2100
Salt Lake City, UT 84112-5330
________________________________________
From: maker-devel [maker-dev...@yandell-lab.org] on behalf of dhivya arasappan [daras...@gmail.com]
Sent: Thursday, January 30, 2014 11:18 AM
To: maker...@yandell-lab.org
Subject: [maker-devel] maker annotation with cufflinks output

dhivya arasappan

unread,
Jan 30, 2014, 4:00:07 PM1/30/14
to maker...@yandell-lab.org

Carson Holt

unread,
Jan 30, 2014, 4:14:53 PM1/30/14
to Daniel Ence, dhivya arasappan, maker...@yandell-lab.org
What you get back from cufflinks should not necessarily be considered a
transcript count, and you should always expect the count given by
cufflinks to be high relative to assembly methods like trinity (especially
in plants). This is because repetitive elements, spurious alignments, and
pseudogenes will all inflate the count because it is an alignment based
method which can be more sensitive but will also generate a lot of false
positives. Fortunately the false positives will mostly be singe exon
results and will be filtered out by maker. Also your mRNA-seq data from
cufflinks will contribute to hints that can generate genes in the absence
of an ab-intio gene prediction, but if the gene finder doesn’t think the
hints make sense it will ignore them. So a lot of cufflinks results that
don’t make sense with respect to ORF etc., will fall into the category of
being ignored.

In addition, you should try running your pipeline through CEGMA
(http://korflab.ucdavis.edu/datasets/cegma/) to identify the expected
completeness of the genome. For example if a genome of 70% completeness
then you only expect to recover 70% of the genes. I believe CEGMA can also
be run online from the iPlant discovery environment and iPlant atmosphere
images. Also make sure you are including proteins with your MAKER run,
as not all genes will be expressed, so mRNAseq will only capture a portion
of the genes and that portion can be as low as 50%.

Thanks,
Carson

Sivaranjani Namasivayam

unread,
Jan 30, 2014, 7:59:56 PM1/30/14
to Carson Holt, Daniel Ence, dhivya arasappan, maker...@yandell-lab.org
Hi All,

This is a problem I have been having for quite some time; maker predicts much lower number of genes or proteins than in my evidence RNA-seq transcripts. My genome is not repetitive and is atleast 90% complete.

I tried setting est2genome to 1, but that still doesn't seem to increase the predicted gene set too much. If I input ~13000 genes(21000 transcripts) as evidence I get predictions of ~5000 genes(6000 transcripts).
I ran MAKER again with the transcripts that didn't have a gene model predicted in the first run, and this time MAKER predicted gene models for ~20-30% of those transcripts.

Is there anything that can be done to increase the predicted gene count?

Thanks,
Ranjani
________________________________________
From: maker-devel <maker-dev...@yandell-lab.org> on behalf of Carson Holt <cars...@gmail.com>
Sent: Thursday, January 30, 2014 4:14 PM
To: Daniel Ence; dhivya arasappan; maker...@yandell-lab.org
Subject: Re: [maker-devel] maker annotation with cufflinks output

Carson Holt

unread,
Jan 31, 2014, 2:20:40 AM1/31/14
to Sivaranjani Namasivayam, Daniel Ence, dhivya arasappan, maker...@yandell-lab.org
So just a few suggestions. If you are getting fewer genes than you expect,
that is usually an indication that the evidence provided is insufficient,
the gene predictors need to be retrained, the repeat masking is
insufficient, or the assembly has problems.

Here is more explanation on each point:

1. In addition to any mRNA/EST data, you should provided full proteomes
from a minimum of two species as closely related as possible, and perhaps
a comprehensive database such as UniProt/Swissprot. Note that based on
experience the comprehensive database cannot substitute for a related
species proteome, they can complement it, but not substitute for it. So
you need to supply full proteomes from something. mRNA/EST data is not
sufficient by itself, so make sure you have enough protein evidence.

2. All models are ultimately generated by the predictors (maker doesn’t
generate these), so care should be taken to train the predictors as best
as possible. Also train at least two predictors (SNAP and Augustus are
recommended). If they are both well trained, then they will be in general
concordance with one another. If they are not well trained, then each
program will produce very different models. So visually inspecting their
concordance can give you an idea of if they need to be retrained.

3. More often than not, poor predictor performance is actually the result
of repeat related complications. Many genomes that at first may seem
repeat poor may actually contain novel repeats that can affect the
performance of the gene predictors. If you are getting fewer genes than
you expect or ab initio models are not in concordance from two independent
predictors, run something like RepeatScout to generate species specific
libraries. This may seem minor, but I have seen predictions go from
apparently random to textbook perfect just by producing a species specific
library of novel repeats.

4. You can’t have gene models if you don’t have open reading frames to
translate through. Also gene predictors need sequence upstream and
downstream of genes to work correctly, so if contigs are too short they
won’t be useful for prediction even if the sum of the contigs is large
enough to encompass the whole genome. In general any contig smaller than
10kb is not annotatable, so you should aim for as high an N50 value as
possible.


Annotating a new genome is sort of like a moving target. No two organisms
are alike, so you usually have to to identify what deficiencies exist
based on preliminary runs and then correct for them in subsequent runs.

Thanks,
Carson

dhivya arasappan

unread,
Jan 30, 2014, 4:22:14 PM1/30/14
to Carson Holt, maker...@yandell-lab.org
Thank you for this information. Our server is currently down, so I'm
unable to get you all the statistics you've asked for. Once the server
is back up, I'll email again with more numbers

But I can tell you that I did run cegma first and got around 92%
completeness (full genes) and 98% completeness (partial genes). This
is why I'm even more puzzled by maker results.

Thanks again
Dhivya

Sujai

unread,
Jan 31, 2014, 3:39:26 AM1/31/14
to Carson Holt, dhivya arasappan, maker...@yandell-lab.org
Many thanks, Carson - for this fabulous post describing general
principles. You've hinted at some of these tips in other posts, but
it's great to have them in one place.

Thanks especially for the kinds of things that make a big difference
(point 3 below). More such tips always welcome!

Best wishes, and thanks for an amazing piece of kit.

- Sujai

dhivya arasappan

unread,
Feb 3, 2014, 11:31:16 AM2/3/14
to Daniel Ence, maker...@yandell-lab.org
Hi Daniel,

I was able to check on some of those questions.

1. From trinity assembly: I started with 102000 contigs. I used trinotate to annotate proteins in this.

I ran maker on this data with est2genome set to 1. The output looks like this (most important parts on top):

    6653 gene
   46675 exon
 280534 protein_match
59934 CDS
    969 contig
 105388 expressed_sequence_match
  12584 five_prime_UTR
  78565 match
1401369 match_part
  10180 mRNA
  11545 three_prime_UTR

2. From cufflinks assembly: I started with 133380 entries (out of which there are 29,000 transcripts).  I used the protein sequences from trinity assembly.

I ran maker on this data with est2genome set to 1. The output looks like this:
     29 gene
     75 exon
 573659 protein_match
67 CDS
   1099 contig
 269298 expressed_sequence_match
     23 five_prime_UTR
 173844 match
2221846 match_part
     29 mRNA
     23 three_prime_UTR

The genes annotated using the trinity assembly is lower than expected, so I went the cufflinks route. I dont understand why when using the cufflinks transcripts, even less genes are being found.

3. Training SNAP:  I used the results of maker from 1 to train SNAP.  I then used that training set to rerun maker:
snaphmm=/scratch/01184/daras/jansen/RHA/allpaths/maker_mpi_withAlltrinity/snap/RHA.hmm
est2genome=0

And again I got results with no entries for gene, exon, CDS etc.
957 contig
  46555 expressed_sequence_match
  43651 match
 553633 match_part
 113738 protein_match

As I mentioned in another email, cegma results indicated that the genome was more than 90% complete. Any suggestions would be helpful.

Thank you
Dhivya

dhivya arasappan

unread,
Feb 4, 2014, 5:43:19 PM2/4/14
to maker...@yandell-lab.org
Resending this since it didnt make it to the mailing list before.

Carson Holt

unread,
Feb 5, 2014, 1:38:50 PM2/5/14
to dhivya arasappan, Daniel Ence, maker...@yandell-lab.org
Do you have any features of type snap in your results from step 3?  We’ve had a couple of recent posts where after training snap was giving no results, and as a result maker couldn’t give any genes.  One cause of something like that may be your step 2.  Make sure the ZFF wasn’t empty you used to train with.  The maker2zff script uses filters to only put the best genes in the off file, and if all your genes fail the filtering then you are training with an empty ZFF.

Also you should use proteins from a related species as your protein file.  I see that you protein marches are varying wildly from run to run? So is your contig count?  Were the subset of contigs you have results for long enough to contain genes?

—Carson

Daniel Ence

unread,
Feb 5, 2014, 2:28:48 PM2/5/14
to Carson Holt, dhivya arasappan, maker...@yandell-lab.org
Hi Dhivya, Are the protein matches in your results coming from your annotations of the transcriptome? You should really use amino-acid sequences from related organisms and some kind of omnibus source like SwissProt. 

Thanks,
Daniel

Daniel Ence
Graduate Student
Eccles Institute of Human Genetics
University of Utah
15 North 2030 East, Room 2100
Salt Lake City, UT 84112-5330

From: Carson Holt [cars...@gmail.com]
Sent: Wednesday, February 05, 2014 11:38 AM
To: dhivya arasappan; Daniel Ence
Cc: maker...@yandell-lab.org

dhivya arasappan

unread,
Feb 5, 2014, 3:13:57 PM2/5/14
to Daniel Ence, maker...@yandell-lab.org
Hello Daniel and Carson,

Thanks for your replies.

Yes I used the the protein sequences resulting from annotation of trinity assembly (using trinotate).  I'll try using protein sequences from related species (though there arent sequences from closely related orgs).  Could you tell me a little about why protein data from annotating my rnaseq data would not work best here?

Thanks
Dhivya

Daniel Ence

unread,
Feb 5, 2014, 3:36:26 PM2/5/14
to dhivya arasappan, maker...@yandell-lab.org
Hi Dhivya, 

In genome annotation, often you want to use as many sources for evidence as is reasonable, but those sources should be distinct.  It will confuse downstream annotation efforts if your protein evidence is actually based on the RNA-seq data. 

Using the trinotate results for protein evidence here restricts you first to the proteins coded by the transcripts in the RNA-seq data, which may be incomplete, and secondly to the proteins that trinotate could annotate from among the transcripts. 

The problem that Carson mentioned with the SNAP HMM file is a real possibility also. 

Thanks,
Daniel

Daniel Ence
Graduate Student
Eccles Institute of Human Genetics
University of Utah
15 North 2030 East, Room 2100
Salt Lake City, UT 84112-5330

From: dhivya arasappan [daras...@gmail.com]
Sent: Wednesday, February 05, 2014 1:13 PM
To: Daniel Ence
Cc: Carson Holt; maker...@yandell-lab.org

Carson Holt

unread,
Feb 5, 2014, 3:38:44 PM2/5/14
to dhivya arasappan, Daniel Ence, maker...@yandell-lab.org
Protein data doesn’t have to be from that closely a related species.  This is because genes maintain homology at the amino acid level across even very large evolutionary distances.  Having a closer related species just ensures that genome contents are similar (fewer losses/gains relative to each other). And use the entire proteome of at least one related species (just using a database like swiss-prot is not sufficient).

Using translated mRNA-seq data will not give you any new information that was not already available from the untranslated sequence.  Plus it will introduce the complicating artifacts that mRNA-seq generates into the protein part of the pipeline (gene merging, incorrect assembly, and false calls caused by background transcription).  A big gotcha with mRNA-seq is that all of your genome gets transcribed at a low level, not just the genes, so you will always have contamination that does not represent real gene models.  Also in the end you really only expect to capture about 50% of the genes with mRNA-seq (maybe 70% if you are fortunate - and most of those will be partial). So using the proteins from another species, is important to improve sensitivity, and fix many of the issues that arise from the noisy nature of mRNA-seq.  In fact if you were forced to use only one (either protein evidence or mRNA-seq) you will actually get better annotations from the protein evidence in most cases. You get better annotations when you use both, but if using only one of them, the proteins from another species are better, and noisy mRNA-seq will be the primary source of annotation error.

Thanks,

dhivya arasappan

unread,
Feb 6, 2014, 12:16:43 AM2/6/14
to Carson Holt, maker...@yandell-lab.org
Thank you both for those explanations. I'll get back to you after I try rerunning maker.

Dhivya

dhivya arasappan

unread,
Feb 6, 2014, 9:52:12 AM2/6/14
to maker...@yandell-lab.org
Hello,

I does appear than my genome.ann file from maker2zff script has data in it. However, the SNAP steps after that have created empty files.  The following are all empty:

alt.dna  err.dna  export.dna  genome.dna  olp.dna  uni.dna  wrn.dna
alt.ann  err.ann  export.ann  genome.ann  olp.ann  uni.ann  wrn.ann

When I tried to get gene stats or validate genome.ann, I get errors like this for all of them:

fathom genome.ann genome.dna -gene-stats |more
MODEL5547 1 1 6 + errors(6): exon-1:out_of_bounds exon-2:out_of_bounds exon-3:out_of_bounds exon-4:out_of_bounds exon-5:out_of_bounds exon-6:out_of_bounds
MODEL5568 1 1 6 - errors(6): exon-6:out_of_bounds exon-5:out_of_bounds exon-4:out_of_bounds exon-3:out_of_bounds exon-2:out_of_bounds exon-1:out_of_bounds
MODEL5589 1 1 5 + errors(5): exon-1:out_of_bounds exon-2:out_of_bounds exon-3:out_of_bounds exon-4:out_of_bounds exon-5:out_of_bounds
MODEL5195 1 1 21 + errors(21): exon-1:out_of_bounds exon-2:out_of_bounds exon-3:out_of_bounds exon-4:out_of_bounds exon-5:out_of_bounds exon-6:out_of_bounds exon-7:out_of_bounds exon-8:out_of_bounds exon-9:out_of_bounds exon-10:out_of_bounds exon-11:out_of_bounds exon-12:out_of_bounds exon-13:out_of_bounds exon-14:out_of_bounds exon-15:out_of_bounds exon-16:out_of_bounds exon-17:out_of_bounds exon-18:out_of_bounds exon-19:out_of_bounds exon-20:out_of_bounds exon-21:out_of_bounds

I'm not sure why the annotation I'm seeing in genome.ann are all showing up as errors. I realize this may be an issue with snap, but are you familiar with anything like this? Snippet of my genome.ann file is attached (since its too big for the list) for reference.

Thanks
Dhivya

head.genome.ann
head.genome.dna

Carson Holt

unread,
Feb 6, 2014, 11:05:05 AM2/6/14
to dhivya arasappan, maker...@yandell-lab.org
Your genome.dna file has no sequence?  Did you by any chance strip the fasta sequence from the GFF3 you are using as input to maker2zff?  There should be fasta sequence at the end of that file.  Also can I see the GFF3 file you are using as input to maker2zff.

Thanks,
Carson

From: dhivya arasappan <daras...@gmail.com>
Date: Thursday, February 6, 2014 at 7:47 AM
To: Carson Holt <cars...@gmail.com>
Cc: Daniel Ence <de...@genetics.utah.edu>, "maker...@yandell-lab.org" <maker...@yandell-lab.org>
Subject: Re: [maker-devel] maker annotation with cufflinks output

I'm not sure why the annotation I'm seeing in genome.ann are all showing up as errors. I realize this may be an issue with snap, but are you familiar with anything like this? My genome.ann file is attached for reference.

Thanks
Dhivya

Carson Holt

unread,
Feb 6, 2014, 12:04:25 PM2/6/14
to dhivya arasappan, maker...@yandell-lab.org
Could you give me the file without using 'head’ to trim it, its cutting it before it reaches the part I’m interested in.

—Carson


From: dhivya arasappan <daras...@gmail.com>
Date: Thursday, February 6, 2014 at 10:01 AM
To: Carson Holt <cars...@gmail.com>
Cc: Daniel Ence <de...@genetics.utah.edu>, "maker...@yandell-lab.org" <maker...@yandell-lab.org>
Subject: Re: [maker-devel] maker annotation with cufflinks output

Oh yes I did- I took just the non sequence entries in the gff file and used that as my input.  I will rerun snap with the gff file containing the sequences as well. 

I'm attaching a snippet of the gff file that I used as input to maker2zff.

Thanks for your help
Dhivya

dhivya arasappan

unread,
Feb 6, 2014, 12:01:44 PM2/6/14
to Carson Holt, maker...@yandell-lab.org
head.cat.formatted.gff

dhivya arasappan

unread,
Feb 11, 2014, 1:48:23 PM2/11/14
to Carson Holt, maker...@yandell-lab.org
With your suggested changes (using a protein file not derived from the RNA-seq data and fixing the gff file for training SNAP), I was able to increase the number of genes from 6000+ to 18116.  

I'm now trying to evaluate the quality of the annotation.  I have a question about the usage for mpi_evaluator.

In the maker tutorial,  the usage is given as:

 mpi_evaluator [options] <eval_opts> <eval_bopts> <eval_exe>
What files are being referred to in the input parameters: eval_opts, eval_bopts and eval_exe?

Thanks 
Dhivya

On Feb 6, 2014, at 11:47 AM, Carson Holt wrote:

Ok.  Content looks good.  Just make sure to use gff3_merge to join the GFF3’s without stripping out the fasta sequence at the end when training SNAP.

Thanks,
Carson


From: dhivya arasappan <daras...@gmail.com>
Date: Thursday, February 6, 2014 at 10:29 AM
To: Carson Holt <cars...@gmail.com>
Cc: Daniel Ence <de...@genetics.utah.edu>
Subject: Re: [maker-devel] maker annotation with cufflinks output

Sorry I was just trying to make it small enough to be approved by the mailing list.

Here is the whole file:



Carson Holt

unread,
Feb 11, 2014, 1:55:38 PM2/11/14
to dhivya arasappan, maker...@yandell-lab.org
I wouldn’t use mpi_evaluator.  It is buggy and has virtually no documentation.  The AED values are the best way to identify which genes are higher and lower quality.  You can also run interproscan to identify protein domain content as an independent evaluation. Look at this paper here —> http://www.biomedcentral.com/1471-2105/12/491 

Figure 4 has a nice example of how AED, domain content, and gene orthology correlate to show the quality of different subsets of genes in seven ant genomes.

If you choose to try mpi_evaluator it uses the -CTL option to generate empty files that you then fill in.
Reply all
Reply to author
Forward
0 new messages