[maker-devel] Too many genes?

824 views
Skip to first unread message

Annabel Beichman

unread,
Oct 17, 2016, 3:20:56 PM10/17/16
to maker...@yandell-lab.org
Hi Carson et al.,

Thanks so much for such a great pipeline, tutorials and advice pages.

I have just finished four rounds of annotation in Maker on the sea otter genome which we assembled using Meraculous shotgun assembly + Dovetail Genomics HiRise scaffolding.

Rounds I & II: In the first two rounds, I trained Augustus and Snap on 400 scaffolds > 500kb using mRNA-seq data assembled in Trinity, and protein data from Ensembl for ferret, dog and cat.

Round III: Then, using the trained gene predictors (Augustus showed spec/sens > 90%), I annotated all scaffolds >50kb.

Round IV: Based on reading emails in this group, I then decided to make a custom repeat library, and re-run maker one last time using my trained gene predictors, custom repeat library, and 1200 scaffolds >15kb.

I found my number of genes dropping each round, as you suggest they should (47465 after Round I, 27289 after round II, 25847 after round III, and 25031 after round IV).

However, this final gene count (25,031) still seems to high too me, and I was wondering if you had some advice for filtering? Using BUSCO, our assembly is 78% complete, and the final annotation is 72% complete. However, I am getting 25,000+ annotated genes; 22,000+ of which are below an AED and eAED cutoff of 0.5. This seems like far too many genes for a mammal genome that is only ~75% complete. I would have expected to get something more like 15-20,000 genes.

22870 of the Maker-annotated proteins have BLAST hits to SwissProt/UniProt (e value 1e-03), but only 13,000 annotated proteins have orthologs in the ferret, the otter’s closest relative (e value 1e-05 using ProteinOrtho). 900 genes do not have any BLAST hits in SwissProt/UniProt, but have AED/eAED scores of 0.00 – when I visualize them in Jbrowse they have a Trinity read as evidence, but nothing else. Could these be Trinity artefacts? I also notice that my SNAP tracts are very long (some almost as long as the whole scaffold).

I am designing an exome-capture array based on this annotation, and so am trying to filter the gene models to have a set of genes that we can be fairly confident in, but also trying not to miss real gene models. Could you please advise me on how to filter down the gene models, or what might be happening to cause the excess of genes? The most conservative gene list would be the 13,000 genes that are ferret orthologs. But I would like to salvage more genes if possible, if you can suggest a way to parse out real genes from among the ones that do not have ferret orthologs, but do have Blast hits to SwissProt? Would you recommend any additional filters on gene length, etc.?


Not sure if this is significant, but one thing I’ve noticed is that many of the genes with Blast hits in SwissProt but no ferret orthologs often have several similar genes in a row along the same scaffold:
ScbS9RH_101185 30796 38760 + ELUT_00004195-RA ELUT_00004195 Name=ELUT_00004195-RA 0.08 0.17 Similar to ANO3: Anoctamin-3 (Homo sapiens)
ScbS9RH_101185 42617 51087 + ELUT_00004196-RA ELUT_00004196 Name=ELUT_00004196-RA 0.25 0.26 Similar to ANO3: Anoctamin-3 (Homo sapiens)
ScbS9RH_101185 87006 87827 + ELUT_00004198-RA ELUT_00004198 Name=ELUT_00004198-RA 0.18 0.18 Similar to Ano3: Anoctamin-3 (Mus musculus)
ScbS9RH_101185 110043 122523 + ELUT_00004199-RA ELUT_00004199 Name=ELUT_00004199-RA 0.09 0.09 Similar to ANO3: Anoctamin-3 (Homo sapiens)

Thank you all so much for your help and advice!

[I also want to report an odd behavior, that may be specific to our server – when the number of scaffolds being annotated using maker drops below the number of cores (e.g. usning openmpi with 45 cores available, but there are only 44 scaffolds left), maker crashes. I then have to restart it with fewer cores, and it will crash again once the number of remaining scaffolds drops below the new lower number of cores. This makes finishing a run of Maker a bit like Zeno’s paradox, where it gets very slow for the last two days of the run due to the stopping and restarting.]

Best wishes,
Annabel Beichman
Wayne Lab/Lohmueller Lab
Ecology & Evolutionary Biology
UCLA
Annabelbeichman.com
_______________________________________________
maker-devel mailing list
maker...@box290.bluehost.com
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

Carson Holt

unread,
Oct 17, 2016, 4:25:57 PM10/17/16
to Annabel Beichman, maker...@yandell-lab.org
Better training and repeat masking will result in fewer false positive gene calls. Depending on how many contigs there are in the genome, you may also get gene fragmentation (genes split across contigs or genes split due to short runs of NNNNN within a contig). Fragmented genes tend to lack start or stop codons. Finally pick a few of the contigs with the highest gene density and look at them in a browser. If one of the gene predictors you are using (SNAP or Augustus) does not have good concordance with the models, you may want to drop the predictor (sometimes a predictor does not work well on a particular genome for one reason or another - SNAP tends to have issues with mammalian genomes for example). Also when looking at the contig, if you see contig consisting of only single exon genes then you may have some prokaryotic contamination (they assemble as independent gene dense contigs - so a good thing to look at if gene counts are high). Finally high gene counts can mean that repeats are still under masked (repeats encode real proteins like transposases).

You can also scan all resulting models with InterProScan to see what fraction contain identifiable protein domains (a well annotated genome will have ~75-85% of genes with an InterPro domain).

—Carson

Annabel Beichman

unread,
Oct 17, 2016, 7:13:38 PM10/17/16
to Carson Holt, maker...@yandell-lab.org
Thank you so much for all these suggestions, Carson! I will give them a try, particularly dropping SNAP as it definitely doesn’t show great concordance compared to Augustus.

Do you have any additional recommendations for improving my repeat masking? I have already made a custom repeat library in repeatmodeler following this tutorial: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic and have model_org=all and repeat_protein=/home/opt/maker/data/te_proteins.fasta

My interproscan results have ~73% of my total genes (including genes with high AED scores) with Pfam domains, so it at least seems like I’m on the right track.

Thanks so much again,

~ Annabel

Carson Holt

unread,
Oct 17, 2016, 8:10:23 PM10/17/16
to Annabel Beichman, maker...@yandell-lab.org
It sounds like your repeat masking is probably sufficient. Perhaps just the change of removing SNAP this time will give you what you want.

—Carson

Annabel Beichman

unread,
Oct 28, 2016, 7:11:30 PM10/28/16
to maker...@yandell-lab.org
re-sending this to the list without attachments as they were too large

Cheers,
Annabel
> On Oct 28, 2016, at 4:04 PM, Annabel Beichman <annabel....@gmail.com> wrote:
>
> Hi Carson,
> Re-running Maker without SNAP definitely improved things, as did filtering out fragmented genes without start/stop codons. Thank you!
>
> However, I’m still seeing an odd pattern that I wonder if you have any ideas about:
>
> For the set of ~6000 genes that do not have orthologs in the ferret, but do have start/stop codons and are below AED/eAED of 0.5, I am seeing duplication of BLAST annotations for ~2,600 of the gene models, particularly gene models that are in a row on a scaffold. I’ve thrown the genes with duplicate blast annotations into the attached excel file so you can see the patterns I’m describing.
> For example, there is a similar annotation for two genes in a row on a scaffold, both of which have low AED/eAED scores and start/stop codons (also visualized in attached Jbrowse screenshot):
>
> Scaffold Start Stop Strand GeneID mRNALength #Exons BlastInfo
> ScbS9RH_82700 41318 49503 + ELUT_00017706 8185 3 Similar to Cdh13: Cadherin-13 (Mus musculus)
> ScbS9RH_82700 99358 103910 + ELUT_00017707 4552 3 Similar to Cdh13: Cadherin-13 (Mus musculus)
>
> I am trying to filter out false positive gene models as I make my exome capture design so wondered if you had any tips on what might be going on here. Paralogs? Artifacts of the assembly? Is the gene with the most exons likely to be the original gene? Should I filter sets of duplicates by those that have IPR domains?
>
> Secondly, I also notice 250 of these repeat genes are annotated as 40S or 60S ribosomal protein genes. Do you expect to see this many (I know there are usually many rDNA genes) or could this number be inflated due to ribosomal RNA in the RNA-seq reads? (I carried out poly-A selection prior to sequencing)
>
> Thanks so much again for your help!
>
> ~ Annabel

Carson Holt

unread,
Oct 28, 2016, 7:23:20 PM10/28/16
to Annabel Beichman, maker...@yandell-lab.org
You need to look at some of the contigs in a browser. Look at the most gene dense ones first (density = gene_count/contig_length). You may have prokaryiotic contamination if you are seeing a lot of contigs containing primarily single exon gene models. Also make sure you still left model_org=all on after adding the species specific library (the species specific library is to supplement RepBase as opposed to replace it).

Some locations where you are seeing neighboring genes with similar blast hits (Cadherin) may infact be one gene that was split, either because evidence insufficiently clusters (perhaps the max intron size is set too low in the control files), or perhaps the assembly has runs of NNNN that do not permit the gene predictor to create a spanning model (not uncommon). If you are using Apollo to view the genes you can zoom in around evidence alignments until you see the sequence, and often you will see clusters of NNNN in the sequence around evidence HSP breakpoints.

—Carson

> On Oct 28, 2016, at 5:04 PM, Annabel Beichman <annabel....@gmail.com> wrote:
>
> Hi Carson,
> Re-running Maker without SNAP definitely improved things, as did filtering out fragmented genes without start/stop codons. Thank you!
>

> <Examples_DuplicateBlastEntries_forCarson.xlsx>However, I’m still seeing an odd pattern that I wonder if you have any ideas about:


>
> For the set of ~6000 genes that do not have orthologs in the ferret, but do have start/stop codons and are below AED/eAED of 0.5, I am seeing duplication of BLAST annotations for ~2,600 of the gene models, particularly gene models that are in a row on a scaffold. I’ve thrown the genes with duplicate blast annotations into the attached excel file so you can see the patterns I’m describing.
> For example, there is a similar annotation for two genes in a row on a scaffold, both of which have low AED/eAED scores and start/stop codons (also visualized in attached Jbrowse screenshot):
>
> Scaffold Start Stop Strand GeneID mRNALength #Exons BlastInfo
> ScbS9RH_82700 41318 49503 + ELUT_00017706 8185 3 Similar to Cdh13: Cadherin-13 (Mus musculus)
> ScbS9RH_82700 99358 103910 + ELUT_00017707 4552 3 Similar to Cdh13: Cadherin-13 (Mus musculus)
>
> I am trying to filter out false positive gene models as I make my exome capture design so wondered if you had any tips on what might be going on here. Paralogs? Artifacts of the assembly? Is the gene with the most exons likely to be the original gene? Should I filter sets of duplicates by those that have IPR domains?
>
> Secondly, I also notice 250 of these repeat genes are annotated as 40S or 60S ribosomal protein genes. Do you expect to see this many (I know there are usually many rDNA genes) or could this number be inflated due to ribosomal RNA in the RNA-seq reads? (I carried out poly-A selection prior to sequencing)
>

> Thanks so much again for your help!<Screen Shot 2016-10-28 at 4.02.46 PM.png>
>
> ~ Annabel

Carson Holt

unread,
Oct 28, 2016, 7:28:19 PM10/28/16
to Annabel Beichman, maker...@yandell-lab.org
Also if you labeled putative function using BLAST results, make sure you set the expect value sufficiently low to filter out false homology. Otherwise you will be labeling off the best hit, which may in fact have a very poor score, but because it’s the best one. The threshold value should never be higher than 1e-6. You can go all the way down to 1e-10 if necessary.

—Carson

Annabel Beichman

unread,
Oct 28, 2016, 7:36:23 PM10/28/16
to Carson Holt, maker...@yandell-lab.org
Thank you so much, Carson, for such a rapid reply!

I have checked the prokaryotic issue and it looks okay — my most gene-dense contigs all have multi-exon genes. I will re-blast with a more stringent cutoff as well. I think your theory about the NNNNNNs might be spot on. The assembly is by Dovetail Genomics and they insert many NNNNNs as they join contigs together into the long scaffolds, which would disrupt the gene models. Is there any way to salvage the genes that are split around the NNNNs? Or should I just leave them out of my analyses?

Thanks again,
~ Annabel

Carson Holt

unread,
Oct 28, 2016, 7:49:47 PM10/28/16
to Annabel Beichman, maker...@yandell-lab.org
The NNNN’s both preclude alignment and prediction, so unless they occur in an intron, it results in a split model (many times runs of NNN may just be a few base pairs long, but if they occur in the exon, you can’t really work around it). The predictors work off of a maximum score, so the ab initio predictor ends up finding some way of terminating the model around the NNN’s that scores well even though it does not reflect the biology. Sometimes you can try and force things in manually (non-canonical splice sites etc.) if it is an important gene (Web-Apollo even allows you to insert SNPs and INDELS to correct the ORF, but it’s a labor intensive manual process).

So short answer. You should investigate if you see these in a browser. If you do have them, then you will have to decide how to handle them depending on the analysis (perhaps take the longer one?).

Take some time just viewing alignments and models to get a feel of how evidence and gene models should correlate. There really is no substitute for visual manual review.

—Carson

Reply all
Reply to author
Forward
0 new messages