_______________________________________________<protein_match_example.png>
maker-devel mailing list
maker...@box290.bluehost.com
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
On Jul 6, 2017, at 6:45 AM, Tim Fallon <tfa...@mit.edu> wrote:Hi Carson,This region is definitely entirely correct at the genomic nucleotide level, no missassemblies. Would you have any strong reservations about ditching the ab-initio prediction and sticking entirely with the est2genome predictions and protein2genome predictions? Right now this is what I’m thinking, as troubleshooting the ab-initio training seems like it could be a long road.All the best,-Tim
On Jun 26, 2017, at 6:00 PM, Carson Holt <cars...@gmail.com> wrote:
On Aug 19, 2017, at 10:38 AM, Tim Fallon <tfa...@mit.edu> wrote:
Hi Carson,Just a follow up to this, for posterity. I was able to do what I wanted by using just the est2genome=1, and turning off protein2genome. The input to the est2genome is a Trinity de novo transcriptome assembly with strand specific libraries + assembly and jaccard clip. The results seem quite reliable, and I’m not getting the problem where tandem similar genes were getting fused anymore (the original problem with this inquiry). I expect this is due to there being enough nucleotide differences in the est2genome alignment of two similar and tandem transcripts to effectively distinguish them.In any event, it wasn’t clear to me that est2genome=1 alone would produce ORF/CDS predictions (for the genes), and I’ve done a lot of reading around the Maker documentation and papers. Might be worth considering making the documentation more clear in this respect in the future. I know that est2genome & protein2genome were originally intended more as an intermediate step for ab-initio gene predictor training, but in my opinion with the quality and cost-effectivness of transcript discovery RNA-Seq, it seems reasonable to ditch the ab-initio gene prediction and go entirely with a “est2genome=1” like approach. It might be worthwhile to document what your thought process would be for reliable ab-initio free gene annotation w/ Maker. I’ll mention I haven’t looked into the PASA pipeline for this, which is the only other major publicly available gene structure annotation pipeline known to me, as the parallelization in Maker has been working quite well for me.Are the heuristics for this ORF prediction in est2genome=1 documented anywhere? E.g., does it only pick the longest ORF per transcript? Or if there are multiple “good” ORFs (>200 amino acids) per transcript, will it try and split those into different genes? I ask as my current task is trying to merge the previously mentioned de novo transcriptome derived gene models from est2genome with est2genome gene models of a reference guided transcriptome assembly. Although the reference guided transcript assembly captures more genes that the Trinity assembly (by tblastn), the transcripts are notably artifactually chimeric, sometimes containing 4-5 CDSs, so the heuristics for the Maker est2genome could be pretty influential.All the best,-TimOn Jul 13, 2017, at 1:05 PM, Carson Holt <cars...@gmail.com> wrote:est2genome and protein2genome take BLAST hits, polish them with exonerate around splice sites and then turn the alignment directly into a gene model. So if the alignment is partial because the EST or mRNA-seq do not cross the entire transcript or the protein homology does not cross the entire CDS, then the resulting model will be partial. It can be end to end, but partial tends to be more common than not unless you are using a protein evidence library with limited divergence.—Carson
On Jul 10, 2017, at 2:00 PM, Tim Fallon <tfa...@mit.edu> wrote:
Hi Carson,So far what I've noticed with just est2genome, and protein2genome, using only de novo assembled transcripts with transdecoder predicted peptides (both mapped in maker with blast evalue limit = 1e-50), the gene models (for the genes where I have enough information about the "correct" gene structure), have been full length. Is this unexpected?Will try Apollo. Though I'd like to avoid manual curation. Perhaps it is worth talking to the Augustus developers to see why Augustus was making the exon error in my key gene that led me to ditching it altogether.Agree there are varying qualities of draft assemblies. In our case, we did 100X Illumina hybrid assembly w/ 50X PacBio. The local structure so far seems to be pretty good.Good to know that the human and mouse assemblies even have gene errors, makes me feel better about how much time I've put in trying to get my genome annotation perfect :)All the best,-Tim
We don’t have an externally editable wiki, but the mailing list is archived on google groups and is searchable.
> As I was mentioning earlier in the thread, the ab-initio predictor (augustus) was making errors sublte errors (splice donor site being ~12 nt downstream than supported), despite being trained (I trained through BUSCO, for ease), and having an aligned transcript “hint” which had the correct structure. I believe the maker configuration was correct. Beyond troubleshooting the augustus training, which seems a bit complicated, and doing manual curation / fixing of the gene models (which seems to be a bandaid over my potentially misconfigured augustus training?), going with a purely est2genome=1 approach seems to be a nice way to do it. Better in my opinion to have a known unknown (obvious errors, fragmented genes that are supported by transcript evidence), that unknown unknowns (subtle errors in exon-exon junctions from augustus).
Related to this, I just got off of a conference call where we were looking at Augustus behavior, and a student did an experiment where they introduced early stop codons into 100 genes, then let Augustus predict again. 80% of the time Augustus altered splicing patterns to try and jump over the stop codon, 11% of the time it would truncate the transcript, and 9% of the time it would refuse to call anything. So when you see splicing errors, it is usually because something is affecting the ORF somewhere, so it alters splicing to extend the ORF to get the maximum scoring bonus by capturing downstream parts of features en hints
> A quick question: Could you confirm / deny that Maker doesn’t annotate non-coding RNA genes? E.g. I’ve picked up some rRNAs and ncRNAs in my de novo transcriptome, but my understanding is that est2genome and the ab-inition approach requires that an ORF be present, hence no non-coding RNA genes (beyond the tRNAs and whatnot that can be specifically included)
MAKER only annotates tRNA’s and whatever snoscan annotates. It does not annotated any other non-coding features.
—Carson
—Carson