errors in evm.out.log files

mtbr...@ucdavis.edu

unread,

Jan 23, 2019, 9:30:31 PM1/23/19

to EVidenceModeler-users

Hi Brian:

Following up on my previous posts to the PASA group, I have pieced together the input for EVM. I have two gff3 files:

gene_predictions.gff3 was generated with transdecoder to have the "full" features (gene/mRNA/exon/CDS). This is remapped RefSeq transcripts, so I have weighted it high.

transcript_alignments.gff3 is the concatenated PASA-generated gff3 files ("cDNA_match" format) from alignment of an old, filtered Trinity assembly, and new Isoseq high-quality clusters. Each of those are designated differently in field #2 of the gff3. I made some edits to make sure that the IDs were unique between the two datasets. I reduced the weighting of the Trinity transcripts in weights.txt

I partitioned everything, and am running the partitions on our cluster. The contents of a representative partition (plus a few related files) can be accessed at https://bioshare.bioinformatics.ucdavis.edu/bioshare/view/n2jb7o039u9zpi9/. (See Chr02_1-500000.command.txt for the command run.)

When I look at evm.out.log, there are several lines stating that particular predictions "fail validation", and there are also multiple "Error with prediction" lines at the end of this file. I did validate the original (genome-wide) gene_predictions.gff3 with gff3_gene_prediction_file_validator.pl and nothing was flagged. However, if I run this validation within the partition, I get a fatal error:

gff3_gene_prediction_file_validator.pl gene_predictions.gff3

Fatal Error, cannot locate data entry for ID: [NW_017442718.1_mrna_XM_018957817.1_43458.p1] at /software/evidence_modeler/1.1.1/lssc0-linux/EvmUtils/gff3_gene_prediction_file_validator.pl line 125.

Is there a problem with one or more of the partitioned gff3 files that is causing these errors, and if so, how do I fix it?

I looked at that region in IGV, and for the most part, the predicted gene structures are very similar to those in gene_predictions.gff3, however I do see that for some genes with multiple isoforms, only one isoform is kept by EVM. See Chr02_igv_snapshot1.png for an example of one of these genes, where only the shortest isoform is retained. Is the "loss" of the longer isoforms related to these errors, or something else?

Any suggestions? Other info that I could provide that would be helpful?

Thanks so much,

Monica

Brian Haas

unread,

Jan 24, 2019, 9:46:11 AM1/24/19

to Monica Britton, EVidenceModeler-users

Hi Monica,

I downloaded your example and ran it through to see what kinds of errors popped up. My command was:

~/GITHUB/EVidenceModeler/evidence_modeler.pl -G genome.fa -g gene_predictions.gff3 -w weights.txt -e transcript_alignments.gff3 -r RepeatModeler.gff

The stderr and stdout are attached.

The non-fatal errors that I see showing up are pretty regular ones, like this for example:

Sorry, prediction transdecoder_NW_017388807.1_mrna_XM_018977167.1_649.p1 fails validation.
(CAT) GG 271739-272313 GT TGG
(GCC) AG 272816-272931 GC AGG
(GTT) AG 273047-273169 GT AGG
(GGT) AG 273271-273456 GT AGG
(GTC) AG 273716-273931 GT AGG
(TGT) AG 274970-275187 GT AGG
(TGA) AG 276219-276294 GT AGG
(CTA) AG 276379-276468 GT AGG
(GCT) AG 276829-278003 GT GTG
-recovered transdecoder_NW_017388807.1_mrna_XM_018977167.1_649.p1, internal, 272816, 272931
-recovered transdecoder_NW_017388807.1_mrna_XM_018977167.1_649.p1, internal, 273047, 273169
-recovered transdecoder_NW_017388807.1_mrna_XM_018977167.1_649.p1, internal, 273271, 273456
-recovered transdecoder_NW_017388807.1_mrna_XM_018977167.1_649.p1, internal, 273716, 273931
-recovered transdecoder_NW_017388807.1_mrna_XM_018977167.1_649.p1, internal, 274970, 275187
-recovered transdecoder_NW_017388807.1_mrna_XM_018977167.1_649.p1, internal, 276219, 276294
-recovered transdecoder_NW_017388807.1_mrna_XM_018977167.1_649.p1, internal, 276379, 276468
-recovered transdecoder_NW_017388807.1_mrna_XM_018977167.1_649.p1, internal, 276829, 278003

The above is a case where the coding prediction is not full-length, so doesn't start with a start codon, and it looks like it doesn't end at a stop codon.

In these cases, EVM uses what it can and incorporates the exons that it can classify as internal, initial, or terminal exons, based on splice dinucleotides and start/stop codons. That's where the 'recovered...' messages come in.

In regard to cases where there are multiple isoforms, EVM is only capable of modeling a single isoform structure and only the coding regions for that isoform. In this case, it should pick out the 'best' (as defined by the scoring system and weights), and it'll lack any UTR exon annotations.

What we would normally do is to run EVM, and then use the EVM predictions as an annotation input to PASA for adding on UTRs and modeling alt splicing isoforms that are well supported by the transcriptome data.

I hope this helps. I'm happy to continue to look into any issues. Hopefully my stderr/stdout files will provide a reference for hunting things down too.

best,

~brian

evm.stdout

evm.stderr

Message has been deleted

Brian Haas

unread,

Jan 30, 2019, 8:59:22 AM1/30/19

to Monica Britton, EVidenceModeler-users

Hi Monica,

You've stumped me on this one. I haven't seen this before and it's not obvious to me where it would have come from. If you find out the location of the specific evm.out.gff3 that it came from, that might offer more clues, followed by looking at the specific command that was used to generate that corresponding evm.out file.

best,

~brian

On Tue, Jan 29, 2019 at 9:35 PM <mtbr...@ucdavis.edu> wrote:

Hi Brian:

Thanks, this was very helpful. One more thing I just noticed with the EVM output ... when I generate the combined gff3 file with the command

find . -regex ".*evm.out.gff3" -exec cat {} \; > EVM.all.gff3

there are some strange lines in the gff3 file. In addition to creating the gff3 lines for all the chromosomes and scaffolds that are in genome.fa, it also includes some lines with genome.fa as the Chr. (see attached) This causes an error when I run gff3_file_to_proteins.pl because it can't find a chromosome called "genome.fa". I haven't been able to figure out where these lines are coming from. I'm guessing something went wrong with one of my partitions, but I can't find "genome" in any of the partition-specific gff3 files.

Any suggestions on that? I don't want to just delete them if they are real genes.

Thanks,

Monica

--
You received this message because you are subscribed to the Google Groups "EVidenceModeler-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to evidencemodeler-...@googlegroups.com.
To post to this group, send email to evidencemo...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/evidencemodeler-users/8302d895-a38b-456f-bbfa-9242b6b1c8b3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.