Using miniprot output with EVM

270 views
Skip to first unread message

Krešimir Križanović

unread,
Feb 14, 2023, 6:22:15 AM2/14/23
to EVidenceModeler-users
Hello,

I'm trying to do gene discovery and annotation on a large plant genome that I've assembly as a part of a project.

I've completed PASA with Trinity transcriptome assembly (both de-novo and genome guided),

Now I'm trying to to ab-initio gene discovery with BRAKER/Augustus and homology based discovery with an appropriate protein set. I've mapped proteins to my genome using recently published miniprot tool (https://github.com/lh3/miniprot).

How I would like to put that all into EVM. However, miniprot gff output does not seem to match what EVM requires. Example of miniprot output:

##PAF   YP_010239391.1  510     0       510     -       HiC_scaffold_1543       48322   15314   17543   1530    1530    0       AS:i:2579       ms:i:2629       np:i:510        da:i:0  do:i:0  cg:Z:259M699N251M       cs:Z::259~gt699ac:251
HiC_scaffold_1543       miniprot        mRNA    15312   17543   2629    -       .       ID=MP002337;Rank=22;Identity=1.0000;Positive=1.0000;Target=YP_010239391.1 1 510
HiC_scaffold_1543       miniprot        CDS     16767   17543   1353    -       0       ID=MP002337.1;Parent=MP002337;Rank=22;Identity=1.0000;Target=YP_010239391.1 1 259
HiC_scaffold_1543       miniprot        CDS     15315   16067   1276    -       0       ID=MP002337.2;Parent=MP002337;Rank=22;Identity=1.0000;Acceptor=AC;Target=YP_010239391.1 260 510
HiC_scaffold_1543       miniprot        stop_codon      15312   15314   0       -       0       ID=MP002337.3;Parent=MP002337;Rank=22

I have two questions.

1. In my weights file, how should I classify miniprot output, by format it seems it should be OTHER_PREDICTION or ABINITIO_PREDICTION.
2. I obviously need to reformat the output. I can write my own Python script to do it, but I need a few pointers. Correct me if I'm wrong:
- I need to add a gene record for each sett of GFF lines, instead of comment line in the original file, and set it as the parent of mRNA record
- for each CDS record, I need to add corresponding exon record, both CDS and exon have mRNA as the parent record
- I can delete the stop_codon record

Let me know if this is correct and if I need to also change anything else.

Krešimir Križanović

Brian Haas

unread,
Feb 14, 2023, 9:11:21 AM2/14/23
to Krešimir Križanović, EVidenceModeler-users

Hi Kresimir,

Usually this kind of information would be included as the alignment-gff3 formatting and specified in the weights as PROTEIN type.

If miniprot is giving you actual gene predictions and these look to be high quality, then you could consider representing them as OTHER_PREDICTION type and providing the gene-prediction type gff3 formatting.

I haven't experimented with miniprot yet but looking forward to doing so.  If you end up writing a converter for either or both formatting options, and want to contribute to the codebase, contributions are greatly appreciated!

best,

~b


--
You received this message because you are subscribed to the Google Groups "EVidenceModeler-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to evidencemodeler-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/evidencemodeler-users/33d0c17c-2289-4c3b-bad6-a6cd73027c63n%40googlegroups.com.


--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

 

Krešimir Križanović

unread,
Feb 16, 2023, 3:49:25 AM2/16/23
to EVidenceModeler-users
Hello,

I've run EVM with miniprot data two times now. In the first run, I've specified the weight as PROTEIN and in the second run I've used OTHER_PREDICTION. However, I've used the same GFF file format both times, the one for prediction. Aside from miniprot data, I've use only passa output (sample_mydb_pasa.sqlite.pasa_assemblies.gff3) as TRANSCRIPT. Pasa output was obtained by combining two Trinity outputs (de-novo and reference-based) and running Pasa alignAssembly.

I'm not sure about the results, they seem to be roughly the same in number. About 15k gene records in EVM.gff3 output file for both runs. I still have to find a ways to asses this better.

I have a couple of questions

1. I would like to see how many genes I've discovered :). When I look at PASA output (gtf format), there seems to be about 105k different gene_id fields, named PASA_cluster_XXXXX. I have just looked for the largest cluster number in the gene_id field, and it is a bit over 105k. However, in the EVM output there is about 15300 gene records. I was expecting 30k-40k genes, based on what a colleague got using Maker. Can you perhaps comment on that, and maybe give me some advice on what to do to try and get more genes predicted?

2. I want to run BUSCO on the results, should I use .EVM.cds file for BUSCO?

3. I also have RNA data from 4 different parts of the plant. However, i've used that for Trinity and PASA. Is there any point in mapping that data to my genome and then also passing the mapping to EVM?

I'll try to finish the scripts to convert miniprot data by the end of the week and send them to you.

K.K.

Brian Haas

unread,
Feb 16, 2023, 9:10:25 AM2/16/23
to Krešimir Križanović, EVidenceModeler-users
Hi Kresimir,

responses below

On Thu, Feb 16, 2023 at 3:49 AM Krešimir Križanović <kkriz...@gmail.com> wrote:
Hello,

I've run EVM with miniprot data two times now. In the first run, I've specified the weight as PROTEIN and in the second run I've used OTHER_PREDICTION. However, I've used the same GFF file format both times, the one for prediction. Aside from miniprot data, I've use only passa output (sample_mydb_pasa.sqlite.pasa_assemblies.gff3) as TRANSCRIPT. Pasa output was obtained by combining two Trinity outputs (de-novo and reference-based) and running Pasa alignAssembly.


There are several issues to consider here. 
- EVM will work poorly given alignment data alone, as alignment data won't provide for complete sets of exons needed to model most genes.
- The PROTEIN and OTHER_PREDICTION require very different gff3 formatting. If they aren't what's expected, then it's not going to leverage the corresponding data type effectively or at all.
- only use the OTHER_PREDICTION type if you're getting complete gene models that begin with start codons and end with stop codons.
- the best way to incorporate PASA as both OTHER_PREDICTION and TRANSCRIPT evidence types is now indicated here: https://github.com/EVidenceModeler/EVidenceModeler/wiki#incorporating-rna-seq-via-pasa


 
I'm not sure about the results, they seem to be roughly the same in number. About 15k gene records in EVM.gff3 output file for both runs. I still have to find a ways to asses this better.


Take a look at the evm.out file generated for one of your larger contigs.  It will indicate the evidence chosen as support for the various gene structure components. This will give you insight into how the evidence was being leveraged. If an evidence type is lacking where you would expect it to exist, then it could indicate a problem with the input file formatting or other EVM setting.
 
I have a couple of questions

1. I would like to see how many genes I've discovered :). When I look at PASA output (gtf format), there seems to be about 105k different gene_id fields, named PASA_cluster_XXXXX. I have just looked for the largest cluster number in the gene_id field, and it is a bit over 105k. However, in the EVM output there is about 15300 gene records. I was expecting 30k-40k genes, based on what a colleague got using Maker. Can you perhaps comment on that, and maybe give me some advice on what to do to try and get more genes predicted?


Try generating the input as indicated above for PASA and including that. If you have other evidence types that are going into Maker, or want to include the Maker predictions too, that would be fine.

You can send me your inputs for one of the larger contigs and I'm happy to take a look to see if there are any issues. Just send privately to bhaas at broadinstitute dot org

 
2. I want to run BUSCO on the results, should I use .EVM.cds file for BUSCO?

Yes
 

3. I also have RNA data from 4 different parts of the plant. However, i've used that for Trinity and PASA. Is there any point in mapping that data to my genome and then also passing the mapping to EVM?

It would be fine to include transcript alignments in addition to the PASA data. This would help in areas where the alignments weren't perfect according to PASA's validation criteria, but could still be useful for modeling parts of a gene.
 

I'll try to finish the scripts to convert miniprot data by the end of the week and send them to you.


No rush. I'll be out of the office from tomorrow through most of next week.  Definitely looking forward to incorporating your contributions into the codebase, though!   Thx in advance!

 
Reply all
Reply to author
Forward
0 new messages