Hello,
I'm trying to do gene discovery and annotation on a large plant genome that I've assembly as a part of a project.
I've completed PASA with Trinity transcriptome assembly (both de-novo and genome guided),
Now I'm trying to to ab-initio gene discovery with BRAKER/Augustus and homology based discovery with an appropriate protein set. I've mapped proteins to my genome using recently published miniprot tool (
https://github.com/lh3/miniprot).
How I would like to put that all into EVM. However, miniprot gff output does not seem to match what EVM requires. Example of miniprot output:
##PAF YP_010239391.1 510 0 510 - HiC_scaffold_1543 48322 15314 17543 1530 1530 0 AS:i:2579 ms:i:2629 np:i:510 da:i:0 do:i:0 cg:Z:259M699N251M cs:Z::259~gt699ac:251
HiC_scaffold_1543 miniprot mRNA 15312 17543 2629 - . ID=MP002337;Rank=22;Identity=1.0000;Positive=1.0000;Target=YP_010239391.1 1 510
HiC_scaffold_1543 miniprot CDS 16767 17543 1353 - 0 ID=MP002337.1;Parent=MP002337;Rank=22;Identity=1.0000;Target=YP_010239391.1 1 259
HiC_scaffold_1543 miniprot CDS 15315 16067 1276 - 0 ID=MP002337.2;Parent=MP002337;Rank=22;Identity=1.0000;Acceptor=AC;Target=YP_010239391.1 260 510
HiC_scaffold_1543 miniprot stop_codon 15312 15314 0 - 0 ID=MP002337.3;Parent=MP002337;Rank=22
I have two questions.
1. In my weights file, how should I classify miniprot output, by format it seems it should be OTHER_PREDICTION or ABINITIO_PREDICTION.
2. I obviously need to reformat the output. I can write my own Python script to do it, but I need a few pointers. Correct me if I'm wrong:
- I need to add a gene record for each sett of GFF lines, instead of comment line in the original file, and set it as the parent of mRNA record
- for each CDS record, I need to add corresponding exon record, both CDS and exon have mRNA as the parent record
- I can delete the stop_codon record
Let me know if this is correct and if I need to also change anything else.
Krešimir Križanović