generate gff/gtf from the okay set

65 views

Skip to first unread message

June Ordoñez

unread,

Feb 14, 2024, 8:35:37 AM2/14/24

to EvidentialGene

Dear Gilbert,

I used tr2aacds to build a reference transcriptome for scRNAseq data (working with non-model organism). The scRNAseq pipeline required the gtf of the reference. I ended up using transdecoder to generate this as this is the only tool I know that could generate the gff. However, there's a difference in the final protein count between okay.aa and transdecoder.aa, which I am not sure how much it affects the downstream analyses. And I am aware that transdecoder might not give the most accurate protein set (as you have mentioned in some of your posts).

I would like to ask for some advice on how to generate the gff/gtf of the okayset.

Thank you!

Kind regards,

Don Gilbert

unread,

Feb 20, 2024, 2:25:58 PM2/20/24

to EvidentialGene

You may want to use the gtf from transdecoder, as I don't know what data this software scRNAseq is looking for. Transcripts don't have any annotation locations but for the simple coding start-stop points, unless some other method adds such. Is that what you are looking for? E.g transcriptID .. coding-start coding-stop ... in GTF or GFF format?

The CDS start-stop span is in Evigene's okayset sequence file headers, e.g. offs=83-583

>BemtraEVm011246t1 type=mRNA; aalen=166,60%,complete; clen=833; offs=83-583; organism=Bembidion_haplogonum; evgclass=main;

One can use a perl one-liner like this to make a table in GTF format of the CDS start-stop span if that is your need:

zgrep '>' okayset/bbeet1r.okay.mrna.gz | perl -ne '($id)=m/>(\S+)/; ($len)=m/clen=(\d+)/; ($ob,$oe)=m/offs=(\d+).(\d+)/; ($gd=$id)=~s/t\d+$//; print join("\t",$id,"evg","CDS",$ob,$oe,1,"+",".","gene_id \"".$gd."\"; transcript_id \"".$id."\"")."\n";' > bbeet1r.okay.mrna.gtf

Looks like this:

BemtraEVm028654t1 evg CDS 749 913 1 + . gene_id "BemtraEVm028654"; transcript_id "BemtraEVm028654t1"

BemtraEVm035533t1 evg CDS 349 480 1 + . gene_id "BemtraEVm035533"; transcript_id "BemtraEVm035533t1"

BemtraEVm021797t1 evg CDS 127 339 1 + . gene_id "BemtraEVm021797"; transcript_id "BemtraEVm021797t1"