generate gff/gtf from the okay set

64 views
Skip to first unread message

June Ordoñez

unread,
Feb 14, 2024, 8:35:37 AM2/14/24
to EvidentialGene
Dear Gilbert,

I used tr2aacds to build a reference transcriptome for scRNAseq data (working with non-model organism). The scRNAseq pipeline required the gtf of the reference. I ended up using transdecoder to generate this as this is the only tool I know that could generate the gff.  However, there's a difference in the final protein count between okay.aa and transdecoder.aa, which I am not sure how much it affects the downstream analyses. And I am aware that transdecoder might not give the most accurate protein set (as you have mentioned in some of your posts).

I would like to ask for some advice on how to generate the gff/gtf of the okayset. 

Thank you!

Kind regards,
J.

Don Gilbert

unread,
Feb 20, 2024, 2:25:58 PM2/20/24
to EvidentialGene
You may want to use the gtf from transdecoder, as I don't know what data this software scRNAseq is looking for.  Transcripts don't have any annotation locations but for the simple coding start-stop points, unless some other method adds such.  Is that what you are looking for?  E.g  transcriptID .. coding-start  coding-stop ... in GTF or GFF format?

The CDS start-stop span is in Evigene's okayset sequence file headers, e.g.  offs=83-583 
    >BemtraEVm011246t1 type=mRNA; aalen=166,60%,complete; clen=833; offs=83-583; organism=Bembidion_haplogonum; evgclass=main; 

One can use a perl one-liner like this to make a table in GTF format of the CDS start-stop span if that is your need:

zgrep '>'  okayset/bbeet1r.okay.mrna.gz |  perl -ne  '($id)=m/>(\S+)/; ($len)=m/clen=(\d+)/; ($ob,$oe)=m/offs=(\d+).(\d+)/; ($gd=$id)=~s/t\d+$//; print join("\t",$id,"evg","CDS",$ob,$oe,1,"+",".","gene_id \"".$gd."\"; transcript_id \"".$id."\"")."\n";'  > bbeet1r.okay.mrna.gtf

Looks like this:

BemtraEVm028654t1 evg CDS 749 913 1 + . gene_id "BemtraEVm028654"; transcript_id "BemtraEVm028654t1"

BemtraEVm035533t1 evg CDS 349 480 1 + . gene_id "BemtraEVm035533"; transcript_id "BemtraEVm035533t1"

BemtraEVm021797t1 evg CDS 127 339 1 + . gene_id "BemtraEVm021797"; transcript_id "BemtraEVm021797t1"

-- Don Gilbert

Reply all
Reply to author
Forward
0 new messages